Closed Bug 1142538 Opened 9 years ago Closed 9 years ago

Create a separate production ES cluster for Orange Factor

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cliang, Assigned: cliang)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/736] )

The upgrade of ES in the dev environment (0.90.x -> 1.x) showed that there were changes that broke existing code.  Given that there is a re-write of OF in the works, we're going to move the production ES indexes off of the (shared) ES cluster and onto a separate one as not to hold up a planned upgrade.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/736]
Blocks: 1062342
The cluster is up and servicing requests sent to of-elasticsearch-zlb.webapp.scl3.mozilla.com.  It should be available for wider testing once ACLs have been added

I'm testing how long it takes to copy indexes.  Currently, bugs takes about 1-2 minutes; logs takes about 20 minutes.

* Initial configs added to puppet (r102171)
* added collectd monitoring of ES 
* added New Relic monitoring of ES
ACLs allowing servers to talk to the new VIP were added in bug 1145745.

@mcote / edmorley: If there are other severs that need access to the VIP, please let me know.

After Ed gets back in, I'll start poking RE: the actual cutover process. =)
@edmorely: Looking to see if:
   1. There are any other servers that will need access to the new Orange Factor ES VIP and 
   2. When might be a good time to attempt a cutover (stopping ingestion of logs for about 30 minutes as I do a fresh copy of the indexes).
Flags: needinfo?(emorley)
Hi - sorry not had a chance to get to this today, the tab has been open in my browser staring at me, but we had tree-closing issues with treeherder log parsing today and dev blocking breakage after a pip point release yesterday, and I'm still catching up after PTO/public holiday .

(In reply to C. Liang [:cyliang] from comment #3)
> @edmorely: Looking to see if:
>    1. There are any other servers that will need access to the new Orange
> Factor ES VIP and 

I'm pretty sure that's not the correct nodes in bug 1145745 comment 0 - I think rabbitmq is the one that needs access, since the submissions are made via:
https://github.com/mozilla/treeherder-service/blob/c51409d34efe4cab7d925cb5906538e319a8da85/treeherder/model/derived/jobs.py#L674

And the high_priority queue runs only here:
https://github.com/mozilla/treeherder-service/blob/0d5b02daf52fa9c2637309989798e9518f3a7f80/bin/run_celery_worker_hp#L30

And:
[emorley@treeherderadm.private.scl3 ~]$ multi treeherder 'ps ax | grep "[h]igh_priority"'
[2015-04-09 19:23:56] [treeherder1.webapp.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder2.webapp.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder3.webapp.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-processor1.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-processor2.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-processor3.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-etl1.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-etl2.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-rabbitmq1.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:57] [treeherder-processor1.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.110s)

[2015-04-09 19:23:57] [treeherder-processor3.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.113s)

[2015-04-09 19:23:57] [treeherder-processor2.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.121s)

[2015-04-09 19:23:57] [treeherder3.webapp.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.131s)
[2015-04-09 19:23:57] [treeherder-etl1.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.120s)
[2015-04-09 19:23:57] [treeherder-etl2.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.110s)
[2015-04-09 19:23:57] [treeherder-rabbitmq1.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.109s) [treeherder-rabbitmq1.private.scl3.mozilla.com] out: 6086 ?        Sl     0:05 /usr/bin/python2.7 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log -l INFO -n hp.%h
[treeherder-rabbitmq1.private.scl3.mozilla.com] out: 30530 ?        S      1:53 /usr/bin/python2.7 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log -l INFO -n hp.%h
[2015-04-09 19:23:57] [treeherder1.webapp.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.229s)
[2015-04-09 19:23:57] [treeherder2.webapp.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.229s)

>    2. When might be a good time to attempt a cutover (stopping ingestion of
> logs for about 30 minutes as I do a fresh copy of the indexes).

I'm not too concerned about OF being down for a few hours if that helps?

I'm in the UK, so am mostly only around earlier than now.
Flags: needinfo?(emorley)
My sympathies: it's never fun trying to field emergencies when digging out from underneath a backlog. =\

I've put in a bug for a flow for the rabbit server (and verified that everything else in the treeherder commanderconfig group can reach the new ES VIP/port).   Depending on when that bug is done and your availability, I'd like to tentatively schedule the cut-over for *next* Thursday, April 16th at around 2PM UTC (3PM BST / 7AM PDT).
Depends on: 1153186
Bug 1153186 will allow us to more easily pause just the Bugzilla/ES data mirroring (rather than also something else that is currently on the same queue).

That said, could we switch over the data submitters to using the new cluster before the rest of the data is migrated over perhaps, to avoid us needing to do everything at once?
(Removing infra group since there are no IPs in this bug, and it means I can CC the watcher email for the OrangeFactor component).
Group: infra
OS: Mac OS X → All
Hardware: x86 → All
Do you have a (rough) ETA on how long it will take to get "pause" button put into play?  

(If the answer is something like "end of the quarter", I'll need to go back to some other folks and make sure I'm not holding up some of their objectives. =) )
The bug in comment 6 has a patch written today, awaiting review :-)

Also, can we avoid the need to pause at all? (My question in comment 6)
I've carved out some time today to test doing a merge of an older index into a new one, so we'll see what the answer is. =)
I *think* as long as the IDs are being generated in a way that won't cause conflict, we should be okay to point the submitters over to the new cluster and then copy over the old data.  

It looks like reads/writes to the index while I was doing the re-indexing of the old data sometimes a tad slower, but nothing that sent up a red flag (at most, about 4% slower).    

Do you want to try to cutover the data submitters at the beginning of next week?
(As an FYI: I'm going to be highly distracted on Thursday, April 23rd and away from Friday, April 24th and back on Wednesday, April 29th.)
Flags: needinfo?(emorley)
Depends on: 1156399
I'll see if I can switch them over now.
Flags: needinfo?(emorley)
Depends on: 1156448
Everything's migrated over now to the new cluster now (as in, data submitters and the queries), so we're good to migrate the old data now :-)
A temporary backup copy of the new logs index (before merge) was taken (logs-20140421); merging of the old data has started.
Merging of old logs is done.  Backups taken of bugs and bzcache indexes (bugs-20150421, bzcache-20150421); merging of the old data has started.
Old data merged.  Going to http://brasstacks.mozilla.com/orangefactor/ and setting the start date to before the changeover (March 1st) brought up "old" data.

I've tweaked the backup jobs:
  1. One set will do a local index copy of the bugs and logs indexes, on the OF ES cluster (<index>-<date in YYYYMMDD format).
  2.  One set will do an index copy of the bugs and logs indexes to the general dev cluster.  

I've not actively tested doing a "copy restore" from an ES 1.x cluster to an ES 0.90 cluster, which is why backups set #1 exists.  Backups set #2 exists in case of complete failure or unrecoverable damage of the new cluster.  Leaving this bug open until I've verified the backups have taken place. =)

[For future reference, the bzcache index is populated fresh, every hour.]
We're not using the dev cluster at all, if that saves having to sync there.
And everything looks good - thank you for sorting this :-)
Backups look like they fired off correctly.  W00t.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1179575
You need to log in before you can comment on or make changes to this bug.