1142538 - Create a separate production ES cluster for Orange Factor

Assignee

Description

•

9 years ago

The upgrade of ES in the dev environment (0.90.x -> 1.x) showed that there were changes that broke existing code.  Given that there is a re-write of OF in the works, we're going to move the production ES indexes off of the (shared) ES cluster and onto a separate one as not to hold up a planned upgrade.

:kanban

Updated

•

9 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/736]

C. Liang [:cyliang]

Assignee

Updated

•

9 years ago

Blocks: 1062342

C. Liang [:cyliang]

Assignee

Comment 1

•

9 years ago

The cluster is up and servicing requests sent to of-elasticsearch-zlb.webapp.scl3.mozilla.com.  It should be available for wider testing once ACLs have been added

I'm testing how long it takes to copy indexes.  Currently, bugs takes about 1-2 minutes; logs takes about 20 minutes.

* Initial configs added to puppet (r102171)
* added collectd monitoring of ES 
* added New Relic monitoring of ES

C. Liang [:cyliang]

Assignee

Comment 2

•

9 years ago

ACLs allowing servers to talk to the new VIP were added in bug 1145745.

@mcote / edmorley: If there are other severs that need access to the VIP, please let me know.

After Ed gets back in, I'll start poking RE: the actual cutover process. =)

C. Liang [:cyliang]

Assignee

Comment 3

•

9 years ago

@edmorely: Looking to see if:
   1. There are any other servers that will need access to the new Orange Factor ES VIP and 
   2. When might be a good time to attempt a cutover (stopping ingestion of logs for about 30 minutes as I do a fresh copy of the indexes).

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Comment 4

•

9 years ago

Hi - sorry not had a chance to get to this today, the tab has been open in my browser staring at me, but we had tree-closing issues with treeherder log parsing today and dev blocking breakage after a pip point release yesterday, and I'm still catching up after PTO/public holiday .

(In reply to C. Liang [:cyliang] from comment #3)
> @edmorely: Looking to see if:
>    1. There are any other servers that will need access to the new Orange
> Factor ES VIP and 

I'm pretty sure that's not the correct nodes in bug 1145745 comment 0 - I think rabbitmq is the one that needs access, since the submissions are made via:
https://github.com/mozilla/treeherder-service/blob/c51409d34efe4cab7d925cb5906538e319a8da85/treeherder/model/derived/jobs.py#L674

And the high_priority queue runs only here:
https://github.com/mozilla/treeherder-service/blob/0d5b02daf52fa9c2637309989798e9518f3a7f80/bin/run_celery_worker_hp#L30

And:
[emorley@treeherderadm.private.scl3 ~]$ multi treeherder 'ps ax | grep "[h]igh_priority"'
[2015-04-09 19:23:56] [treeherder1.webapp.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder2.webapp.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder3.webapp.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-processor1.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-processor2.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-processor3.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-etl1.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-etl2.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:56] [treeherder-rabbitmq1.private.scl3.mozilla.com] running: ps ax | grep "[h]igh_priority"
[2015-04-09 19:23:57] [treeherder-processor1.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.110s)

[2015-04-09 19:23:57] [treeherder-processor3.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.113s)

[2015-04-09 19:23:57] [treeherder-processor2.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.121s)

[2015-04-09 19:23:57] [treeherder3.webapp.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.131s)
[2015-04-09 19:23:57] [treeherder-etl1.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.120s)
[2015-04-09 19:23:57] [treeherder-etl2.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.110s)
[2015-04-09 19:23:57] [treeherder-rabbitmq1.private.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.109s) [treeherder-rabbitmq1.private.scl3.mozilla.com] out: 6086 ?        Sl     0:05 /usr/bin/python2.7 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log -l INFO -n hp.%h
[treeherder-rabbitmq1.private.scl3.mozilla.com] out: 30530 ?        S      1:53 /usr/bin/python2.7 /usr/bin/celery -A treeherder worker -c 1 -Q high_priority -E --maxtasksperchild=500 --logfile=/var/log/celery/celery_worker_hp.log -l INFO -n hp.%h
[2015-04-09 19:23:57] [treeherder1.webapp.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.229s)
[2015-04-09 19:23:57] [treeherder2.webapp.scl3.mozilla.com] finished: ps ax | grep "[h]igh_priority" (0.229s)

>    2. When might be a good time to attempt a cutover (stopping ingestion of
> logs for about 30 minutes as I do a fresh copy of the indexes).

I'm not too concerned about OF being down for a few hours if that helps?

I'm in the UK, so am mostly only around earlier than now.

Flags: needinfo?(emorley)

C. Liang [:cyliang]

Assignee

Comment 5

•

9 years ago

My sympathies: it's never fun trying to field emergencies when digging out from underneath a backlog. =\

I've put in a bug for a flow for the rabbit server (and verified that everything else in the treeherder commanderconfig group can reach the new ES VIP/port).   Depending on when that bug is done and your availability, I'd like to tentatively schedule the cut-over for *next* Thursday, April 16th at around 2PM UTC (3PM BST / 7AM PDT).

Ed Morley [:emorley]

Updated

•

9 years ago

Depends on: 1153186

Ed Morley [:emorley]

Comment 6

•

9 years ago

Bug 1153186 will allow us to more easily pause just the Bugzilla/ES data mirroring (rather than also something else that is currently on the same queue).

That said, could we switch over the data submitters to using the new cluster before the rest of the data is migrated over perhaps, to avoid us needing to do everything at once?

Ed Morley [:emorley]

Comment 7

•

9 years ago

(Removing infra group since there are no IPs in this bug, and it means I can CC the watcher email for the OrangeFactor component).

Group: infra

OS: Mac OS X → All

Hardware: x86 → All

C. Liang [:cyliang]

Assignee

Comment 8

•

9 years ago

Do you have a (rough) ETA on how long it will take to get "pause" button put into play?  

(If the answer is something like "end of the quarter", I'll need to go back to some other folks and make sure I'm not holding up some of their objectives. =) )

Ed Morley [:emorley]

Comment 9

•

9 years ago

The bug in comment 6 has a patch written today, awaiting review :-)

Also, can we avoid the need to pause at all? (My question in comment 6)

C. Liang [:cyliang]

Assignee

Comment 10

•

9 years ago

I've carved out some time today to test doing a merge of an older index into a new one, so we'll see what the answer is. =)

C. Liang [:cyliang]

Assignee

Comment 11

•

9 years ago

I *think* as long as the IDs are being generated in a way that won't cause conflict, we should be okay to point the submitters over to the new cluster and then copy over the old data.  

It looks like reads/writes to the index while I was doing the re-indexing of the old data sometimes a tad slower, but nothing that sent up a red flag (at most, about 4% slower).    

Do you want to try to cutover the data submitters at the beginning of next week?

C. Liang [:cyliang]

Assignee

Comment 12

•

9 years ago

(As an FYI: I'm going to be highly distracted on Thursday, April 23rd and away from Friday, April 24th and back on Wednesday, April 29th.)

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Updated

•

9 years ago

Depends on: 1156399

Ed Morley [:emorley]

Comment 13

•

9 years ago

I'll see if I can switch them over now.

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Updated

•

9 years ago

Depends on: 1156448

Ed Morley [:emorley]

Comment 14

•

9 years ago

Everything's migrated over now to the new cluster now (as in, data submitters and the queries), so we're good to migrate the old data now :-)

C. Liang [:cyliang]

Assignee

Comment 15

•

9 years ago

A temporary backup copy of the new logs index (before merge) was taken (logs-20140421); merging of the old data has started.

C. Liang [:cyliang]

Assignee

Comment 16

•

9 years ago

Merging of old logs is done.  Backups taken of bugs and bzcache indexes (bugs-20150421, bzcache-20150421); merging of the old data has started.

C. Liang [:cyliang]

Assignee

Comment 17

•

9 years ago

Old data merged.  Going to http://brasstacks.mozilla.com/orangefactor/ and setting the start date to before the changeover (March 1st) brought up "old" data.

I've tweaked the backup jobs:
  1. One set will do a local index copy of the bugs and logs indexes, on the OF ES cluster (<index>-<date in YYYYMMDD format).
  2.  One set will do an index copy of the bugs and logs indexes to the general dev cluster.  

I've not actively tested doing a "copy restore" from an ES 1.x cluster to an ES 0.90 cluster, which is why backups set #1 exists.  Backups set #2 exists in case of complete failure or unrecoverable damage of the new cluster.  Leaving this bug open until I've verified the backups have taken place. =)

[For future reference, the bzcache index is populated fresh, every hour.]

Ed Morley [:emorley]

Comment 18

•

9 years ago

We're not using the dev cluster at all, if that saves having to sync there.

Ed Morley [:emorley]

Comment 19

•

9 years ago

And everything looks good - thank you for sorting this :-)

C. Liang [:cyliang]

Assignee

Comment 20

•

9 years ago

Backups look like they fired off correctly.  W00t.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

9 years ago

Blocks: 1179575