946334 - switch buildapi from old to new rabbit instances

Assignee

Description

•

12 years ago

We have a new RabbitMQ instance in scl3 (bug 934593) and will shortly have flows to it (bug 945940). I'd like to switch both the producer (buildapi01) and consumer (selfserve-agent) sides of that from the old to the new instances. The easy way to do this would be to declare a short self-serve downtime, make the switch, and be done with it. The harder way is to get something to transfer messages from the relevant queues on the old rabbit cluster into the relevant queues on the new cluster, then atomically switch selfserve-agent to read from the new cluster, then do the same for buildapi01. I don't know how to do the latter, but I'm sure it's relatively easy, at worst with a simple read-and-write Python script. I'll be happy to help if you'd like to pursue that option. One potential complicating factor is authentication and exchange/queue setup. I *think* that the clients set up what they need, but that hasn't been verified for a long time..

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

12 years ago

Blocks: 863268

Depends on: 934593

Chris Cooper [:coop] (he/him)

Comment 1

•

12 years ago

buildduty can monitor this process and deal with developer fallout when the switch happens, but this will need an actual non-buildduty owner to drive it forward.

Component: Buildduty → General Automation

QA Contact: armenzg → catlee

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

12 years ago

I can do that. Any thoughts on which method I should plan to use?

Assignee: nobody → dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

12 years ago

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 3

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #2) > I can do that. Any thoughts on which method I should plan to use? Let's do the easy way. I presume it won't take too long, so we can coordinate something with the sheriffs in the early EST work day easily enough.

Flags: needinfo?(coop)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

12 years ago

OK -- let's plan to do this tomorrow morning. It will likely be a few minutes' buildapi outage, but let's plan an hour for the inevitable. I'll pre-flight things today, and if that's not successful, call off the change tomorrow.

Carsten Book [:Tomcat]

Comment 5

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #4) > OK -- let's plan to do this tomorrow morning. It will likely be a few > minutes' buildapi outage, but let's plan an hour for the inevitable. I'll > pre-flight things today, and if that's not successful, call off the change > tomorrow. per irc ryan and myself are ok with this, cc'ing the sheriffs to make them aware of the 1 hour outage of self-serve tomorrow morning EST

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

12 years ago

Oh, never mind the plan for tomorrow. The flow bug isn't closed yet. We'll wait until next week.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

12 years ago

OK, I've confirmed that the necessary queues, exchanges, and so on are automatically created on connection, by running a test instance of buildapi against the new rabbit servers. There was an issue with an incorrectly-named virtualhost (buildapi instead of /buildapi) that would have been annoying to track down during a downtime, but it's fixed now. So, plan is: * stop buildapi service * land the puppetagain change to point self-serve to new servers * change buildapi config to point to new servers * restart buildapi * force a puppet run on the affected masters. Tomcat, barring other complications, how do you feel about doing this Monday EST morning?

Flags: needinfo?(cbook)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

12 years ago

Blocks: 950135

Carsten Book [:Tomcat]

Comment 8

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #7) > OK, I've confirmed that the necessary queues, exchanges, and so on are > automatically created on connection, by running a test instance of buildapi > against the new rabbit servers. > > There was an issue with an incorrectly-named virtualhost (buildapi instead > of /buildapi) that would have been annoying to track down during a downtime, > but it's fixed now. > > So, plan is: > > * stop buildapi service > * land the puppetagain change to point self-serve to new servers > * change buildapi config to point to new servers > * restart buildapi > * force a puppet run on the affected masters. > > Tomcat, barring other complications, how do you feel about doing this Monday > EST morning? basically ok for me, but ryan is normally on duty during that time , cc;ing him :)

Flags: needinfo?(cbook) → needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 9

•

12 years ago

Sounds fine. The earlier the better.

Flags: needinfo?(ryanvm)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

12 years ago

Done and done, with no trouble

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Updated

•

12 years ago

Depends on: 951558

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

12 years ago

From http://www.rabbitmq.com/ha.html it looks like we need to set a policy on these queues to make them mirror correctly. It doesn't really say what the default policy is, but presumably it's something like "none". I applied this policy: [root@rabbit2.releng.webapp.scl3 dmitchell]# rabbitmqctl list_policies -p /buildapi Listing policies ... /buildapi HA queues .* {"ha-mode":"all","ha-sync-mode":"automatic"} 0 ...done. [root@rabbit2.releng.webapp.scl3 dmitchell]# rabbitmqctl list_queues -p /buildapi name slave_pids synchronised_slave_pids policy Listing queues ... buildapi-agent-rabbit2 [<rabbit@rabbit1.3.15184.2>] [<rabbit@rabbit1.3.15184.2>] HA buildapi-web2 [<rabbit@rabbit1.3.15186.2>] [<rabbit@rabbit1.3.15186.2>] HA ...done. so queues are now HA. If this was the cause of bug 951558 (although I'm not convinced, since rabbit showed zero consumers of that queue on either node), then hopefully this will fix it.

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

Bugzilla

switch buildapi from old to new rabbit instances

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Updated