Closed Bug 946334 Opened 11 years ago Closed 11 years ago

switch buildapi from old to new rabbit instances

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

We have a new RabbitMQ instance in scl3 (bug 934593) and will shortly have flows to it (bug 945940).  I'd like to switch both the producer (buildapi01) and consumer (selfserve-agent) sides of that from the old to the new instances.

The easy way to do this would be to declare a short self-serve downtime, make the switch, and be done with it.

The harder way is to get something to transfer messages from the relevant queues on the old rabbit cluster into the relevant queues on the new cluster, then atomically switch selfserve-agent to read from the new cluster, then do the same for buildapi01.

I don't know how to do the latter, but I'm sure it's relatively easy, at worst with a simple read-and-write Python script.  I'll be happy to help if you'd like to pursue that option.

One potential complicating factor is authentication and exchange/queue setup.  I *think* that the clients set up what they need, but that hasn't been verified for a long time..
Blocks: 863268
Depends on: 934593
buildduty can monitor this process and deal with developer fallout when the switch happens, but this will need an actual non-buildduty owner to drive it forward.
Component: Buildduty → General Automation
QA Contact: armenzg → catlee
I can do that.  Any thoughts on which method I should plan to use?
Assignee: nobody → dustin
Flags: needinfo?(coop)
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #2)
> I can do that.  Any thoughts on which method I should plan to use?

Let's do the easy way. 

I presume it won't take too long, so we can coordinate something with the sheriffs in the early EST work day easily enough.
Flags: needinfo?(coop)
OK -- let's plan to do this tomorrow morning.  It will likely be a few minutes' buildapi outage, but let's plan an hour for the inevitable.  I'll pre-flight things today, and if that's not successful, call off the change tomorrow.
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #4)
> OK -- let's plan to do this tomorrow morning.  It will likely be a few
> minutes' buildapi outage, but let's plan an hour for the inevitable.  I'll
> pre-flight things today, and if that's not successful, call off the change
> tomorrow.

per irc ryan and myself are ok with this, cc'ing the sheriffs to make them aware of the 1 hour outage of self-serve tomorrow morning EST
Oh, never mind the plan for tomorrow.  The flow bug isn't closed yet.  We'll wait until next week.
OK, I've confirmed that the necessary queues, exchanges, and so on are automatically created on connection, by running a test instance of buildapi against the new rabbit servers.

There was an issue with an incorrectly-named virtualhost (buildapi instead of /buildapi) that would have been annoying to track down during a downtime, but it's fixed now.

So, plan is:

* stop buildapi service
* land the puppetagain change to point self-serve to new servers
* change buildapi config to point to new servers
* restart buildapi
* force a puppet run on the affected masters.

Tomcat, barring other complications, how do you feel about doing this Monday EST morning?
Flags: needinfo?(cbook)
Blocks: 950135
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #7)
> OK, I've confirmed that the necessary queues, exchanges, and so on are
> automatically created on connection, by running a test instance of buildapi
> against the new rabbit servers.
> 
> There was an issue with an incorrectly-named virtualhost (buildapi instead
> of /buildapi) that would have been annoying to track down during a downtime,
> but it's fixed now.
> 
> So, plan is:
> 
> * stop buildapi service
> * land the puppetagain change to point self-serve to new servers
> * change buildapi config to point to new servers
> * restart buildapi
> * force a puppet run on the affected masters.
> 
> Tomcat, barring other complications, how do you feel about doing this Monday
> EST morning?

basically ok for me, but ryan is normally on duty during that time , cc;ing him :)
Flags: needinfo?(cbook) → needinfo?(ryanvm)
Sounds fine. The earlier the better.
Flags: needinfo?(ryanvm)
Done and done, with no trouble
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 951558
From http://www.rabbitmq.com/ha.html it looks like we need to set a policy on these queues to make them mirror correctly.  It doesn't really say what the default policy is, but presumably it's something like "none".

I applied this policy:

[root@rabbit2.releng.webapp.scl3 dmitchell]# rabbitmqctl list_policies -p /buildapi
Listing policies ...
/buildapi       HA      queues  .*      {"ha-mode":"all","ha-sync-mode":"automatic"}    0
...done.

[root@rabbit2.releng.webapp.scl3 dmitchell]# rabbitmqctl list_queues -p /buildapi name slave_pids synchronised_slave_pids policy
Listing queues ...
buildapi-agent-rabbit2  [<rabbit@rabbit1.3.15184.2>]    [<rabbit@rabbit1.3.15184.2>]    HA
buildapi-web2   [<rabbit@rabbit1.3.15186.2>]    [<rabbit@rabbit1.3.15186.2>]    HA
...done.

so queues are now HA.  If this was the cause of bug 951558 (although I'm not convinced, since rabbit showed zero consumers of that queue on either node), then hopefully this will fix it.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.