Connection to CloudAMQP and Pulse not working

RESOLVED FIXED

Status

Tree Management
Treeherder: Data Ingestion
--
blocker
RESOLVED FIXED
11 months ago
11 months ago

People

(Reporter: camd, Assigned: camd)

Tracking

Details

Attachments

(2 attachments)

(Assignee)

Description

11 months ago
Trying to resolve with support now.
Severity: normal → blocker
(Assignee)

Comment 1

11 months ago
I rotated the RabbitMQ password and that seems to have helped.  checking now...
(Assignee)

Comment 2

11 months ago
OK, queues appear to be draining now and things are getting back to normal.  I'm seeing new resultsets and jobs show up.
(Assignee)

Comment 3

11 months ago
While things are returning to normal, there was data loss.  The queue size got larger than 16000 and was automatically deleted.  So we lost that many jobs.
(Assignee)

Comment 4

11 months ago
I'm going to modify that tomorrow so that some applications (like Treeherder) don't have their queues deleted when they overgrow.
(Assignee)

Updated

11 months ago
See Also: → bug 1321704
(Assignee)

Comment 5

11 months ago
Rotating the Rabbitamqp passwords worked.  Did so on stage and prod and queues returning to normal.
Assignee: nobody → cdawson
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
(Assignee)

Comment 6

11 months ago
Just a summary of what happened:

* Cloud AMQP did an upgrade on their servers around 4pm my time, and that caused us to lose connection with them.  
* Advised Sheriffs to close the trees.
* We contacted their support and went back and forth in email a bit to figure out what was going on.  Then they advised me to rotate the password.  
* I did so and restarted all the dynos.  Then the queues began to drain and everything returned to normal.

However, the delay between me noticing the Pulse Guardian emails and rotating the passwords let the queue size exceed the max of 16000 messages.  This is the Pulse guardian default threshold for queue deletion, and those messages were lost.  After the first deletion, Mark Cote bumped the deletion size up to 50K.  This gave us enough time to come to the solution with support and resolve the issue.

I opened Bug 1321704 to modify Pulse guardian to ensure we never delete Treeherder queues again.  I'll start on that tomorrow.

Comment 7

11 months ago
Thank you for sorting this out!

It's hard to tell without the email transcripts (could you attach them here if secrets redacted, or forward them to treeherder-internal?) - but I presume we're talking about the pulse.m.o CloudAMQP instance, not the Treeherder Heroku add-on one?

Re password rotation, do you mean for internal Pulse/Pulseguardian passwords, or for the credentials Treeherder uses to connect to pulse.m.o? I have lots more questions, but I'll hold off until we get the emails since I'm sure many of them will be answered there :-)

Could you also attach the #moc chat log? (That channel doesn't have logging; have filed bug 1321779 to add it).
Flags: needinfo?(cdawson)
(Assignee)

Comment 8

11 months ago
Created attachment 8816479 [details]
#moc channel transcript of the event
Flags: needinfo?(cdawson)
(Assignee)

Comment 9

11 months ago
Created attachment 8816486 [details]
rabbitmq outage email transcript.txt

Here is the email transcript.  Cleaned up a bit.

Comment 10

11 months ago
Ah awesome, that clears things up a bit. Would it be ok to always CC treeherder-internal@ if there are support emails sent in the future? :-)

So my understanding is:
* CloudAMQP did some update that required that the user/password/host environment variable (CLOUDAMQP_URL) be updated.
* To do that on Heroku they use Heroku's API to push out the change, but for some reason that resulted in a CLOUDAMQP_URL being set on treeherder stage+prod that had different credentials from those actually set on the rabbitmq instance side.
* This meant the Treeherder tasks were unable to connect to rabbitmq, which meant that the "fetch things from pulse.m.o and put them in Treeherder's rabbitmq queue" task would perma-fail.
* However this incorrect password wasn't obviously clear because the default behaviour (until py-amqp 2.1.1, only very recently released) is to just close the socket instead of giving an obvious "auth failed" message (see bug 1287404 comment 19). (This confusing error message got us scratching our heads at least 3 times when we were on SCL3 too.)

Had there been a clearer "incorrect username/password" error message, I suspect the debug step would have gone like:
1) Check Heroku activity page -> see updated CLOUDAMQP_URL env 
2) Check CloudAMQP Heroku status panel
3) Compare username/password there with that in CLOUDAMQP_URL -> discover they are different
4) Manually update CLOUDAMQP_URL or hit password rotate whilst waiting for support to reply

Follow-up items:
* Follow up with CloudAMQP to see why their update pushed out the wrong environment variable.
* Stop the auto-deletion of treeherder queues on pulse.m.o since it's a massive footgun (bug 1321704).
* Update to py-amqp 2.1.1 to get the more helpful auth message (however it requires celery 4.0 which has significant changes and only recently released, so I had been waiting until a point release to update them all).
(Assignee)

Comment 11

11 months ago
(In reply to Ed Morley [:emorley] from comment #10)
> Ah awesome, that clears things up a bit. Would it be ok to always CC
> treeherder-internal@ if there are support emails sent in the future? :-)

Absolutely, I'm kicking myself a bit for not doing that in the first place.  Definitely will do so in the future.  :)

> 
> So my understanding is:
> * CloudAMQP did some update that required that the user/password/host
> environment variable (CLOUDAMQP_URL) be updated.
> * To do that on Heroku they use Heroku's API to push out the change, but for
> some reason that resulted in a CLOUDAMQP_URL being set on treeherder
> stage+prod that had different credentials from those actually set on the
> rabbitmq instance side.

I'm not sure if their update changed our credentials in Heroku or not.  I'm pretty sure the credentials required changed on their end, but were perhaps not updated on our end till I did the rotate.  Really hard to tell the order there.  

There's no indication I can see in our Heroku Activity feed that the CLOUDAMQP_URL was changed.  Though I did the "rotate" password.  Perhaps it only creates an activity item if it's done directly in the Heroku admin, as opposed to the CloudAMQP plugin.

> * This meant the Treeherder tasks were unable to connect to rabbitmq, which
> meant that the "fetch things from pulse.m.o and put them in Treeherder's
> rabbitmq queue" task would perma-fail.

I'm actually not convinced that the password rotation was the ONLY thing that fixed this.  I alluded to this in IRC and the email that I was trying to reach beige-hedgehog.rmq.cloudamqp.com in a browser and was getting a 503.  Later, stephend tried it and got a 404.  Then, eventually I got the RabbitMQ login screen.  Once I got that, and rotated the password, things were fixed.

So this would tell me there was more going on than just the password issue.  Support kept saying "No servers were affected" but I don't believe that, tbh.  That domain was unavailable for much of the outage, so Carl must have done something to bring it back up.

> * However this incorrect password wasn't obviously clear because the default
> behaviour (until py-amqp 2.1.1, only very recently released) is to just
> close the socket instead of giving an obvious "auth failed" message (see bug
> 1287404 comment 19). (This confusing error message got us scratching our
> heads at least 3 times when we were on SCL3 too.)
> 
> Had there been a clearer "incorrect username/password" error message, I
> suspect the debug step would have gone like:
> 1) Check Heroku activity page -> see updated CLOUDAMQP_URL env 
> 2) Check CloudAMQP Heroku status panel
> 3) Compare username/password there with that in CLOUDAMQP_URL -> discover
> they are different
> 4) Manually update CLOUDAMQP_URL or hit password rotate whilst waiting for
> support to reply
> 
> Follow-up items:
> * Follow up with CloudAMQP to see why their update pushed out the wrong
> environment variable.
> * Stop the auto-deletion of treeherder queues on pulse.m.o since it's a
> massive footgun (bug 1321704).
> * Update to py-amqp 2.1.1 to get the more helpful auth message (however it
> requires celery 4.0 which has significant changes and only recently
> released, so I had been waiting until a point release to update them all).
You need to log in before you can comment on or make changes to this bug.