Closed
Bug 1321697
Opened 8 years ago
Closed 8 years ago
Connection to CloudAMQP and Pulse not working
Categories
(Tree Management :: Treeherder: Data Ingestion, defect)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: camd, Assigned: camd)
References
Details
Attachments
(2 files)
Trying to resolve with support now.
Updated•8 years ago
|
Severity: normal → blocker
Assignee | ||
Comment 1•8 years ago
|
||
I rotated the RabbitMQ password and that seems to have helped. checking now...
Assignee | ||
Comment 2•8 years ago
|
||
OK, queues appear to be draining now and things are getting back to normal. I'm seeing new resultsets and jobs show up.
Assignee | ||
Comment 3•8 years ago
|
||
While things are returning to normal, there was data loss. The queue size got larger than 16000 and was automatically deleted. So we lost that many jobs.
Assignee | ||
Comment 4•8 years ago
|
||
I'm going to modify that tomorrow so that some applications (like Treeherder) don't have their queues deleted when they overgrow.
Assignee | ||
Comment 5•8 years ago
|
||
Rotating the Rabbitamqp passwords worked. Did so on stage and prod and queues returning to normal.
Assignee: nobody → cdawson
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 6•8 years ago
|
||
Just a summary of what happened: * Cloud AMQP did an upgrade on their servers around 4pm my time, and that caused us to lose connection with them. * Advised Sheriffs to close the trees. * We contacted their support and went back and forth in email a bit to figure out what was going on. Then they advised me to rotate the password. * I did so and restarted all the dynos. Then the queues began to drain and everything returned to normal. However, the delay between me noticing the Pulse Guardian emails and rotating the passwords let the queue size exceed the max of 16000 messages. This is the Pulse guardian default threshold for queue deletion, and those messages were lost. After the first deletion, Mark Cote bumped the deletion size up to 50K. This gave us enough time to come to the solution with support and resolve the issue. I opened Bug 1321704 to modify Pulse guardian to ensure we never delete Treeherder queues again. I'll start on that tomorrow.
Comment 7•8 years ago
|
||
Thank you for sorting this out! It's hard to tell without the email transcripts (could you attach them here if secrets redacted, or forward them to treeherder-internal?) - but I presume we're talking about the pulse.m.o CloudAMQP instance, not the Treeherder Heroku add-on one? Re password rotation, do you mean for internal Pulse/Pulseguardian passwords, or for the credentials Treeherder uses to connect to pulse.m.o? I have lots more questions, but I'll hold off until we get the emails since I'm sure many of them will be answered there :-) Could you also attach the #moc chat log? (That channel doesn't have logging; have filed bug 1321779 to add it).
Flags: needinfo?(cdawson)
Assignee | ||
Comment 8•8 years ago
|
||
Flags: needinfo?(cdawson)
Assignee | ||
Comment 9•8 years ago
|
||
Here is the email transcript. Cleaned up a bit.
Comment 10•8 years ago
|
||
Ah awesome, that clears things up a bit. Would it be ok to always CC treeherder-internal@ if there are support emails sent in the future? :-) So my understanding is: * CloudAMQP did some update that required that the user/password/host environment variable (CLOUDAMQP_URL) be updated. * To do that on Heroku they use Heroku's API to push out the change, but for some reason that resulted in a CLOUDAMQP_URL being set on treeherder stage+prod that had different credentials from those actually set on the rabbitmq instance side. * This meant the Treeherder tasks were unable to connect to rabbitmq, which meant that the "fetch things from pulse.m.o and put them in Treeherder's rabbitmq queue" task would perma-fail. * However this incorrect password wasn't obviously clear because the default behaviour (until py-amqp 2.1.1, only very recently released) is to just close the socket instead of giving an obvious "auth failed" message (see bug 1287404 comment 19). (This confusing error message got us scratching our heads at least 3 times when we were on SCL3 too.) Had there been a clearer "incorrect username/password" error message, I suspect the debug step would have gone like: 1) Check Heroku activity page -> see updated CLOUDAMQP_URL env 2) Check CloudAMQP Heroku status panel 3) Compare username/password there with that in CLOUDAMQP_URL -> discover they are different 4) Manually update CLOUDAMQP_URL or hit password rotate whilst waiting for support to reply Follow-up items: * Follow up with CloudAMQP to see why their update pushed out the wrong environment variable. * Stop the auto-deletion of treeherder queues on pulse.m.o since it's a massive footgun (bug 1321704). * Update to py-amqp 2.1.1 to get the more helpful auth message (however it requires celery 4.0 which has significant changes and only recently released, so I had been waiting until a point release to update them all).
Assignee | ||
Comment 11•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #10) > Ah awesome, that clears things up a bit. Would it be ok to always CC > treeherder-internal@ if there are support emails sent in the future? :-) Absolutely, I'm kicking myself a bit for not doing that in the first place. Definitely will do so in the future. :) > > So my understanding is: > * CloudAMQP did some update that required that the user/password/host > environment variable (CLOUDAMQP_URL) be updated. > * To do that on Heroku they use Heroku's API to push out the change, but for > some reason that resulted in a CLOUDAMQP_URL being set on treeherder > stage+prod that had different credentials from those actually set on the > rabbitmq instance side. I'm not sure if their update changed our credentials in Heroku or not. I'm pretty sure the credentials required changed on their end, but were perhaps not updated on our end till I did the rotate. Really hard to tell the order there. There's no indication I can see in our Heroku Activity feed that the CLOUDAMQP_URL was changed. Though I did the "rotate" password. Perhaps it only creates an activity item if it's done directly in the Heroku admin, as opposed to the CloudAMQP plugin. > * This meant the Treeherder tasks were unable to connect to rabbitmq, which > meant that the "fetch things from pulse.m.o and put them in Treeherder's > rabbitmq queue" task would perma-fail. I'm actually not convinced that the password rotation was the ONLY thing that fixed this. I alluded to this in IRC and the email that I was trying to reach beige-hedgehog.rmq.cloudamqp.com in a browser and was getting a 503. Later, stephend tried it and got a 404. Then, eventually I got the RabbitMQ login screen. Once I got that, and rotated the password, things were fixed. So this would tell me there was more going on than just the password issue. Support kept saying "No servers were affected" but I don't believe that, tbh. That domain was unavailable for much of the outage, so Carl must have done something to bring it back up. > * However this incorrect password wasn't obviously clear because the default > behaviour (until py-amqp 2.1.1, only very recently released) is to just > close the socket instead of giving an obvious "auth failed" message (see bug > 1287404 comment 19). (This confusing error message got us scratching our > heads at least 3 times when we were on SCL3 too.) > > Had there been a clearer "incorrect username/password" error message, I > suspect the debug step would have gone like: > 1) Check Heroku activity page -> see updated CLOUDAMQP_URL env > 2) Check CloudAMQP Heroku status panel > 3) Compare username/password there with that in CLOUDAMQP_URL -> discover > they are different > 4) Manually update CLOUDAMQP_URL or hit password rotate whilst waiting for > support to reply > > Follow-up items: > * Follow up with CloudAMQP to see why their update pushed out the wrong > environment variable. > * Stop the auto-deletion of treeherder queues on pulse.m.o since it's a > massive footgun (bug 1321704). > * Update to py-amqp 2.1.1 to get the more helpful auth message (however it > requires celery 4.0 which has significant changes and only recently > released, so I had been waiting until a point release to update them all).
You need to log in
before you can comment on or make changes to this bug.
Description
•