Closed Bug 1538914 Opened 7 years ago Closed 7 years ago

[Tracking] Pulse is down - along with every service that uses pulse.

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bcrisan, Unassigned)

References

Details

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Description

•

7 years ago

Tracking bug for pulse issues on March 25, 2019.
It affects every service that is based on/uses Pulse.

dustin> signs are pointing toward having pulse issues again
21:07 starting just before hte top of the hour
21:07 kmoir: glob: ^^
21:07 
<kmoir> I will go look
21:07 
<dustin> https://irccloud.mozilla.com/file/Dz0xpNVa/image.pngimage.png24.27KB • image/png
21:07 
— •bcrisan|ciduty good to know
21:07 
<ahal> catlee: might be better to move tier 2 jobs into --full instead?
21:08 
<catlee> maybe
21:08 
<ahal> I get many complaints around how confusing fuzzy is already
21:08 
<catlee> dustin: looks like something starts leaking around 14:45
21:08 
<ahal> having both --full and --tier 2 filtering out things seems like a footgun
21:08 
<dustin> yeah
21:09 "The management statistics database currently has a queue of 2070 events to process. If this number keeps increasing, so will the memory used by the management plugin. You may find it useful to set the rates_mode config item to none."
21:09 
<catlee> ahal: what does --full do right now?
21:09 
<ahal> let's you pick any task from the full task set
21:09 
<dustin> kmoir: -01 seems to be the least sick right now
21:09 
<kmoir> okay
21:10 
<catlee> ahal: vs the target task set?
21:10 
<ahal> catlee: yeah, right now it's target task set - ccov - a handful of things people requested not be there
21:11 so we could definitely hardcode a rule to get tier 2 things excluded
21:12 ⇐ Standard8 quit (Standard8@moz-1hu.vrc.166.195.IP) Quit: ZNC - http://znc.in
21:15 
<bstack> dustin: the pulse stuff is causing users trouble with task submission timing out now. I wonder how widespread this is
21:15 perhaps we should think about closing trees
21:15 
<dustin> I suspect we should
21:15 
<•bcrisan|ciduty> should I tell the sheriffs to close the trees?
21:16 
<dustin> I think we'll find ourselves unable to do so like last time, actually
21:16 
<•bcrisan|ciduty> oh, yeah, that tools is based on pulse..
21:17 
<dustin> huh, auth0 seems down too
21:17 
<bstack> created https://app.google.stackdriver.com/metrics-explorer?project=heroku-logging&metric=logging.googleapis.com%2Fuser%2FsendDeadlineExceeded to track send deadline exceeded in queue
21:17 
<catlee> how does treestatus fail when pulse is down?
21:18 
— •bcrisan|ciduty is going to fill a tracking bug
21:18 
<dustin> I suspect I can close the trees by putting the treestatus app in maintenance mode
21:18 catlee: it uses TC auth, and pulse being down pretty much decimates Taskcluster
21:18 
<catlee> ah
21:18 does TC auth need pulse?
21:19 → smacleod joined (sid301@moz-q6a57d.n4vs.eb63.0101.2620.IP)
21:20 
<dustin> yeah
21:20 
— dustin turns on maintenance mode for treestatus
21:21 
<dustin> at least, last time this happened, auth crashed when pulse went down
21:21 which means logins don't work anymore
21:21 I couldn't figure out why it crashed from the logs (stackdriver, ugh)
21:22 (it has no way to search for the first matching log entry)
21:22 
<catlee> treestatus may also send out pulse messages when the state changes...
21:22 
<dustin> yeah
21:22 
<kmoir>  Do we know if the messages that are not being acked are valid messages or just that the messages are not being processed in a timely fashion?  Is it just a load issue?
21:23 and apologizing in advance again for my lack of pulse knowledge
21:23 
<dustin> kmoir: no idea, I can't see anything right now :(
21:23 
<aki> hm, wondering if my periodic hook is firing too much
21:23 
<dustin> aki: did that start at about :45?
21:23 
<catlee> or someone starting up a bunch of listeners?
21:23 
<aki> i think at 12:02
21:24 
<dustin> catlee: so at around :10, I managed to load a list of connections on the cluster, and didn't see anything unusual in the # of channels
21:24 or usernames
21:26 so i see a few channels belonging to tc-queue that have unconfirmed messages
21:27 
<catlee> so yeah, treestatus does try and send out pulse messages on status change, but it's wrapped in a try/except, so shouldn't render the app inoperable on its own - https://github.com/mozilla/release-services/blob/master/src/treestatus/api/treestatus_api/api.py#L59
21:28 
<dustin> hmm
21:28 hassan: I see a lot more of those taskcluster-user-haali queues
21:29 
<smacleod> CPU usage jumped like crazy on the pulse cluster about 30 minutes ago
21:29 
<hassan> hm, looking
21:30 
<smacleod> dustin / kmoir: according to the dashboard I'm looking at about 30 minutes ago the number of consumers from pulse went up by an order of magnitude
21:30 
<dustin> consumers = queues, connections, or channels?
21:30 
<smacleod> ~100 to ~1000
21:30 
<dustin> yeah
21:30 
<smacleod> "consumers" being the name of the chart I'm looking at, I'll try and figure out what that means exactly
21:31 
<hassan> dustin: i'm testing task subscriptions in Task Group. it's possible those listners are not being closed...
21:31 
<smacleod> dustin: connections and channels remained about the same
21:31 dustin: queue's jumped a bit, but "consumers" skyrocketed
21:31 
<dustin> hassan: ahh, and is that user configured in your tc-web-server user-config.yml?
21:31 
<smacleod> unacked messages went crazy high too
21:31 ya, 330,000 unacked messasges
21:32 
<dustin> hassan: can you kill tc-web-server?
21:32 the one you're developing against, wherever that runs..
21:32 
<hassan> dustin: yeah...killed tc-web-server

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Comment 1

•

7 years ago

The status remains the same, other mentions:
-Roller reported Internal server error for almost 10 minutes
-We can observe a drop of pendings across the whole infrastructure
-Machines reported “Generic worker on machine … returned exit code 69” and most of the
-Many tasks report as failed/exception

IRC conversations (continuation):

hassan> taskcluster-user-haali
21:33 dustin: ^
21:33 
<dustin> ok
21:35 hm, -01 is down for me again
21:36 
<smacleod> dustin: "vm_memory_high_watermark clear. Memory used:3064217688 allowed:3195988254"
21:37 =WARNING EVENT==== Mon, 25 Mar 2019 19:36:37 GMT ===
21:37 memory resource limit alarm cleared on node 'rabbit@orange-antelope-01'
21:37 ⇐ mayhemer quit (Miranda@moz-9mslpd.broadband6.iol.cz) Quit: Miranda NG! Smaller, Faster, Easier. http://miranda-ng.org/
21:37 
<smacleod> dustin: it appears there's something like 1 million messages sitting around too
21:37 
<dustin> yeah
21:38 and growing quickly it seems - it was 250k just a few minutes ago
21:38 the fact that the first thing to die in rabbitmq is the mgmt interface is *insanely* frustrating
21:39 
<kmoir> I'm writing support about that, it's a huge design flaw
21:41 
<dustin> ok, I'm looking at one of these queues
21:41 it has auto-delete: true
21:41 docs: "Auto-delete (queue that has had at least one consumer is deleted when last consumer unsubscribes)"
21:41 and for that queue
21:41 Consumers 	0
21:41 so.. why is it still here?
21:41 
<tomprince> Did it ever have a consumer though?
21:42 The doc quotes suggests that it is edge rather than level triggered.
21:42 
<dustin> true
21:42 
<tomprince> (Maybe it misses the edge sometimes and so hangs around, though)
21:42 
<dustin> perhaps not, if it was ctrl-c'd between declaring the queue and consuming
21:43 there are a lot of them in that state though
21:43 ok, I think the fix is to delete all queues belonging to that user
21:44 not sure how best to do that...
21:44 I can write a script..
21:49 
<smacleod> dustin: do you know what the actual cause of this was? Who that user is / what they were running?
21:49 
<dustin> no
21:49 I'm guessing it's these tc-web-server queues created with prefix queue/taskcluster-user-haali
21:49 I'm not sure why though
21:50 but if deleting those fixes it, great, and then we can investigate
21:56 if you know a faster way than writing a few hundred lines of JS, please say somehting
21:59 ok, script is running
21:59 bstack: let's delay meeting until this is done
22:03 Kwan → Kwan|dinner
22:04 
<dustin> hm, of course this is taking a few minutes per queue
22:04 since the api is dead
22:05 hassan: this is running tc-web-server locally? any local changes to PulseEngine or is it all in how that's called?
22:06 ..and the UI managed to crash my browser
22:07 it hasn't even managed to delete one queue yet
22:07 smacleod: ideas?
22:07 
<hassan> dustin: this was local. no changes to PulseEngine
22:08 
<dustin> ok
22:08 thanks
22:10 
<smacleod> dustin: what are you running to try and delete the queues?
22:10 
<dustin> https://irccloud.mozilla.com/pastebin/bDSTNunh/
 Plain Text • 21 lines raw | line numbers 
22:10 RabbitManager is https://github.com/taskcluster/taskcluster-pulse/blob/master/src/rabbitmanager.js
22:12 I think some of them are deleted
22:13 I suspect the large number of connections is due to reconnects
22:13 and probably rabbitmq not correctly syncing disconnections
22:14 
<catlee> maybe it uses amqp itself to synchronize across the cluster
22:14 
<dustin> oh, apparently also admin connections count as "channels"
22:14 haha
22:15 hassan: how are you finding the taskGroupId to listen for (I admit this is 99% idle curiosity)
22:19 
<smacleod> dustin: what about https://www.cloudamqp.com/blog/2016-06-21-how-to-delete-queues-in-rabbitmq.html
22:19 see the "policy" section
22:19 Could we add a policy with a wildcard on that user to expire all the queues?
22:19 
<dustin> "Note that this will only work for unused queues" not sure what that means
22:19 
<smacleod> that's what I was wondering too, heh
22:21 dustin: I wonder if using a policy to max length those queues would prune them?
22:21 speeding things up enough to properly delete them
22:21 
<dustin> maybe
22:22 i really have no idea what's up here :(
22:22 
<hassan> dustin: https://github.com/taskcluster/taskcluster/pull/477/files#diff-5a4aea30ce8e1175a0de706664adbdb1R178
22:22 
<smacleod> dustin: want me to give it a shot?
22:22 
<dustin> smacleod: actually at the moment if you can restart -03 and -01 that would probably be useful
22:23 looks like -03 is not coming back, and -01 is over its fd limit
22:23 → erahm joined (sid214@moz-crtbdd.n4vs.eb63.0101.2620.IP)
22:24 
<smacleod> dustin: do you think just shutting down all but one of them might actually help? a lot of theses queues aren't synchronized, maybe a huge portion of the load is it attempting to sync them?
22:24 
<dustin> yeah, I suspect that's the case
22:24 maybe
22:25 maybe keep -02 since it's doing OK
22:25 
<smacleod> k, will try
22:26 dustin: k, they shut be shut down
22:27 
<dustin> hassan: and then just finding an "active" taskgroup? via treeherder?
22:27 
<hassan> dustin: yeah
22:27 
<dustin> kk
22:27 smacleod: looks like they're all down?
22:28 
<smacleod> dustin: says -02 is running in the console?
22:28 
<dustin> huh
22:28 https://orange-antelope-02.rmq.cloudamqp.com/#/ gives 502
22:29 smacleod: ok, let's try the policy to delete all queues
22:31 
<smacleod> dustin: trying to get the mngmt interface back, that's where you enter them :/
22:31 rebooting 2
22:31 
<dustin> yeah
22:31 ok
22:34 ⇐ Usul quit (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP) Ping timeout: 121 seconds
22:34 
<smacleod> dustin: alright, looks like it's back
22:34 
<dustin> oo, my script is cruisin'
22:35 huh
22:35 so I listed the queues with one API method, then loop over the results calling DELETE
22:35 and most of those DELETEs (but not all) are 404
22:35 I wonder if that's because the queues are on another host
22:35 
<smacleod> dustin: probably the queues homes on 01, or 03, they're down
22:36 
<dustin> yeah
22:36 
<smacleod> dustin: let me know when it's finished with all it can on -02 and we can bring up one of the others?
22:36 
<dustin> sounds like a good plan
22:37 hassan: (still trying to figure out how this went sideways) is it possible that these subscriptions happened in some kind of loop?
22:37 there are thousands of them -- I suspect you haven't hit reload in the UI thousands of times today
22:38 smacleod: ok, script is done, let's bring up another node
22:38 
<smacleod> dustin: okay, rebooting 01
22:38 ⇐ Aryx quit (Archaeopter@moz-e3srbg.cust.telecolumbus.net) Connection closed
22:39 
<erahm> Hi ci folks, is all of my try runs going awol expected? https://treeherder.mozilla.org/#/jobs?repo=try&revision=edf3141cf1fd6ce05f9083d6d88ec06b399bd037
22:39 
<tomprince> erahm: Yes. There is an issue with the underlying infra.
22:39 
<smacleod> dustin: looks like you managed to delete about 300 queues I think, like 1200 left
22:39 
<erahm> tomprince: thanks!
22:39 
<dustin> ok
22:39 I'll run the same against -01
22:40 
<smacleod> dustin: -01 should be running now
22:41 ping when script is finished there and I'll bring -03 up

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Updated

•

7 years ago

Severity: normal → blocker

Priority: -- → P1

Jordan Lund (:jlund)

Updated

•

7 years ago

Comment 2

•

7 years ago

Updates:

Treestatus is working again
Not a blocker anymore
Trees are open again.

Severity: blocker → normal

Priority: P1 → P3

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Reporter

Comment 3

•

7 years ago

•

Edited

Updates 2:

We started to see increase of pending tasks across the whole infrastructure (expected)
All of the services that were affected, recovered and working OK.

Post Morten notes from the last week will include this outage too.

LE: The root of the problems we had today can be found tracked in bug 1538961.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

[Tracking] Pulse is down - along with every service that uses pulse.

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

Tracking

(Not tracked)

People

(Reporter: bcrisan, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Comment 2

Comment 3

Updated