Closed Bug 1538914 Opened 5 years ago Closed 5 years ago

[Tracking] Pulse is down - along with every service that uses pulse.

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bcrisan, Unassigned)

References

Details

Tracking bug for pulse issues on March 25, 2019.
It affects every service that is based on/uses Pulse.

dustin> signs are pointing toward having pulse issues again
21:07 starting just before hte top of the hour
21:07 kmoir: glob: ^^
21:07 
<kmoir> I will go look
21:07 
<dustin> https://irccloud.mozilla.com/file/Dz0xpNVa/image.pngimage.png24.27KB • image/png
21:07 
— •bcrisan|ciduty good to know
21:07 
<ahal> catlee: might be better to move tier 2 jobs into --full instead?
21:08 
<catlee> maybe
21:08 
<ahal> I get many complaints around how confusing fuzzy is already
21:08 
<catlee> dustin: looks like something starts leaking around 14:45
21:08 
<ahal> having both --full and --tier 2 filtering out things seems like a footgun
21:08 
<dustin> yeah
21:09 "The management statistics database currently has a queue of 2070 events to process. If this number keeps increasing, so will the memory used by the management plugin. You may find it useful to set the rates_mode config item to none."
21:09 
<catlee> ahal: what does --full do right now?
21:09 
<ahal> let's you pick any task from the full task set
21:09 
<dustin> kmoir: -01 seems to be the least sick right now
21:09 
<kmoir> okay
21:10 
<catlee> ahal: vs the target task set?
21:10 
<ahal> catlee: yeah, right now it's target task set - ccov - a handful of things people requested not be there
21:11 so we could definitely hardcode a rule to get tier 2 things excluded
21:12 ⇐ Standard8 quit (Standard8@moz-1hu.vrc.166.195.IP) Quit: ZNC - http://znc.in
21:15 
<bstack> dustin: the pulse stuff is causing users trouble with task submission timing out now. I wonder how widespread this is
21:15 perhaps we should think about closing trees
21:15 
<dustin> I suspect we should
21:15 
<•bcrisan|ciduty> should I tell the sheriffs to close the trees?
21:16 
<dustin> I think we'll find ourselves unable to do so like last time, actually
21:16 
<•bcrisan|ciduty> oh, yeah, that tools is based on pulse..
21:17 
<dustin> huh, auth0 seems down too
21:17 
<bstack> created https://app.google.stackdriver.com/metrics-explorer?project=heroku-logging&metric=logging.googleapis.com%2Fuser%2FsendDeadlineExceeded to track send deadline exceeded in queue
21:17 
<catlee> how does treestatus fail when pulse is down?
21:18 
— •bcrisan|ciduty is going to fill a tracking bug
21:18 
<dustin> I suspect I can close the trees by putting the treestatus app in maintenance mode
21:18 catlee: it uses TC auth, and pulse being down pretty much decimates Taskcluster
21:18 
<catlee> ah
21:18 does TC auth need pulse?
21:19 → smacleod joined (sid301@moz-q6a57d.n4vs.eb63.0101.2620.IP)
21:20 
<dustin> yeah
21:20 
— dustin turns on maintenance mode for treestatus
21:21 
<dustin> at least, last time this happened, auth crashed when pulse went down
21:21 which means logins don't work anymore
21:21 I couldn't figure out why it crashed from the logs (stackdriver, ugh)
21:22 (it has no way to search for the first matching log entry)
21:22 
<catlee> treestatus may also send out pulse messages when the state changes...
21:22 
<dustin> yeah
21:22 
<kmoir>  Do we know if the messages that are not being acked are valid messages or just that the messages are not being processed in a timely fashion?  Is it just a load issue?
21:23 and apologizing in advance again for my lack of pulse knowledge
21:23 
<dustin> kmoir: no idea, I can't see anything right now :(
21:23 
<aki> hm, wondering if my periodic hook is firing too much
21:23 
<dustin> aki: did that start at about :45?
21:23 
<catlee> or someone starting up a bunch of listeners?
21:23 
<aki> i think at 12:02
21:24 
<dustin> catlee: so at around :10, I managed to load a list of connections on the cluster, and didn't see anything unusual in the # of channels
21:24 or usernames
21:26 so i see a few channels belonging to tc-queue that have unconfirmed messages
21:27 
<catlee> so yeah, treestatus does try and send out pulse messages on status change, but it's wrapped in a try/except, so shouldn't render the app inoperable on its own - https://github.com/mozilla/release-services/blob/master/src/treestatus/api/treestatus_api/api.py#L59
21:28 
<dustin> hmm
21:28 hassan: I see a lot more of those taskcluster-user-haali queues
21:29 
<smacleod> CPU usage jumped like crazy on the pulse cluster about 30 minutes ago
21:29 
<hassan> hm, looking
21:30 
<smacleod> dustin / kmoir: according to the dashboard I'm looking at about 30 minutes ago the number of consumers from pulse went up by an order of magnitude
21:30 
<dustin> consumers = queues, connections, or channels?
21:30 
<smacleod> ~100 to ~1000
21:30 
<dustin> yeah
21:30 
<smacleod> "consumers" being the name of the chart I'm looking at, I'll try and figure out what that means exactly
21:31 
<hassan> dustin: i'm testing task subscriptions in Task Group. it's possible those listners are not being closed...
21:31 
<smacleod> dustin: connections and channels remained about the same
21:31 dustin: queue's jumped a bit, but "consumers" skyrocketed
21:31 
<dustin> hassan: ahh, and is that user configured in your tc-web-server user-config.yml?
21:31 
<smacleod> unacked messages went crazy high too
21:31 ya, 330,000 unacked messasges
21:32 
<dustin> hassan: can you kill tc-web-server?
21:32 the one you're developing against, wherever that runs..
21:32 
<hassan> dustin: yeah...killed tc-web-server

The status remains the same, other mentions:
-Roller reported Internal server error for almost 10 minutes
-We can observe a drop of pendings across the whole infrastructure
-Machines reported “Generic worker on machine … returned exit code 69” and most of the
-Many tasks report as failed/exception

IRC conversations (continuation):

hassan> taskcluster-user-haali
21:33 dustin: ^
21:33 
<dustin> ok
21:35 hm, -01 is down for me again
21:36 
<smacleod> dustin: "vm_memory_high_watermark clear. Memory used:3064217688 allowed:3195988254"
21:37 =WARNING EVENT==== Mon, 25 Mar 2019 19:36:37 GMT ===
21:37 memory resource limit alarm cleared on node 'rabbit@orange-antelope-01'
21:37 ⇐ mayhemer quit (Miranda@moz-9mslpd.broadband6.iol.cz) Quit: Miranda NG! Smaller, Faster, Easier. http://miranda-ng.org/
21:37 
<smacleod> dustin: it appears there's something like 1 million messages sitting around too
21:37 
<dustin> yeah
21:38 and growing quickly it seems - it was 250k just a few minutes ago
21:38 the fact that the first thing to die in rabbitmq is the mgmt interface is *insanely* frustrating
21:39 
<kmoir> I'm writing support about that, it's a huge design flaw
21:41 
<dustin> ok, I'm looking at one of these queues
21:41 it has auto-delete: true
21:41 docs: "Auto-delete (queue that has had at least one consumer is deleted when last consumer unsubscribes)"
21:41 and for that queue
21:41 Consumers 	0
21:41 so.. why is it still here?
21:41 
<tomprince> Did it ever have a consumer though?
21:42 The doc quotes suggests that it is edge rather than level triggered.
21:42 
<dustin> true
21:42 
<tomprince> (Maybe it misses the edge sometimes and so hangs around, though)
21:42 
<dustin> perhaps not, if it was ctrl-c'd between declaring the queue and consuming
21:43 there are a lot of them in that state though
21:43 ok, I think the fix is to delete all queues belonging to that user
21:44 not sure how best to do that...
21:44 I can write a script..
21:49 
<smacleod> dustin: do you know what the actual cause of this was? Who that user is / what they were running?
21:49 
<dustin> no
21:49 I'm guessing it's these tc-web-server queues created with prefix queue/taskcluster-user-haali
21:49 I'm not sure why though
21:50 but if deleting those fixes it, great, and then we can investigate
21:56 if you know a faster way than writing a few hundred lines of JS, please say somehting
21:59 ok, script is running
21:59 bstack: let's delay meeting until this is done
22:03 Kwan → Kwan|dinner
22:04 
<dustin> hm, of course this is taking a few minutes per queue
22:04 since the api is dead
22:05 hassan: this is running tc-web-server locally? any local changes to PulseEngine or is it all in how that's called?
22:06 ..and the UI managed to crash my browser
22:07 it hasn't even managed to delete one queue yet
22:07 smacleod: ideas?
22:07 
<hassan> dustin: this was local. no changes to PulseEngine
22:08 
<dustin> ok
22:08 thanks
22:10 
<smacleod> dustin: what are you running to try and delete the queues?
22:10 
<dustin> https://irccloud.mozilla.com/pastebin/bDSTNunh/
 Plain Text • 21 lines raw | line numbers 
22:10 RabbitManager is https://github.com/taskcluster/taskcluster-pulse/blob/master/src/rabbitmanager.js
22:12 I think some of them are deleted
22:13 I suspect the large number of connections is due to reconnects
22:13 and probably rabbitmq not correctly syncing disconnections
22:14 
<catlee> maybe it uses amqp itself to synchronize across the cluster
22:14 
<dustin> oh, apparently also admin connections count as "channels"
22:14 haha
22:15 hassan: how are you finding the taskGroupId to listen for (I admit this is 99% idle curiosity)
22:19 
<smacleod> dustin: what about https://www.cloudamqp.com/blog/2016-06-21-how-to-delete-queues-in-rabbitmq.html
22:19 see the "policy" section
22:19 Could we add a policy with a wildcard on that user to expire all the queues?
22:19 
<dustin> "Note that this will only work for unused queues" not sure what that means
22:19 
<smacleod> that's what I was wondering too, heh
22:21 dustin: I wonder if using a policy to max length those queues would prune them?
22:21 speeding things up enough to properly delete them
22:21 
<dustin> maybe
22:22 i really have no idea what's up here :(
22:22 
<hassan> dustin: https://github.com/taskcluster/taskcluster/pull/477/files#diff-5a4aea30ce8e1175a0de706664adbdb1R178
22:22 
<smacleod> dustin: want me to give it a shot?
22:22 
<dustin> smacleod: actually at the moment if you can restart -03 and -01 that would probably be useful
22:23 looks like -03 is not coming back, and -01 is over its fd limit
22:23 → erahm joined (sid214@moz-crtbdd.n4vs.eb63.0101.2620.IP)
22:24 
<smacleod> dustin: do you think just shutting down all but one of them might actually help? a lot of theses queues aren't synchronized, maybe a huge portion of the load is it attempting to sync them?
22:24 
<dustin> yeah, I suspect that's the case
22:24 maybe
22:25 maybe keep -02 since it's doing OK
22:25 
<smacleod> k, will try
22:26 dustin: k, they shut be shut down
22:27 
<dustin> hassan: and then just finding an "active" taskgroup? via treeherder?
22:27 
<hassan> dustin: yeah
22:27 
<dustin> kk
22:27 smacleod: looks like they're all down?
22:28 
<smacleod> dustin: says -02 is running in the console?
22:28 
<dustin> huh
22:28 https://orange-antelope-02.rmq.cloudamqp.com/#/ gives 502
22:29 smacleod: ok, let's try the policy to delete all queues
22:31 
<smacleod> dustin: trying to get the mngmt interface back, that's where you enter them :/
22:31 rebooting 2
22:31 
<dustin> yeah
22:31 ok
22:34 ⇐ Usul quit (Ludovic@moz-vn82b0.v98s.ddh0.0e34.2a01.IP) Ping timeout: 121 seconds
22:34 
<smacleod> dustin: alright, looks like it's back
22:34 
<dustin> oo, my script is cruisin'
22:35 huh
22:35 so I listed the queues with one API method, then loop over the results calling DELETE
22:35 and most of those DELETEs (but not all) are 404
22:35 I wonder if that's because the queues are on another host
22:35 
<smacleod> dustin: probably the queues homes on 01, or 03, they're down
22:36 
<dustin> yeah
22:36 
<smacleod> dustin: let me know when it's finished with all it can on -02 and we can bring up one of the others?
22:36 
<dustin> sounds like a good plan
22:37 hassan: (still trying to figure out how this went sideways) is it possible that these subscriptions happened in some kind of loop?
22:37 there are thousands of them -- I suspect you haven't hit reload in the UI thousands of times today
22:38 smacleod: ok, script is done, let's bring up another node
22:38 
<smacleod> dustin: okay, rebooting 01
22:38 ⇐ Aryx quit (Archaeopter@moz-e3srbg.cust.telecolumbus.net) Connection closed
22:39 
<erahm> Hi ci folks, is all of my try runs going awol expected? https://treeherder.mozilla.org/#/jobs?repo=try&revision=edf3141cf1fd6ce05f9083d6d88ec06b399bd037
22:39 
<tomprince> erahm: Yes. There is an issue with the underlying infra.
22:39 
<smacleod> dustin: looks like you managed to delete about 300 queues I think, like 1200 left
22:39 
<erahm> tomprince: thanks!
22:39 
<dustin> ok
22:39 I'll run the same against -01
22:40 
<smacleod> dustin: -01 should be running now
22:41 ping when script is finished there and I'll bring -03 up
Severity: normal → blocker
Priority: -- → P1
See Also: → 1537756

Updates:

  • Treestatus is working again
  • Not a blocker anymore
  • Trees are open again.
Severity: blocker → normal
Priority: P1 → P3

Updates 2:

  • We started to see increase of pending tasks across the whole infrastructure (expected)
  • All of the services that were affected, recovered and working OK.

Post Morten notes from the last week will include this outage too.

LE: The root of the problems we had today can be found tracked in bug 1538961.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.