Closed
Bug 964804
Opened 9 years ago
Closed 9 years ago
Jobs not being scheduled
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Unassigned)
References
Details
Tests are not being scheduled once builds are complete. Philor reports this happened ~10 hrs ago too. All trees closed.
Reporter | ||
Comment 1•9 years ago
|
||
philor: https://tbpl.mozilla.org/?showall=1&tree=B2g-Inbound&onlyunstarred=1 philor: where's the orange? philor: yeah, G is permaorange, Gi is over 50% red, so... where's the orange? philor: 04:36 push got its linux desktop coalesced to a 04:51 push? 04:51 push got no tests scheduled at all? philor: that looks like either totally busted scheduling, or, a reconfig that made all my dreams come true, not sure which ... philor: edmorley|sheriffduty: the former, you should close every tree, we haven't scheduled tests on builds that finished 40 minutes ago philor: and judging by my experience last night when I noticed it and started screaming one minute before it unstuck, closing should be enough to unstick it edmorley|sheriffduty: philor: was there a bug for last night, or should I file new? (I didn't seen anything in the bugmail) philor: edmorley|sheriffduty: nope, I started screaming "urgent:" precisely one minute before it unstuck :)
Comment 2•9 years ago
|
||
Looking at https://secure.pub.build.mozilla.org/buildapi/self-serve/try/rev/4a4d24c24588 The change was pushed at 07:15:57. Buildbot picked up the change at 07:16:27: 2014-01-28 07:16:27-0800 [HTTPPageGetter,client] adding change, who bjacob@mozilla.com, 1 files, rev=4a4d24c24588827cc045b49ed35a0d4f8155d144, branch=try, repository=, comments try: -b do -p linux,macosx64,win32,android,emulator -u mochitests,reftest -t none, category None, project submittime for these builds isn't much later, until 07:32:06, where we see a few entries like this: 2014-01-28 07:32:05-0800 [-] b2g_try-b2g: propfunc returned {'builduid': '742a333baa8a498798ec59f68bd97bc5'} which matches the builduid in the request here https://secure.pub.build.mozilla.org/buildapi/self-serve/try/build/36050468
Reporter | ||
Comment 3•9 years ago
|
||
Trees reopened; leaving open for further investigation :-) philor: is there any way that nagios can watch the scheduler? catlee: philor: to notice when it's stuck? philor: or is there no way to see a scheduling backlog except by being the scheduler that has a 45 minute backlog right now? philor: like the 30 minute backlog it had last night philor: and the one that I now wonder if we have every single night, right before we schedule all the things at once and cause the nightly closures catlee: philor: I'll take a peek catlee: edmorley|sheriffduty: just tests? RyanVM: catlee: well, my push to c-c from 6 minutes ago isn't even showing pending builds, so I would guess "everything" philor: catlee: probably over, ended at 07:48 edmorley|sheriffduty: catlee: other than old cruft, https://secure.pub.build.mozilla.org/buildapi/pending is looking pretty empty catlee: philor: ok. how about last night? philor: https://secure.pub.build.mozilla.org/buildapi/self-serve/b2g-inbound/rev/7bc0d1e21cb7 - emulator-debug build finished at 07:25, scheduled tests at 07:48 philor: catlee: both, I think, since my try build taking forever to show up was what first made me start looking philor: catlee: yeah, both: https://secure.pub.build.mozilla.org/buildapi/self-serve/try/rev/4a4d24c24588 was pushed at 07:15 catlee: philor: so I see tons of pending jobs piling up until around 6:49, when they start to go down again catlee: it looks like jobs were getting scheduled... cyborgshadow: catlee: 6:49 what timezone? catlee: cyborgshadow: PT cyborgshadow: catlee: so...1 hour 5 minutes ago? catlee: cyborgshadow: yeah cyborgshadow: catlee: I have a huge spike at around 6:35 PT - 6:50 PT cyborgshadow: catlee: but it started pretty precisely around 6:35 catlee: cyborgshadow: spike of what? cyborgshadow: catlee: db traffic catlee: philor: so I see the sendchange at 07:24:19 catlee: cyborgshadow: any other spikes you see? cyborgshadow: catlee: I'm only the database guy. New relic isn't monitoring the buildbot application, so I can see db traffic and queries. cyborgshadow: catlee: I can try and dig to see what those queries were. catlee: just wondering if it's happened before cyborgshadow: catlee: note however though, I'm finding several potential improvements to buildbot today and seeking approval to make them happier. cyborgshadow: catlee: that I can check. cyborgshadow: catlee: yes it has. It tends to happen 1 or 2 times a day, but the last 3 days in particular it's been at a high spike all day. catlee: cyborgshadow: around what times? catlee: buildapi is really slow right now... edmorley|sheriffduty: catlee: seems down for retriggers again bhearsum: catlee: slaverebooter is running, that may be affecting it... cyborgshadow: catlee: the last 3 days one started around 14:00-16:00 on the 25th. Prior spikes align with that time: 14:00-16:00 they start ramping up. They don't come down until 05:00 usually. cyborgshadow: catlee: fwiw: after implementing the changes I did a few minutes ago, we're at an alltime low on utilization on the box currently. cyborgshadow: catlee: but network traffic is up, so it's a positive indication surrounding the changes. (utilization down +network traffic up = change was positive) dustin: cyborgshadow++ awesome :) catlee: nice cyborgshadow: dustin: fwiw: we went from a new entry in the slow query log every 1-2 seconds to now only having 2 in the last 40 minutes. dustin: awesome! dustin: we're still slow though :( dustin: I wish I could tell which bit of buidlapi is causing the slow dustin: I strongly suspect reporter.py dustin: but it's possible that memcache's hit rate is lower than redis dustin: or that the web UI, now that it's on a faster host, is causing more hits dustin: or that slaverebooter is bludgeoning it bhearsum: it wouldn't shock me if it's that bhearsum: slave rebooter runs for just over an hour on a 43 */4 * * * schedule
Severity: blocker → critical
Comment 4•9 years ago
|
||
Looks like it didn't get around to noticing the changes until this: 2014-01-28 07:21:26-0800 [-] Looking at changes: [<buildbot.changes.changes.Change instance at 0x5a89f38>, <buildbot.changes.changes.Change instance at 0x117d4248>, <buildbot.changes.changes.Change instance at 0x6f32c68>, <buildbot.changes.changes.Change instance at 0x989d3b0>, <buildbot.changes.changes.Change instance at 0xba1d2d8>, <buildbot.changes.changes.Change instance at 0xc3a2128>] 2014-01-28 07:21:26-0800 [-] TryChooser OPTIONS : MESSAGE Namespace(build=['opt', 'debug'], talos=u'none', test=u'mochitests,reftest', user_platforms=set([u'win32', u'android', u'macosx64', u'emulator', u'linux'])) : try: -b do -p linux,macosx64,win32,android,emulator -u mochitests,reftest -t none
Comment 5•9 years ago
|
||
I don't know what else I can do without catching it in action again
Updated•9 years ago
|
Summary: Tests not being scheduled → Jobs not being scheduled
Comment 6•9 years ago
|
||
Possibly related, we're seeing bad latency to the db: 1. 10.26.48.1 0.0% 96 253.5 240.2 210.0 278.8 15.1 2. v-1034.border2.scl3.mozilla.net 1.1% 96 532.2 482.6 412.3 554.6 29.8 3. v-1030.fw1.scl3.mozilla.net 0.0% 95 512.5 481.9 418.7 563.2 30.3 4. buildbot-rw-vip.db.scl3.mozilla.com 0.0% 95 493.7 486.4 424.6 554.6 28.0
Comment 7•9 years ago
|
||
Trees were closed for this again at 09:24.
Comment 8•9 years ago
|
||
Some trees were reopened at 13:12. In order to avoid a big load spike, the remaining trees were reopened at 15:09.
Comment 9•9 years ago
|
||
It's gotten bad enough again (really, has been bad enough for the last hour or two, but I'm slow) that I've closed all trees.
Comment 10•9 years ago
|
||
And closed again.
Comment 11•9 years ago
|
||
releng firewall upgrade and other measures have corrected the ping times that was the root cause.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•3 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•