Closed Bug 964804 Opened 11 years ago Closed 11 years ago

Jobs not being scheduled

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

Tests are not being scheduled once builds are complete.
Philor reports this happened ~10 hrs ago too.

All trees closed.
philor: https://tbpl.mozilla.org/?showall=1&tree=B2g-Inbound&onlyunstarred=1
philor: where's the orange?
philor: yeah, G is permaorange, Gi is over 50% red, so... where's the orange?
philor: 04:36 push got its linux desktop coalesced to a 04:51 push? 04:51 push got no tests scheduled at all?
philor: that looks like either totally busted scheduling, or, a reconfig that made all my dreams come true, not sure which
...
philor: edmorley|sheriffduty: the former, you should close every tree, we haven't scheduled tests on builds that finished 40 minutes ago
philor: and judging by my experience last night when I noticed it and started screaming one minute before it unstuck, closing should be enough to unstick it
edmorley|sheriffduty: philor: was there a bug for last night, or should I file new? (I didn't seen anything in the bugmail)
philor: edmorley|sheriffduty: nope, I started screaming "urgent:" precisely one minute before it unstuck :)
Looking at https://secure.pub.build.mozilla.org/buildapi/self-serve/try/rev/4a4d24c24588

The change was pushed at 07:15:57. Buildbot picked up the change at 07:16:27:
2014-01-28 07:16:27-0800 [HTTPPageGetter,client] adding change, who bjacob@mozilla.com, 1 files, rev=4a4d24c24588827cc045b49ed35a0d4f8155d144, branch=try, repository=, comments try: -b do -p linux,macosx64,win32,android,emulator -u mochitests,reftest -t none, category None, project

submittime for these builds isn't much later, until 07:32:06, where we see a few entries like this:
2014-01-28 07:32:05-0800 [-] b2g_try-b2g: propfunc returned {'builduid': '742a333baa8a498798ec59f68bd97bc5'}

which matches the builduid in the request here https://secure.pub.build.mozilla.org/buildapi/self-serve/try/build/36050468
Trees reopened; leaving open for further investigation :-)


philor: is there any way that nagios can watch the scheduler?
catlee: philor: to notice when it's stuck?
philor: or is there no way to see a scheduling backlog except by being the scheduler that has a 45 minute backlog right now?
philor: like the 30 minute backlog it had last night
philor: and the one that I now wonder if we have every single night, right before we schedule all the things at once and cause the nightly closures
catlee: philor: I'll take a peek

catlee: edmorley|sheriffduty: just tests?
RyanVM: catlee: well, my push to c-c from 6 minutes ago isn't even showing pending builds, so I would guess "everything"
philor: catlee: probably over, ended at 07:48
edmorley|sheriffduty: catlee: other than old cruft, https://secure.pub.build.mozilla.org/buildapi/pending is looking pretty empty
catlee: philor: ok. how about last night?
philor: https://secure.pub.build.mozilla.org/buildapi/self-serve/b2g-inbound/rev/7bc0d1e21cb7 - emulator-debug build finished at 07:25, scheduled tests at 07:48
philor: catlee: both, I think, since my try build taking forever to show up was what first made me start looking
philor: catlee: yeah, both: https://secure.pub.build.mozilla.org/buildapi/self-serve/try/rev/4a4d24c24588 was pushed at 07:15
catlee: philor: so I see tons of pending jobs piling up until around 6:49, when they start to go down again
catlee: it looks like jobs were getting scheduled...
cyborgshadow: catlee: 6:49 what timezone?
catlee: cyborgshadow: PT
cyborgshadow: catlee: so...1 hour 5 minutes ago?
catlee: cyborgshadow: yeah
cyborgshadow: catlee: I have a huge spike at around 6:35 PT - 6:50 PT
cyborgshadow: catlee: but it started pretty precisely around 6:35
catlee: cyborgshadow: spike of what?
cyborgshadow: catlee: db traffic
catlee: philor: so I see the sendchange at 07:24:19
catlee: cyborgshadow: any other spikes you see?
cyborgshadow: catlee: I'm only the database guy. New relic isn't monitoring the buildbot application, so I can see db traffic and queries.
cyborgshadow: catlee: I can try and dig to see what those queries were.
catlee: just wondering if it's happened before
cyborgshadow: catlee: note however though, I'm finding several potential improvements to buildbot today and seeking approval to make them happier.
cyborgshadow: catlee: that I can check.
cyborgshadow: catlee: yes it has. It tends to happen 1 or 2 times a day, but the last 3 days in particular it's been at a high spike all day.
catlee: cyborgshadow: around what times?
catlee: buildapi is really slow right now...
edmorley|sheriffduty: catlee: seems down for retriggers again
bhearsum: catlee: slaverebooter is running, that may be affecting it...
cyborgshadow: catlee: the last 3 days one started around 14:00-16:00 on the 25th. Prior spikes align with that time: 14:00-16:00 they start ramping up. They don't come down until 05:00 usually.
cyborgshadow: catlee: fwiw: after implementing the changes I did a few minutes ago, we're at an alltime low on utilization on the box currently.
cyborgshadow: catlee: but network traffic is up, so it's a positive indication surrounding the changes. (utilization down +network traffic up = change was positive)
dustin: cyborgshadow++ awesome :)
catlee: nice
cyborgshadow: dustin: fwiw: we went from a new entry in the slow query log every 1-2 seconds to now only having 2 in the last 40 minutes.
dustin: awesome!
dustin: we're still slow though :(
dustin: I wish I could tell which bit of buidlapi is causing the slow
dustin: I strongly suspect reporter.py
dustin: but it's possible that memcache's hit rate is lower than redis
dustin: or that the web UI, now that it's on a faster host, is causing more hits
dustin: or that slaverebooter is bludgeoning it
bhearsum: it wouldn't shock me if it's that
bhearsum: slave rebooter runs for just over an hour on a 43 */4 * * * schedule
Severity: blocker → critical
Looks like it didn't get around to noticing the changes until this:

2014-01-28 07:21:26-0800 [-] Looking at changes: [<buildbot.changes.changes.Change instance at 0x5a89f38>, <buildbot.changes.changes.Change instance at 0x117d4248>, <buildbot.changes.changes.Change instance at 0x6f32c68>, <buildbot.changes.changes.Change instance at 0x989d3b0>, <buildbot.changes.changes.Change instance at 0xba1d2d8>, <buildbot.changes.changes.Change instance at 0xc3a2128>]
2014-01-28 07:21:26-0800 [-] TryChooser OPTIONS : MESSAGE Namespace(build=['opt', 'debug'], talos=u'none', test=u'mochitests,reftest', user_platforms=set([u'win32', u'android', u'macosx64', u'emulator', u'linux'])) : try: -b do -p linux,macosx64,win32,android,emulator -u mochitests,reftest -t none
I don't know what else I can do without catching it in action again
Blocks: 957502
Summary: Tests not being scheduled → Jobs not being scheduled
Possibly related, we're seeing bad latency to the db:
 1. 10.26.48.1                                       0.0%    96  253.5 240.2 210.0 278.8  15.1
 2. v-1034.border2.scl3.mozilla.net                  1.1%    96  532.2 482.6 412.3 554.6  29.8
 3. v-1030.fw1.scl3.mozilla.net                      0.0%    95  512.5 481.9 418.7 563.2  30.3
 4. buildbot-rw-vip.db.scl3.mozilla.com              0.0%    95  493.7 486.4 424.6 554.6  28.0
Depends on: 964853
Trees were closed for this again at 09:24.
Some trees were reopened at 13:12. In order to avoid a big load spike, the remaining trees were reopened at 15:09.
It's gotten bad enough again (really, has been bad enough for the last hour or two, but I'm slow) that I've closed all trees.
And closed again.
releng firewall upgrade and other measures have corrected the ping times that was the root cause.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.