964804 - Jobs not being scheduled

Reporter

Description

•

11 years ago

Tests are not being scheduled once builds are complete.
Philor reports this happened ~10 hrs ago too.

All trees closed.

Ed Morley [:emorley]

Reporter

Comment 1

•

11 years ago

philor: https://tbpl.mozilla.org/?showall=1&tree=B2g-Inbound&onlyunstarred=1
philor: where's the orange?
philor: yeah, G is permaorange, Gi is over 50% red, so... where's the orange?
philor: 04:36 push got its linux desktop coalesced to a 04:51 push? 04:51 push got no tests scheduled at all?
philor: that looks like either totally busted scheduling, or, a reconfig that made all my dreams come true, not sure which
...
philor: edmorley|sheriffduty: the former, you should close every tree, we haven't scheduled tests on builds that finished 40 minutes ago
philor: and judging by my experience last night when I noticed it and started screaming one minute before it unstuck, closing should be enough to unstick it
edmorley|sheriffduty: philor: was there a bug for last night, or should I file new? (I didn't seen anything in the bugmail)
philor: edmorley|sheriffduty: nope, I started screaming "urgent:" precisely one minute before it unstuck :)

Chris AtLee [:catlee]

Comment 2

•

11 years ago

Looking at https://secure.pub.build.mozilla.org/buildapi/self-serve/try/rev/4a4d24c24588

The change was pushed at 07:15:57. Buildbot picked up the change at 07:16:27:
2014-01-28 07:16:27-0800 [HTTPPageGetter,client] adding change, who bjacob@mozilla.com, 1 files, rev=4a4d24c24588827cc045b49ed35a0d4f8155d144, branch=try, repository=, comments try: -b do -p linux,macosx64,win32,android,emulator -u mochitests,reftest -t none, category None, project

submittime for these builds isn't much later, until 07:32:06, where we see a few entries like this:
2014-01-28 07:32:05-0800 [-] b2g_try-b2g: propfunc returned {'builduid': '742a333baa8a498798ec59f68bd97bc5'}

which matches the builduid in the request here https://secure.pub.build.mozilla.org/buildapi/self-serve/try/build/36050468

Ed Morley [:emorley]

Reporter

Comment 3

•

11 years ago

Trees reopened; leaving open for further investigation :-)


philor: is there any way that nagios can watch the scheduler?
catlee: philor: to notice when it's stuck?
philor: or is there no way to see a scheduling backlog except by being the scheduler that has a 45 minute backlog right now?
philor: like the 30 minute backlog it had last night
philor: and the one that I now wonder if we have every single night, right before we schedule all the things at once and cause the nightly closures
catlee: philor: I'll take a peek

catlee: edmorley|sheriffduty: just tests?
RyanVM: catlee: well, my push to c-c from 6 minutes ago isn't even showing pending builds, so I would guess "everything"
philor: catlee: probably over, ended at 07:48
edmorley|sheriffduty: catlee: other than old cruft, https://secure.pub.build.mozilla.org/buildapi/pending is looking pretty empty
catlee: philor: ok. how about last night?
philor: https://secure.pub.build.mozilla.org/buildapi/self-serve/b2g-inbound/rev/7bc0d1e21cb7 - emulator-debug build finished at 07:25, scheduled tests at 07:48
philor: catlee: both, I think, since my try build taking forever to show up was what first made me start looking
philor: catlee: yeah, both: https://secure.pub.build.mozilla.org/buildapi/self-serve/try/rev/4a4d24c24588 was pushed at 07:15
catlee: philor: so I see tons of pending jobs piling up until around 6:49, when they start to go down again
catlee: it looks like jobs were getting scheduled...
cyborgshadow: catlee: 6:49 what timezone?
catlee: cyborgshadow: PT
cyborgshadow: catlee: so...1 hour 5 minutes ago?
catlee: cyborgshadow: yeah
cyborgshadow: catlee: I have a huge spike at around 6:35 PT - 6:50 PT
cyborgshadow: catlee: but it started pretty precisely around 6:35
catlee: cyborgshadow: spike of what?
cyborgshadow: catlee: db traffic
catlee: philor: so I see the sendchange at 07:24:19
catlee: cyborgshadow: any other spikes you see?
cyborgshadow: catlee: I'm only the database guy. New relic isn't monitoring the buildbot application, so I can see db traffic and queries.
cyborgshadow: catlee: I can try and dig to see what those queries were.
catlee: just wondering if it's happened before
cyborgshadow: catlee: note however though, I'm finding several potential improvements to buildbot today and seeking approval to make them happier.
cyborgshadow: catlee: that I can check.
cyborgshadow: catlee: yes it has. It tends to happen 1 or 2 times a day, but the last 3 days in particular it's been at a high spike all day.
catlee: cyborgshadow: around what times?
catlee: buildapi is really slow right now...
edmorley|sheriffduty: catlee: seems down for retriggers again
bhearsum: catlee: slaverebooter is running, that may be affecting it...
cyborgshadow: catlee: the last 3 days one started around 14:00-16:00 on the 25th. Prior spikes align with that time: 14:00-16:00 they start ramping up. They don't come down until 05:00 usually.
cyborgshadow: catlee: fwiw: after implementing the changes I did a few minutes ago, we're at an alltime low on utilization on the box currently.
cyborgshadow: catlee: but network traffic is up, so it's a positive indication surrounding the changes. (utilization down +network traffic up = change was positive)
dustin: cyborgshadow++ awesome :)
catlee: nice
cyborgshadow: dustin: fwiw: we went from a new entry in the slow query log every 1-2 seconds to now only having 2 in the last 40 minutes.
dustin: awesome!
dustin: we're still slow though :(
dustin: I wish I could tell which bit of buidlapi is causing the slow
dustin: I strongly suspect reporter.py
dustin: but it's possible that memcache's hit rate is lower than redis
dustin: or that the web UI, now that it's on a faster host, is causing more hits
dustin: or that slaverebooter is bludgeoning it
bhearsum: it wouldn't shock me if it's that
bhearsum: slave rebooter runs for just over an hour on a 43 */4 * * * schedule

Severity: blocker → critical

Chris AtLee [:catlee]

Comment 4

•

11 years ago

Looks like it didn't get around to noticing the changes until this:

2014-01-28 07:21:26-0800 [-] Looking at changes: [<buildbot.changes.changes.Change instance at 0x5a89f38>, <buildbot.changes.changes.Change instance at 0x117d4248>, <buildbot.changes.changes.Change instance at 0x6f32c68>, <buildbot.changes.changes.Change instance at 0x989d3b0>, <buildbot.changes.changes.Change instance at 0xba1d2d8>, <buildbot.changes.changes.Change instance at 0xc3a2128>]
2014-01-28 07:21:26-0800 [-] TryChooser OPTIONS : MESSAGE Namespace(build=['opt', 'debug'], talos=u'none', test=u'mochitests,reftest', user_platforms=set([u'win32', u'android', u'macosx64', u'emulator', u'linux'])) : try: -b do -p linux,macosx64,win32,android,emulator -u mochitests,reftest -t none

Chris AtLee [:catlee]

Comment 5

•

11 years ago

I don't know what else I can do without catching it in action again

Hal Wine [:hwine] use NI!

Updated

•

11 years ago

Blocks: 957502

Chris AtLee [:catlee]

Updated

•

11 years ago

Summary: Tests not being scheduled → Jobs not being scheduled

Chris AtLee [:catlee]

Comment 6

•

11 years ago

Possibly related, we're seeing bad latency to the db:
 1. 10.26.48.1                                       0.0%    96  253.5 240.2 210.0 278.8  15.1
 2. v-1034.border2.scl3.mozilla.net                  1.1%    96  532.2 482.6 412.3 554.6  29.8
 3. v-1030.fw1.scl3.mozilla.net                      0.0%    95  512.5 481.9 418.7 563.2  30.3
 4. buildbot-rw-vip.db.scl3.mozilla.com              0.0%    95  493.7 486.4 424.6 554.6  28.0

Chris AtLee [:catlee]

Updated

•

11 years ago

Depends on: 964853

Ryan VanderMeulen [:RyanVM]

Comment 7

•

11 years ago

Trees were closed for this again at 09:24.

Ryan VanderMeulen [:RyanVM]

Comment 8

•

11 years ago

Some trees were reopened at 13:12. In order to avoid a big load spike, the remaining trees were reopened at 15:09.

Phil Ringnalda (:philor)

Comment 9

•

11 years ago

It's gotten bad enough again (really, has been bad enough for the last hour or two, but I'm slow) that I've closed all trees.

Ryan VanderMeulen [:RyanVM]

Comment 10

•

11 years ago

And closed again.

Hal Wine [:hwine] use NI!

Comment 11

•

11 years ago

releng firewall upgrade and other measures have corrected the ping times that was the root cause.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Jobs not being scheduled

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: emorley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated