Pending jobs count has gone through the roof

RESOLVED WORKSFORME

Status

Release Engineering
General
--
blocker
RESOLVED WORKSFORME
7 years ago
5 years ago

People

(Reporter: philor, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

7 years ago
May very well wind up invalid, caused by too much load, but:

mozilla-central's closed, because we currently have 2192 jobs pending vs. 502 running, 258 vs. 48 for m-c, with some of the m-c jobs having waited over 3 hours now.
At the moment, it seems this is, indeed, load-related.  There's a backlog of linux try builds, but that's clearing out.  But all of the builds have created a big backlog of test runs.

I'm not sure closing m-c will help much, since most of the pending jobs are from try, but perhaps MFBT will ride in to the rescue?
(Reporter)

Comment 2

7 years ago
In theory, m-c's prioritized above try, so try can have a trillion jobs pending if it wants. There seems to maybe be another theory, that m-c, m-1.9.2 and m-1.9.1 are prioritized above the project branches, but the mix of pending/running on all the others, which you would then expect to be "nnn/0" if m-c has any pending things, doesn't support that theory.

But I didn't say to close m-c to solve the problem so much as to not pile crap on crap - we don't have any real idea how badly broken the patches that were pushed four or five hours ago were, so we don't need to be throwing more broken patches on top of them.
(Reporter)

Comment 3

7 years ago
Try still has a thousand jobs to chew through, but thanks to the power of massive coalescing everything else is caught up, and only three or four new bugs managed to sneak in without blame.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WORKSFORME
For posterity, here are some data I collected from buildbot-master{04,06} while this was going on:

* The reactor was very slow to get around to doing scheduled tasks. Running the snippet below from the manhole logged delays of up to a minute, when slightly more than 1 second is expected.

from buildbot import util
from twisted.python import log
from twisted.internet import reactor
def cb(then):
    now = util.now()
    log.msg("delay: %s" % (now-then))

reactor.callLater(1, cb, util.now())

* Symptoms of this include slaves being connected but not getting jobs. One slave I looked at connected at :17 past the hour and didn't get a job until :29 past the hour, even though we had several hundred pending jobs.

* Number of TCP established connections spiked on both machines:

http://people.mozilla.org/~catlee/sattap/ffadb220.png

This is unusual for the week

http://people.mozilla.org/~catlee/sattap/57e84712.png

If/when this happens again, I'd like to see a list of open files/sockets for the master process.

Incidentally, buildbot-master5 was behaving just fine during this period.
We also had 15,762 jobs for the test pool yesterday, which I think is an all-time high.
buildbot-master5 was gracefully restarted (and disabled via slavealloc) yesterday, which explains the jump in tcp connections.

Does this mean we need 3 masters to handle all our test load?
My gut feeling is that 5-6 testing masters will help the masters to distribute the jobs timely. 3 masters seems to be the minimum.
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.