Closed
Bug 613191
Opened 14 years ago
Closed 14 years ago
mozilla-central Mac and Windows nightlies did not fire this morning
Categories
(Release Engineering :: General, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: catlee)
References
Details
I had to kick these manually after beltzner noted that he wasn't getting a nightly update. Both Linux nightlies fired successfully, I haven't checked other branches.
Reporter | ||
Comment 1•14 years ago
|
||
May have been an issue yesterday, too: 9:31 AM <Pike> bhearsum: they might not have gotten their trigger yesterday, either. There's quite some empty on http://l10n.mozilla.org/~axel/nightlies/?date=2010-11-17&buildbot=true, which I blamed to downtime
Comment 2•14 years ago
|
||
The Android R7 nightlies also were not triggered.
Assignee | ||
Comment 3•14 years ago
|
||
This was caused by a bug fix in buildbot: https://github.com/buildbot/buildbot/commit/91c884e5faee567e2dd66b65c40736e69b5c8657 This fix was to clean out an intermediate table (scheduler_changes) that buildbot uses to track which changes are important to each scheduler. The bug was that for nightly schedulers, these entries were never being cleaned out. At this point we have 7.6 million unneeded rows in this table for all the various nightly schedulers. We didn't hit this in staging because we have relatively small databases generally. Also, this wasn't a problem with buildbot-0.8.2 until the nightly schedulers started firing this morning. At that point the entire scheduler loop was blocked waiting for each scheduler in turn to clear out all the accumulated changes. I've manually deleted all these rows from the DB. Normal operations seem to have continued. delete from scheduler_changes where schedulerid in (select schedulerid from schedulers where class_name='buildbot.schedulers.timed.Nightly'); A few notes about debugging this: * Shutting down the master would usually hang. (i.e. the process wouldn't exit for a long time, if at all). strace -p `cat twistd.pid` revealed that the process was blocked waiting on a futex. This is usually an indication of DB-related code, since that's the only code that runs in a thread on the scheduler master. * Looking at the output of 'show processlist' from mysql revealed that the master was doing a lot of queries to fetch change objects. * I added lots of debug logging in schedulers/base.py, basic.py, timed.py to narrow down where we were getting blocked. Restarting the scheduler master is relatively safe, and I thought necessary here to be able to figure out what was going on.
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 14 years ago
Priority: -- → P1
Resolution: --- → FIXED
Comment 4•14 years ago
|
||
This is a remnant of the "don't delete anything" approach to the schedulerdb. If I recall correctly, that table's only used to cache changes that the scheduler feels are important while it waits for the treeStableTimer to expire. It looks like the DELETE fixed the immediate problem, but won't this occur again? If so, is the fix as simple as just flushing those rows once the corresponding buildset has been created?
Assignee | ||
Comment 5•14 years ago
|
||
Yeah, we shouldn't hit this again. The table is getting cleared out every time a build runs.
Comment 6•14 years ago
|
||
For the record, then - since this took me a minute to figure out - as of 0.8.2 the nightly schedulers take care to clean out their classified changes even when they're not paying attention to them (http://buildbot.net/trac/ticket/939). In MySQL, deleting rows is an expensive operation, and in this case we had lots of conditional deletions from the table - one for each nightly scheduler. So it was taking forever, and blocking the scheduler loop. Catlee's shortcut was to just delete *all* of the classified changes generated by Nightly schedulers, then restart the master.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•