Closed Bug 613191 Opened 14 years ago Closed 14 years ago

mozilla-central Mac and Windows nightlies did not fire this morning

Categories

(Release Engineering :: General, defect, P1)

All
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

I had to kick these manually after beltzner noted that he wasn't getting a nightly update. Both Linux nightlies fired successfully, I haven't checked other branches.
May have been an issue yesterday, too:
9:31 AM <Pike> bhearsum: they might not have gotten their trigger yesterday, either. There's quite some empty on http://l10n.mozilla.org/~axel/nightlies/?date=2010-11-17&buildbot=true, which I blamed to downtime
The Android R7 nightlies also were not triggered.
This was caused by a bug fix in buildbot: https://github.com/buildbot/buildbot/commit/91c884e5faee567e2dd66b65c40736e69b5c8657

This fix was to clean out an intermediate table (scheduler_changes) that buildbot uses to track which changes are important to each scheduler.  The bug was that for nightly schedulers, these entries were never being cleaned out.  At this point we have 7.6 million unneeded rows in this table for all the various nightly schedulers.

We didn't hit this in staging because we have relatively small databases generally.  Also, this wasn't a problem with buildbot-0.8.2 until the nightly schedulers started firing this morning.  At that point the entire scheduler loop was blocked waiting for each scheduler in turn to clear out all the accumulated changes.

I've manually deleted all these rows from the DB.  Normal operations seem to have continued.

   delete from scheduler_changes where schedulerid in (select schedulerid from schedulers where class_name='buildbot.schedulers.timed.Nightly');

A few notes about debugging this:

* Shutting down the master would usually hang. (i.e. the process wouldn't exit for a long time, if at all).  strace -p `cat twistd.pid` revealed that the process was blocked waiting on a futex.  This is usually an indication of DB-related code, since that's the only code that runs in a thread on the scheduler master.

* Looking at the output of 'show processlist' from mysql revealed that the master was doing a lot of queries to fetch change objects.

* I added lots of debug logging in schedulers/base.py, basic.py, timed.py to narrow down where we were getting blocked.  Restarting the scheduler master is relatively safe, and I thought necessary here to be able to figure out what was going on.
Assignee: nobody → catlee
Status: NEW → RESOLVED
Closed: 14 years ago
Priority: -- → P1
Resolution: --- → FIXED
This is a remnant of the "don't delete anything" approach to the schedulerdb.  If I recall correctly, that table's only used to cache changes that the scheduler feels are important while it waits for the treeStableTimer to expire.

It looks like the DELETE fixed the immediate problem, but won't this occur again?  If so, is the fix as simple as just flushing those rows once the corresponding buildset has been created?
Yeah, we shouldn't hit this again.  The table is getting cleared out every time a build runs.
For the record, then - since this took me a minute to figure out - as of 0.8.2 the nightly schedulers take care to clean out their classified changes even when they're not paying attention to them (http://buildbot.net/trac/ticket/939).  In MySQL, deleting rows is an expensive operation, and in this case we had lots of conditional deletions from the table - one for each nightly scheduler.  So it was taking forever, and blocking the scheduler loop.  Catlee's shortcut was to just delete *all* of the classified changes generated by Nightly schedulers, then restart the master.
Depends on: 613832
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.