Closed Bug 592244 Opened 9 years ago Closed 9 years ago
race condition in schedulerdb
I'm seeing two nightlies for the same locale on the same builder, but different masters. http://production-master01.build.mozilla.org:8010/builders/Firefox%20mozilla-central%20macosx64%20l10n%20nightly/builds/1707 http://production-master03.build.mozilla.org:8010/builders/Firefox%20mozilla-central%20macosx64%20l10n%20nightly/builds/1670 Both with the same start time, Tue Aug 31 04:41:31 2010. Is there a race of masters grabbing build jobs?
Looks possible...There's only one build request for these, which has two build rows. There are 4 builds that started at exactly that time: select buildrequests.id,buildername from buildrequests, builds where buildrequests.id = builds.brid and start_time = 1283254891; +--------+-----------------------------------------------+ | id | buildername | +--------+-----------------------------------------------+ | 487448 | Firefox mozilla-central macosx64 l10n nightly | | 487449 | Firefox mozilla-central macosx64 l10n nightly | | 487448 | Firefox mozilla-central macosx64 l10n nightly | | 487524 | Firefox mozilla-central linux64 l10n nightly | +--------+-----------------------------------------------+ The two that share the same request id are the problematic ones.
Assignee: nobody → catlee
Nothing obvious in the logs....
So it looks like both masters are seeing the same unclaimed build request, and are able to both update the same row without the database complaining. P3 for now. If serious issues come up because of this, please let me know.
Priority: -- → P3
Checking the effects of the code against an InnoDB MySQL table, concurrent updates are handled just fine - the last transaction to commit takes precedence. To the extent it handles this at all, the db code expects to get an error about conflicts from the DB server. SQLite gives such an error. MySQL and Postgres do not, whether with MyISAM or InnoDB. Incidentally, catlee reports we're using MyISAM tables on the backend. So again, I'm not sure how this has *ever* worked.
(In reply to comment #6) > Checking the effects of the code against an InnoDB MySQL table, concurrent > updates are handled just fine - the last transaction to commit takes > precedence. To the extent it handles this at all, the db code expects to get > an error about conflicts from the DB server. SQLite gives such an error. > MySQL and Postgres do not, whether with MyISAM or InnoDB. > > Incidentally, catlee reports we're using MyISAM tables on the backend. > > So again, I'm not sure how this has *ever* worked. Until recently we had at most 2 masters configured to run the same builders. Now we've got 4 or 6 in some cases. To mitigate this, we could consolidate masters again...or split them up so that they're not running all the same builders. Or come up with a scheme to ensure that masters aren't all checking for builds at the same time.
Wouldn't that kinda defeat the purpose of schedulerdb, though?
Yes, but it's a short-term solution while we work out the schema changes required to fix it. Catlee, do you want to get things squared away short-term, and then hand this over to me?
(In reply to comment #9) > Yes, but it's a short-term solution while we work out the schema changes > required to fix it. > > Catlee, do you want to get things squared away short-term, and then hand this > over to me? Sure. I'll start by writing a tool to figure out how often this is happening. If it's fairly rare, I'm ok to leave it be for now.
Summary: race condition in nightly schedulers? → race condition in schedulerdb
There have been 58 occurrences of duplicate builds since I started counting on Nov 3rd at 1pm PDT. The vast majority of these are l10n repack jobs, the rest being a mix of various test jobs. I haven't seen any regular builds get duplicated so far. Axel, what's the impact of duplicate l10n repacks? Have we seen any failures as a result?
I guess that you're in a better position to tell, as you can work on first-class data. Pretty hard to tell from here. http://l10n.mozilla.org/~axel/nightlies/?date=2010-11-05&buildbot=true looks more dupe-dense that usual, and no obvious failures this time. I quite often see failed builds and passing next to each other though. See ga-IE fx36 yesterday, http://l10n.mozilla.org/~axel/nightlies/?date=2010-11-04&buildbot=true. Given that we're still seeing upload errors much more frequently than anything else, it's hard to figure out what fails when and why from here.
We could fix this temporarily by restricting the l10n repacks to happen on only one master.
This needs to be fixed upstream.
Assignee: catlee → nobody
We're doing a *ton* of this right now (way more than I thought just by looking at tbpl, since for the most part the double runs start at the same time, and thus Tinderbox and tbpl wind up only knowing about one of them), which seems unlikely to do good things for our shortage of slaves and consequent wait times. From a quick look through the tip TraceMonkey push (on buildapi, where they're visible), looks like it had 28 jobs that were double-run, and 1 that was triple-run.
It's definitely gotten worse lately, not sure why. Maybe the change in bug 649734 made the builder loop more efficient, so masters are racing each other more often?
Actually, this might explain why wait times are bad if we're duplicating a lot of test jobs.
This seems to be intermittently breaking Android updates (bug 651925).
Android updates again: bug 652667
The approach here is similar to what's used in 0.8.4 I think. The UPDATE query to claim the buildrequest rows is qualified by conditions that should match only unclaimed requests. If the master claims fewer rows than is in the batch, the entire set of buildrequests is then unclaimed again. I've seen no duplicated builds with this code. I think there's a small risk that if re-claiming a buildrequest fails, then it could be subsequently claimed by another master. I haven't actually seen this happen, and I think that the master would have to fail to re-claim the request 6 times for this to occur (reclaims happen every 10 minutes, and buildrequests aren't stolen by other masters until an hour has gone by)
Comment on attachment 528306 [details] [diff] [review] Try to handle race conditions + missed = t.rowcount This seems an odd variable name - it's the number of build requests that were *hit*, not missed, right? If a reclaim fails, this will unclaim the builds that should have been reclaimed. But I suppose that's a very narrow failure case. So, barring moving to a "claims" table, I think this looks good.
This builds on the previous patch by adding the 'am_reclaiming' flag to the claim buildrequests method. If we fail to reclaim the buildrequests, we log a warning and exit instead of trying to unclaim them.
Comment on attachment 529712 [details] [diff] [review] Try to handle race conditions, v2 Review of attachment 529712 [details] [diff] [review]: lgtm ::: master/buildbot/db/connector.py @@ +872,5 @@ + qargs = [now, master_name, master_incarnation] + list(batch) + [old, master_name] + t.execute(q, qargs) + matched = t.rowcount + if matched != len(batch): + log.msg("Failed to claim everything; unclaiming (got %s, expected %s)" % (matched, len(batch))) You should probably change this log message, since you won't necessarily claim here.
Attachment #529712 - Flags: feedback?(dustin) → feedback+
3rd time's the charm? Basically the same as before; adjusted the log message you pointed out, and fix claim_buildrequests's return value in the case that we're not reclaiming anything.
Attachment #531158 - Flags: review?(dustin) → review+
Comment on attachment 531158 [details] [diff] [review] Try to handle race conditions, v3 http://hg.mozilla.org/build/buildbot/rev/ed19e0d5b2e5
Attachment #531158 - Flags: checked-in+
Optimistically calling this fixed. It's been deployed to all of the production build masters, and I haven't seen any of the duplicated builds on masters where this has been deployed. The last duplicate builds I've found were at 4am, just prior to this fix being deployed on pm03.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [buildbot][waiting-for-upstream] → [buildbot]
Confirming. The latest of AMO's performance measurement runs created the expected 404 results, not 430 or 450 as it used to be (see bug 650744).
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.