Closed Bug 811708 Opened 13 years ago Closed 12 years ago

race condition in aggregating scheduler

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

Details

(Whiteboard: [schedulers][buildbot])

Attachments

(1 file)

On Wednesday morning, I noticed that updates still weren't ready. I looked through my e-mail and confirmed that I had received all of the "all $platfrom builds now available" mails. Then I went poking around the database. I found that two of the schedulers with "release-mozilla-beta-linux64_build" as an upstream hadn't been properly updated when it had completed: | release-mozilla-beta-firefox_deliverables_ready | {"remainingBuilders": ["release-mozilla-beta-linux64_build"], "upstreamBuilders": ["release-mozilla-beta-firefox_source", "release-mozilla-beta-linux_build", "release-mozilla-beta-linux_repack_complete", "release-mozilla-beta-linux64_build", "release-mozilla-beta-linux64_repack_complete", "release-mozilla-beta-win32_build", "release-mozilla-beta-win32_repack_complete", "release-mozilla-beta-macosx64_build", "release-mozilla-beta-macosx64_repack_complete"], "lastReset": 1352817804.972043, "lastCheck": 1352843283} | | release-mozilla-beta-firefox_signing_done | {"remainingBuilders": ["release-mozilla-beta-linux64_build"], "upstreamBuilders": ["release-mozilla-beta-linux_build", "release-mozilla-beta-linux_repack_complete", "release-mozilla-beta-linux64_build", "release-mozilla-beta-linux64_repack_complete", "release-mozilla-beta-win32_build", "release-mozilla-beta-win32_repack_complete", "release-mozilla-beta-macosx64_build", "release-mozilla-beta-macosx64_repack_complete"], "lastReset": 1352817804.9652419, "lastCheck": 1352843283} | However, the "release-mozilla-beta-linux64_repack" scheduler worked fine. | release-mozilla-beta-linux64_repack | {"remainingBuilders": ["release-mozilla-beta-linux64_build", "release-mozilla-beta-firefox_tag_l10n"], "upstreamBuilders": ["release-mozilla-beta-linux64_build", "release-mozilla-beta-firefox_tag_l10n"], "lastCheck": 1335386373} This is very confusing, because these schedulers would've been updated right at the same time after the linux64 build completed. I poked around the twistd.log of both the scheduler and the build master and neither had an errors or anything out of the ordinary around the time that the linux64 build completed. The buildbot-master12:~cltbld/twistd.log.6 is the builders log. The schedulers had absolutely nothing in it related to this builder, so I didn't back it up.
Catlee looked into this and it's looking like this was caused by a race condition in AggregatingScheduler. At Tue Nov 13 08:08:59 2012, release-mozilla-beta-linux64_build completed on bm12. At Tue Nov 13 08:09:01 2012, release-mozilla-beta-macosx64_build completed on bm32. So it looks like both masters grabbed the current state, then bm12 marked linux64 as done, and then bm32 marked macosx64 as done - overwriting the update that linux64 had done. The reason that the repacks still fired correctly is because that scheduler is a Triggerable, not an AggregatingScheduler!
Summary: post deliverables and signing done schedulers didn't fire correctly for firefox 17.0b6 → race condition in aggregating scheduler
(In reply to Ben Hearsum [:bhearsum] from comment #1) > Catlee looked into this and it's looking like this was caused by a race > condition in AggregatingScheduler. > > At Tue Nov 13 08:08:59 2012, release-mozilla-beta-linux64_build completed on > bm12. > At Tue Nov 13 08:09:01 2012, release-mozilla-beta-macosx64_build completed > on bm32. > > So it looks like both masters grabbed the current state, then bm12 marked > linux64 as done, and then bm32 marked macosx64 as done - overwriting the > update that linux64 had done. I believe there's only one instance of this scheduler, and that what happened was that the macosx64's build row was updated first. Then the scheduler ran, found the new macosx64 build that finished, and updated its internal "lastCheck" variable. This made lastCheck be past the date when the linux64 build finished, and so the scheduler never noticed that linux64 had completed. 2012-11-13 08:09:02-0800 [-] AggregatingScheduler(release-mozilla-beta-firefox_deliverables_ready) <id=406413536>: new builds: ((u'release-mozilla-beta-macosx64_build', 17503906L, 1352822943L),) since 1352820361 1352820361 is 07:26:01 pacific time. 1352822943 is 08:09:03 pacific time. so it looked for new builds since 07:26:01, and only found the macosx64 build. It then updated lastCheck to 1352822943 (08:09:03), and so didn't see any builds that finished before that.
Component: Release Engineering: Automation (Release Automation) → Release Engineering: Automation (General)
jhopkins and I chatted a bit about this. one solution would be to add a timestamp for "lastReset" which records the time when the scheduler was last reset. Then we could add an offset to the findNewBuilds function to look for builds that finished > lastCheck-offset. This lets us find builds that finish close together, but prevents us from having those builds re-discovered after a reset.
Whiteboard: [schedulers][buildbot]
To fix this, manhole into build master, then execute something like this: >>> schedulers = list(master.scheduler_manager) >>> for s in schedulers: ... if s.name == 'release-mozilla-beta-linux_repack_complete': ... break ... >>> >>> s <buildbotcustom.scheduler.AggregatingScheduler instance at 0x2c662dd0> >>> state = master.db.runInteractionNow(s.get_state) >>> state {u'remainingBuilders': [u'release-mozilla-beta-linux_repack_4/6'], u'upstreamBuilders': [u'release-mozilla-beta-linux_repack_1/6', u'release-mozilla-beta-linux_repack_2/6', u'release-mozilla-beta-linux_repack_3/6', u'release-mozilla-beta-linux_repack_4/6', u'release-mozilla-beta-linux_repack_5/6', u'release-mozilla-beta-linux_repack_6/6'], u'lastReset': 1367377370.384453, u'lastCheck': 1367391427} >>> state['remainingBuilders'] = [] >>> master.db.runInteractionNow(s.set_state, state) >>> state = master.db.runInteractionNow(s.get_state) >>> state {u'remainingBuilders': [], u'upstreamBuilders': [u'release-mozilla-beta-linux_repack_1/6', u'release-mozilla-beta-linux_repack_2/6', u'release-mozilla-beta-linux_repack_3/6', u'release-mozilla-beta-linux_repack_4/6', u'release-mozilla-beta-linux_repack_5/6', u'release-mozilla-beta-linux_repack_6/6'], u'lastReset': 1367377370.384453, u'lastCheck': 1367391427}
Attachment #747743 - Flags: review?(rail)
Attachment #747743 - Flags: review?(nthomas)
Comment on attachment 747743 [details] [diff] [review] add tests, and fix the race condition Review of attachment 747743 [details] [diff] [review]: ----------------------------------------------------------------- lgtm
Attachment #747743 - Flags: review?(rail) → review+
Attachment #747743 - Flags: review?(nthomas) → review+
Attachment #747743 - Flags: checked-in+
in production
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: