Closed
Bug 719260
Opened 11 years ago
Closed 11 years ago
Investigate why updates builder triggered twice for 10.0b5
Categories
(Release Engineering :: General, defect, P5)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: rail)
References
Details
(Whiteboard: [releases][automation][updates])
Attachments
(5 files, 4 obsolete files)
950 bytes,
patch
|
catlee
:
review+
catlee
:
feedback+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
2.54 KB,
patch
|
rail
:
review+
dustin
:
feedback+
Callek
:
checked-in+
|
Details | Diff | Splinter Review |
305.52 KB,
application/x-bzip
|
Details | |
1.82 KB,
patch
|
bhearsum
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
5.84 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
+ l10n verify builders
Assignee | ||
Comment 1•11 years ago
|
||
2012-01-18 15:10:01-0800 [HTTPPageGetter,client] Stopping factory <HTTPClientFactory: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log> 2012-01-18 15:20:01-0800 [-] Starting factory <HTTPClientFactory: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log> 2012-01-18 15:20:02-0800 [HTTPPageGetter,client] Stopping UrlPoller for http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log 2012-01-18 15:20:02-0800 [HTTPPageGetter,client] adding change, who http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log, 0 files, rev=None, branch=release-mozilla-beta-post_signing, repository=, comments success, category None, project 2012-01-18 15:20:02-0800 [HTTPPageGetter,client] Stopping factory <HTTPClientFactory: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log>
Assignee | ||
Comment 2•11 years ago
|
||
| 3743 | release-mozilla-beta-updates_done | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-updates"], "lastReset": 1326902444.759733, "lastCheck": 1326932071.5622721} | | 3746 | release-mozilla-beta-firefox_deliverables_ready | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-updates"], "lastReset": 1326902444.87379, "lastCheck": 1326932071.5725319} | | 3752 | release-mozilla-beta-firefox_signing_done | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-signing_done"], "lastReset": 1326902444.570888, "lastCheck": 1326932071.6682479} | | 3755 | release-mozilla-beta-av_done | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-antivirus"], "lastReset": 1326902445.1268611, "lastCheck": 1326932071.6440589} |
Assignee | ||
Comment 3•11 years ago
|
||
2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates> using slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave20'> 2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates>.. pinging the slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave20'> 2012-01-18 15:20:07-0800 [-] sending ping 2012-01-18 15:20:07-0800 [Broker,1603,10.250.49.168] ping finished: success 2012-01-18 15:20:07-0800 [Broker,1603,10.250.49.168] <Build release-mozilla-beta-updates>.startBuild 2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates> using slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave10'> 2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates>.. pinging the slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave10'> 2012-01-18 15:20:07-0800 [-] sending ping 2012-01-18 15:20:07-0800 [Broker,1733,10.250.49.158] ping finished: success 2012-01-18 15:20:07-0800 [Broker,1733,10.250.49.158] <Build release-mozilla-beta-updates>.startBuild
Assignee | ||
Comment 4•11 years ago
|
||
Maybe it's AggregatingScheduler.run() which runs async and allows 2 parallel checks? Running self.parent.db.runInteractionNow(self._run) should help in this case. I've never hit this issue in staging which uses sqlite...
Assignee | ||
Comment 5•11 years ago
|
||
FTR, mysql> select * from buildrequests where buildername='release-mozilla-beta-updates' limit 41,2; +---------+------------+------------------------------+----------+------------+-------------------------------------------------------------------------+-------------------------+----------+---------+--------------+-------------+ | id | buildsetid | buildername | priority | claimed_at | claimed_by_name | claimed_by_incarnation | complete | results | submitted_at | complete_at | +---------+------------+------------------------------+----------+------------+-------------------------------------------------------------------------+-------------------------+----------+---------+--------------+-------------+ | 8054156 | 3310332 | release-mozilla-beta-updates | 0 | 1326944392 | buildbot-master08.build.sjc1.mozilla.com:/builds/buildbot/build1/master | pid11990-boot1326487464 | 0 | NULL | 1326928807 | NULL | | 8054161 | 3310333 | release-mozilla-beta-updates | 0 | 1326928808 | buildbot-master08.build.sjc1.mozilla.com:/builds/buildbot/build1/master | pid11990-boot1326487464 | 1 | 2 | 1326928807 | 1326928841 | +---------+------------+------------------------------+----------+------------+-------------------------------------------------------------------------+-------------------------+----------+---------+--------------+-------------+ mysql> select * from buildsets where id in (3310332, 3310333); +---------+-------------------+------------+---------------+--------------+----------+-------------+---------+ | id | external_idstring | reason | sourcestampid | submitted_at | complete | complete_at | results | +---------+-------------------+------------+---------------+--------------+----------+-------------+---------+ | 3310332 | NULL | downstream | 2663870 | 1326928807 | 0 | NULL | NULL | | 3310333 | NULL | downstream | 2663871 | 1326928807 | 1 | 1326929131 | 2 | +---------+-------------------+------------+---------------+--------------+----------+-------------+---------+ mysql> select * from sourcestamps where id in (2663870, 2663871); +---------+-------------------------------------------+----------+---------+------------+---------+ | id | branch | revision | patchid | repository | project | +---------+-------------------------------------------+----------+---------+------------+---------+ | 2663870 | release-mozilla-beta-firefox_signing_done | NULL | NULL | | | | 2663871 | release-mozilla-beta-firefox_signing_done | NULL | NULL | | | +---------+-------------------------------------------+----------+---------+------------+---------+
Assignee | ||
Comment 6•11 years ago
|
||
Other AggregatingSheduler instances (post updates, deliverables ready, post AV) didn't trigger the downstream builders twice... Probably it would be better to fix bug 719261 and watch what happens in beta6.
Assignee | ||
Comment 7•11 years ago
|
||
There was no such problem 10.0b6. I think the root clause of the problem was problems with mysql (bug 719010). I also believe that the patch from bug 719261 should fix the problem if AggregatingScheduler._run() is run simultaneously.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 8•11 years ago
|
||
3.6.26 worked fine as well
Assignee | ||
Comment 9•11 years ago
|
||
3.6.26 build2 grep AggregatingScheduler twistd.log* twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=441124336>: reloaded twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=441125272>: reloaded twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=46912572231264>: reloaded twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=313373040>: reloaded twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=691277552>: reloaded twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=441126712>: reloaded twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=46912572230832>: reloaded twistd.log.5:2012-01-28 22:33:03-0800 [Broker,2667,10.2.90.51] AggregatingScheduler <id=313373040>: reset state: {'remainingBuilders': ['release-mozilla-1.9.2-signing_done'], 'lastReset': 1327818783.7543161, 'lastCheck': 1327818783.7543161} twistd.log.5:2012-01-28 22:33:03-0800 [Broker,2667,10.2.90.51] AggregatingScheduler <id=46912572230832>: reset state: {'remainingBuilders': ['release-mozilla-1.9.2-updates'], 'lastReset': 1327818783.756583, 'lastCheck': 1327818783.756583} twistd.log.5:2012-01-28 22:33:03-0800 [Broker,2667,10.2.90.51] AggregatingScheduler <id=46912572231264>: reset state: {'remainingBuilders': ['release-mozilla-1.9.2-firefox_source', 'release-mozilla-1.9.2-linux_build', 'release-mozilla-1.9.2-linux_repack_complete', 'release-mozilla-1.9.2-win32_build', 'release-mozilla-1.9.2-win32_repack_complete', 'release-mozilla-1.9.2-macosx_build', 'release-mozilla-1.9.2-macosx_repack_complete', 'release-mozilla-1.9.2-updates'], 'lastReset': 1327818783.7584741, 'lastCheck': 1327818783.7584741} twistd.log.5:2012-01-29 01:59:46-0800 [-] AggregatingScheduler <id=313373040>: new buildset: name=release-mozilla-1.9.2-firefox_signing_done, branch=release-mozilla-1.9.2-firefox_signing_done, ssid=2711412, builders: release-mozilla-1.9.2-linux_l10n_verification, release-mozilla-1.9.2-win32_l10n_verification, release-mozilla-1.9.2-macosx_l10n_verification, release-mozilla-1.9.2-updates twistd.log.5:2012-01-29 01:59:47-0800 [-] AggregatingScheduler <id=313373040>: new buildset: name=release-mozilla-1.9.2-firefox_signing_done, branch=release-mozilla-1.9.2-firefox_signing_done, ssid=2711413, builders: release-mozilla-1.9.2-linux_l10n_verification, release-mozilla-1.9.2-win32_l10n_verification, release-mozilla-1.9.2-macosx_l10n_verification, release-mozilla-1.9.2-updates twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=905263152>: reloaded twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902428072>: reloaded twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=905262720>: reloaded twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902427208>: reloaded twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=905253448>: reloaded twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902427640>: reloaded twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902438424>: reloaded
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 11•11 years ago
|
||
I suspect that the second run() was called too fast, before the first create_buildset actually created the DB entries.
Attachment #592533 -
Flags: feedback?(catlee)
Assignee | ||
Comment 12•11 years ago
|
||
s/frequency/interval/
Attachment #592533 -
Attachment is obsolete: true
Attachment #592533 -
Flags: feedback?(catlee)
Attachment #592534 -
Flags: feedback?(catlee)
Comment 13•11 years ago
|
||
Comment on attachment 592534 [details] [diff] [review] AggregatingScheduler: don't check too fast Review of attachment 592534 [details] [diff] [review]: ----------------------------------------------------------------- I generally don't like this approach, it doesn't really fix the problem but rather makes it less likely. I suspect the problem here is that the check for self.working is done in _run() which is in a DB thread. So it's possible for run() to be called twice, creating two threads, both of which see self.working = False, and then they both set it to True. Can we move the check for self.working and setting it to True in run()? Setting it back to False in _run() is fine. Also, is this reproducible in staging? We have a staging mysql db we can use.
Attachment #592534 -
Flags: feedback?(catlee) → feedback-
Assignee | ||
Comment 14•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #13) > I generally don't like this approach, it doesn't really fix the problem but > rather makes it less likely. yeah. > I suspect the problem here is that the check for self.working is done in > _run() which is in a DB thread. So it's possible for run() to be called > twice, creating two threads, both of which see self.working = False, and > then they both set it to True. > > Can we move the check for self.working and setting it to True in run()? > Setting it back to False in _run() is fine. Good idea. Something like this? > Also, is this reproducible in staging? We have a staging mysql db we can use. It happens only under heavy load when run() is called to fast. :(
Attachment #592534 -
Attachment is obsolete: true
Attachment #592542 -
Flags: feedback?(catlee)
Updated•11 years ago
|
Attachment #592542 -
Flags: feedback?(catlee) → feedback+
Assignee | ||
Comment 15•11 years ago
|
||
Comment on attachment 592542 [details] [diff] [review] AggregatingScheduler: set self.working in run() This should work (©), since all calls from _run() look synchronous. IT passes minimal staging tests (check and reconfig). I'd like to use it for 11.0b1.
Attachment #592542 -
Flags: review?(catlee)
Updated•11 years ago
|
Attachment #592542 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 16•11 years ago
|
||
Comment on attachment 592542 [details] [diff] [review] AggregatingScheduler: set self.working in run() http://hg.mozilla.org/build/buildbotcustom/rev/48f2538d3c70
Attachment #592542 -
Flags: checked-in+
Comment 17•11 years ago
|
||
landed in production with tonights reconfig
Assignee | ||
Comment 18•11 years ago
|
||
Worked fine in 11.0b1
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 19•11 years ago
|
||
3.6.27 updates were triggered twice :/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 20•11 years ago
|
||
the same for 10.0.2esr
Assignee | ||
Comment 21•11 years ago
|
||
2012-02-16 16:21:47-0800 [-] AggregatingScheduler <id=46913060110208>: new buildset: name=release-mozilla-esr10-firefox_signing_done, branch=release-mozilla-esr10-firefox_signing_done, ssid=2795834, builders: release-mozilla-esr10-linux_l10n_verification, release-mozilla-esr10-linux64_l10n_verification, release-mozilla-esr10-win32_l10n_verification, release-mozilla-esr10-macosx64_l10n_verification, release-mozilla-esr10-updates 2012-02-16 16:21:53-0800 [-] AggregatingScheduler <id=46913060110208>: new buildset: name=release-mozilla-esr10-firefox_signing_done, branch=release-mozilla-esr10-firefox_signing_done, ssid=2795835, builders: release-mozilla-esr10-linux_l10n_verification, release-mozilla-esr10-linux64_l10n_verification, release-mozilla-esr10-win32_l10n_verification, release-mozilla-esr10-macosx64_l10n_verification, release-mozilla-esr10-updates
Comment 22•11 years ago
|
||
I have a /theory/ that the use of self.working as a poor-mans-Lock is still too racy for our uses here. Passing False to aquire does not mark it as owned by the thread, so a release() on the thread _run() runs on should be ok. With your most recent comment its obvious it is running processRequest twice (two different ssid's) This should not bitrot 728104
Attachment #598217 -
Flags: review?(rail)
Comment 23•11 years ago
|
||
(actually attach the patch)
Attachment #598217 -
Attachment is obsolete: true
Attachment #598217 -
Flags: review?(rail)
Attachment #598218 -
Flags: review?(rail)
Assignee | ||
Comment 24•11 years ago
|
||
CCing Dustin, maybe he has some brilliant ideas. :)
Comment 25•11 years ago
|
||
So, chatting with tomprince (a buildbot maintainer as well) <tomprince> lock would probably work. But why not just set working=False in the callback chain.. <Callek> tomprince: its set as a finally in _run <Callek> since if the _run() finishes its ok to re-enter <Callek> we just can't reenter that <tomprince> Yeah, but then you are acessing it from two different threads. (Same as how you are doing the locking, which seems strange) <Callek> tomprince: something like |d.addBoth(lambda _: self.working=False)| ? <tomprince> Yes, except you can't do assignemnts in lambda. <tomprince> I guess you don't need to propagate a result since you don't care about one. <Callek> ahh so then a def final(_): above it <tomprince> yep. So I'm going to write that up for us instead of using threading.
Comment 26•11 years ago
|
||
So in the old scheduler implementation, it wasn't very clear what things were and weren't run in a thread. However, note that deferreds can *only* be used in the main thread, so http://hg.mozilla.org/build/buildbotcustom/file/6fe26978479a/scheduler.py#l416 is not threaded. And _run and anything it calls executes in a thread, so it is synchronous. I don't see any violations of these constraints. Callek, the locking that you've added attempts to acquire the lock in one thread, and release it in another -- not a good plan. Also, since you're using acquire(False), it amounts to just checking a boolean flag. In other words, what you've implemented is (a) thread-unsafe in form but (b) degenerates into the self.working check that's already implemented, so it won't fail. In fact, setting self.working in the main thread and then unsetting it in the thread doesn't make a lot of sense (but still isn't a bug). Rather, run() should set the flag, and then use d.addBoth to un-set it when the DB interaction is complete. In comment 21, the two buildsets are created 6 seconds apart. Finally, schedulers are run in a buildbot.util.loop.Loop, so no more than one run() method on a scheduler will be executing at any time. All things considered, I think that the race condition you're looking at is a false lead. However, I do see a race condition between _trigger and _run -- both try to load, modify, and write back out the state. If these happen in the following order: _run: get_initial_state _trigger: get_initial_state _run: set_state _trigger: set_state then the state changes made by _run will be overwritten by _trigger's set_state, and presumably the next invocation of _run will follow the same pattern as the previous, and re-submit the buildset. Since _run and _trigger are both run in threads, a lock is not a bad idea. Incidentally, in newer versions of Buildbot all code that runs in threads is annotated as such with 'thd' somewhere in its function name. That can help disambiguate the two contexts.
Comment 27•11 years ago
|
||
Attachment #598218 -
Attachment is obsolete: true
Attachment #598218 -
Flags: review?(rail)
Attachment #598281 -
Flags: review?(rail)
Attachment #598281 -
Flags: feedback?(dustin)
Updated•11 years ago
|
Attachment #598281 -
Flags: feedback?(dustin) → feedback+
Assignee | ||
Updated•11 years ago
|
Attachment #598281 -
Flags: review?(rail) → review+
Comment 28•11 years ago
|
||
Comment on attachment 598281 [details] [diff] [review] try to do better. http://hg.mozilla.org/build/buildbotcustom/rev/dbee78ea932f Note to buildduty: Symptoms of this blowing up would be scheduler/db locking and not actually running. (Either all schedulers across trees or just AggregatingSchedulers in a release) And at which point I would recommend backing out, to get stuff in a good shape and then looking in twistd.log for information relating to AggregatingScheduler (specifically locking) And following up here. I don't actually *expect* this to blow up, just want to be thorough
Attachment #598281 -
Flags: checked-in+
Assignee | ||
Comment 29•11 years ago
|
||
Bustage fix: http://hg.mozilla.org/build/buildbotcustom/rev/e56221a93eff
Assignee | ||
Comment 30•11 years ago
|
||
"Should" be fixed. I'll reopen if something goes wrong.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 31•11 years ago
|
||
We got two jobs per update_verify builder for 11.0b4, does that count ?
Assignee | ||
Comment 32•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #31) > We got two jobs per update_verify builder for 11.0b4, does that count ? I believe so. :/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 33•11 years ago
|
||
2012-02-22 16:51:42-0800 [-] AggregatingScheduler(release-mozilla-beta-updates_done) <id=46912811128160>: new buildset: branch=release-mozilla-beta-updates_done, ssid=2821970, builders: release-mozilla-beta-linux_update_verify_1/3, release-mozilla-beta-linux_update_verify_2/3, release-mozilla-beta-linux_update_verify_3/3, release-mozilla-beta-linux64_update_verify_1/3, release-mozilla-beta-linux64_update_verify_2/3, release-mozilla-beta-linux64_update_verify_3/3, release-mozilla-beta-macosx64_update_verify_1/3, release-mozilla-beta-macosx64_update_verify_2/3, release-mozilla-beta-macosx64_update_verify_3/3, release-mozilla-beta-win32_update_verify_1/3, release-mozilla-beta-win32_update_verify_2/3, release-mozilla-beta-win32_update_verify_3/3 2012-02-22 16:51:44-0800 [-] AggregatingScheduler(release-mozilla-beta-updates_done) <id=46912811128160>: new buildset: branch=release-mozilla-beta-updates_done, ssid=2821971, builders: release-mozilla-beta-linux_update_verify_1/3, release-mozilla-beta-linux_update_verify_2/3, release-mozilla-beta-linux_update_verify_3/3, release-mozilla-beta-linux64_update_verify_1/3, release-mozilla-beta-linux64_update_verify_2/3, release-mozilla-beta-linux64_update_verify_3/3, release-mozilla-beta-macosx64_update_verify_1/3, release-mozilla-beta-macosx64_update_verify_2/3, release-mozilla-beta-macosx64_update_verify_3/3, release-mozilla-beta-win32_update_verify_1/3, release-mozilla-beta-win32_update_verify_2/3, release-mozilla-beta-win32_update_verify_3/3
Comment 34•11 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #33) ... How's the twistd log look about Aggregating* here?
Assignee | ||
Comment 35•11 years ago
|
||
Assignee | ||
Comment 36•11 years ago
|
||
I suspect that something wrong may be with lastCheck... I added more logging and reset state['lastCheck'] to n after creating a new buildset.
Attachment #600982 -
Flags: review?(bhearsum)
Updated•11 years ago
|
Attachment #600982 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 37•11 years ago
|
||
Comment on attachment 600982 [details] [diff] [review] moar logging http://hg.mozilla.org/build/buildbotcustom/rev/40b591eb6656
Attachment #600982 -
Flags: checked-in+
Comment 38•11 years ago
|
||
I think the problem was that lastCheck ended up being before the complete_at value from the build. The next time the scheduler was run, it was possible that it would discover the same completed upstream build as before, and so trigger new builds again. This patch sets lastCheck to the latest of the completed builds we find, so we're guaranteed not to find the same ones again.
Attachment #601308 -
Flags: review?(rail)
Assignee | ||
Updated•11 years ago
|
Attachment #601308 -
Flags: review?(rail) → review+
Updated•11 years ago
|
Attachment #601308 -
Flags: checked-in+
Assignee | ||
Comment 39•11 years ago
|
||
I hope that catlee's patches fixes it (worked fine in 11.0b5). Will reopen if something goes wrong.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Priority: P1 → P5
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•