Closed Bug 719260 Opened 8 years ago Closed 8 years ago

Investigate why updates builder triggered twice for 10.0b5

Categories

(Release Engineering :: General, defect, P5, major)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: rail)

References

Details

(Whiteboard: [releases][automation][updates])

Attachments

(5 files, 4 obsolete files)

+ l10n verify builders
2012-01-18 15:10:01-0800 [HTTPPageGetter,client] Stopping factory <HTTPClientFactory: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log>
2012-01-18 15:20:01-0800 [-] Starting factory <HTTPClientFactory: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log>
2012-01-18 15:20:02-0800 [HTTPPageGetter,client] Stopping UrlPoller for http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log
2012-01-18 15:20:02-0800 [HTTPPageGetter,client] adding change, who http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log, 0 files, rev=None, branch=release-mozilla-beta-post_signing, repository=, comments success, category None, project 
2012-01-18 15:20:02-0800 [HTTPPageGetter,client] Stopping factory <HTTPClientFactory: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/10.0b5-candidates/build1//win32_signing_build1.log>
|        3743 | release-mozilla-beta-updates_done               | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-updates"], "lastReset": 1326902444.759733, "lastCheck": 1326932071.5622721}      | 
|        3746 | release-mozilla-beta-firefox_deliverables_ready | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-updates"], "lastReset": 1326902444.87379, "lastCheck": 1326932071.5725319}       | 
|        3752 | release-mozilla-beta-firefox_signing_done       | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-signing_done"], "lastReset": 1326902444.570888, "lastCheck": 1326932071.6682479} | 
|        3755 | release-mozilla-beta-av_done                    | buildbotcustom.scheduler.AggregatingScheduler | {"remainingBuilders": ["release-mozilla-beta-antivirus"], "lastReset": 1326902445.1268611, "lastCheck": 1326932071.6440589}   |
2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates> using slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave20'>
2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates>.. pinging the slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave20'>
2012-01-18 15:20:07-0800 [-] sending ping
2012-01-18 15:20:07-0800 [Broker,1603,10.250.49.168] ping finished: success
2012-01-18 15:20:07-0800 [Broker,1603,10.250.49.168] <Build release-mozilla-beta-updates>.startBuild


2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates> using slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave10'>
2012-01-18 15:20:07-0800 [-] starting build <Build release-mozilla-beta-updates>.. pinging the slave <SlaveBuilder builder='release-mozilla-beta-updates' slave='mv-moz2-linux-ix-slave10'>
2012-01-18 15:20:07-0800 [-] sending ping
2012-01-18 15:20:07-0800 [Broker,1733,10.250.49.158] ping finished: success
2012-01-18 15:20:07-0800 [Broker,1733,10.250.49.158] <Build release-mozilla-beta-updates>.startBuild
Maybe it's AggregatingScheduler.run() which runs async and allows 2 parallel checks? Running 
self.parent.db.runInteractionNow(self._run) should help in this case.

I've never hit this issue in staging which uses sqlite...
FTR,

mysql> select * from buildrequests where buildername='release-mozilla-beta-updates' limit 41,2;
+---------+------------+------------------------------+----------+------------+-------------------------------------------------------------------------+-------------------------+----------+---------+--------------+-------------+
| id      | buildsetid | buildername                  | priority | claimed_at | claimed_by_name                                                         | claimed_by_incarnation  | complete | results | submitted_at | complete_at |
+---------+------------+------------------------------+----------+------------+-------------------------------------------------------------------------+-------------------------+----------+---------+--------------+-------------+
| 8054156 |    3310332 | release-mozilla-beta-updates |        0 | 1326944392 | buildbot-master08.build.sjc1.mozilla.com:/builds/buildbot/build1/master | pid11990-boot1326487464 |        0 |    NULL |   1326928807 |        NULL | 
| 8054161 |    3310333 | release-mozilla-beta-updates |        0 | 1326928808 | buildbot-master08.build.sjc1.mozilla.com:/builds/buildbot/build1/master | pid11990-boot1326487464 |        1 |       2 |   1326928807 |  1326928841 | 
+---------+------------+------------------------------+----------+------------+-------------------------------------------------------------------------+-------------------------+----------+---------+--------------+-------------+




mysql> select * from buildsets where id in (3310332, 3310333);
+---------+-------------------+------------+---------------+--------------+----------+-------------+---------+
| id      | external_idstring | reason     | sourcestampid | submitted_at | complete | complete_at | results |
+---------+-------------------+------------+---------------+--------------+----------+-------------+---------+
| 3310332 | NULL              | downstream |       2663870 |   1326928807 |        0 |        NULL |    NULL | 
| 3310333 | NULL              | downstream |       2663871 |   1326928807 |        1 |  1326929131 |       2 | 
+---------+-------------------+------------+---------------+--------------+----------+-------------+---------+





mysql> select * from sourcestamps where id in (2663870, 2663871);
+---------+-------------------------------------------+----------+---------+------------+---------+
| id      | branch                                    | revision | patchid | repository | project |
+---------+-------------------------------------------+----------+---------+------------+---------+
| 2663870 | release-mozilla-beta-firefox_signing_done | NULL     |    NULL |            |         | 
| 2663871 | release-mozilla-beta-firefox_signing_done | NULL     |    NULL |            |         | 
+---------+-------------------------------------------+----------+---------+------------+---------+
Other AggregatingSheduler instances (post updates, deliverables ready, post AV) didn't trigger the downstream builders twice... Probably it would be better to fix bug 719261 and watch what happens in beta6.
There was no such problem 10.0b6. I think the root clause of the problem was problems with mysql (bug 719010). I also believe that the patch from bug 719261 should fix the problem if  AggregatingScheduler._run() is run simultaneously.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
3.6.26 worked fine as well
3.6.26 build2
grep AggregatingScheduler twistd.log*
twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=441124336>: reloaded
twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=441125272>: reloaded
twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=46912572231264>: reloaded
twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=313373040>: reloaded
twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=691277552>: reloaded
twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=441126712>: reloaded
twistd.log.5:2012-01-28 22:33:02-0800 [-] AggregatingScheduler <id=46912572230832>: reloaded
twistd.log.5:2012-01-28 22:33:03-0800 [Broker,2667,10.2.90.51] AggregatingScheduler <id=313373040>: reset state: {'remainingBuilders': ['release-mozilla-1.9.2-signing_done'], 'lastReset': 1327818783.7543161, 'lastCheck': 1327818783.7543161}
twistd.log.5:2012-01-28 22:33:03-0800 [Broker,2667,10.2.90.51] AggregatingScheduler <id=46912572230832>: reset state: {'remainingBuilders': ['release-mozilla-1.9.2-updates'], 'lastReset': 1327818783.756583, 'lastCheck': 1327818783.756583}
twistd.log.5:2012-01-28 22:33:03-0800 [Broker,2667,10.2.90.51] AggregatingScheduler <id=46912572231264>: reset state: {'remainingBuilders': ['release-mozilla-1.9.2-firefox_source', 'release-mozilla-1.9.2-linux_build', 'release-mozilla-1.9.2-linux_repack_complete', 'release-mozilla-1.9.2-win32_build', 'release-mozilla-1.9.2-win32_repack_complete', 'release-mozilla-1.9.2-macosx_build', 'release-mozilla-1.9.2-macosx_repack_complete', 'release-mozilla-1.9.2-updates'], 'lastReset': 1327818783.7584741, 'lastCheck': 1327818783.7584741}
twistd.log.5:2012-01-29 01:59:46-0800 [-] AggregatingScheduler <id=313373040>: new buildset: name=release-mozilla-1.9.2-firefox_signing_done, branch=release-mozilla-1.9.2-firefox_signing_done, ssid=2711412, builders: release-mozilla-1.9.2-linux_l10n_verification, release-mozilla-1.9.2-win32_l10n_verification, release-mozilla-1.9.2-macosx_l10n_verification, release-mozilla-1.9.2-updates
twistd.log.5:2012-01-29 01:59:47-0800 [-] AggregatingScheduler <id=313373040>: new buildset: name=release-mozilla-1.9.2-firefox_signing_done, branch=release-mozilla-1.9.2-firefox_signing_done, ssid=2711413, builders: release-mozilla-1.9.2-linux_l10n_verification, release-mozilla-1.9.2-win32_l10n_verification, release-mozilla-1.9.2-macosx_l10n_verification, release-mozilla-1.9.2-updates
twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=905263152>: reloaded
twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902428072>: reloaded
twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=905262720>: reloaded
twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902427208>: reloaded
twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=905253448>: reloaded
twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902427640>: reloaded
twistd.log.6:2012-01-28 22:29:43-0800 [-] AggregatingScheduler <id=902438424>: reloaded
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Duplicate of this bug: 722182
I suspect that the second run() was called too fast, before the first create_buildset actually created the DB entries.
Attachment #592533 - Flags: feedback?(catlee)
s/frequency/interval/
Attachment #592533 - Attachment is obsolete: true
Attachment #592533 - Flags: feedback?(catlee)
Attachment #592534 - Flags: feedback?(catlee)
Comment on attachment 592534 [details] [diff] [review]
AggregatingScheduler: don't check too fast

Review of attachment 592534 [details] [diff] [review]:
-----------------------------------------------------------------

I generally don't like this approach, it doesn't really fix the problem but rather makes it less likely.

I suspect the problem here is that the check for self.working is done in _run() which is in a DB thread. So it's possible for run() to be called twice, creating two threads, both of which see self.working = False, and then they both set it to True.

Can we move the check for self.working and setting it to True in run()? Setting it back to False in _run() is fine.

Also, is this reproducible in staging? We have a staging mysql db we can use.
Attachment #592534 - Flags: feedback?(catlee) → feedback-
(In reply to Chris AtLee [:catlee] from comment #13)
> I generally don't like this approach, it doesn't really fix the problem but
> rather makes it less likely.

yeah.

> I suspect the problem here is that the check for self.working is done in
> _run() which is in a DB thread. So it's possible for run() to be called
> twice, creating two threads, both of which see self.working = False, and
> then they both set it to True.
> 
> Can we move the check for self.working and setting it to True in run()?
> Setting it back to False in _run() is fine.

Good idea. Something like this?

> Also, is this reproducible in staging? We have a staging mysql db we can use.

It happens only under heavy load when run() is called to fast. :(
Attachment #592534 - Attachment is obsolete: true
Attachment #592542 - Flags: feedback?(catlee)
Attachment #592542 - Flags: feedback?(catlee) → feedback+
Comment on attachment 592542 [details] [diff] [review]
AggregatingScheduler: set self.working in run()

This should work (©), since all calls from _run() look synchronous. IT passes minimal staging tests (check and reconfig). I'd like to use it for 11.0b1.
Attachment #592542 - Flags: review?(catlee)
Attachment #592542 - Flags: review?(catlee) → review+
Comment on attachment 592542 [details] [diff] [review]
AggregatingScheduler: set self.working in run()

http://hg.mozilla.org/build/buildbotcustom/rev/48f2538d3c70
Attachment #592542 - Flags: checked-in+
landed in production with tonights reconfig
Worked fine in 11.0b1
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
3.6.27 updates were triggered twice :/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
the same for 10.0.2esr
2012-02-16 16:21:47-0800 [-] AggregatingScheduler <id=46913060110208>: new buildset: name=release-mozilla-esr10-firefox_signing_done, branch=release-mozilla-esr10-firefox_signing_done, ssid=2795834, builders: release-mozilla-esr10-linux_l10n_verification, release-mozilla-esr10-linux64_l10n_verification, release-mozilla-esr10-win32_l10n_verification, release-mozilla-esr10-macosx64_l10n_verification, release-mozilla-esr10-updates

2012-02-16 16:21:53-0800 [-] AggregatingScheduler <id=46913060110208>: new buildset: name=release-mozilla-esr10-firefox_signing_done, branch=release-mozilla-esr10-firefox_signing_done, ssid=2795835, builders: release-mozilla-esr10-linux_l10n_verification, release-mozilla-esr10-linux64_l10n_verification, release-mozilla-esr10-win32_l10n_verification, release-mozilla-esr10-macosx64_l10n_verification, release-mozilla-esr10-updates
Attached patch use Lock (obsolete) — Splinter Review
I have a /theory/ that the use of self.working as a poor-mans-Lock is still too racy for our uses here.

Passing False to aquire does not mark it as owned by the thread, so a release() on the thread _run() runs on should be ok. 

With your most recent comment its obvious it is running processRequest twice (two different ssid's)

This should not bitrot 728104
Attachment #598217 - Flags: review?(rail)
Attached patch use Lock (obsolete) — Splinter Review
(actually attach the patch)
Attachment #598217 - Attachment is obsolete: true
Attachment #598217 - Flags: review?(rail)
Attachment #598218 - Flags: review?(rail)
CCing Dustin, maybe he has some brilliant ideas. :)
So, chatting with tomprince (a buildbot maintainer as well)

<tomprince> lock would probably work. But why not just set working=False in the callback chain..
<Callek> tomprince: its set as a finally in _run
<Callek> since if the _run() finishes its ok to re-enter
<Callek> we just can't reenter that
<tomprince> Yeah, but then you are acessing it from two different threads. (Same as how you are doing the locking, which seems strange)
<Callek> tomprince: something like |d.addBoth(lambda _: self.working=False)| ?
<tomprince> Yes, except you can't do assignemnts in lambda.
<tomprince> I guess you don't need to propagate a result since you don't care about one.
<Callek> ahh so then a def final(_): above it
<tomprince> yep.

So I'm going to write that up for us instead of using threading.
So in the old scheduler implementation, it wasn't very clear what things were and weren't run in a thread.  However, note that deferreds can *only* be used in the main thread, so http://hg.mozilla.org/build/buildbotcustom/file/6fe26978479a/scheduler.py#l416 is not threaded.  And _run and anything it calls executes in a thread, so it is synchronous.  I don't see any violations of these constraints.

Callek, the locking that you've added attempts to acquire the lock in one thread, and release it in another -- not a good plan.  Also, since you're using acquire(False), it amounts to just checking a boolean flag.  In other words, what you've implemented is (a) thread-unsafe in form but (b) degenerates into the self.working check that's already implemented, so it won't fail.

In fact, setting self.working in the main thread and then unsetting it in the thread doesn't make a lot of sense (but still isn't a bug).  Rather, run() should set the flag, and then use d.addBoth to un-set it when the DB interaction is complete.

In comment 21, the two buildsets are created 6 seconds apart. 

Finally, schedulers are run in a buildbot.util.loop.Loop, so no more than one run() method on a scheduler will be executing at any time.

All things considered, I think that the race condition you're looking at is a false lead.

However, I do see a race condition between _trigger and _run -- both try to load, modify, and write back out the state.  If these happen in the following order:
  _run: get_initial_state
  _trigger: get_initial_state
  _run: set_state
  _trigger: set_state
then the state changes made by _run will be overwritten by _trigger's set_state, and presumably the next invocation of _run will follow the same pattern as the previous, and re-submit the buildset.

Since _run and _trigger are both run in threads, a lock is not a bad idea.

Incidentally, in newer versions of Buildbot all code that runs in threads is annotated as such with 'thd' somewhere in its function name.  That can help disambiguate the two contexts.
Attachment #598218 - Attachment is obsolete: true
Attachment #598218 - Flags: review?(rail)
Attachment #598281 - Flags: review?(rail)
Attachment #598281 - Flags: feedback?(dustin)
Attachment #598281 - Flags: feedback?(dustin) → feedback+
Attachment #598281 - Flags: review?(rail) → review+
Comment on attachment 598281 [details] [diff] [review]
try to do better.

http://hg.mozilla.org/build/buildbotcustom/rev/dbee78ea932f

Note to buildduty: Symptoms of this blowing up would be scheduler/db locking and not actually running. (Either all schedulers across trees or just AggregatingSchedulers in a release) And at which point I would recommend backing out, to get stuff in a good shape and then looking in twistd.log for information relating to AggregatingScheduler (specifically locking) And following up here.

I don't actually *expect* this to blow up, just want to be thorough
Attachment #598281 - Flags: checked-in+
"Should" be fixed. I'll reopen if something goes wrong.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
We got two jobs per update_verify builder for 11.0b4, does that count ?
(In reply to Nick Thomas [:nthomas] from comment #31)
> We got two jobs per update_verify builder for 11.0b4, does that count ?

I believe so. :/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
2012-02-22 16:51:42-0800 [-] AggregatingScheduler(release-mozilla-beta-updates_done) <id=46912811128160>: new buildset: branch=release-mozilla-beta-updates_done, ssid=2821970, builders: release-mozilla-beta-linux_update_verify_1/3, release-mozilla-beta-linux_update_verify_2/3, release-mozilla-beta-linux_update_verify_3/3, release-mozilla-beta-linux64_update_verify_1/3, release-mozilla-beta-linux64_update_verify_2/3, release-mozilla-beta-linux64_update_verify_3/3, release-mozilla-beta-macosx64_update_verify_1/3, release-mozilla-beta-macosx64_update_verify_2/3, release-mozilla-beta-macosx64_update_verify_3/3, release-mozilla-beta-win32_update_verify_1/3, release-mozilla-beta-win32_update_verify_2/3, release-mozilla-beta-win32_update_verify_3/3

2012-02-22 16:51:44-0800 [-] AggregatingScheduler(release-mozilla-beta-updates_done) <id=46912811128160>: new buildset: branch=release-mozilla-beta-updates_done, ssid=2821971, builders: release-mozilla-beta-linux_update_verify_1/3, release-mozilla-beta-linux_update_verify_2/3, release-mozilla-beta-linux_update_verify_3/3, release-mozilla-beta-linux64_update_verify_1/3, release-mozilla-beta-linux64_update_verify_2/3, release-mozilla-beta-linux64_update_verify_3/3, release-mozilla-beta-macosx64_update_verify_1/3, release-mozilla-beta-macosx64_update_verify_2/3, release-mozilla-beta-macosx64_update_verify_3/3, release-mozilla-beta-win32_update_verify_1/3, release-mozilla-beta-win32_update_verify_2/3, release-mozilla-beta-win32_update_verify_3/3
(In reply to Rail Aliiev [:rail] from comment #33)

... How's the twistd log look about Aggregating* here?
Attached patch moar loggingSplinter Review
I suspect that something wrong may be with lastCheck...

I added more logging and reset state['lastCheck'] to n after creating a new buildset.
Attachment #600982 - Flags: review?(bhearsum)
Attachment #600982 - Flags: review?(bhearsum) → review+
I think the problem was that lastCheck ended up being before the complete_at value from the build. The next time the scheduler was run, it was possible that it would discover the same completed upstream build as before, and so trigger new builds again.

This patch sets lastCheck to the latest of the completed builds we find, so we're guaranteed not to find the same ones again.
Attachment #601308 - Flags: review?(rail)
Attachment #601308 - Flags: review?(rail) → review+
Attachment #601308 - Flags: checked-in+
I hope that catlee's patches fixes it (worked fine in 11.0b5). Will reopen if something goes wrong.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Priority: P1 → P5
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.