Closed
Bug 826707
Opened 13 years ago
Closed 12 years ago
sometimes aggregatingscheduler double-triggers jobs
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: catlee)
Details
(Whiteboard: [schedulers][buildbot])
Attachments
(2 files)
3.96 KB,
patch
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
1.70 KB,
patch
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
We hit this in Firefox 10.0.12esr. I originally thought it might be bug 811708, but it looks like that bug has symptoms that involve needing to manually fix up the database. In this case, we had updates run, succeed, and then two jobs for each update verify builder were triggered. The scheduler master had no useful logs, but I found this on bm30 (where updates ran, and where two jobs of the same update verify builder ran):
2013-01-03 15:14:58-0800 [-] <Build release-mozilla-esr10-updates>: build finished
2013-01-03 15:15:01-0800 [-] AggregatingScheduler(release-mozilla-esr10-updates_done) <id=655871224>: new builds: ((u'release-mozilla-esr10-updates', 19172862L, 1357254901L),) since 1357233899.56
2013-01-03 15:15:01-0800 [-] AggregatingScheduler(release-mozilla-esr10-updates_done) <id=655871224>: new buildset: branch=releases/mozilla-esr10, ssid=4408660, builders: release-mozilla-esr10-linux_update_verify_1/4, release-mozilla-esr10-linux_update_verify_2/4, release-mozilla-esr10-linux_update_verify_3/4, release-mozilla-esr10-linux_update_verify_4/4, release-mozilla-esr10-linux64_update_verify_1/4, release-mozilla-esr10-linux64_update_verify_2/4, release-mozilla-esr10-linux64_update_verify_3/4, release-mozilla-esr10-linux64_update_verify_4/4, release-mozilla-esr10-macosx64_update_verify_1/4, release-mozilla-esr10-macosx64_update_verify_2/4, release-mozilla-esr10-macosx64_update_verify_3/4, release-mozilla-esr10-macosx64_update_verify_4/4, release-mozilla-esr10-win32_update_verify_1/4, release-mozilla-esr10-win32_update_verify_2/4, release-mozilla-esr10-win32_update_verify_3/4, release-mozilla-esr10-win32_update_verify_4/4
2013-01-03 15:15:01-0800 [-] AggregatingScheduler(release-mozilla-esr10-updates_done) <id=655871224>: get_initial_state()
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4> using slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave15'>
2013-01-03 15:15:01-0800 [-] acquireLocks(slave <BuildSlave 'mv-moz2-linux-ix-slave15'>, locks [])
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4>.. pinging the slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave15'>
2013-01-03 15:15:01-0800 [-] sending ping
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4> using slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave04'>
2013-01-03 15:15:01-0800 [-] acquireLocks(slave <BuildSlave 'mv-moz2-linux-ix-slave04'>, locks [])
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4>.. pinging the slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave04'>
2013-01-03 15:15:01-0800 [-] sending ping
2013-01-03 15:15:01-0800 [Broker,25104,10.250.49.152] ping finished: success
2013-01-03 15:15:01-0800 [Broker,24884,10.250.49.163] ping finished: success
2013-01-03 15:15:01-0800 [Broker,25104,10.250.49.152] <Build release-mozilla-esr10-linux_update_verify_4/4>.startBuild
2013-01-03 15:15:01-0800 [Broker,24884,10.250.49.163] <Build release-mozilla-esr10-linux_update_verify_4/4>.startBuild
Reporter | ||
Comment 1•13 years ago
|
||
Not sure if this is expected or not, but I don't see any "new builds" message for release-comm-esr10-updates.
Assignee | ||
Comment 2•13 years ago
|
||
Does it actually run the build twice on the slave? It looks like the duplicate builds happened on the same slave. I've seen this on regular builds on my staging masters at times.
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → catlee
Assignee | ||
Comment 3•13 years ago
|
||
It looks like the aggregating schedulers are running on more than one master. In particular, they're active on bm36 and bm30.
This is not a Good Thing.
Reporter | ||
Comment 4•13 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #3)
> It looks like the aggregating schedulers are running on more than one
> master. In particular, they're active on bm36 and bm30.
>
> This is not a Good Thing.
Ouch! That could certainly cause this!!
Reporter | ||
Comment 5•13 years ago
|
||
Sounds like we need a special case for AggregatingScheduler here:
https://github.com/mozilla/build-buildbot-configs/blob/master/mozilla/builder_master.cfg#L150
Assignee | ||
Comment 6•13 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #5)
> Sounds like we need a special case for AggregatingScheduler here:
> https://github.com/mozilla/build-buildbot-configs/blob/master/mozilla/
> builder_master.cfg#L150
Yup, for sure. AggregatingScheduler inherits from Triggerable, so they're being instantiated on all masters.
Assignee | ||
Comment 7•13 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #6)
> (In reply to Ben Hearsum [:bhearsum] from comment #5)
> > Sounds like we need a special case for AggregatingScheduler here:
> > https://github.com/mozilla/build-buildbot-configs/blob/master/mozilla/
> > builder_master.cfg#L150
>
> Yup, for sure. AggregatingScheduler inherits from Triggerable, so they're
> being instantiated on all masters.
Except we need them on all masters so that the reset_schedulers builders can trigger them...
Reporter | ||
Comment 8•13 years ago
|
||
A couple of bad ideas on how to fix this:
1) Put the reset scheduler builders on the scheduler masters. This would mean we need a slave (maybe a self-hosted one) attached to them, and maybe WebStatus so we can double check things.
2) Modify the AggregatingScheduler instances on the builder masters to have no builders to trigger.
Assignee | ||
Updated•13 years ago
|
Whiteboard: [schedulers][buildbot]
Assignee | ||
Comment 9•13 years ago
|
||
When this is True (the default), the regular scheduler polling would be enabled. This is the part that monitors for completed builds and triggers new ones.
When this is False, only the triggerable bit is active so that the scheduler can be reset with a builder.
Attachment #700736 -
Flags: review?(bhearsum)
Assignee | ||
Comment 10•13 years ago
|
||
Set enable_service to False for AggregatingScheduler in build masters (not universal or scheduler masters).
Attachment #700737 -
Flags: review?(bhearsum)
Reporter | ||
Comment 11•13 years ago
|
||
Comment on attachment 700736 [details] [diff] [review]
Add enable_service flag to Aggregating scheduler
Review of attachment 700736 [details] [diff] [review]:
-----------------------------------------------------------------
Do we get any exceptions when stopService gets called at shutdown because the service was never started? If so, it'd be good to null that method out. Otherwise looks fine.
Attachment #700736 -
Flags: review?(bhearsum) → review+
Reporter | ||
Updated•13 years ago
|
Attachment #700737 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 12•13 years ago
|
||
(In reply to Ben Hearsum [:bhearsum] from comment #11)
> Comment on attachment 700736 [details] [diff] [review]
> Add enable_service flag to Aggregating scheduler
>
> Review of attachment 700736 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Do we get any exceptions when stopService gets called at shutdown because
> the service was never started? If so, it'd be good to null that method out.
> Otherwise looks fine.
Nope, I can't make it generate an exception when stopService is called, so I think it's safe.
Assignee | ||
Updated•12 years ago
|
Attachment #700736 -
Flags: checked-in+
Assignee | ||
Updated•12 years ago
|
Attachment #700737 -
Flags: checked-in+
Assignee | ||
Comment 13•12 years ago
|
||
There aren't any doubly run jobs since this went into production last week.
select * from buildrequests as b1, buildrequests as b2 where b1.id > b2.id and abs(b1.submitted_at - b2.submitted_at) < 5 and b1.submitted_at > unix_timestamp("2013-01-21") and b1.buildername like "release-%" and b1.buildername = b2.buildername;
Empty set (0.16 sec)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 14•12 years ago
|
||
* [release] Firefox 23.0b2 build1: failed at push_to_mirrors
** double triggered
** good: http://buildbot-master63.srv.releng.use1.mozilla.com:8001/builders/release-mozilla-beta-push_to_mirrors/builds/0
** bad: http://buildbot-master57.srv.releng.use1.mozilla.com:8001/builders/release-mozilla-beta-push_to_mirrors/builds/2
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•