sometimes aggregatingscheduler double-triggers jobs

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
5 months ago

People

(Reporter: bhearsum, Assigned: catlee)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [schedulers][buildbot])

Attachments

(2 attachments)

(Reporter)

Description

6 years ago
We hit this in Firefox 10.0.12esr. I originally thought it might be bug 811708, but it looks like that bug has symptoms that involve needing to manually fix up the database. In this case, we had updates run, succeed, and then two jobs for each update verify builder were triggered. The scheduler master had no useful logs, but I found this on bm30 (where updates ran, and where two jobs of the same update verify builder ran):
2013-01-03 15:14:58-0800 [-]  <Build release-mozilla-esr10-updates>: build finished
2013-01-03 15:15:01-0800 [-] AggregatingScheduler(release-mozilla-esr10-updates_done) <id=655871224>: new builds: ((u'release-mozilla-esr10-updates', 19172862L, 1357254901L),) since 1357233899.56
2013-01-03 15:15:01-0800 [-] AggregatingScheduler(release-mozilla-esr10-updates_done) <id=655871224>: new buildset: branch=releases/mozilla-esr10, ssid=4408660, builders: release-mozilla-esr10-linux_update_verify_1/4, release-mozilla-esr10-linux_update_verify_2/4, release-mozilla-esr10-linux_update_verify_3/4, release-mozilla-esr10-linux_update_verify_4/4, release-mozilla-esr10-linux64_update_verify_1/4, release-mozilla-esr10-linux64_update_verify_2/4, release-mozilla-esr10-linux64_update_verify_3/4, release-mozilla-esr10-linux64_update_verify_4/4, release-mozilla-esr10-macosx64_update_verify_1/4, release-mozilla-esr10-macosx64_update_verify_2/4, release-mozilla-esr10-macosx64_update_verify_3/4, release-mozilla-esr10-macosx64_update_verify_4/4, release-mozilla-esr10-win32_update_verify_1/4, release-mozilla-esr10-win32_update_verify_2/4, release-mozilla-esr10-win32_update_verify_3/4, release-mozilla-esr10-win32_update_verify_4/4
2013-01-03 15:15:01-0800 [-] AggregatingScheduler(release-mozilla-esr10-updates_done) <id=655871224>: get_initial_state()
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4> using slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave15'>
2013-01-03 15:15:01-0800 [-] acquireLocks(slave <BuildSlave 'mv-moz2-linux-ix-slave15'>, locks [])
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4>.. pinging the slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave15'>
2013-01-03 15:15:01-0800 [-] sending ping
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4> using slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave04'>
2013-01-03 15:15:01-0800 [-] acquireLocks(slave <BuildSlave 'mv-moz2-linux-ix-slave04'>, locks [])
2013-01-03 15:15:01-0800 [-] starting build <Build release-mozilla-esr10-linux_update_verify_4/4>.. pinging the slave <SlaveBuilder builder='release-mozilla-esr10-linux_update_verify_4/4' slave='mv-moz2-linux-ix-slave04'>
2013-01-03 15:15:01-0800 [-] sending ping
2013-01-03 15:15:01-0800 [Broker,25104,10.250.49.152] ping finished: success
2013-01-03 15:15:01-0800 [Broker,24884,10.250.49.163] ping finished: success
2013-01-03 15:15:01-0800 [Broker,25104,10.250.49.152] <Build release-mozilla-esr10-linux_update_verify_4/4>.startBuild
2013-01-03 15:15:01-0800 [Broker,24884,10.250.49.163] <Build release-mozilla-esr10-linux_update_verify_4/4>.startBuild
(Reporter)

Comment 1

6 years ago
Not sure if this is expected or not, but I don't see any "new builds" message for release-comm-esr10-updates.
(Assignee)

Comment 2

6 years ago
Does it actually run the build twice on the slave? It looks like the duplicate builds happened on the same slave. I've seen this on regular builds on my staging masters at times.
(Assignee)

Updated

6 years ago
Assignee: nobody → catlee
(Assignee)

Comment 3

6 years ago
It looks like the aggregating schedulers are running on more than one master. In particular, they're active on bm36 and bm30.

This is not a Good Thing.
(Reporter)

Comment 4

6 years ago
(In reply to Chris AtLee [:catlee] from comment #3)
> It looks like the aggregating schedulers are running on more than one
> master. In particular, they're active on bm36 and bm30.
> 
> This is not a Good Thing.

Ouch! That could certainly cause this!!
(Reporter)

Comment 5

6 years ago
Sounds like we need a special case for AggregatingScheduler here:
https://github.com/mozilla/build-buildbot-configs/blob/master/mozilla/builder_master.cfg#L150
(Assignee)

Comment 6

6 years ago
(In reply to Ben Hearsum [:bhearsum] from comment #5)
> Sounds like we need a special case for AggregatingScheduler here:
> https://github.com/mozilla/build-buildbot-configs/blob/master/mozilla/
> builder_master.cfg#L150

Yup, for sure. AggregatingScheduler inherits from Triggerable, so they're being instantiated on all masters.
(Assignee)

Comment 7

6 years ago
(In reply to Chris AtLee [:catlee] from comment #6)
> (In reply to Ben Hearsum [:bhearsum] from comment #5)
> > Sounds like we need a special case for AggregatingScheduler here:
> > https://github.com/mozilla/build-buildbot-configs/blob/master/mozilla/
> > builder_master.cfg#L150
> 
> Yup, for sure. AggregatingScheduler inherits from Triggerable, so they're
> being instantiated on all masters.

Except we need them on all masters so that the reset_schedulers builders can trigger them...
(Reporter)

Comment 8

6 years ago
A couple of bad ideas on how to fix this:
1) Put the reset scheduler builders on the scheduler masters. This would mean we need a slave (maybe a self-hosted one) attached to them, and maybe WebStatus so we can double check things.
2) Modify the AggregatingScheduler instances on the builder masters to have no builders to trigger.
(Assignee)

Updated

6 years ago
Whiteboard: [schedulers][buildbot]
(Assignee)

Comment 9

6 years ago
Created attachment 700736 [details] [diff] [review]
Add enable_service flag to Aggregating scheduler

When this is True (the default), the regular scheduler polling would be enabled. This is the part that monitors for completed builds and triggers new ones.

When this is False, only the triggerable bit is active so that the scheduler can be reset with a builder.
Attachment #700736 - Flags: review?(bhearsum)
(Assignee)

Comment 10

6 years ago
Created attachment 700737 [details] [diff] [review]
Disable AggregatingScheduler service in build masters

Set enable_service to False for AggregatingScheduler in build masters (not universal or scheduler masters).
Attachment #700737 - Flags: review?(bhearsum)
(Reporter)

Comment 11

6 years ago
Comment on attachment 700736 [details] [diff] [review]
Add enable_service flag to Aggregating scheduler

Review of attachment 700736 [details] [diff] [review]:
-----------------------------------------------------------------

Do we get any exceptions when stopService gets called at shutdown because the service was never started? If so, it'd be good to null that method out. Otherwise looks fine.
Attachment #700736 - Flags: review?(bhearsum) → review+
(Reporter)

Updated

6 years ago
Attachment #700737 - Flags: review?(bhearsum) → review+
(Assignee)

Comment 12

6 years ago
(In reply to Ben Hearsum [:bhearsum] from comment #11)
> Comment on attachment 700736 [details] [diff] [review]
> Add enable_service flag to Aggregating scheduler
> 
> Review of attachment 700736 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Do we get any exceptions when stopService gets called at shutdown because
> the service was never started? If so, it'd be good to null that method out.
> Otherwise looks fine.

Nope, I can't make it generate an exception when stopService is called, so I think it's safe.
(Assignee)

Updated

6 years ago
Attachment #700736 - Flags: checked-in+
(Assignee)

Updated

6 years ago
Attachment #700737 - Flags: checked-in+
(Assignee)

Comment 13

6 years ago
There aren't any doubly run jobs since this went into production last week.

select * from buildrequests as b1, buildrequests as b2 where b1.id > b2.id and abs(b1.submitted_at - b2.submitted_at) < 5 and b1.submitted_at > unix_timestamp("2013-01-21") and b1.buildername like "release-%" and b1.buildername = b2.buildername;
Empty set (0.16 sec)
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.