Closed Bug 557268 Opened 10 years ago Closed 8 years ago

release dependent schedulers sometimes don't fire

Categories

(Release Engineering :: General, defect, P3)

All
macOS
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

(Whiteboard: [automation][buildmasters])

We've had a few cases recently where the source/build dep scheduler didn't fire, for no discernible reason. The only commonality I've noticed is that it's always happened in cases where there is multiple reconfigs close together -- for a build2 in every case IIRC.
There's two possible work items here:
* Figure out why the dep schedulers broke, and fix that issue
or
* Switch to Triggerable, giving careful thought to whether or not it's a bad thing that there's *no* easy way to prevent subsequent builds from firing like there is with Dependent.
Whiteboard: [automation][buildmasters]
How about a triggerable with a config flag of whether to trigger them?
That would require a reconfig, but so does a lot of other release automation recovery.
Rather than use a flag, we could use a property which defaults to True. That way, we can override without a reconfig.
I hit this issue for the 3.6.14 release.

I filed http://trac.buildbot.net/ticket/1777 to keep track of it.

#######################
We triggered a release on Friday and in between then and today 2 reconfigurations happened.
On Monday the "updates" builder [1] got triggered by an ftpPoller [2] and it was supposed to trigger the "update_verify" builders [3].

The problem is that a reconfigure happened before that and it made the Dependent scheduler to forget who to trigger.

We could switch to trigger steps but then it prevents us from doing a "force build" and not have any dependent jobs to be triggered.

The release was triggered at 14:32 on Friday.
The updates builder was triggered at 13:01 on Monday.
2 reconfigures happened in between.

Could this be the place where the factory is considered to be changed and the memory loss happen?

2011-01-21 21:56:03-0800 [-] updating builder release-mozilla-1.9.2-updates: factory changed

    nextSlave changed from <function _nextFastReservedSlave at 0x1953e64c> to <function _nextFastReservedSlave at 0x1d747b1c>

2011-01-21 21:56:03-0800 [-] consumeTheSoulOfYourPredecessor: <Builder release-mozilla-1.9.2-updates at 500572364> feeding upon <Builder release-mozilla-1.9.2-updates at 240793324>

Reconfigs on masters listed chronologically (first two happened before "updates" job got triggered):
twistd.log.228:2011-01-21 21:56:05-0800 [-] configuration update started
twistd.log.228:2011-01-21 21:56:37-0800 [-] configuration update complete
twistd.log.65:2011-01-24 12:48:34-0800 [-] configuration update started
twistd.log.65:2011-01-24 12:48:55-0800 [-] configuration update complete
twistd.log.52:2011-01-24 16:29:40-0800 [-] configuration update started
twistd.log.52:2011-01-24 16:31:24-0800 [-] configuration update complete

[1] Updates scheduler -  http://hg.mozilla.org/build/buildbotcustom/file/tip/process/release.py#l313
[2] ftpPoller -  http://hg.mozilla.org/build/buildbotcustom/file/tip/process/release.py#l212
[3] Update_verify builder -  http://hg.mozilla.org/build/buildbotcustom/file/tip/process/release.py#l327
I would suggest to wontfix this:
* we are going to use AggregatingScheduler instead of Dependent for some builders 
* Other Dependent schedulers (tag and build) happen in a very short period of time after release sendchange. So it will be safe to reconfig ~30-60 minutes after we start a release.

If you don't want to wontfix, I can grab the bug.
Assignee: nobody → bhearsum
Yeah, let's WONTFIX.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.