Closed Bug 782874 Opened 13 years ago Closed 9 years ago

builds-running.js doesn't know what rev a coalesced job actually runs on

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Assigned: catlee)

References

Details

(Keywords: sheriffing-P1, Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2554] [capacity])

For months, I've been bitching about how we coalesce test jobs backward, to an older push which happened to do the sendchange later than a newer push. Tonight, I was watching just two jobs on two pushes right next to each other, and finally saw what's really happening. Given push B build started 08:00, test scheduled 09:00 push A build started 07:00, test scheduled 10:00 the RequestSortingBuildFactory will see (when it finally gets to schedule the test at 11:00) that the buildid for push B is a higher number, and so the test will get correctly scheduled against push B. However, builds-running.js either didn't get the memo about request sorting, or gets confused by something, and it says that the test is running on push A. tbpl displays running jobs based on what https://secure.pub.build.mozilla.org/builddata/buildjson/builds-running.js tells it, and finished jobs based on what http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz tells it, so in that case it goes from showing a pending test on each push, to showing a running test on push A, to showing the completed test on push B, leading to much wailing and gnashing of teeth when the running display says that some useless old push stole away the job you needed to have run on a newer push (or to premature and incorrect celebration when you think your old suspect push got the test, turning to rage when the completed job disappears and jumps to some newer push that you already knew was bad).
It's true that the buildapi code for generating builds-running.js doesn't know about the 'take the latest buildID logic' used when starting builds. Specifically in the block starting at http://hg.mozilla.org/build/buildapi/file/6a71c42d0895/buildapi/model/query.py#l135 it will take the highest buildrequest (ie latest) to look up the revision (ie push A). To mimic the job-starting code we'd need to teach buildapi to look up the buildID for a request, which we could do by getting the change_properties table, after schlepping from buildrequests.buildsetid through buildsets, sourcestamps, sourcestamp_changes, and changes.
I wonder whether this is the reason why when we cancel running jobs (because we know they are going to fail, and we want to see that job pass on the backout push up above), we frequently wind up accidentally cancelling the job on some push up above rather than on the push where we actually cancelled it. That would be the reason that we never cancel running jobs on trees other than try, which is a pretty nasty chunk of [capacity] when we have 10 known-worthless pushes full of running tests that we leave running for fear of killing the job on the backout instead.
Whiteboard: [sheriff-want] → [sheriff-want][capacity]
Keywords: sheriffing-P1
Whiteboard: [sheriff-want][capacity] → [capacity]
This caught me out on m-c again this morning. I was looking at m-c tip, since it appeared that a completed OS X build there hadn't triggered jobs. However in fact, the running job was just displaying on the push prior (confirmed by viewing the buildbot master job entry for that job, which listed the tip cset).
So I think we need to modify this code: http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l301 and have it add a new property like 'coalesced_id' with the id of the request that ends up being sorted last.
Assignee: nobody → catlee
Getting worse now that we decided it was okay to cancel running builds on not-try because they now clobber - if you don't obsessively make sure that the build you are going to cancel is in fact shown as running on every single push above the one where you cancel it, you generally actually cancel the one good one you wanted to keep, and have to retrigger it.
(In reply to Chris AtLee [:catlee] from comment #4) > So I think we need to modify this code: > http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory. > py#l301 > > and have it add a new property like 'coalesced_id' with the id of the > request that ends up being sorted last. actually, that won't work, since any property changes at this point won't appear in the schedulerdb, which is where builds-running gets its data from.
Product: mozilla.org → Release Engineering
Whiteboard: [capacity] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2554] [capacity]
Blocks: 1164545
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.