Closed
Bug 1093743
Opened 10 years ago
Closed 10 years ago
Job ingestion sometimes picks the wrong request_id & calculates the guid incorrectly
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P2)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: RyanVM, Assigned: emorley)
References
Details
(Keywords: regression)
Attachments
(1 file, 2 obsolete files)
https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&fromchange=3629ea724086&tochange=f7999c1b6030
I attempted to retrigger the Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-e10s-2 job on revision f7999c1b6030. It started on revision 3629ea724086 instead.
Updated•10 years ago
|
Blocks: treeherder-dev-transition
Assignee | ||
Comment 1•10 years ago
|
||
I reproduced this just now on that job.
Filtered results link:
https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&fromchange=3629ea724086&tochange=f7999c1b6030&searchQuery=Ubuntu%20VM%2012.04%20x64%20mozilla-inbound%20debug%20test%20mochitest-e10s-2
The job on which Ryan used retrigger:
https://treeherder.mozilla.org/api/project/mozilla-inbound/jobs/3411383/
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=3411383&name=buildapi&type=json
[{"blob": {"buildername": "Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-e10s-2", "request_id": 53835378}, "type": "json", "id": 18363937, "name": "buildapi", "job_id": 3411383}]
Pressing retrigger via the UI makes a request to buildapi, with request body:
request_id=53835378
However the job as found by buildapi:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/f7999c1b6030
-> https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/54696365
...has:
"request_id": 53835405
So:
a) We're storing the wrong request_id for the job
b) TBPL doesn't use request_id, it uses build_id (and uses the /build endpoint rather than /request). What's the difference between the two, and which should we be using?
Workaround for now:
On jobs where we have the wrong request_id associated with the job, use the buildapi self-serve page to retrigger the job, rather than the treeherder UI.
Priority: -- → P2
Assignee | ||
Comment 2•10 years ago
|
||
Using the /request endpoint...
The request_id we used:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/53835378
-> has build_id of 54696364
The request_id we should have used:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/53835405
-> has build_id of 54696365
Perhaps significant that the build_ids are adjacent to each other?
Reporter | ||
Comment 3•10 years ago
|
||
Hit it again this morning. Basically means we're going to have to do all retriggers through buildapi so we aren't wasting our time expecting things to start that won't.
Vote we bump this to P1.
Comment 4•10 years ago
|
||
yeah agree on having this a p1
Assignee | ||
Comment 5•10 years ago
|
||
I wonder why it only just started happening?
(Unless it's a case of us noticing now we're paying more attention to it)
Priority: P2 → P1
Reporter | ||
Comment 6•10 years ago
|
||
I'm pretty sure I've seen it in the past but dismissed it as user error on my part.
Updated•10 years ago
|
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Comment 7•10 years ago
|
||
I had thought this would be an issue on the client in how we retrigger. But after talking to Ed, I believe this is actually an ingestion issue. The ``request_id`` value we keep when there are multiple ``request_ids`` in the list should be the LATEST one. I was saving the EARLIEST one. So, for coalesced jobs, I would try to retrigger the wrong job. In Ed's example above, the list of ``request_ids`` was:
[53835378, 53835405]. I was saving element 0, where I should have been saving the latest one: the max value.
Catlee: can you confirm this is the right ``request_id`` to use for retriggers?
Flags: needinfo?(catlee)
Comment 8•10 years ago
|
||
It's probably safer to use build_id for retriggers if you want to re-run exactly the same thing as ran before. We don't always coalesce to the highest request ids (see bugs 555664, 627603 for why), so I'm not sure if it's possible to determine which request in the end was chosen. For this reason retriggering based on the build_id is safer for what I think you're trying to do.
OTOH if you're trying to retrigger a job that was coalesced away, then you need to use the request_id.
Flags: needinfo?(catlee)
Comment 9•10 years ago
|
||
I confirmed this fix with these two resultsets:
http://local.treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&tochange=a2551ec63d3c&searchQuery=os%20x%2010.8%20debug&fromchange=49d26513f6be
On the old code, when I retriggered the red build from change a2551ec63d3c, the retrigger showed in change 49d26513f6be. With the new code, it showed in the first change a2551ec63d3c.
Comment 10•10 years ago
|
||
Note: this won't fix old jobs that already have the wrong ``request_id`` in the artifact. This fixes new jobs coming in so that their ``buildapi`` artifact is now correct.
Comment 11•10 years ago
|
||
Updated•10 years ago
|
Attachment #8521056 -
Flags: review?(emorley)
Assignee | ||
Comment 12•10 years ago
|
||
Comment on attachment 8521056 [details] [review]
PR286
Given comment 8, I think we may have to go with the more complicated approach (split build_id/request_id depending on job state) - thoughts?
Either way, guess we should ingest the correct request_id for completed jobs (or even the whole list) rather than the current wrong one - if we're going to keep it in the data model.
Attachment #8521056 -
Flags: review?(emorley)
Comment 13•10 years ago
|
||
Catlee-- Oops, I just saw your comment. So, wrt builds4hr.js, running.js, pending.js: which field is the right build_id? It seems to work with the ``id`` field from running.js. But I haven't had luck with any of the fields from builds4hr.js.
Here's an example of those structures for a single job:
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=3778396&type=json
in each of those objects of the array, the ``blob`` field is the one from one of these files:
1. buildapi_complete: builds4hr.js
2. buildapi_running: running.js
3. buildapi_pending: pending.js
4. buildapi: our internal summary artifact
So, I know if I have a pending job, I'd retrigger with the request_id. If it's running, I could use the /build/ endpoint and pass the build_id=55791253 (in the link provided). But for a complete job, I can't find what id from ``buildapi_complete`` to send to the /build/ endpoint. So I've been using the /request/ endpoint.
Which id would be the right one?
Assignee | ||
Updated•10 years ago
|
Keywords: regression
Comment 14•10 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder-service
https://github.com/mozilla/treeherder-service/commit/b189b22fe9f2ac86240db055c67f32a020e84a94
bug 1093743 - correct request id for retrigger/cancel on coalesced jobs
https://github.com/mozilla/treeherder-service/commit/52fff1c94f993dd88cd392dbb57e96276d5962fc
bug 1093743 - revert change line that shouldn't be in this fix
https://github.com/mozilla/treeherder-service/commit/22f3482143da0850ec30da4e57f4dc3c65c4aab2
Merge pull request #286 from mozilla/bug-1093743-retrigger
Bug 1093743 retrigger fix
Comment 15•10 years ago
|
||
I tried to retrigger (amongst others) win xp debug bc1 on:
https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=ca998f9e1b71
Nothing is showing up there for me. However, https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=b8613576f657 seems to have more xp debug bc1 jobs than I retriggered there...
Still more confusingly, self serve claims there are bc1 runs ongoing for ca998f9e1b71:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/ca998f9e1b71
Comment 16•10 years ago
|
||
(In reply to :Gijs Kruitbosch from comment #15)
> I tried to retrigger (amongst others) win xp debug bc1 on:
>
> https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-
> inbound&revision=ca998f9e1b71
>
> Nothing is showing up there for me. However,
> https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-
> inbound&revision=b8613576f657 seems to have more xp debug bc1 jobs than I
> retriggered there...
>
> Still more confusingly, self serve claims there are bc1 runs ongoing for
> ca998f9e1b71:
>
> https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/
> ca998f9e1b71
Just realized those are linux runs, not windows... so it seems like this is essentially just not fixed.
Comment 17•10 years ago
|
||
Retriggered the windows builds using self-serve instead, which seems to have worked.
Assignee | ||
Updated•10 years ago
|
Component: Treeherder → Treeherder: Data Ingestion
Assignee | ||
Updated•10 years ago
|
No longer blocks: treeherder-dev-transition
Comment 18•10 years ago
|
||
Hey Catlee: I'm back to working on this bug now. I never added a needinfo request. My bad...
Would you take a look at my question in Comment 13 above and see if you can help me determine the right ``build id`` to use?
Thanks!
Flags: needinfo?(catlee)
Assignee | ||
Comment 19•10 years ago
|
||
Testcase broken:
{
"builder_id": 248158,
"buildnumber": 346,
"endtime": 1422472064,
"id": 57666910,
"master_id": 80,
"properties": {
"basedir": "/builds/slave/test",
"blobber_files": "{\"qemu.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/75dc184489708c4ef6edbcf30fd5ebc9b95b4d000034a48d131d77e29489defed4cee754333676544e3563df0b3e6b582867d5a70f3816e2ac50e03cfb2ef006\", \"emulator-5554.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5f3dc4a19484d52abde92c8e01aebaf0d2a155799ef3a5a48263f6bab7b62c237613a5601238f280d250aca2605d69749665b2247b82f5ee924c394b5af67627\"}",
"branch": "mozilla-inbound",
"build_url": "http://pvtbuilds.pvt.build.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/20150128084950/emulator.tar.gz",
"buildername": "b2g_emulator_vm mozilla-inbound opt test reftest-15",
"buildid": "20150128084950",
"buildnumber": 346,
"builduid": "55f2a8ac293f411aac247203b95f8ec8",
"log_url": "http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1422463790/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-reftest-15-bm67-tests1-linux64-build346.txt.gz",
"master": "http://buildbot-master67.srv.releng.use1.mozilla.com:8201/",
"pgo_build": "False",
"platform": "emulator",
"product": "b2g",
"project": "",
"repo_path": "integration/mozilla-inbound",
"repository": "",
"request_ids": [
60457511,
60457203
],
"request_times": {
"60457203": 1422465787,
"60457511": 1422465962
},
"revision": "2c010ad45bd386f2e80e458c15a6f2d6e6d56255",
"scheduler": "tests-mozilla-inbound-ubuntu64_vm-b2g-emulator-opt-unittest",
"script_repo_revision": "bafcff7355f4",
"script_repo_url": "https://hg.mozilla.org/build/mozharness",
"slavebuilddir": "test",
"slavename": "tst-linux64-spot-996",
"stage_platform": "emulator"
},
"reason": "scheduler",
"request_ids": [
60457203,
60457511
],
"requesttime": 1422465787,
"result": 0,
"slave_id": 7990,
"starttime": 1422466140
},
Using:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/60457203
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/60457511
request id 60457511 even though it was lower, was the one used (compare the revision against the builds4hr example above) - ie the last item in request_ids was used.
Spot checking a bunch of other >1 length request_ids properties shows the last value in request_ids is the one that gets used there too.
So seems like the solution here is to not use the max request_id:
https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/buildapi.py#L256
'request_id': max(request_ids)
...but the last one in the list :-)
Assignee | ||
Comment 20•10 years ago
|
||
Just to follow on from the Vidyo call - I think maybe a good way to test the theory in comment 19 would be to log at point of completed job ingestion - and actually do an two api calls to:
https://secure.pub.build.mozilla.org/buildapi/self-serve/<project>/request/<request_id>
...one where we use max(request_ids) and one where we use request_ids[-1] - and then compare to the revision in builds4-hr and dump to the log which approach was correct (if they didn't both match).
Comment 21•10 years ago
|
||
So I did this manually with a builds-4hr job that had 3 request ids:
"request_ids": [
60472268,
60472306,
60472287
],
and checked the revisions with the url in comment 20. The first two ids point to different revisions, whereas the last points to the correct revision of: d975577991caa3a7f5aaff02f6ba5dbe16c66a7a
I've tried this on several and it's always the last in the list that points to the right revision. So, Ed, I believe your hypothesis is confirmed. It's always the last in that array. A nice easy fix. Thanks so much for your help on this!
Comment 22•10 years ago
|
||
I ingested this job locally and was able to retrigger correctly on the right revision, as expected. So we are good here.
Comment 23•10 years ago
|
||
Attachment #8556213 -
Flags: review?(emorley)
Updated•10 years ago
|
Attachment #8556213 -
Attachment description: UI pr 259 → service pr 259
Assignee | ||
Comment 24•10 years ago
|
||
Comment on attachment 8556213 [details] [review]
service pr 359
Commented on the PR - thing one more change needed - and probably worth testing that's the right choice for running too.
Attachment #8556213 -
Flags: review?(emorley)
Updated•10 years ago
|
Attachment #8556213 -
Attachment description: service pr 259 → service pr 359
Updated•10 years ago
|
Attachment #8556213 -
Flags: review?(emorley)
Assignee | ||
Comment 25•10 years ago
|
||
Comment on attachment 8556213 [details] [review]
service pr 359
Left a few comments on the PR, looks great overall :-)
Attachment #8556213 -
Flags: review?(emorley)
Comment 26•10 years ago
|
||
hm, looking at the code, I don't think we publish the build_id in builds-4hr.js :(
do you need it added at this point?
Flags: needinfo?(catlee)
Assignee | ||
Comment 27•10 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #26)
> hm, looking at the code, I don't think we publish the build_id in
> builds-4hr.js :(
>
> do you need it added at this point?
I think we've got an approach working using just the request id (we pick the last one in the request_ids list, and it seems to always be the correct one) - so don't need it for now.
Assignee | ||
Comment 28•10 years ago
|
||
Tweaking summary since the PR not only means we choose the correct request_id, when there are several in request_ids (eg for a job that subsumed several others) but now also corrects how we calculate the job guid for certain job states. :-D
tl;dr for times when there was coalescing, we would (a) have a wrong request_id so retriggers/cancellations wouldn't work, but (b) we'd merge the wrong jobs, leading to things like bug 1113142 and bug 1097328.
Comment 31•10 years ago
|
||
Tried this with a running job and it works just fine.
Comment 32•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-service
https://github.com/mozilla/treeherder-service/commit/d400a8cb56add5f51b973162b78d06ab251c102c
Bug 1093743 - use correct request id when multiple exist
Bug 1097328 - fixed coalesced job_guid for running/complete jobs
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 33•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-service
https://github.com/mozilla/treeherder-service/commit/fd7cd206830bde2d5d0b882d838c1c7a196dacc3
Revert "Bug 1093743 - use correct request id when multiple exist"
This reverts commit d400a8cb56add5f51b973162b78d06ab251c102c.
Assignee | ||
Comment 34•10 years ago
|
||
Unfortunately whilst the changes here were technically more correct than the previous approach, due to inconsistencies in builds-4hr, we're seeing some issues with running jobs being orphaned/duplicated.
For now, I've reverted the commit here (and its followup in bug 1133455 to fix the exception) so we can deploy the other changes in the meantime.
For more info, see bug 1133455 comment 8 onwards.
I think the best way forwards is to file bugs against builds-4hr - since I have performed some analysis locally and the inconsistencies are horrendous & mean there isn't actually a correct approach we can take, that will not leave orphaned jobs around. Will file a release engineering bug blocking this one soon.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(emorley)
Comment 35•10 years ago
|
||
FWIW, I've been hitting what appears to be this bug today.
In case it helps to have another data point: I've been trying (unsuccessfully) to retrigger the opt "mochitest-9" job on this Try run:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=50913131140c&filter-ref_data_name=b2g_emulator_vm
...but the retriggers haven't been showing up (after repeated waiting, reloading, retriggering again, waiting, etc)
Comment 36•10 years ago
|
||
what does self-serve think about these requests?
Reporter | ||
Comment 37•10 years ago
|
||
Retriggers via self-serve work correctly.
Comment 38•10 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #36)
> what does self-serve think about these requests?
(My missing TreeHerder-spawned retriggers do not show up on self-serve)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #37)
> Retriggers via self-serve work correctly.
I'll try a few of those now. So anyone looking at my treeherder link may see a few successful retriggers, which just come from me + self-serve.
Assignee | ||
Comment 39•10 years ago
|
||
I've finally filed bug 1141603 based on my analysis of builds-4hr, though things seem to have improved slightly since bug 1140612 has landed (will need to wait until we have a few days of data to analyse, to be sure).
Flags: needinfo?(emorley)
Assignee | ||
Comment 41•10 years ago
|
||
Cameron's on PTO and I've been digging into the builds-4hr inconsistencies, so I'll drive this for the moment :-)
Assignee: cdawson → emorley
Assignee | ||
Comment 43•10 years ago
|
||
This is blocked on bug 1141603 so not actionable for the moment.
Assignee: emorley → nobody
Priority: P1 → P2
Assignee | ||
Comment 44•10 years ago
|
||
Going to have another go at driving answers in bug 1141603 for this.
Assignee: nobody → emorley
Assignee | ||
Comment 45•10 years ago
|
||
It sounds like some of the symptoms here could also be due to the job being missing from builds-4hr and not just the interaction between treeherder and builds-4hr. See bug 1159279.
Depends on: 1159279
Assignee | ||
Comment 46•10 years ago
|
||
Here goes nothing..! :-)
Attachment #8521056 -
Attachment is obsolete: true
Attachment #8556213 -
Attachment is obsolete: true
Attachment #8601829 -
Flags: review?(cdawson)
Assignee | ||
Updated•10 years ago
|
Status: REOPENED → ASSIGNED
Comment 47•10 years ago
|
||
Sorry for the delay in reviewing this. I will get to it before EOD.
Updated•10 years ago
|
Attachment #8601829 -
Flags: review?(cdawson) → review+
Comment 48•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/bd5cf6d775bf9a2562e778d08043bf49cabdb8ef
Bug 1093743 - Attempt to generate job_guids more consistently
In order that Treeherder can track a buildbot ingested job as it moves
from pending to running to completed, a unique job_guid must be
generated, since buildbot doesn't directly give us one. To make things
worse, each of the three data sources (builds-{pending,running,4hr})
make slightly different properties available to us, and the data in them
is not always correct. Until now we've been calculating the guid in such
a way that as some jobs (particularly those that are coalesced) change
state, the guid generated changes, causing us to think it's a brand new
job, thereby leaving behind an orphaned pending/running job.
The new method of generating a guid uses only the single request_id
corresponding to the job that ran, rather than all of the request_ids.
This is because when in the pending state, there is only ever one
request id, so if we use the full list later on, there is no way it will
match, if the job had others coalesced into it.
In addition, the request_times properties are known to be unreliable
(see bug 1141603), so we've stopped using them. This isn't a problem,
since they didn't help us with the buildbot retry case anyway (where the
request id is recycled), and we already handle that by using endtime.
We instead now use the buildername to add uniqueness rather than the
request_times.
Comment 49•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/54727c2994d9f937bd6297dadd000580bfcfc891
Bug 1093743 - Don't overwrite the value of request_id
It turns out that the variable used for the index in a loop persists
outside of the loop in Python, so previously we were overwriting the
original value of request_id with that of the index.
Assignee | ||
Comment 50•10 years ago
|
||
Looks fine so far, we'll have to see how it goes (this is a pretty risky change, but no real way to be 100% sure when testing locally) :-)
The comment 49 commit fixes bug 1163496, and in bug 1163496 I've tweaked how the UI determines whether a job is coalesced, to work around the bad data that was ingested for the few hours before the comment 49 commit landed. The log term fix for how we handle coalesced jobs is bug 1132546.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•