Closed Bug 1093743 Opened 10 years ago Closed 9 years ago

Job ingestion sometimes picks the wrong request_id & calculates the guid incorrectly

Categories

(Tree Management :: Treeherder: Data Ingestion, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: emorley)

References

Details

(Keywords: regression)

Attachments

(1 file, 2 obsolete files)

https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&fromchange=3629ea724086&tochange=f7999c1b6030

I attempted to retrigger the Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-e10s-2 job on revision f7999c1b6030. It started on revision 3629ea724086 instead.
I reproduced this just now on that job.

Filtered results link:
https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&fromchange=3629ea724086&tochange=f7999c1b6030&searchQuery=Ubuntu%20VM%2012.04%20x64%20mozilla-inbound%20debug%20test%20mochitest-e10s-2

The job on which Ryan used retrigger:
https://treeherder.mozilla.org/api/project/mozilla-inbound/jobs/3411383/
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=3411383&name=buildapi&type=json

[{"blob": {"buildername": "Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-e10s-2", "request_id": 53835378}, "type": "json", "id": 18363937, "name": "buildapi", "job_id": 3411383}]

Pressing retrigger via the UI makes a request to buildapi, with request body:
request_id=53835378

However the job as found by buildapi:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/f7999c1b6030
-> https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/54696365

...has:
"request_id": 53835405

So:
a) We're storing the wrong request_id for the job
b) TBPL doesn't use request_id, it uses build_id (and uses the /build endpoint rather than /request). What's the difference between the two, and which should we be using?

Workaround for now:
On jobs where we have the wrong request_id associated with the job, use the buildapi self-serve page to retrigger the job, rather than the treeherder UI.
Priority: -- → P2
Using the /request endpoint...

The request_id we used:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/53835378
-> has build_id of 54696364

The request_id we should have used: 
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/53835405
-> has build_id of 54696365

Perhaps significant that the build_ids are adjacent to each other?
Hit it again this morning. Basically means we're going to have to do all retriggers through buildapi so we aren't wasting our time expecting things to start that won't.

Vote we bump this to P1.
yeah agree on having this a p1
I wonder why it only just started happening? 
(Unless it's a case of us noticing now we're paying more attention to it)
Priority: P2 → P1
I'm pretty sure I've seen it in the past but dismissed it as user error on my part.
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
I had thought this would be an issue on the client in how we retrigger.  But after talking to Ed, I believe this is actually an ingestion issue.  The ``request_id`` value we keep when there are multiple ``request_ids`` in the list should be the LATEST one.  I was saving the EARLIEST one.  So, for coalesced jobs, I would try to retrigger the wrong job.  In Ed's example above, the list of ``request_ids`` was:
[53835378, 53835405].  I was saving element 0, where I should have been saving the latest one: the max value.

Catlee: can you confirm this is the right ``request_id`` to use for retriggers?
Flags: needinfo?(catlee)
It's probably safer to use build_id for retriggers if you want to re-run exactly the same thing as ran before. We don't always coalesce to the highest request ids (see bugs 555664, 627603 for why), so I'm not sure if it's possible to determine which request in the end was chosen. For this reason retriggering based on the build_id is safer for what I think you're trying to do.

OTOH if you're trying to retrigger a job that was coalesced away, then you need to use the request_id.
Flags: needinfo?(catlee)
I confirmed this fix with these two resultsets:
http://local.treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&tochange=a2551ec63d3c&searchQuery=os%20x%2010.8%20debug&fromchange=49d26513f6be

On the old code, when I retriggered the red build from change a2551ec63d3c, the retrigger showed in change 49d26513f6be.  With the new code, it showed in the first change a2551ec63d3c.
Note: this won't fix old jobs that already have the wrong ``request_id`` in the artifact.  This fixes new jobs coming in so that their ``buildapi`` artifact is now correct.
Attached file PR286 (obsolete) —
Attachment #8521056 - Flags: review?(emorley)
Comment on attachment 8521056 [details] [review]
PR286

Given comment 8, I think we may have to go with the more complicated approach (split build_id/request_id depending on job state) - thoughts?

Either way, guess we should ingest the correct request_id for completed jobs (or even the whole list) rather than the current wrong one - if we're going to keep it in the data model.
Attachment #8521056 - Flags: review?(emorley)
Catlee-- Oops, I just saw your comment.  So, wrt builds4hr.js, running.js, pending.js: which field is the right build_id?  It seems to work with the ``id`` field from running.js.  But I haven't had luck with any of the fields from builds4hr.js.

Here's an example of those structures for a single job:
https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=3778396&type=json

in each of those objects of the array, the ``blob`` field is the one from one of these files:
1. buildapi_complete: builds4hr.js
2. buildapi_running: running.js
3. buildapi_pending: pending.js
4. buildapi: our internal summary artifact

So, I know if I have a pending job, I'd retrigger with the request_id.  If it's running, I could use the /build/ endpoint and pass the build_id=55791253 (in the link provided).  But for a complete job, I can't find what id from ``buildapi_complete`` to send to the /build/ endpoint.  So I've been using the /request/ endpoint.

Which id would be the right one?
Keywords: regression
I tried to retrigger (amongst others) win xp debug bc1 on:

https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=ca998f9e1b71

Nothing is showing up there for me. However, https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=b8613576f657 seems to have more xp debug bc1 jobs than I retriggered there...

Still more confusingly, self serve claims there are bc1 runs ongoing for ca998f9e1b71:

https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/ca998f9e1b71
(In reply to :Gijs Kruitbosch from comment #15)
> I tried to retrigger (amongst others) win xp debug bc1 on:
> 
> https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-
> inbound&revision=ca998f9e1b71
> 
> Nothing is showing up there for me. However,
> https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-
> inbound&revision=b8613576f657 seems to have more xp debug bc1 jobs than I
> retriggered there...
> 
> Still more confusingly, self serve claims there are bc1 runs ongoing for
> ca998f9e1b71:
> 
> https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/
> ca998f9e1b71

Just realized those are linux runs, not windows... so it seems like this is essentially just not fixed.
Retriggered the windows builds using self-serve instead, which seems to have worked.
Component: Treeherder → Treeherder: Data Ingestion
No longer blocks: treeherder-dev-transition
Hey Catlee: I'm back to working on this bug now.  I never added a needinfo request.  My bad...

Would you take a look at my question in Comment 13 above and see if you can help me determine the right ``build id`` to use?

Thanks!
Flags: needinfo?(catlee)
Depends on: 1127101
Testcase broken:

    {
      "builder_id": 248158, 
      "buildnumber": 346, 
      "endtime": 1422472064, 
      "id": 57666910, 
      "master_id": 80, 
      "properties": {
        "basedir": "/builds/slave/test", 
        "blobber_files": "{\"qemu.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/75dc184489708c4ef6edbcf30fd5ebc9b95b4d000034a48d131d77e29489defed4cee754333676544e3563df0b3e6b582867d5a70f3816e2ac50e03cfb2ef006\", \"emulator-5554.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5f3dc4a19484d52abde92c8e01aebaf0d2a155799ef3a5a48263f6bab7b62c237613a5601238f280d250aca2605d69749665b2247b82f5ee924c394b5af67627\"}", 
        "branch": "mozilla-inbound", 
        "build_url": "http://pvtbuilds.pvt.build.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/20150128084950/emulator.tar.gz", 
        "buildername": "b2g_emulator_vm mozilla-inbound opt test reftest-15", 
        "buildid": "20150128084950", 
        "buildnumber": 346, 
        "builduid": "55f2a8ac293f411aac247203b95f8ec8", 
        "log_url": "http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1422463790/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-reftest-15-bm67-tests1-linux64-build346.txt.gz", 
        "master": "http://buildbot-master67.srv.releng.use1.mozilla.com:8201/", 
        "pgo_build": "False", 
        "platform": "emulator", 
        "product": "b2g", 
        "project": "", 
        "repo_path": "integration/mozilla-inbound", 
        "repository": "", 
        "request_ids": [
          60457511, 
          60457203
        ], 
        "request_times": {
          "60457203": 1422465787, 
          "60457511": 1422465962
        }, 
        "revision": "2c010ad45bd386f2e80e458c15a6f2d6e6d56255", 
        "scheduler": "tests-mozilla-inbound-ubuntu64_vm-b2g-emulator-opt-unittest", 
        "script_repo_revision": "bafcff7355f4", 
        "script_repo_url": "https://hg.mozilla.org/build/mozharness", 
        "slavebuilddir": "test", 
        "slavename": "tst-linux64-spot-996", 
        "stage_platform": "emulator"
      }, 
      "reason": "scheduler", 
      "request_ids": [
        60457203, 
        60457511
      ], 
      "requesttime": 1422465787, 
      "result": 0, 
      "slave_id": 7990, 
      "starttime": 1422466140
    }, 

Using:
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/60457203
https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/60457511

request id 60457511 even though it was lower, was the one used (compare the revision against the builds4hr example above) - ie the last item in request_ids was used.

Spot checking a bunch of other >1 length request_ids properties shows the last value in request_ids is the one that gets used there too.

So seems like the solution here is to not use the max request_id:
https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/buildapi.py#L256
                            'request_id': max(request_ids)

...but the last one in the list :-)
Just to follow on from the Vidyo call - I think maybe a good way to test the theory in comment 19 would be to log at point of completed job ingestion - and actually do an two api calls to:
https://secure.pub.build.mozilla.org/buildapi/self-serve/<project>/request/<request_id>

...one where we use max(request_ids) and one where we use request_ids[-1] - and then compare to the revision in builds4-hr and dump to the log which approach was correct (if they didn't both match).
So I did this manually with a builds-4hr job that had 3 request ids:
 "request_ids": [
          60472268, 
          60472306, 
          60472287
        ], 
and checked the revisions with the url in comment 20.  The first two ids point to different revisions, whereas the last points to the correct revision of: d975577991caa3a7f5aaff02f6ba5dbe16c66a7a

I've tried this on several and it's always the last in the list that points to the right revision.  So, Ed, I believe your hypothesis is confirmed.  It's always the last in that array.  A nice easy fix.  Thanks so much for your help on this!
I ingested this job locally and was able to retrigger correctly on the right revision, as expected.  So we are good here.
Attached file service pr 359 (obsolete) —
Attachment #8556213 - Flags: review?(emorley)
Attachment #8556213 - Attachment description: UI pr 259 → service pr 259
Comment on attachment 8556213 [details] [review]
service pr 359

Commented on the PR - thing one more change needed - and probably worth testing that's the right choice for running too.
Attachment #8556213 - Flags: review?(emorley)
Attachment #8556213 - Attachment description: service pr 259 → service pr 359
Attachment #8556213 - Flags: review?(emorley)
Comment on attachment 8556213 [details] [review]
service pr 359

Left a few comments on the PR, looks great overall :-)
Attachment #8556213 - Flags: review?(emorley)
hm, looking at the code, I don't think we publish the build_id in builds-4hr.js :(

do you need it added at this point?
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #26)
> hm, looking at the code, I don't think we publish the build_id in
> builds-4hr.js :(
> 
> do you need it added at this point?

I think we've got an approach working using just the request id (we pick the last one in the request_ids list, and it seems to always be the correct one) - so don't need it for now.
Tweaking summary since the PR not only means we choose the correct request_id, when there are several in request_ids (eg for a job that subsumed several others) but now also corrects how we calculate the job guid for certain job states. :-D

tl;dr for times when there was coalescing, we would (a) have a wrong request_id so retriggers/cancellations wouldn't work, but (b) we'd merge the wrong jobs, leading to things like bug 1113142 and bug 1097328.
Blocks: 1059909
No longer depends on: 1127101
Summary: Treeherder occasionally retriggers jobs on the wrong push → Job ingestion sometimes picks the wrong request_id & calculates the guid incorrectly
Tried this with a running job and it works just fine.
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/d400a8cb56add5f51b973162b78d06ab251c102c
Bug 1093743 - use correct request id when multiple exist

Bug 1097328 - fixed coalesced job_guid for running/complete jobs
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Depends on: 1133455
Commit pushed to master at https://github.com/mozilla/treeherder-service

https://github.com/mozilla/treeherder-service/commit/fd7cd206830bde2d5d0b882d838c1c7a196dacc3
Revert "Bug 1093743 - use correct request id when multiple exist"

This reverts commit d400a8cb56add5f51b973162b78d06ab251c102c.
Unfortunately whilst the changes here were technically more correct than the previous approach, due to inconsistencies in builds-4hr, we're seeing some issues with running jobs being orphaned/duplicated.

For now, I've reverted the commit here (and its followup in bug 1133455 to fix the exception) so we can deploy the other changes in the meantime.

For more info, see bug 1133455 comment 8 onwards.

I think the best way forwards is to file bugs against builds-4hr - since I have performed some analysis locally and the inconsistencies are horrendous & mean there isn't actually a correct approach we can take, that will not leave orphaned jobs around. Will file a release engineering bug blocking this one soon.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(emorley)
FWIW, I've been hitting what appears to be this bug today.

In case it helps to have another data point: I've been trying (unsuccessfully) to retrigger the opt "mochitest-9" job on this Try run:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=50913131140c&filter-ref_data_name=b2g_emulator_vm
...but the retriggers haven't been showing up (after repeated waiting, reloading, retriggering again, waiting, etc)
what does self-serve think about these requests?
Retriggers via self-serve work correctly.
(In reply to Chris AtLee [:catlee] from comment #36)
> what does self-serve think about these requests?

(My missing TreeHerder-spawned retriggers do not show up on self-serve)

(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #37)
> Retriggers via self-serve work correctly.

I'll try a few of those now. So anyone looking at my treeherder link may see a few successful retriggers, which just come from me + self-serve.
Depends on: 1141603
I've finally filed bug 1141603 based on my analysis of builds-4hr, though things seem to have improved slightly since bug 1140612 has landed (will need to wait until we have a few days of data to analyse, to be sure).
Flags: needinfo?(emorley)
Cameron's on PTO and I've been digging into the builds-4hr inconsistencies, so I'll drive this for the moment :-)
Assignee: cdawson → emorley
This is blocked on bug 1141603 so not actionable for the moment.
Assignee: emorley → nobody
Priority: P1 → P2
Going to have another go at driving answers in bug 1141603 for this.
Assignee: nobody → emorley
It sounds like some of the symptoms here could also be due to the job being missing from builds-4hr and not just the interaction between treeherder and builds-4hr. See bug 1159279.
Depends on: 1159279
Here goes nothing..! :-)
Attachment #8521056 - Attachment is obsolete: true
Attachment #8556213 - Attachment is obsolete: true
Attachment #8601829 - Flags: review?(cdawson)
Status: REOPENED → ASSIGNED
Sorry for the delay in reviewing this.  I will get to it before EOD.
Attachment #8601829 - Flags: review?(cdawson) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/bd5cf6d775bf9a2562e778d08043bf49cabdb8ef
Bug 1093743 - Attempt to generate job_guids more consistently

In order that Treeherder can track a buildbot ingested job as it moves
from pending to running to completed, a unique job_guid must be
generated, since buildbot doesn't directly give us one. To make things
worse, each of the three data sources (builds-{pending,running,4hr})
make slightly different properties available to us, and the data in them
is not always correct. Until now we've been calculating the guid in such
a way that as some jobs (particularly those that are coalesced) change
state, the guid generated changes, causing us to think it's a brand new
job, thereby leaving behind an orphaned pending/running job.

The new method of generating a guid uses only the single request_id
corresponding to the job that ran, rather than all of the request_ids.
This is because when in the pending state, there is only ever one
request id, so if we use the full list later on, there is no way it will
match, if the job had others coalesced into it.

In addition, the request_times properties are known to be unreliable
(see bug 1141603), so we've stopped using them. This isn't a problem,
since they didn't help us with the buildbot retry case anyway (where the
request id is recycled), and we already handle that by using endtime.
We instead now use the buildername to add uniqueness rather than the
request_times.
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/54727c2994d9f937bd6297dadd000580bfcfc891
Bug 1093743 - Don't overwrite the value of request_id

It turns out that the variable used for the index in a loop persists
outside of the loop in Python, so previously we were overwriting the
original value of request_id with that of the index.
Depends on: 1163496
Looks fine so far, we'll have to see how it goes (this is a pretty risky change, but no real way to be 100% sure when testing locally) :-)

The comment 49 commit fixes bug 1163496, and in bug 1163496 I've tweaked how the UI determines whether a job is coalesced, to work around the bad data that was ingested for the few hours before the comment 49 commit landed. The log term fix for how we handle coalesced jobs is bug 1132546.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: