<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Assignee

Comment 1

•

11 years ago

I reproduced this just now on that job. Filtered results link: https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&fromchange=3629ea724086&tochange=f7999c1b6030&searchQuery=Ubuntu%20VM%2012.04%20x64%20mozilla-inbound%20debug%20test%20mochitest-e10s-2 The job on which Ryan used retrigger: https://treeherder.mozilla.org/api/project/mozilla-inbound/jobs/3411383/ https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=3411383&name=buildapi&type=json [{"blob": {"buildername": "Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-e10s-2", "request_id": 53835378}, "type": "json", "id": 18363937, "name": "buildapi", "job_id": 3411383}] Pressing retrigger via the UI makes a request to buildapi, with request body: request_id=53835378 However the job as found by buildapi: https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/f7999c1b6030 -> https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/build/54696365 ...has: "request_id": 53835405 So: a) We're storing the wrong request_id for the job b) TBPL doesn't use request_id, it uses build_id (and uses the /build endpoint rather than /request). What's the difference between the two, and which should we be using? Workaround for now: On jobs where we have the wrong request_id associated with the job, use the buildapi self-serve page to retrigger the job, rather than the treeherder UI.

Priority: -- → P2

Ryan VanderMeulen [:RyanVM]

Assignee

Comment 2

•

11 years ago

Using the /request endpoint... The request_id we used: https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/53835378 -> has build_id of 54696364 The request_id we should have used: https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/53835405 -> has build_id of 54696365 Perhaps significant that the build_ids are adjacent to each other?

Reporter

Comment 3

•

11 years ago

Hit it again this morning. Basically means we're going to have to do all retriggers through buildapi so we aren't wasting our time expecting things to start that won't. Vote we bump this to P1.

Carsten Book [:Tomcat]

Comment 4

•

11 years ago

yeah agree on having this a p1

Ryan VanderMeulen [:RyanVM]

Assignee

Comment 5

•

11 years ago

I wonder why it only just started happening? (Unless it's a case of us noticing now we're paying more attention to it)

Priority: P2 → P1

Reporter

Comment 6

•

11 years ago

I'm pretty sure I've seen it in the past but dismissed it as user error on my part.

Updated

•

11 years ago

Assignee: nobody → cdawson

Status: NEW → ASSIGNED

Comment 7

•

11 years ago

I had thought this would be an issue on the client in how we retrigger. But after talking to Ed, I believe this is actually an ingestion issue. The ``request_id`` value we keep when there are multiple ``request_ids`` in the list should be the LATEST one. I was saving the EARLIEST one. So, for coalesced jobs, I would try to retrigger the wrong job. In Ed's example above, the list of ``request_ids`` was: [53835378, 53835405]. I was saving element 0, where I should have been saving the latest one: the max value. Catlee: can you confirm this is the right ``request_id`` to use for retriggers?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 8

•

11 years ago

It's probably safer to use build_id for retriggers if you want to re-run exactly the same thing as ran before. We don't always coalesce to the highest request ids (see bugs 555664, 627603 for why), so I'm not sure if it's possible to determine which request in the end was chosen. For this reason retriggering based on the build_id is safer for what I think you're trying to do. OTOH if you're trying to retrigger a job that was coalesced away, then you need to use the request_id.

Flags: needinfo?(catlee)

Comment 9

•

11 years ago

I confirmed this fix with these two resultsets: http://local.treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&tochange=a2551ec63d3c&searchQuery=os%20x%2010.8%20debug&fromchange=49d26513f6be On the old code, when I retriggered the red build from change a2551ec63d3c, the retrigger showed in change 49d26513f6be. With the new code, it showed in the first change a2551ec63d3c.

Comment 10

•

11 years ago

Note: this won't fix old jobs that already have the wrong ``request_id`` in the artifact. This fixes new jobs coming in so that their ``buildapi`` artifact is now correct.

Comment 11

•

11 years ago

Attached file PR286 (obsolete) — Details

Updated

•

11 years ago

Attachment #8521056 - Flags: review?(emorley)

Assignee

Comment 12

•

11 years ago

Comment on attachment 8521056 [details] [review] PR286 Given comment 8, I think we may have to go with the more complicated approach (split build_id/request_id depending on job state) - thoughts? Either way, guess we should ingest the correct request_id for completed jobs (or even the whole list) rather than the current wrong one - if we're going to keep it in the data model.

Attachment #8521056 - Flags: review?(emorley)

Comment 13

•

11 years ago

Catlee-- Oops, I just saw your comment. So, wrt builds4hr.js, running.js, pending.js: which field is the right build_id? It seems to work with the ``id`` field from running.js. But I haven't had luck with any of the fields from builds4hr.js. Here's an example of those structures for a single job: https://treeherder.mozilla.org/api/project/mozilla-inbound/artifact/?job_id=3778396&type=json in each of those objects of the array, the ``blob`` field is the one from one of these files: 1. buildapi_complete: builds4hr.js 2. buildapi_running: running.js 3. buildapi_pending: pending.js 4. buildapi: our internal summary artifact So, I know if I have a pending job, I'd retrigger with the request_id. If it's running, I could use the /build/ endpoint and pass the build_id=55791253 (in the link provided). But for a complete job, I can't find what id from ``buildapi_complete`` to send to the /build/ endpoint. So I've been using the /request/ endpoint. Which id would be the right one?

Assignee

Updated

•

11 years ago

Keywords: regression

Comment 14

•

11 years ago

Commits pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/b189b22fe9f2ac86240db055c67f32a020e84a94 bug 1093743 - correct request id for retrigger/cancel on coalesced jobs https://github.com/mozilla/treeherder-service/commit/52fff1c94f993dd88cd392dbb57e96276d5962fc bug 1093743 - revert change line that shouldn't be in this fix https://github.com/mozilla/treeherder-service/commit/22f3482143da0850ec30da4e57f4dc3c65c4aab2 Merge pull request #286 from mozilla/bug-1093743-retrigger Bug 1093743 retrigger fix

:Gijs (he/him)

Comment 15

•

11 years ago

I tried to retrigger (amongst others) win xp debug bc1 on: https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=ca998f9e1b71 Nothing is showing up there for me. However, https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla-inbound&revision=b8613576f657 seems to have more xp debug bc1 jobs than I retriggered there... Still more confusingly, self serve claims there are bc1 runs ongoing for ca998f9e1b71: https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/ca998f9e1b71

:Gijs (he/him)

Comment 16

•

11 years ago

(In reply to :Gijs Kruitbosch from comment #15) > I tried to retrigger (amongst others) win xp debug bc1 on: > > https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla- > inbound&revision=ca998f9e1b71 > > Nothing is showing up there for me. However, > https://treeherder.mozilla.org/ui/#/jobs?repo=mozilla- > inbound&revision=b8613576f657 seems to have more xp debug bc1 jobs than I > retriggered there... > > Still more confusingly, self serve claims there are bc1 runs ongoing for > ca998f9e1b71: > > https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/rev/ > ca998f9e1b71 Just realized those are linux runs, not windows... so it seems like this is essentially just not fixed.

:Gijs (he/him)

Comment 17

•

11 years ago

Retriggered the windows builds using self-serve instead, which seems to have worked.

Assignee

Updated

•

11 years ago

Component: Treeherder → Treeherder: Data Ingestion

Assignee

Updated

•

11 years ago

No longer blocks: treeherder-dev-transition

Comment 18

•

11 years ago

Hey Catlee: I'm back to working on this bug now. I never added a needinfo request. My bad... Would you take a look at my question in Comment 13 above and see if you can help me determine the right ``build id`` to use? Thanks!

Flags: needinfo?(catlee)

Updated

•

11 years ago

Depends on: 1127101

Assignee

Comment 19

•

11 years ago

Testcase broken: { "builder_id": 248158, "buildnumber": 346, "endtime": 1422472064, "id": 57666910, "master_id": 80, "properties": { "basedir": "/builds/slave/test", "blobber_files": "{\"qemu.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/75dc184489708c4ef6edbcf30fd5ebc9b95b4d000034a48d131d77e29489defed4cee754333676544e3563df0b3e6b582867d5a70f3816e2ac50e03cfb2ef006\", \"emulator-5554.log\": \"http://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/5f3dc4a19484d52abde92c8e01aebaf0d2a155799ef3a5a48263f6bab7b62c237613a5601238f280d250aca2605d69749665b2247b82f5ee924c394b5af67627\"}", "branch": "mozilla-inbound", "build_url": "http://pvtbuilds.pvt.build.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/20150128084950/emulator.tar.gz", "buildername": "b2g_emulator_vm mozilla-inbound opt test reftest-15", "buildid": "20150128084950", "buildnumber": 346, "builduid": "55f2a8ac293f411aac247203b95f8ec8", "log_url": "http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1422463790/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-reftest-15-bm67-tests1-linux64-build346.txt.gz", "master": "http://buildbot-master67.srv.releng.use1.mozilla.com:8201/", "pgo_build": "False", "platform": "emulator", "product": "b2g", "project": "", "repo_path": "integration/mozilla-inbound", "repository": "", "request_ids": [ 60457511, 60457203 ], "request_times": { "60457203": 1422465787, "60457511": 1422465962 }, "revision": "2c010ad45bd386f2e80e458c15a6f2d6e6d56255", "scheduler": "tests-mozilla-inbound-ubuntu64_vm-b2g-emulator-opt-unittest", "script_repo_revision": "bafcff7355f4", "script_repo_url": "https://hg.mozilla.org/build/mozharness", "slavebuilddir": "test", "slavename": "tst-linux64-spot-996", "stage_platform": "emulator" }, "reason": "scheduler", "request_ids": [ 60457203, 60457511 ], "requesttime": 1422465787, "result": 0, "slave_id": 7990, "starttime": 1422466140 }, Using: https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/60457203 https://secure.pub.build.mozilla.org/buildapi/self-serve/mozilla-inbound/request/60457511 request id 60457511 even though it was lower, was the one used (compare the revision against the builds4hr example above) - ie the last item in request_ids was used. Spot checking a bunch of other >1 length request_ids properties shows the last value in request_ids is the one that gets used there too. So seems like the solution here is to not use the max request_id: https://github.com/mozilla/treeherder-service/blob/master/treeherder/etl/buildapi.py#L256 'request_id': max(request_ids) ...but the last one in the list :-)

Assignee

Comment 20

•

11 years ago

Just to follow on from the Vidyo call - I think maybe a good way to test the theory in comment 19 would be to log at point of completed job ingestion - and actually do an two api calls to: https://secure.pub.build.mozilla.org/buildapi/self-serve/<project>/request/<request_id> ...one where we use max(request_ids) and one where we use request_ids[-1] - and then compare to the revision in builds4-hr and dump to the log which approach was correct (if they didn't both match).

Comment 21

•

11 years ago

So I did this manually with a builds-4hr job that had 3 request ids: "request_ids": [ 60472268, 60472306, 60472287 ], and checked the revisions with the url in comment 20. The first two ids point to different revisions, whereas the last points to the correct revision of: d975577991caa3a7f5aaff02f6ba5dbe16c66a7a I've tried this on several and it's always the last in the list that points to the right revision. So, Ed, I believe your hypothesis is confirmed. It's always the last in that array. A nice easy fix. Thanks so much for your help on this!

Comment 22

•

11 years ago

I ingested this job locally and was able to retrigger correctly on the right revision, as expected. So we are good here.

Comment 23

•

11 years ago

Attached file service pr 359 (obsolete) — Details

Attachment #8556213 - Flags: review?(emorley)

Updated

•

11 years ago

Attachment #8556213 - Attachment description: UI pr 259 → service pr 259

Assignee

Comment 24

•

11 years ago

Comment on attachment 8556213 [details] [review] service pr 359 Commented on the PR - thing one more change needed - and probably worth testing that's the right choice for running too.

Attachment #8556213 - Flags: review?(emorley)

Updated

•

11 years ago

Attachment #8556213 - Attachment description: service pr 259 → service pr 359

Updated

•

11 years ago

Attachment #8556213 - Flags: review?(emorley)

Assignee

Comment 25

•

11 years ago

Comment on attachment 8556213 [details] [review] service pr 359 Left a few comments on the PR, looks great overall :-)

Attachment #8556213 - Flags: review?(emorley)

Chris AtLee [:catlee]

Comment 26

•

11 years ago

hm, looking at the code, I don't think we publish the build_id in builds-4hr.js :( do you need it added at this point?

Flags: needinfo?(catlee)

Assignee

Comment 27

•

11 years ago

(In reply to Chris AtLee [:catlee] from comment #26) > hm, looking at the code, I don't think we publish the build_id in > builds-4hr.js :( > > do you need it added at this point? I think we've got an approach working using just the request id (we pick the last one in the request_ids list, and it seems to always be the correct one) - so don't need it for now.

Assignee

Comment 28

•

11 years ago

Tweaking summary since the PR not only means we choose the correct request_id, when there are several in request_ids (eg for a job that subsumed several others) but now also corrects how we calculate the job guid for certain job states. :-D tl;dr for times when there was coalescing, we would (a) have a wrong request_id so retriggers/cancellations wouldn't work, but (b) we'd merge the wrong jobs, leading to things like bug 1113142 and bug 1097328.

Blocks: 1059909

No longer depends on: 1127101

Summary: Treeherder occasionally retriggers jobs on the wrong push → Job ingestion sometimes picks the wrong request_id & calculates the guid incorrectly

Comment 31

•

11 years ago

Tried this with a running job and it works just fine.

Comment 32

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/d400a8cb56add5f51b973162b78d06ab251c102c Bug 1093743 - use correct request id when multiple exist Bug 1097328 - fixed coalesced job_guid for running/complete jobs

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Assignee

Updated

•

10 years ago

Depends on: 1133455

Comment 33

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder-service https://github.com/mozilla/treeherder-service/commit/fd7cd206830bde2d5d0b882d838c1c7a196dacc3 Revert "Bug 1093743 - use correct request id when multiple exist" This reverts commit d400a8cb56add5f51b973162b78d06ab251c102c.

Assignee

Comment 34

•

10 years ago

Unfortunately whilst the changes here were technically more correct than the previous approach, due to inconsistencies in builds-4hr, we're seeing some issues with running jobs being orphaned/duplicated. For now, I've reverted the commit here (and its followup in bug 1133455 to fix the exception) so we can deploy the other changes in the meantime. For more info, see bug 1133455 comment 8 onwards. I think the best way forwards is to file bugs against builds-4hr - since I have performed some analysis locally and the inconsistencies are horrendous & mean there isn't actually a correct approach we can take, that will not leave orphaned jobs around. Will file a release engineering bug blocking this one soon.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Daniel Holbert [:dholbert]

Assignee

Updated

•

10 years ago

Flags: needinfo?(emorley)

Comment 35

•

10 years ago

FWIW, I've been hitting what appears to be this bug today. In case it helps to have another data point: I've been trying (unsuccessfully) to retrigger the opt "mochitest-9" job on this Try run: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=50913131140c&filter-ref_data_name=b2g_emulator_vm ...but the retriggers haven't been showing up (after repeated waiting, reloading, retriggering again, waiting, etc)

Chris AtLee [:catlee]

Comment 36

•

10 years ago

what does self-serve think about these requests?

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 37

•

10 years ago

Retriggers via self-serve work correctly.

Daniel Holbert [:dholbert]

Comment 38

•

10 years ago

(In reply to Chris AtLee [:catlee] from comment #36) > what does self-serve think about these requests? (My missing TreeHerder-spawned retriggers do not show up on self-serve) (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #37) > Retriggers via self-serve work correctly. I'll try a few of those now. So anyone looking at my treeherder link may see a few successful retriggers, which just come from me + self-serve.

Assignee

Updated

•

10 years ago

Depends on: 1141603

Assignee

Comment 39

•

10 years ago

I've finally filed bug 1141603 based on my analysis of builds-4hr, though things seem to have improved slightly since bug 1140612 has landed (will need to wait until we have a few days of data to analyse, to be sure).

Flags: needinfo?(emorley)

Assignee

Comment 41

•

10 years ago

Cameron's on PTO and I've been digging into the builds-4hr inconsistencies, so I'll drive this for the moment :-)

Assignee: cdawson → emorley

Assignee

Comment 43

•

10 years ago

This is blocked on bug 1141603 so not actionable for the moment.

Assignee: emorley → nobody

Priority: P1 → P2

Assignee

Comment 44

•

10 years ago

Going to have another go at driving answers in bug 1141603 for this.

Assignee: nobody → emorley

Assignee

Comment 45

•

10 years ago

It sounds like some of the symptoms here could also be due to the job being missing from builds-4hr and not just the interaction between treeherder and builds-4hr. See bug 1159279.

Depends on: 1159279

Assignee

Comment 46

•

10 years ago

Attached file Attempt to generate job_guids more consistently — Details

Here goes nothing..! :-)

Attachment #8521056 - Attachment is obsolete: true

Attachment #8556213 - Attachment is obsolete: true

Attachment #8601829 - Flags: review?(cdawson)

Assignee

Updated

•

10 years ago

Status: REOPENED → ASSIGNED

Comment 47

•

10 years ago

Sorry for the delay in reviewing this. I will get to it before EOD.

Updated

•

10 years ago

Attachment #8601829 - Flags: review?(cdawson) → review+

Comment 48

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/bd5cf6d775bf9a2562e778d08043bf49cabdb8ef Bug 1093743 - Attempt to generate job_guids more consistently In order that Treeherder can track a buildbot ingested job as it moves from pending to running to completed, a unique job_guid must be generated, since buildbot doesn't directly give us one. To make things worse, each of the three data sources (builds-{pending,running,4hr}) make slightly different properties available to us, and the data in them is not always correct. Until now we've been calculating the guid in such a way that as some jobs (particularly those that are coalesced) change state, the guid generated changes, causing us to think it's a brand new job, thereby leaving behind an orphaned pending/running job. The new method of generating a guid uses only the single request_id corresponding to the job that ran, rather than all of the request_ids. This is because when in the pending state, there is only ever one request id, so if we use the full list later on, there is no way it will match, if the job had others coalesced into it. In addition, the request_times properties are known to be unreliable (see bug 1141603), so we've stopped using them. This isn't a problem, since they didn't help us with the buildbot retry case anyway (where the request id is recycled), and we already handle that by using endtime. We instead now use the buildername to add uniqueness rather than the request_times.

Comment 49

•

10 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/54727c2994d9f937bd6297dadd000580bfcfc891 Bug 1093743 - Don't overwrite the value of request_id It turns out that the variable used for the index in a loop persists outside of the loop in Python, so previously we were overwriting the original value of request_id with that of the index.

Assignee

Updated

•

10 years ago

Depends on: 1163496