Closed Bug 1141603 Opened 9 years ago Closed 8 years ago

Inconsistencies with request_ids, request_times and requesttime in builds-4hr.js that break Treeherder ingestion

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: emorley, Unassigned)

References

Details

(Keywords: treeherder)

Excuse the long winded context, but I think it is necessary to understand what we're trying to do:
* Unlike TBPL, Treeherder ingestions all of builds-{pending,running,4hr}.js server-side, and not just builds-4hr, since it lets us do much more than if pending/running is only overlaid temporarily client-side.
* However this means we have to track a job as it moves from pending -> running -> completed, so we don't end up with duplicate jobs in treeherder (ie one new job for each state). This is made extra hard, since:
 - Coalescing can occur, so we have to track which jobs were combined into only one completed build, and update the pending jobs so they display as coalesced and not forever-pending.
 - There isn't a consistent property across all of builds-{pending,running,4hr} that we can use to track a job, so we have to make our own.
 - If a job incurs a buildbot retry (result code 5), then the subsequent job has an identical request id/request time, so we have to try and fake a unique job id for these.

To generate a unique job ID in treeherder (from now on referred to as job guid), we use the following:
* the request ID
* the request time
* [and for buildbot retried jobs the end time (IMO we should use the start time, but that's for another bug)]

Now for builds-pending.js, these correspond to:
* request ID: data['pending'][repo][rev]['id']
* request time: data['pending'][repo][rev]['submitted_at']

However for builds-{running,4hr} we don't have specific properties for these, and instead have:

running...
* data['running'][repo][rev]['request_ids']
* data['running'][repo][rev]['submitted_at']

builds-4hr...
A mixture of (not all present at all times)
* data['builds']['properties']['request_ids']
* data['builds']['properties']['request_times']
* data['builds']['request_ids']
* data['builds']['requesttime']

Treeherder's current implementation for extracting the request ID and request time from the running/completed is not correct - and in bug 1093743 we've been trying to fix it. We came to the conclusion that the last element in request_ids is _the_ request ID that actually got picked, when it came to determining the properties for the job. ie for 3 jobs coalesced into one, the last element of request IDs was the request that set the revision on which to run the job.

However after landing https://github.com/mozilla/treeherder-service/commit/d400a8cb56add5f51b973162b78d06ab251c102c we experienced many zombie jobs on Treeherder. 

Analysis of builds-4hr showed many inconsistencies in the properties within, which means that we are currently unable to rely on the request IDs we extract from it.

For example:
* Some jobs have the request_ids key at the top level, some have it under the "properties" key, and some have it on both.
* For those that have it on both - it sometimes has a different value in each place (sometimes just a different order, sometimes a different number of ids in the list!).
* Some jobs are missing properties['request_times']
* Some jobs have a top level 'requesttime' (note: singular integer) set, others don't
* request_times is a dict of request IDs to request time - but sometimes the IDs listed in request_ids don't match the keys of request_times.
* For jobs where the top level integer 'requesttime' _is_ set (and using the request_times dict as mapping between IDs and times) - sometimes the last ID in request_ids (ie the one Cam and I were pretty sure corresponded to the actual request ID) does not match the request time in requesttime.
We really need:
1) For each job in builds-{running,4hr} a property containing _the_ request ID for that job, not a list of request IDs from which we have to select an ID and hope for the best.
2) Same as #1, but for request time (and not like the current top level 'requesttime' property which is broken)
3) Optionally: Clean up the properties available, such that we either stick everything under 'properties' or on the top level object, and stop mixing and matching.

I noticed bug 1140612 landed recently (my analysis was from before then) - I wonder if it would help at all. That said, having a standalone property from which we can fetch the canonical request id and time for a job would help make this much less fragile :-)
Flags: needinfo?(catlee)
(In reply to Ed Morley [:edmorley] from comment #1)
> I noticed bug 1140612 landed recently (my analysis was from before then) - I
> wonder if it would help at all. 

It has appeared to have helped slightly (either that or the current builds-4hr file is just more accurate than the last snapshot I took), but I'm still seeing this (in ~900 jobs out of 19000):

(In reply to Ed Morley [:edmorley] from comment #0)
> * For jobs where the top level integer 'requesttime' _is_ set (and using the
> request_times dict as mapping between IDs and times) - sometimes the last ID
> in request_ids (ie the one Cam and I were pretty sure corresponded to the
> actual request ID) does not match the request time in requesttime.

For example in this job:

{
...
  "id": 60857868, 
...
  "properties": {
    "buildername": "Windows 7 32-bit mozilla-inbound debug test mochitest-gl", 
...
    "request_ids": [
      63941077, 
      63941098, 
      63941555
    ], 
    "request_times": {
      "63941077": 1425990592, 
      "63941098": 1425990592, 
      "63941555": 1425990833
    }, 
...
  }, 
...
  "request_ids": [
    63941077, 
    63941098, 
    63941555
  ], 
  "requesttime": 1425990592, 
...
}

One would expect the last element in request IDs (63941555; which _should_ be the canonical ID for the job) to map to requesttime (1425990592), when it fact request_times['63941555'] is 1425990833.

I'm presuming this is due to the min() in bug 1140612 comment 8?
Are request ids ever re-used? Could we just stop using the combination of (request id + request time) and just use the request id? (For buildbot retries even the combination of the two isn't unique, so it doesn't add any value - unless we can somehow make buildbot retries use a different request time?)
I think the thing you're missing here is that the tests don't pick the earliest or latest requestid to run, they choose the earliest buildid to run. This was to fix bug 555664. It's normally the case that higher buildids have higher requestids for the tests, but not always.

Do you see this problem only with tests?
request ids are re-used for automated retries of a job.
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #5)
> request ids are re-used for automated retries of a job.

Please could you re-read comment 3? I know automated reties re-use them, but they also re-use the request time at present, so additionally using the request time doesn't seem to add any value - unless there's another case I'm missing? :-)
(In reply to Chris AtLee [:catlee] from comment #4)
> I think the thing you're missing here is that the tests don't pick the
> earliest or latest requestid to run, they choose the earliest buildid to
> run. This was to fix bug 555664. It's normally the case that higher buildids
> have higher requestids for the tests, but not always.
> 
> Do you see this problem only with tests?

No, builds too. I think we're talking about two different things.

To hopefully clarify:

1) What is the 'requesttime' property in builds-4hr supposed to represent? I thought it was "request time for the request whose properties [eg revision] were used to generate this job", whereas the use of min() in bug 1140612 implies it's actually "the request time of the earliest request that got bundled into this build after coalescing". If we can stop using request time (per comment 3 / comment 6) then perhaps we can just ignore this.

2) Why are there request_ids on both the job object and the properties key beneath it? Which should we use? Could we get rid of the other to avoid confusion?

3) Could we have a "request_id" property added, which is the request id out of request_ids that we're looking for?
Flags: needinfo?(catlee)
(In reply to Ed Morley [:edmorley] from comment #7)
> (In reply to Chris AtLee [:catlee] from comment #4)
> > I think the thing you're missing here is that the tests don't pick the
> > earliest or latest requestid to run, they choose the earliest buildid to
> > run. This was to fix bug 555664. It's normally the case that higher buildids
> > have higher requestids for the tests, but not always.
> > 
> > Do you see this problem only with tests?
> 
> No, builds too. I think we're talking about two different things.
> 
> To hopefully clarify:
> 
> 1) What is the 'requesttime' property in builds-4hr supposed to represent? I
> thought it was "request time for the request whose properties [eg revision]
> were used to generate this job", whereas the use of min() in bug 1140612
> implies it's actually "the request time of the earliest request that got
> bundled into this build after coalescing". If we can stop using request time
> (per comment 3 / comment 6) then perhaps we can just ignore this.

'requesttime' at the job level is the time of the earliest job that got coalesced. I don't think you should need to rely on them for anything.

> 2) Why are there request_ids on both the job object and the properties key
> beneath it? Which should we use? Could we get rid of the other to avoid
> confusion?

They weren't always available as properties. Since that's universal now (I think?), we could get rid of the job-level ones, but that would probably break some consumers.

> 3) Could we have a "request_id" property added, which is the request id out
> of request_ids that we're looking for?

That's bug 782874 I think?
Flags: needinfo?(catlee)
For bug 1093743 I'm going to rely on the following assumptions:
1) Apart from buildbot status 5 retries, the ids in request_ids are unique across all branches and are never re-used (we already have handling for the retries).
2) The element of the list in request_ids is _the_ request id that corresponds to what actually ran.
3) The top level request_ids is the canonical field, since the properties instance of it is missing for some jobs, eg:

* b2g_emulator_vm mozilla-central opt test reftest-1
Missing properties['request_ids']

* Ubuntu VM 12.04 mozilla-release pgo test cppunit
Missing properties['request_ids']

* b2g_mozilla-inbound_linux64-b2g-haz_dep
Missing properties['request_ids']

* b2g_emulator_vm try debug test mochitest-debug-19
Missing properties['request_ids']

* Ubuntu VM 12.04 x64 fx-team debug test mochitest-e10s-3
Missing properties['request_ids']

* Ubuntu VM 12.04 mozilla-inbound opt test web-platform-tests-2
Missing properties['request_ids']

* Windows XP 32-bit fx-team opt test mochitest-browser-chrome-3
Missing properties['request_ids']

* Ubuntu VM 12.04 mozilla-inbound opt test mochitest-devtools-chrome
Missing properties['request_ids']

* Android 4.2 x86 mozilla-inbound build
Missing properties['request_ids']
missing request_ids in properties could be bug 942616, no?
Bug 942616 definitely improved the situation here.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.