Closed Bug 1096605 Opened 10 years ago Closed 9 years ago

Calculate UI job ETAs using only the average runtime, not average pending+running time

Categories

(Tree Management :: Treeherder, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jfkthame, Assigned: KWierso)

References

Details

(Keywords: regression)

Attachments

(1 file)

E.g. I currently (at 10:13 PM) have a job that's showing Build: x86_64 linux64 linux Job name: Build Request Time: 11/10/14 9:25 PM Start Time: 11/10/14 9:39 PM Duration: 34 minute(s) and the tooltip says 13 minutes overdue, typically takes 35 mins I'm assuming that "typical" duration of 35 minutes is the (average) actual job running time, not including any wait for the job to start; but the job is being considered "overdue" because it's 48 minutes since the request was submitted, not since the job actually started. The OS X Opt build on the same push has currently only been running for 13 minutes (it waited longer than Linux to start), but shows an ETA of 2 mins because likewise, it's basing the ETA on the request time rather than the job start time.
Priority: -- → P2
Combining a few issues, since they are all related: (In reply to Ed Morley [:edmorley] from bug 1087915 comment #0) > 20:55 NeilAway is it known that pending jobs can show as overdue? > 20:56 camd NeilAway: do you have an example of a job I could look at? > 21:01 NeilAway camd: 8e21737b6ff7 > 21:01 camd NeilAway: is that mozilla-inbound? > 21:03 NeilAway camd: try > 21:06 camd NeilAway: yeah, I think that can happen when the machines are > quite busy. We do our best to predict how soon a job will complete based on > past jobs of the same type. But sometimes it can take a long time for a job > to start running if the machine load is high. > 21:07 camd that being said, if this seems like it's off-base, then please do > file a bug and we'll take a look at it. :) > 21:09 NeilAway camd: well, surely you should only start predicting the > completion time once the job has actually started? > 21:13 camd NeilAway: I *think* the dev on that had tried to predict based on > the time it took a pending job of that type to get all the way through to > completion. I'd have to double-check. > 21:19 camd NeilAway: yeah, it looks like he calculates an average based on > the time, which will include the time to start. But that time can vary a > fair bit depending on infrastructure load. (In reply to Geoff Brown [:gbrown] from bug 1123419 comment #0) > The Android x86 "S4" test job seems to run in about 40 minutes and "ETA to > completed" estimates on mozilla-inbound reflect that. However, in a try push > - https://treeherder.mozilla.org/#/jobs?repo=try&revision=aadc3ecc0aa9 - I > see > > ETA to completed: 132 minutes > Duration: 31 minutes > > I think I have seen this sort of gross over-estimate before, but can't > recall specifics. If this isn't a known problem, I'll keep an eye out for > this condition and report more examples here.
Keywords: regression
Summary: treeherder's "ETA to completed" seems to be based on the typical run time of the job, but is measured from job request time rather than when job actually started → Job ETAs are often wrong
We should: 1) Not show an ETA for pending jobs - or at least show something more like "Estimated job runtime, once it starts: N mins" - since any prediction about how long a job will stay pending is not going to be accurate. 2) Make sure we're storing historic runtimes based on job start time, not request time. 3) Make sure infrequently used repos don't show wacky ETAs (comment 3, though this might have been caused by #2). 4) Check we're happy with how often we refresh the historic data & what time range it covers (to account for time of day and weekday vs weekend variations).
Also: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=a6bbabebed2f&filter-searchStr=Android armv7 API 10%2B mozilla-central nightly Hovering over the "N" gives a tooltip of: "Nightlyusercancel - -23700573 mins"
(In reply to Ed Morley [:edmorley] from comment #4) > We should: ... 5) Check that we're not including coalesced jobs in the ETA calculations.
Note that this problem seems to hit every Windows 8 x64 opt Nightly build I saw in the last couple days. They are shown to have an expected run-time of ~90 minutes, but instead it's more like ~250+ minutes. I'm not familiar with treeherder, but does it use the same data as TBPL for that estimation? Because TBPL correctly estimates the time left; Example: Treeherder: "69 mins overdue, typically takes ~ 90 mins " TBPL (on hover): "Nightly opt is still running, ETA ~102mins"
Their approaches are completed different: TBPL averages the runtime for all green jobs of that type shown in the current view (hence why on a Try run &rev=SHA type page, there is no estimation, since there is not another green job of the same page visible). Treeherder on the other hand, runs a task every day (or some other timeframe) that calculates average run times. This approach isn't bad itself, we're just doing it wrong (comment 4 and others).
completely, even
Ok, thank you for clearing that up. Then that's just a case of gross under-estimation / miscalculation.
And it looks like this case may already be fixed, because the ETA of that build is correct on treeherder.allizom.org. Shame on me for only checking .mozilla.org m(
Priority: P2 → P3
Attached file PR 676
This: 1) Stops using pending ETAs in the calculations for how much time is left for the current job. 2) Makes pending jobs always display the typical ETA, not how much time is left for the current job, since there's no way to estimate how long pending->running will take. This does not: 1) Stop the pending_eta calculations from happening every day. 2) Remove the pending_eta fields from any databases.
Attachment #8626441 - Flags: review?(emorley)
Assignee: nobody → wkocher
Comment on attachment 8626441 [details] [review] PR 676 Looks good from my point of view - it should fix most of the inaccuracy issues - we can handle the rest (and the DB cleanup) in another bug. Asking Cameron to have a glance too, since he's more familiar with the ETA implementation.
Attachment #8626441 - Flags: review?(emorley)
Attachment #8626441 - Flags: review?(cdawson)
Attachment #8626441 - Flags: feedback+
Comment on attachment 8626441 [details] [review] PR 676 I didn't really have much of a hand in implementing this originally. But this all looks great to me. Makes sense. I think our original intent (from my recollection of conversations) was that the ETA on pending could give an indication of a releng provisioning problem. Though it sounds like it just wasn't useful.
Attachment #8626441 - Flags: review?(cdawson) → review+
Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/fe9c55d233068f1e57d1893a6f1b529a71c0293a Bug 1096605 - Don't include time spend pending in time estimates, and change pending jobs to always display typical ETAs https://github.com/mozilla/treeherder/commit/b435ae55c48cbee74c44f37919e3622741724999 Merge pull request #676 from KWierso/1096605 Bug 1096605 - Don't include time spent pending in time estimates, and change pending jobs to always display typical ETAs r=camd
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Summary: Job ETAs are often wrong → Calculate UI job ETAs using only the average runtime, not average pending+running time
Blocks: 1181572
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: