Closed
Bug 1096605
Opened 10 years ago
Closed 9 years ago
Calculate UI job ETAs using only the average runtime, not average pending+running time
Categories
(Tree Management :: Treeherder, defect, P3)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jfkthame, Assigned: KWierso)
References
Details
(Keywords: regression)
Attachments
(1 file)
E.g. I currently (at 10:13 PM) have a job that's showing
Build: x86_64 linux64 linux
Job name: Build
Request Time: 11/10/14 9:25 PM
Start Time: 11/10/14 9:39 PM
Duration: 34 minute(s)
and the tooltip says
13 minutes overdue, typically takes 35 mins
I'm assuming that "typical" duration of 35 minutes is the (average) actual job running time, not including any wait for the job to start; but the job is being considered "overdue" because it's 48 minutes since the request was submitted, not since the job actually started.
The OS X Opt build on the same push has currently only been running for 13 minutes (it waited longer than Linux to start), but shows an ETA of 2 mins because likewise, it's basing the ETA on the request time rather than the job start time.
Updated•10 years ago
|
Priority: -- → P2
Comment 3•10 years ago
|
||
Combining a few issues, since they are all related:
(In reply to Ed Morley [:edmorley] from bug 1087915 comment #0)
> 20:55 NeilAway is it known that pending jobs can show as overdue?
> 20:56 camd NeilAway: do you have an example of a job I could look at?
> 21:01 NeilAway camd: 8e21737b6ff7
> 21:01 camd NeilAway: is that mozilla-inbound?
> 21:03 NeilAway camd: try
> 21:06 camd NeilAway: yeah, I think that can happen when the machines are
> quite busy. We do our best to predict how soon a job will complete based on
> past jobs of the same type. But sometimes it can take a long time for a job
> to start running if the machine load is high.
> 21:07 camd that being said, if this seems like it's off-base, then please do
> file a bug and we'll take a look at it. :)
> 21:09 NeilAway camd: well, surely you should only start predicting the
> completion time once the job has actually started?
> 21:13 camd NeilAway: I *think* the dev on that had tried to predict based on
> the time it took a pending job of that type to get all the way through to
> completion. I'd have to double-check.
> 21:19 camd NeilAway: yeah, it looks like he calculates an average based on
> the time, which will include the time to start. But that time can vary a
> fair bit depending on infrastructure load.
(In reply to Geoff Brown [:gbrown] from bug 1123419 comment #0)
> The Android x86 "S4" test job seems to run in about 40 minutes and "ETA to
> completed" estimates on mozilla-inbound reflect that. However, in a try push
> - https://treeherder.mozilla.org/#/jobs?repo=try&revision=aadc3ecc0aa9 - I
> see
>
> ETA to completed: 132 minutes
> Duration: 31 minutes
>
> I think I have seen this sort of gross over-estimate before, but can't
> recall specifics. If this isn't a known problem, I'll keep an eye out for
> this condition and report more examples here.
Keywords: regression
Summary: treeherder's "ETA to completed" seems to be based on the typical run time of the job, but is measured from job request time rather than when job actually started → Job ETAs are often wrong
Comment 4•10 years ago
|
||
We should:
1) Not show an ETA for pending jobs - or at least show something more like "Estimated job runtime, once it starts: N mins" - since any prediction about how long a job will stay pending is not going to be accurate.
2) Make sure we're storing historic runtimes based on job start time, not request time.
3) Make sure infrequently used repos don't show wacky ETAs (comment 3, though this might have been caused by #2).
4) Check we're happy with how often we refresh the historic data & what time range it covers (to account for time of day and weekday vs weekend variations).
Comment 5•10 years ago
|
||
Also:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=a6bbabebed2f&filter-searchStr=Android armv7 API 10%2B mozilla-central nightly
Hovering over the "N" gives a tooltip of:
"Nightlyusercancel - -23700573 mins"
Comment 6•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #4)
> We should:
...
5) Check that we're not including coalesced jobs in the ETA calculations.
Comment 7•10 years ago
|
||
Note that this problem seems to hit every Windows 8 x64 opt Nightly build I saw in the last couple days. They are shown to have an expected run-time of ~90 minutes, but instead it's more like ~250+ minutes.
I'm not familiar with treeherder, but does it use the same data as TBPL for that estimation? Because TBPL correctly estimates the time left; Example:
Treeherder: "69 mins overdue, typically takes ~ 90 mins "
TBPL (on hover): "Nightly opt is still running, ETA ~102mins"
Comment 8•10 years ago
|
||
Their approaches are completed different:
TBPL averages the runtime for all green jobs of that type shown in the current view (hence why on a Try run &rev=SHA type page, there is no estimation, since there is not another green job of the same page visible).
Treeherder on the other hand, runs a task every day (or some other timeframe) that calculates average run times. This approach isn't bad itself, we're just doing it wrong (comment 4 and others).
Comment 9•10 years ago
|
||
completely, even
Comment 10•10 years ago
|
||
Ok, thank you for clearing that up. Then that's just a case of gross under-estimation / miscalculation.
Comment 11•10 years ago
|
||
And it looks like this case may already be fixed, because the ETA of that build is correct on treeherder.allizom.org. Shame on me for only checking .mozilla.org m(
Updated•10 years ago
|
Priority: P2 → P3
Assignee | ||
Comment 12•9 years ago
|
||
This:
1) Stops using pending ETAs in the calculations for how much time is left for the current job.
2) Makes pending jobs always display the typical ETA, not how much time is left for the current job, since there's no way to estimate how long pending->running will take.
This does not:
1) Stop the pending_eta calculations from happening every day.
2) Remove the pending_eta fields from any databases.
Attachment #8626441 -
Flags: review?(emorley)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → wkocher
Comment 13•9 years ago
|
||
Comment on attachment 8626441 [details] [review]
PR 676
Looks good from my point of view - it should fix most of the inaccuracy issues - we can handle the rest (and the DB cleanup) in another bug.
Asking Cameron to have a glance too, since he's more familiar with the ETA implementation.
Attachment #8626441 -
Flags: review?(emorley)
Attachment #8626441 -
Flags: review?(cdawson)
Attachment #8626441 -
Flags: feedback+
Comment 14•9 years ago
|
||
Comment on attachment 8626441 [details] [review]
PR 676
I didn't really have much of a hand in implementing this originally. But this all looks great to me. Makes sense.
I think our original intent (from my recollection of conversations) was that the ETA on pending could give an indication of a releng provisioning problem. Though it sounds like it just wasn't useful.
Attachment #8626441 -
Flags: review?(cdawson) → review+
Comment 15•9 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/fe9c55d233068f1e57d1893a6f1b529a71c0293a
Bug 1096605 - Don't include time spend pending in time estimates, and change pending jobs to always display typical ETAs
https://github.com/mozilla/treeherder/commit/b435ae55c48cbee74c44f37919e3622741724999
Merge pull request #676 from KWierso/1096605
Bug 1096605 - Don't include time spent pending in time estimates, and change pending jobs to always display typical ETAs r=camd
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Summary: Job ETAs are often wrong → Calculate UI job ETAs using only the average runtime, not average pending+running time
You need to log in
before you can comment on or make changes to this bug.
Description
•