Closed Bug 1420078 Opened 2 years ago Closed 2 years ago

Intermittent underreported Taskcluster OS X and Windows Aborting task - max run time exceeded!

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla60

People

(Reporter: philor, Assigned: rwood)

References

Details

(Whiteboard: [stockwell infra])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1374170 +++

Because bug 1333957, you will never know how often it happens.

https://treeherder.mozilla.org/logviewer.html#?job_id=146497824&repo=autoland
If you take that fellow right there, the one named t-yosemite-r7-239, out and shoot him, the frequency of this failure will drop by one third.
There have been 31 failures in the last week.
Almost all the failures occur on OS X 10.10 / opt. There some other failures that occur on OS X 10.10 / debug and one on: Windows 7 / debug.



Here is a recent log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=157527324

This is blocked by bug 1333957.
Whiteboard: [stockwell needswork]
Whiteboard: [stockwell needswork] → [stockwell infra]
There are 121 failures in the past 7 days., most of them on OS X 10.10 opt/debug, windows10-64-ccov debug, and a few occurrences on windows10-64 debug, Windows 2012 x64 debug, windows2012-32 debug and once on Windows 7 debug.

Recent log event: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=158998234
 
Waiting for bug 1333957 to be fixed (no new updates in the past 2 months)
There are 69 failures in the past 7 days.
Most of them on OS X 10.10 opt/debug, Windows 7 opt/ debug, and some occurrences on windows10-64 debug, windows10-64-ccov, macosx64-nightly opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=160394706
many of these failures are talos h2 and tp6- h2 is in process of getting disabled, so the rate should go down
This bug has failed 44 times in the last 7 days, mainly on OSX 10.10 but there are a few occurrences on Windows 10, affecting opt,pgo and debug build types. 

Failing tests: opt-talos-tp6-e10,debug-mochitest-e10, opt-talos-tp6-stylo-threads-e10, opt-talos-speedometer-e10.

Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=162322109

Waiting a resolution on bug 1333957  which apparently will be fixed very soon.
:rwood- we might want to look at the talos issues here to see if there is a delay we can fix or extend the expected runtime.
Flags: needinfo?(rwood)
Update: 

There are 51 failures in the last 7 days.

Failures per platform and build type:
- OS X 10.10: 39
- windows10-64: 8
- Windows 7: 3
- windows10-64-ccov: 1

- debug: 19
- opt: 25
- pgo: 7

Recent log and snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=164114227

Waiting a resolution on bug 1333957
Comment on attachment 8955232 [details]
Bug 1420078 - Increase some talos job max runtimes for some tests that areborderline and intermittently timing out;

https://reviewboard.mozilla.org/r/224392/#review230350

I suspect this might be osx only changes- possibly we can optimize our runtime- the advantage of that is to fail faster; for now I don't think it is critical.
Attachment #8955232 - Flags: review?(jmaher) → review+
Thanks, some of them were really close and were running fine but just needed a bit longer. Yes ideally they could be optimized better.

This patch (comment 29) will help with the talos jobs, but alot of the other intermittent failures in this bug are mochitest and will need to be looked at / assigned to the mochitest triage owner - and should probably be filed in a separate bug.
Flags: needinfo?(rwood)
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4c6f709a5e54
Increase some talos job max runtimes for some tests that areborderline and intermittently timing out; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/4c6f709a5e54
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
:rwood hi, i'm not sure this is fixed. Please take a look at the following logs, one from inbound and one from mozilla-central:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=165461473

and

https://treeherder.mozilla.org/logviewer.html#?job_id=165483051&repo=mozilla-central
Flags: needinfo?(rwood)
(In reply to Andreea Pavel [:apavel] from comment #35)
> :rwood hi, i'm not sure this is fixed. Please take a look at the following
> logs, one from inbound and one from mozilla-central:
> 
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=165461473
> 
> and
> 
> https://treeherder.mozilla.org/logviewer.html#?job_id=165483051&repo=mozilla-
> central

Hi Andreea, the fix I landed is for talos only. The tests at the logs you point to are mochitest and reftest. Separate bugs should be filed for each of those and the triage owners for those areas (not sure who they are) should be requested to take a look. Thanks! :)
Flags: needinfo?(rwood)
Robert, thanks for clarifying, however we classified failures that were on windows of Mac, and had max run time exceeded, not just talos jobs but mochitest, jreftest etc.

Aryx: should we create clones for the rest?
Flags: needinfo?(aryx.bugmail)
Yes, please create new bugs in Testing::and the related test suite (e.g. mochitest, reftest, xpcshell test) if you see this issue again. Also add them to the etherpad.
Flags: needinfo?(aryx.bugmail)
Assignee: nobody → rwood
You need to log in before you can comment on or make changes to this bug.