Intermittent Raptor [taskcluster:error] Task aborted - max run time exceeded

REOPENED
Assigned to

Status

defect
P5
normal
REOPENED
9 months ago
2 days ago

People

(Reporter: intermittent-bug-filer, Assigned: rwood, NeedInfo)

Tracking

({intermittent-failure})

Version 3
mozilla65
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox65 fixed)

Details

(Whiteboard: [stockwell unknown])

Attachments

(1 attachment)

Filed by: dvarga [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=207703467&repo=mozilla-central

https://queue.taskcluster.net/v1/task/NeLAGCKfT6OGF_hjm6pG2w/runs/0/artifacts/public/logs/live_backing.log

22:42:10     INFO -  Application command: /home/cltbld/tasks/task_1540446089/build/application/firefox/firefox -profile /tmp/tmpi_0lwD.mozrunner
22:42:11     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:11] "GET /raptor-wasm-misc-firefox.json HTTP/1.1" 200 -
22:42:11     INFO -  raptor-control-server reading test settings from raptor-wasm-misc-firefox.json
22:42:11     INFO -  raptor-control-server sent test settings to web ext runner
22:42:26     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:26] "POST / HTTP/1.1" 200 -
22:42:26     INFO -  raptor-control-server received webext_status: * pausing 30 seconds to let browser settle... *
22:42:56     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:56] "POST / HTTP/1.1" 200 -
22:42:56     INFO -  raptor-control-server received webext_status: running 5 pagecycles of http://127.0.0.1:55673/wasm-misc/index.html?raptor
22:42:56     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:56] "POST / HTTP/1.1" 200 -
22:42:56     INFO -  raptor-control-server received webext_status: opened new empty tab 2
22:42:57     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:57] "POST / HTTP/1.1" 200 -
22:42:57     INFO -  raptor-control-server received webext_status: begin pagecycle 1
22:42:58     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:58] "POST / HTTP/1.1" 200 -
22:42:58     INFO -  raptor-control-server received webext_status: update tab 2
22:42:58     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:58] "POST / HTTP/1.1" 200 -
22:42:58     INFO -  raptor-control-server received webext_status: test tab updated 2
22:43:54     INFO -  127.0.0.1 - - [24/Oct/2018 22:43:54] "POST / HTTP/1.1" 200 -
22:43:54     INFO -  raptor-control-server received webext_status: begin pagecycle 2
22:43:55     INFO -  127.0.0.1 - - [24/Oct/2018 22:43:55] "POST / HTTP/1.1" 200 -
22:43:55     INFO -  raptor-control-server received webext_status: update tab 2
22:43:55     INFO -  127.0.0.1 - - [24/Oct/2018 22:43:55] "POST / HTTP/1.1" 200 -
22:43:55     INFO -  raptor-control-server received webext_status: test tab updated 2
[taskcluster:error] Aborting task...
[taskcluster 2018-10-25T06:11:30.432Z] === Task Finished ===
[taskcluster 2018-10-25T06:11:30.432Z] Task Duration: 29m59.360604652s
[taskcluster 2018-10-25T06:11:30.822Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:31.573Z] Uploading artifact public/logs/raptor_critical.log from file logs/raptor_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:31.956Z] Uploading artifact public/logs/raptor_error.log from file logs/raptor_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:32.436Z] Uploading artifact public/logs/raptor_fatal.log from file logs/raptor_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:32.826Z] Uploading artifact public/logs/raptor_info.log from file logs/raptor_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:33.274Z] Uploading artifact public/logs/raptor_raw.log from file logs/raptor_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:33.729Z] Uploading artifact public/logs/raptor_warning.log from file logs/raptor_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster:error] Task aborted - max run time exceeded
Blocks: 1510400
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab6d46128df2
Increase max time limit for Raptor jobs on qr and ux builds to prevent intermittent timeout; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/ab6d46128df2
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
(In reply to Cosmin Sabou [:CosminS] from comment #12)
> This looks like some timeouts still happen:
> https://treeherder.mozilla.org/logviewer.html#/
> jobs?job_id=215205588&repo=mozilla-inbound&lineNumber=543
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&resultStatus=testfailed%2Cbusted%2Cexception&group_state=expanded&rev
> ision=c758d2a5b8cb04b1b5e5188a309d2994ad430d80&searchStr=linux%2Cx64%2Cquantu
> mrender%2Copt%2Craptor%2Cperformance%2Ctests%2Con%2Cfirefox%2Cwith%2Ce10s%2Ct
> est-linux64-qr%2Fopt-raptor-motionmark-animometer-firefox-e10s%2Crap-
> e10s%28mm-a%29&selectedJob=215205588
> 
> Robert could you take a look into this? Thank you.

Thanks, yes looks like motionmark and maybe tp6-5 need more time. Let's leave it a few days so we can get more data then I'll update the max time again where required.
Flags: needinfo?(rwood)
Flags: needinfo?(rwood)

In the last 7 days there have been 27 occurrences on android-hw, linux 64, windows 10, build types asan, opt, pgo.

I noticed that when I see raptor tp6m cold load tests fail it's often because they just ran out of time (30 minutes)
e.g.
https://taskcluster-artifacts.net/Z59qzIqqQUGWJ_9OAzcTSQ/0/public/logs/live_backing.log

The many of tp6m cold load tests on fenix seem to have this problem.

Can we consider increasing the max-run-time for android, or android cold load tests?
https://searchfox.org/mozilla-central/rev/0da35261b6789eec65476dbdd4913df6e235af6d/taskcluster/ci/test/raptor.yml#9

Another option is to use a single conditioned profile.

In that linked log the browser is spending 15 minutes of the 30 waiting for the browser to settle (30 seconds * 15 cycles * 2 tests).

With profile conditioning we would let the browser settle for 2 minutes, copy this profile to the host, and re-use it for every cycle.
So we could then reduce the 30 second pause to something much shorter like 5.
This has worked well in browsertime testing for many developers.

Separate bug, but I'd be happy to help with this.

The up-front conditioning is being tracked in bug 1537944. It's blocked by bug 1547747 which is being worked on. One issue with supporting this on desktop is that we need to be able to compare with Chrome, and we don't currently support user profiles on Chrome. We could most likely add mobile support though as we don't currently run page load tests against Chrome on Android due to the lack of web extension support.

:rwood if we can't reduce the number of sites we're testing (because we're already down to one per job), can we increase the max timeout?

Flags: needinfo?(rwood)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #25)

The up-front conditioning is being tracked in bug 1537944. It's blocked by bug 1547747 which is being worked on. One issue with supporting this on desktop is that we need to be able to compare with Chrome, and we don't currently support user profiles on Chrome. We could most likely add mobile support though as we don't currently run page load tests against Chrome on Android due to the lack of web extension support.

:rwood if we can't reduce the number of sites we're testing (because we're already down to one per job), can we increase the max timeout?

Actually we're running 2 sites per cold page-load suite. Maybe we should break them out to one site per suite? One benefit is faster bisecting/retriggers if you only want to test one site.

Flags: needinfo?(rwood) → needinfo?(dave.hunt)

(In reply to Robert Wood [:rwood] from comment #26)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #25)

The up-front conditioning is being tracked in bug 1537944. It's blocked by bug 1547747 which is being worked on. One issue with supporting this on desktop is that we need to be able to compare with Chrome, and we don't currently support user profiles on Chrome. We could most likely add mobile support though as we don't currently run page load tests against Chrome on Android due to the lack of web extension support.

:rwood if we can't reduce the number of sites we're testing (because we're already down to one per job), can we increase the max timeout?

Actually we're running 2 sites per cold page-load suite. Maybe we should break them out to one site per suite? One benefit is faster bisecting/retriggers if you only want to test one site.

Yes, let's go for one site per job.

Flags: needinfo?(dave.hunt)
Depends on: 1558621

This will be significantly reduced now that Bug 1558621 has landed.

Hi Robert, seems like bug 1558621 has landed but the failure rate is still high, 102 failures in the last 7 days.

Could you take a look or assign someone?

Flags: needinfo?(rwood)

Failure rate seems to be going down.

Joel is there something more we can do here?

Flags: needinfo?(jmaher)

(In reply to Intermittent Failures Robot from comment #37)

** This test has failed more than 150 times in the last 21 days. It should be disabled until it can be fixed. **

There's no individual test here to disable. Rob, is there anything else you can think of to eliminate these timeouts? I saw a bug recently about catching when the web extension is not installed and failing fast, is that likely to help here? Do we need to further increase the timeout?

the timeout won't help here, looking at the last 10 failures I see:

  1. desktop tests are usually hung doing something (I don't know raptor specifics) for up to 30 minutes and then terminated
  2. some android tests have a traceback and a long timeout followed by termination
  3. some android tests appear to need a longer timeout- I suspect there is a transaction that takes place earlier in the job that eats up time, maybe a download, or a retry on a test.

Maybe :rwood has some thoughts on 1 or more of these failure modes

Flags: needinfo?(jmaher)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
You need to log in before you can comment on or make changes to this bug.