Closed Bug 1502032 Opened 6 years ago Closed 4 years ago

Intermittent Raptor [taskcluster:error] Task aborted - max run time exceeded

Categories

(Testing :: Raptor, defect, P5)

Version 3
defect

Tracking

(firefox79 fixed)

RESOLVED FIXED
mozilla79
Tracking Status
firefox79 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: sparky)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(2 files)

Filed by: dvarga [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=207703467&repo=mozilla-central

https://queue.taskcluster.net/v1/task/NeLAGCKfT6OGF_hjm6pG2w/runs/0/artifacts/public/logs/live_backing.log

22:42:10     INFO -  Application command: /home/cltbld/tasks/task_1540446089/build/application/firefox/firefox -profile /tmp/tmpi_0lwD.mozrunner
22:42:11     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:11] "GET /raptor-wasm-misc-firefox.json HTTP/1.1" 200 -
22:42:11     INFO -  raptor-control-server reading test settings from raptor-wasm-misc-firefox.json
22:42:11     INFO -  raptor-control-server sent test settings to web ext runner
22:42:26     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:26] "POST / HTTP/1.1" 200 -
22:42:26     INFO -  raptor-control-server received webext_status: * pausing 30 seconds to let browser settle... *
22:42:56     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:56] "POST / HTTP/1.1" 200 -
22:42:56     INFO -  raptor-control-server received webext_status: running 5 pagecycles of http://127.0.0.1:55673/wasm-misc/index.html?raptor
22:42:56     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:56] "POST / HTTP/1.1" 200 -
22:42:56     INFO -  raptor-control-server received webext_status: opened new empty tab 2
22:42:57     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:57] "POST / HTTP/1.1" 200 -
22:42:57     INFO -  raptor-control-server received webext_status: begin pagecycle 1
22:42:58     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:58] "POST / HTTP/1.1" 200 -
22:42:58     INFO -  raptor-control-server received webext_status: update tab 2
22:42:58     INFO -  127.0.0.1 - - [24/Oct/2018 22:42:58] "POST / HTTP/1.1" 200 -
22:42:58     INFO -  raptor-control-server received webext_status: test tab updated 2
22:43:54     INFO -  127.0.0.1 - - [24/Oct/2018 22:43:54] "POST / HTTP/1.1" 200 -
22:43:54     INFO -  raptor-control-server received webext_status: begin pagecycle 2
22:43:55     INFO -  127.0.0.1 - - [24/Oct/2018 22:43:55] "POST / HTTP/1.1" 200 -
22:43:55     INFO -  raptor-control-server received webext_status: update tab 2
22:43:55     INFO -  127.0.0.1 - - [24/Oct/2018 22:43:55] "POST / HTTP/1.1" 200 -
22:43:55     INFO -  raptor-control-server received webext_status: test tab updated 2
[taskcluster:error] Aborting task...
[taskcluster 2018-10-25T06:11:30.432Z] === Task Finished ===
[taskcluster 2018-10-25T06:11:30.432Z] Task Duration: 29m59.360604652s
[taskcluster 2018-10-25T06:11:30.822Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:31.573Z] Uploading artifact public/logs/raptor_critical.log from file logs/raptor_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:31.956Z] Uploading artifact public/logs/raptor_error.log from file logs/raptor_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:32.436Z] Uploading artifact public/logs/raptor_fatal.log from file logs/raptor_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:32.826Z] Uploading artifact public/logs/raptor_info.log from file logs/raptor_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:33.274Z] Uploading artifact public/logs/raptor_raw.log from file logs/raptor_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster 2018-10-25T06:11:33.729Z] Uploading artifact public/logs/raptor_warning.log from file logs/raptor_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-10-25T04:51:43.793Z
[taskcluster:error] Task aborted - max run time exceeded
Blocks: 1510400
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab6d46128df2
Increase max time limit for Raptor jobs on qr and ux builds to prevent intermittent timeout; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/ab6d46128df2
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
(In reply to Cosmin Sabou [:CosminS] from comment #12)
> This looks like some timeouts still happen:
> https://treeherder.mozilla.org/logviewer.html#/
> jobs?job_id=215205588&repo=mozilla-inbound&lineNumber=543
> 
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&resultStatus=testfailed%2Cbusted%2Cexception&group_state=expanded&rev
> ision=c758d2a5b8cb04b1b5e5188a309d2994ad430d80&searchStr=linux%2Cx64%2Cquantu
> mrender%2Copt%2Craptor%2Cperformance%2Ctests%2Con%2Cfirefox%2Cwith%2Ce10s%2Ct
> est-linux64-qr%2Fopt-raptor-motionmark-animometer-firefox-e10s%2Crap-
> e10s%28mm-a%29&selectedJob=215205588
> 
> Robert could you take a look into this? Thank you.

Thanks, yes looks like motionmark and maybe tp6-5 need more time. Let's leave it a few days so we can get more data then I'll update the max time again where required.
Flags: needinfo?(rwood)
Flags: needinfo?(rwood)

In the last 7 days there have been 27 occurrences on android-hw, linux 64, windows 10, build types asan, opt, pgo.

I noticed that when I see raptor tp6m cold load tests fail it's often because they just ran out of time (30 minutes)
e.g.
https://taskcluster-artifacts.net/Z59qzIqqQUGWJ_9OAzcTSQ/0/public/logs/live_backing.log

The many of tp6m cold load tests on fenix seem to have this problem.

Can we consider increasing the max-run-time for android, or android cold load tests?
https://searchfox.org/mozilla-central/rev/0da35261b6789eec65476dbdd4913df6e235af6d/taskcluster/ci/test/raptor.yml#9

Another option is to use a single conditioned profile.

In that linked log the browser is spending 15 minutes of the 30 waiting for the browser to settle (30 seconds * 15 cycles * 2 tests).

With profile conditioning we would let the browser settle for 2 minutes, copy this profile to the host, and re-use it for every cycle.
So we could then reduce the 30 second pause to something much shorter like 5.
This has worked well in browsertime testing for many developers.

Separate bug, but I'd be happy to help with this.

The up-front conditioning is being tracked in bug 1537944. It's blocked by bug 1547747 which is being worked on. One issue with supporting this on desktop is that we need to be able to compare with Chrome, and we don't currently support user profiles on Chrome. We could most likely add mobile support though as we don't currently run page load tests against Chrome on Android due to the lack of web extension support.

:rwood if we can't reduce the number of sites we're testing (because we're already down to one per job), can we increase the max timeout?

Flags: needinfo?(rwood)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #25)

The up-front conditioning is being tracked in bug 1537944. It's blocked by bug 1547747 which is being worked on. One issue with supporting this on desktop is that we need to be able to compare with Chrome, and we don't currently support user profiles on Chrome. We could most likely add mobile support though as we don't currently run page load tests against Chrome on Android due to the lack of web extension support.

:rwood if we can't reduce the number of sites we're testing (because we're already down to one per job), can we increase the max timeout?

Actually we're running 2 sites per cold page-load suite. Maybe we should break them out to one site per suite? One benefit is faster bisecting/retriggers if you only want to test one site.

Flags: needinfo?(rwood) → needinfo?(dave.hunt)

(In reply to Robert Wood [:rwood] from comment #26)

(In reply to Dave Hunt [:davehunt] [he/him] ⌚️UTC from comment #25)

The up-front conditioning is being tracked in bug 1537944. It's blocked by bug 1547747 which is being worked on. One issue with supporting this on desktop is that we need to be able to compare with Chrome, and we don't currently support user profiles on Chrome. We could most likely add mobile support though as we don't currently run page load tests against Chrome on Android due to the lack of web extension support.

:rwood if we can't reduce the number of sites we're testing (because we're already down to one per job), can we increase the max timeout?

Actually we're running 2 sites per cold page-load suite. Maybe we should break them out to one site per suite? One benefit is faster bisecting/retriggers if you only want to test one site.

Yes, let's go for one site per job.

Flags: needinfo?(dave.hunt)
Depends on: 1558621

This will be significantly reduced now that Bug 1558621 has landed.

Hi Robert, seems like bug 1558621 has landed but the failure rate is still high, 102 failures in the last 7 days.

Could you take a look or assign someone?

Flags: needinfo?(rwood)

Failure rate seems to be going down.

Joel is there something more we can do here?

Flags: needinfo?(jmaher)

(In reply to Intermittent Failures Robot from comment #37)

** This test has failed more than 150 times in the last 21 days. It should be disabled until it can be fixed. **

There's no individual test here to disable. Rob, is there anything else you can think of to eliminate these timeouts? I saw a bug recently about catching when the web extension is not installed and failing fast, is that likely to help here? Do we need to further increase the timeout?

the timeout won't help here, looking at the last 10 failures I see:

  1. desktop tests are usually hung doing something (I don't know raptor specifics) for up to 30 minutes and then terminated
  2. some android tests have a traceback and a long timeout followed by termination
  3. some android tests appear to need a longer timeout- I suspect there is a transaction that takes place earlier in the job that eats up time, maybe a download, or a retry on a test.

Maybe :rwood has some thoughts on 1 or more of these failure modes

Flags: needinfo?(jmaher)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]

There are 22 total failures in the last 7 days on android-hw-g5-7-0-arm7-api-16 pgo, windows10-aarch64 opt, windows7-32 opt, windows7-32-shippable opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=258583406&repo=mozilla-central&lineNumber=3684

[task 2019-07-26T23:55:36.739Z] 23:55:36 INFO - raptor-main Info: starting firefox
[task 2019-07-26T23:55:36.739Z] 23:55:36 INFO - Application command: C:\Users\task_1564176079\build\application\firefox\firefox.exe --wait-for-browser -profile c:\users\task_1564176079\appdata\local\temp\tmpnaf5uf.mozrunner
[task 2019-07-26T23:55:37.625Z] 23:55:37 INFO - PID 2292 | console.log: "[raptor-runnerjs] Attaching event listener successful!"
[task 2019-07-26T23:55:37.625Z] 23:55:37 INFO - PID 2292 | console.log: "[raptor-runnerjs] starting raptorRunner"
[task 2019-07-26T23:55:37.625Z] 23:55:37 INFO - PID 2292 | console.log: "[raptor-runnerjs] test name is: raptor-tp6-yahoo-mail-firefox"
[task 2019-07-26T23:55:37.625Z] 23:55:37 INFO - PID 2292 | console.log: "[raptor-runnerjs] test settings url is: http://127.0.0.1:49782/json/raptor-tp6-yahoo-mail-firefox.json"

[task 2019-07-26T23:56:09.731Z] 23:56:09 INFO - raptor-control-server Info: received webext_status: update tab: 3
[task 2019-07-26T23:56:09.741Z] 23:56:09 INFO - raptor-control-server Info: received webext_status: test tab updated: 3
[task 2019-07-26T23:56:09.741Z] 23:56:09 INFO - PID 2292 | console.log: "[raptor-runnerjs] post success"
[task 2019-07-26T23:56:09.747Z] 23:56:09 INFO - PID 2292 | console.log: "[raptor-runnerjs] post success"
[taskcluster:error] Aborting task...
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 4004 (child process of PID 4548) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 7004 (child process of PID 4548) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 3704 (child process of PID 4548) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 6728 (child process of PID 4548) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 6164 (child process of PID 4548) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 1004 (child process of PID 2744) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 4548 (child process of PID 2292) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 2744 (child process of PID 5876) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 2292 (child process of PID 5876) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 6464 (child process of PID 4036) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 5876 (child process of PID 4036) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 4036 (child process of PID 4948) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 2700 (child process of PID 5048) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 4948 (child process of PID 5048) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z] SUCCESS: The process with PID 5048 (child process of PID 5644) has been terminated.
[taskcluster 2019-07-27T00:21:00.686Z]
[taskcluster 2019-07-27T00:21:00.687Z] === Task Finished ===
[taskcluster 2019-07-27T00:21:00.687Z] Task Duration: 30m0.3646392s
[taskcluster 2019-07-27T00:21:01.024Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2020-07-25T21:39:32.706Z
[taskcluster 2019-07-27T00:21:01.722Z] Uploading artifact public/test_info/mitmproxy.log from file build\blobber_upload_dir\mitmproxy.log with content encoding "gzip", mime type "text/plain" and expiry 2020-07-25T21:39:32.706Z
[taskcluster 2019-07-27T00:21:03.182Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/QTBx05ofRO6hhj0nbzRj2Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-07-25T21:39:32.706Z
[taskcluster:error] Task aborted - max run time exceeded

Failure rate seems to have decreased here.

Recent failures all appear to be related to running the Yahoo Mail page load test on Windows 7. Bebe could someone investigate?

Flags: needinfo?(fstrugariu)
Flags: needinfo?(rwood)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]

There are 109 failures associated to this bug in the last 7 days. These are occurring on linux64-shippable, macosx1014-64-shippable, windows10-64-shippable, windows7-32 and windows7-32-shippable for opt builds.

Flags: needinfo?(rwood)

This is tricky as the error "Task aborted - max run time exceeded" is happening because of different failures. For example, in the logs in comment 56 there are different tests causing it:

opt-raptor-tp6m-10-geckoview-cold-e10s
raptor-motionmark-animometer-chromium
raptor-tp6-youtube-firefox

It's not something caused by a single issue / requiring a single fix. We can just go through the most frequent tests that are causing that error and try to knock them off one at a time.

:marauder, can you please have a look and see if you can figure out and solve some of the most frequent failures here? If you can track down specific issues that cause some of these max run time errors feel free to file separate dependent bugs for any fixes. Thanks!

Flags: needinfo?(rwood) → needinfo?(marian.raiciof)

Hi Marian, are there updates here?

There are 22 failures in the last 7 days and 224 total failures. There seems to have been a higher spike between August 11 and 18: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-07-30&endday=2019-08-29&tree=trunk&bug=1502032

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=263941867&repo=autoland&lineNumber=1450

[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-main Info: starting test: raptor-speedometer-geckoview
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-main Info: starting raptor test: raptor-speedometer-geckoview
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-main Info: test settings: {'expected_browser_cycles': 1, 'alert_threshold': '2.0', 'subtest_unit': 'ms', 'here': '/builds/task_1567026790/workspace/build/tests/raptor/raptor/tests/benchmarks', 'path': '/builds/task_1567026790/workspace/build/tests/raptor/raptor/tests/benchmarks/raptor-speedometer-geckoview', 'cold': False, 'unit': 'score', 'ancestor-manifest': '/builds/task_1567026790/workspace/build/tests/raptor/raptor/raptor.ini', 'name': 'raptor-speedometer-geckoview', 'type': 'benchmark', 'page_cycles': '1', 'subtest_lower_is_better': 'true', 'apps': 'geckoview', 'test_url': 'http://<host>:<port>/Speedometer/index.html?raptor', 'manifest': '/builds/task_1567026790/workspace/build/tests/raptor/raptor/tests/benchmarks/raptor-speedometer.ini', 'page_timeout': '900000', 'browser_cycle': 1, 'expected': 'pass', 'lower_is_better': 'false', 'relpath': 'tests/benchmarks/raptor-speedometer-geckoview'}
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-main Info: raptor config: {'binary': 'org.mozilla.geckoview_example', 'local_profile_dir': '/tmp/tmp65dAXh.mozrunner', 'symbols_path': 'https://queue.taskcluster.net/v1/task/fLvlhjLkQSui3BxKXsbqQQ/artifacts/public/build/target.crashreporter-symbols.zip', 'memory_test': False, 'cpu_test': False, 'enable_control_server_wait': False, 'e10s': True, 'app': 'geckoview', 'gecko_profile_entries': None, 'power_test': False, 'run_local': False, 'platform': 'linux', 'host': '127.0.0.1', 'is_release_build': False, 'intent': 'android.intent.action.MAIN', 'enable_webrender': False, 'activity': 'org.mozilla.geckoview_example.GeckoViewActivity', 'gecko_profile_interval': None, 'processor': 'x86_64', 'gecko_profile': False, 'obj_path': None}
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-benchmark Info: bench_dir contains:
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-benchmark Info: ['ARES-6', 'Speedometer', 'MotionMark', 'webaudio', 'StyleBench', 'wasm-godot', 'SunSpider', 'six-speed', 'resources']
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-benchmark Info: writing wptserve headers file
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-benchmark Info: wrote wpt headers file: /builds/task_1567026790/workspace/build/tests/raptor/raptor/tests/webkit/PerformanceTests/dir.headers
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-benchmark Info: starting webserver on '127.0.0.1:53901'
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-benchmark Info: serving benchmarks from here: /builds/task_1567026790/workspace/build/tests/raptor/raptor/tests/webkit/PerformanceTests
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-gen-test-config Info: writing test settings into background js, so webext can get it
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-gen-test-config Info: finished writing test config to /builds/task_1567026790/workspace/build/tests/raptor/webext/raptor/auto_gen_test_config.js
[task 2019-08-28T21:15:43.734Z] 21:15:43 INFO - raptor-main Info: installing webext /builds/task_1567026790/workspace/build/tests/raptor/raptor/../webext/raptor
[task 2019-08-28T21:15:48.443Z] 21:15:43 INFO - raptor-main Info: making the raptor control server port available to device
[task 2019-08-28T21:15:48.443Z] 21:15:43 INFO - adb command_output: adb -s ZY322LDKCW wait-for-device reverse tcp:58657 tcp:58657, timeout: None, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.443Z] 21:15:43 INFO - raptor-main Info: making the raptor playback server port available to device
[task 2019-08-28T21:15:48.443Z] 21:15:43 INFO - adb command_output: adb -s ZY322LDKCW wait-for-device reverse tcp:8080 tcp:8080, timeout: None, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.443Z] 21:15:43 INFO - raptor-main Info: making the raptor benchmarks server port available to device
[task 2019-08-28T21:15:48.443Z] 21:15:44 INFO - adb command_output: adb -s ZY322LDKCW wait-for-device reverse tcp:53901 tcp:53901, timeout: None, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.443Z] 21:15:44 INFO - raptor-main Info: clearing org.mozilla.geckoview_example app data
[task 2019-08-28T21:15:48.444Z] 21:15:45 INFO - raptor-main Info: setting debug-app flag for org.mozilla.geckoview_example
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell pm list package org.mozilla.geckoview_example, timeout: None, root: False, timedout: None, exitcode: 0, output: package:org.mozilla.geckoview_example
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - raptor-main Info: copying profile to device: /sdcard/raptor/profile
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb shell_bool: adb -s ZY322LDKCW wait-for-device shell test -d /sdcard/raptor/profile, timeout: None, root: False, timedout: None, exitcode: 1, output:
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb command_output: adb -s ZY322LDKCW wait-for-device push /tmp/tmpaM3g4A/profile /sdcard/raptor/profile, timeout: None, timedout: None, exitcode: 0, output: /tmp/tmpaM3g4A/profile/: 8 files pushed. 0.8 MB/s (47755 bytes in 0.059s)
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb chmod: path=/sdcard/raptor/profile, recursive=True, mask=777, root=True
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb Ignoring attempt to chmod external storage
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell cat sys/class/thermal/thermal_zone0/temp, timeout: None, root: False, timedout: None, exitcode: 0, output: 26500
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell cat sys/class/thermal/thermal_zone0/type, timeout: None, root: False, timedout: None, exitcode: 0, output: bms
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - raptor-main Info: (thermal_zone0) device temperature: 26.500 zone type: bms
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - raptor-main Info: creating android app config.yml
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - raptor-utils Info: writing {'args': ['--profile', '/sdcard/raptor/profile', 'use_multiprocess', True], 'env': {'MOZ_WEBRENDER': 0, 'R_LOG_LEVEL': 6, 'LOG_VERBOSE': 1}} to /tmp/tmpUH4q0g/org.mozilla.geckoview_example-geckoview-config.yaml
[task 2019-08-28T21:15:48.444Z] 21:15:47 INFO - raptor-main Info: copying /tmp/tmpUH4q0g/org.mozilla.geckoview_example-geckoview-config.yaml to device: /data/local/tmp/org.mozilla.geckoview_example-geckoview-config.yaml
[task 2019-08-28T21:15:48.444Z] 21:15:48 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.444Z] 21:15:48 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[task 2019-08-28T21:15:48.444Z] 21:15:48 INFO - adb command_output: adb -s ZY322LDKCW wait-for-device push /tmp/tmpUH4q0g/org.mozilla.geckoview_example-geckoview-config.yaml /data/local/tmp/org.mozilla.geckoview_example-geckoview-config.yaml, timeout: None, timedout: None, exitcode: 0, output: /tmp/tmpUH4q0g/org.mozilla.geckoview_example-geckoview-config.yaml: 1 file pushed. 0.1 MB/s (127 bytes in 0.002s)
[task 2019-08-28T21:15:48.444Z] 21:15:48 INFO - adb shell_output: adb -s ZY322LDKCW wait-for-device shell sync, timeout: None, root: False, timedout: None, exitcode: 0, output:
[taskcluster:error] Aborting task...
[taskcluster:warn 2019-08-28T21:28:32.686Z] no such process
[taskcluster 2019-08-28T21:28:32.686Z] === Task Finished ===
[taskcluster 2019-08-28T21:28:32.686Z] Task Duration: 15m0.00026537s
[taskcluster 2019-08-28T21:28:33.406Z] Uploading artifact public/logs/localconfig.json from file workspace/logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2020-08-27T19:38:34.628Z
[taskcluster 2019-08-28T21:28:33.875Z] Uploading artifact public/test_info/logcat-ZY322LDKCW.log from file workspace/build/blobber_upload_dir/logcat-ZY322LDKCW.log with content encoding "gzip", mime type "text/plain" and expiry 2020-08-27T19:38:34.628Z
[taskcluster 2019-08-28T21:28:34.640Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/K7tflP5fTV-ba8STgRpI3g/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-08-27T19:38:34.628Z
[taskcluster:error] Task aborted - max run time exceeded

Flags: needinfo?(marian.raiciof)
Flags: needinfo?(marian.raiciof)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]
Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

I didn't find a specific reason that caused these failures, but i created a local patch to test different scenarios:

  • increased page_timeout in the .ini files
  • increased max-run-time in the .yml files
  • reduced gecko_profile_entries in the .ini files
  • increased gRetryCounter in pageload.js for the measured metrics.

https://hg.mozilla.org/try/rev/a122e2fe37f016d62212f479f63fd3e1a65cf5ff

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6bcf2082ea9f4350a3395b143be84c920f50f7ba

But, even with these changes there are some failures related to "max run time exceeded" , others caused by different issues.

I'll test further to see how it goes.

Flags: needinfo?(marian.raiciof)

In the last 7 days there have been 32 occurrences, most on Windows 10 64, build type opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272300976&repo=mozilla-central&lineNumber=3806

Depends on: 1592115
Flags: needinfo?(fstrugariu)
Assignee: rwood → nobody
Assignee: nobody → aionescu
Depends on: 1611894
Blocks: 1612795
No longer blocks: 1612795
Depends on: 1612795
Blocks: 1612795
No longer depends on: 1612795

I assume that there might be a problem with the benchmark harness inside the browser and that it may raise an error, and as such the test might not even start? Maybe some more logging inside Raptor web extension for specifically benchmark tests might be good to have.

But the question is if we should spend the time on that, or better moving tests to browsertime which gives way more detailed logging.

Target Milestone: mozilla65 → ---

:alexandrui, are you still working on this issue? If so, can you set the priority to P1. Otherwise, remove yourself from the assignee.

Flags: needinfo?(aionescu)
Assignee: aionescu → nobody
Flags: needinfo?(aionescu)

With bug 1618390 more logging has been added when downloading a conditioned profile. Also the download timeout has been changed. But we still experience hangs:

On Windows we still do NOT show additional log output:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291169984&repo=mozilla-central&lineNumber=2948

On MacOS we have additional logging but we do NOT timeout for downloading the profile:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291179702&repo=autoland&lineNumber=2082

[task 2020-03-01T14:49:07.383Z] 14:49:07 INFO - condprof INFO | Getting https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/gecko.v2.mozilla-central.latest.firefox.condprof-macosx64/artifacts/public/condprof/profile-macosx64-settled-default.tgz
[task 2020-03-01T14:49:07.812Z] 14:49:07 INFO - condprof INFO | Already Downloaded.
[..]
[task 2020-03-01T14:49:18.815Z] 14:49:18 INFO - condprof INFO | Getting https://firefox-ci-tc.services.mozilla.com/api/index/v1/task/gecko.v2.mozilla-central.latest.firefox.condprof-macosx64/artifacts/public/condprof/profile-macosx64-settled-default.tgz
[taskcluster:error] Aborting task...
[taskcluster:warn 2020-03-01T15:14:59.480Z] no such process

Note that here the profile already got downloaded a couple of times, but in this case we seem to hang detecting that?

Tarek, we still need an additional fix here. Any idea what this could be? If yes a new dependency should be filed.

Depends on: 1613938, 1618390
Flags: needinfo?(tarek)
Priority: P5 → P2
Blocks: 1589796

I'll add more logs in check_exist

for windows I don't know why we don't see the logs. one thing we could maybe do is dump the traceback of the python process on shutdown to see where it really hangs when things time out. It's a bit complicated to implement but doable

Flags: needinfo?(tarek)
Depends on: 1620214
Depends on: 1621339
No longer blocks: 1612795

Only happens sporadically. Not a priority anymore.

Priority: P2 → P5
Whiteboard: [stockwell unknown] → [stockwell needswork:owner]

So :bogdan_tara, I think it was caused by that patch. The funny thing is that it should only add 30s to the total runtime which I guess was too much - I'll increase the task timeout.

the osx raptor-profile-webaudio job is failing often on m-c and it marked as this bug, but the linux one is marked as bug 1645720 which might be more appropriate.

both started friday june 12 with this commit range:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=80b6f21783a324fd2e84e363dc4028e1ccff161e

I see 2 bugs that landed in that range:
https://bugzilla.mozilla.org/show_bug.cgi?id=1645424 (raptor profiling changes)
https://bugzilla.mozilla.org/show_bug.cgi?id=1626918 (actual profiler changes - maybe more devtools)

it would be nice to confirm these are one and the same issue for rap-profiling-webaudio, and work on a fix.

for the webaudio task we see

[task 2020-06-22T05:41:38.034Z] 05:41:38     INFO -  PID 1405 | console.info: "[raptor-runnerjs] results pending..."
[task 2020-06-22T05:41:38.416Z] 05:41:38     INFO -  PID 1405 | console.info: "[raptor-runnerjs] results pending..."
[task 2020-06-22T05:41:38.811Z] 05:41:38     INFO -  PID 1405 | console.info: "[raptor-runnerjs] results pending..."
[task 2020-06-22T05:41:39.071Z] 05:41:39     INFO -  PID 1405 | console.error: "ERROR: [raptor-runnerjs] raptor-page-timeout on http://127.0.0.1:49977/webaudio/index.html?raptor"
[task 2020-06-22T05:41:39.073Z] 05:41:39     INFO -  raptor-control-server Info: received webext_raptor-page-timeout: [u'raptor-webaudio-firefox', u'http://127.0.0.1:49977/webaudio/index.html?raptor', 2]
[task 2020-06-22T05:41:39.073Z] 05:41:39     INFO -  PID 1405 | console.info: "[raptor-runnerjs] capturing screenshot"
[task 2020-06-22T05:41:39.081Z] 05:41:39     INFO -  PID 1405 | console.info: "[raptor-runnerjs] results pending..."
[task 2020-06-22T05:41:39.441Z] 05:41:39     INFO -  PID 1405 | console.info: "[raptor-runnerjs] results pending..."

Can we take a look over the raptor-control-server and raise a exception when webext_raptor-page-timeout is triggered

Flags: needinfo?(aionescu)

:bc we see a lot of adb command output for the android tests :

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=306950368&repo=mozilla-central&lineNumber=3063

do you have time to investigate the failure?

Flags: needinfo?(bob)

There doesn't appear to be anything out of the ordinary in the adb verbose output from the test. Looking at the logcat output, it appears that GeckoView and Raptor are running. It just doesn't shut down. I think the raptor folks are better prepared to diagnose this.

Flags: needinfo?(bob)

Testing with increased timeout for youtube tests. It looks like that's all that's needed based on the logcat logs: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a30870694b497768bdb2f406e78326dd737f47a

Flags: needinfo?(gmierz2)
Assignee: nobody → gmierz2
Flags: needinfo?(aionescu)
Pushed by fstrugariu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a2bd479965da
Increase raptor ytp task timeout. r=perftest-reviewers,Bebe
Status: REOPENED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla79
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: