Closed Bug 1636514 Opened 5 years ago Closed 5 years ago

Intermittent Android 7.0 x86-64 opt [taskcluster:error] Task timeout after 2700/5400 seconds. Force killing container.

Categories

(Firefox Build System :: Task Configuration, defect, P5)

defect

Tracking

(firefox76 unaffected, firefox77 unaffected, firefox78 wontfix)

RESOLVED DUPLICATE of bug 1411358
Tracking Status
firefox76 --- unaffected
firefox77 --- unaffected
firefox78 --- wontfix

People

(Reporter: intermittent-bug-filer, Assigned: gbrown)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

Filed by: nerli [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=301420455&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/RWHImX9DT920WGk4eScQqg/runs/0/artifacts/public/logs/live_backing.log


[task 2020-05-08T15:51:22.812Z] 15:51:22 INFO - wait for org.mozilla.geckoview.test complete; top activity=com.android.launcher3
[task 2020-05-08T15:51:22.920Z] 15:51:22 INFO - remoteautomation.py | Application ran for: 0:00:27.934177
[task 2020-05-08T15:51:23.458Z] 15:51:23 INFO - Stopping web server
[task 2020-05-08T15:51:23.463Z] 15:51:23 INFO - Stopping web socket server
[task 2020-05-08T15:51:23.487Z] 15:51:23 INFO - Stopping ssltunnel
[task 2020-05-08T15:51:23.509Z] 15:51:23 INFO - leakcheck | refcount logging is off, so leaks can't be detected!
[task 2020-05-08T15:51:23.509Z] 15:51:23 INFO - runtests.py | Running tests: end.
[task 2020-05-08T15:51:24.799Z] 15:51:24 INFO - Buffered messages finished
[task 2020-05-08T15:51:24.799Z] 15:51:24 INFO - Running manifest: parser/htmlparser/tests/mochitest/mochitest.ini
[task 2020-05-08T15:51:25.373Z] 15:51:25 INFO - runtests.py | Failed to copy /builds/worker/workspace/build/tests/mochitest/hyphenation to profile
[task 2020-05-08T15:51:25.484Z] 15:51:25 INFO - pk12util: PKCS12 IMPORT SUCCESSFUL
[task 2020-05-08T15:51:25.594Z] 15:51:25 INFO - MochitestServer : launching [u'/builds/worker/workspace/build/hostutils/host-utils-77.0a1.en-US.linux-x86_64/xpcshell', '-g', '/builds/worker/workspace/build/hostutils/host-utils-77.0a1.en-US.linux-x86_64', '-f', '/builds/worker/workspace/build/hostutils/host-utils-77.0a1.en-US.linux-x86_64/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmp7eqAGE.mozrunner'; const _SERVER_PORT = '8854'; const _SERVER_ADDR = '10.0.2.2'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/builds/worker/workspace/build/tests/mochitest/server.js']
[task 2020-05-08T15:51:25.594Z] 15:51:25 INFO - runtests.py | Server pid: 9193
[task 2020-05-08T15:51:25.599Z] 15:51:25 INFO - runtests.py | Websocket server pid: 9196
[task 2020-05-08T15:51:25.606Z] 15:51:25 INFO - runtests.py | SSL tunnel pid: 9199
[task 2020-05-08T15:51:26.271Z] 15:51:26 INFO - runtests.py | Running with scheme: http
[task 2020-05-08T15:51:26.271Z] 15:51:26 INFO - runtests.py | Running with e10s: True
[task 2020-05-08T15:51:26.271Z] 15:51:26 INFO - runtests.py | Running with fission: False
[task 2020-05-08T15:51:26.271Z] 15:51:26 INFO - runtests.py | Running with serviceworker_e10s: True
[task 2020-05-08T15:51:26.271Z] 15:51:26 INFO - runtests.py | Running with socketprocess_e10s: False
[task 2020-05-08T15:51:26.271Z] 15:51:26 INFO - runtests.py | Running tests: start.
[task 2020-05-08T15:51:26.595Z] 15:51:26 INFO - adb Granting important runtime permissions to org.mozilla.geckoview.test
[task 2020-05-08T15:51:28.442Z] 15:51:28 INFO - adb launch_application: am start -W -n org.mozilla.geckoview.test/org.mozilla.geckoview.test.TestRunnerActivity -a android.intent.action.MAIN --es env9 MOZ_UPLOAD_DIR=/data/local/tmp/tests/mozlog --es env8 R_LOG_DESTINATION=stderr --es args "-no-remote -profile /data/local/tmp/tests/profile//" --es env3 R_LOG_VERBOSE=1 --es env2 XPCOM_DEBUG_BREAK=stack --es env1 MOZ_WEBRENDER=0 --es env0 MOZ_CRASHREPORTER=1 --es env7 MOZ_CRASHREPORTER_SHUTDOWN=1 --es env6 MOZ_IN_AUTOMATION=1 --es env5 MOZ_DISABLE_NONLOCAL_CONNECTIONS=1 --es env4 MOZ_HIDE_RESULTS_TABLE=1 --ez use_multiprocess True --es env11 R_LOG_LEVEL=6 --es env10 MOZ_CRASHREPORTER_NO_REPORT=1 -d "http://mochi.test:8888/tests?autorun=1&closeWhenDone=1&logFile=%2Fdata%2Flocal%2Ftmp%2Ftests%2Flogs%2Fmochitest.log&fileLevel=INFO&consoleLevel=INFO&hideResultsTable=1&manifestFile=tests.json&dumpOutputDirectory=%2Fdata%2Flocal%2Ftmp%2Ftests"
[task 2020-05-08T15:51:29.259Z] 15:51:29 INFO - remoteautomation.py | Application pid: 14236
[task 2020-05-08T15:51:31.333Z] 15:51:31 INFO - 3668 INFO SimpleTest START
[task 2020-05-08T15:51:31.334Z] 15:51:31 INFO - 3669 INFO TEST-START | parser/htmlparser/tests/mochitest/test_bug102699.html
[task 2020-05-08T15:51:31.335Z] 15:51:31 INFO - 3670 INFO TEST-OK | parser/htmlparser/tests/mochitest/test_bug102699.html | took 1113ms
[task 2020-05-08T15:51:31.335Z] 15:51:31 INFO - 3671 INFO TEST-START | parser/htmlparser/tests/mochitest/test_bug1104732.html
[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2020-05-08 15:51:38.817Z] === Task Finished ===
[taskcluster 2020-05-08 15:51:38.818Z] Unsuccessful task run with exit code: -1 completed in 5403.46 seconds

Flags: needinfo?(gbrown)

These task timeouts show very long android-sdk-linux downloads.

android-sdk-linux downloads changed from tooltool to toolchain in bug 1624649, on 2020-05-06.

There was also a change in hosts, from packet.net to aws bare metal, bug 1635877, on 2020-05-07/08.

wcosta is trying to determine which change is responsible for these timeouts.

Flags: needinfo?(gbrown)

Task timeouts are often reported in bug 1411358. Note another complicating factor in analyzing those failures: There were recent (2020-05-03/04) android-em failures there from another cause, addressed in bug 1634230.

See Also: → 1411358

Spot checking earlier android-em tasks on packet.net with toolchain fetches of android-sdk-linux I see good performance:

[fetches 2020-05-08T04:18:58.541Z] http://taskcluster/api/queue/v1/task/dttL8yxRRridMMMoTyfacQ/artifacts/project/gecko/android-sdk/android-sdk-linux.tar.xz resolved to 592960992 bytes with sha256 288057ca2a3a06edcb2daa05d8fa0f972295651d13d515d6f958dc7bee2f4958 in 31.204s

[fetches 2020-05-08T04:03:45.809Z] http://taskcluster/api/queue/v1/task/dttL8yxRRridMMMoTyfacQ/artifacts/project/gecko/android-sdk/android-sdk-linux.tar.xz resolved to 592960992 bytes with sha256 288057ca2a3a06edcb2daa05d8fa0f972295651d13d515d6f958dc7bee2f4958 in 34.987s

[fetches 2020-05-08T04:03:37.218Z] http://taskcluster/api/queue/v1/task/dttL8yxRRridMMMoTyfacQ/artifacts/project/gecko/android-sdk/android-sdk-linux.tar.xz resolved to 592960992 bytes with sha256 288057ca2a3a06edcb2daa05d8fa0f972295651d13d515d6f958dc7bee2f4958 in 26.148s

...but spot checking recent android-em tasks on aws with toolchain fetches also typically shows good performance, sometimes better:

[fetches 2020-05-08T16:26:37.773Z] http://taskcluster/api/queue/v1/task/dttL8yxRRridMMMoTyfacQ/artifacts/project/gecko/android-sdk/android-sdk-linux.tar.xz resolved to 592960992 bytes with sha256 288057ca2a3a06edcb2daa05d8fa0f972295651d13d515d6f958dc7bee2f4958 in 11.648s

As we discussed in the channel, packet.net runs 4 tasks per worker, while AWS we run 32 (r5.metal) or 15 (m5.metal) tasks in parallel. In a perfect storm where all tasks require the android-sdk-linux artifact, and they all start more or less the same time, it might be possible the instance suffers from a network bandwidth shortage, causing slow downloads of artifacts.

(In reply to Wander Lairson Costa from comment #4)

I made a couple of try pushes to packet.net to confirm if it only happens on AWS:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=828a47a8bda6a7217aeff9e1ae77c5fee8706bcd
https://treeherder.mozilla.org/#/jobs?repo=try&revision=97e1464b3431ca69044f5e3c635b0ff7769608fe

Thanks Wander. I don't see any task timeouts there.

Set release status flags based on info from the regressing bug 1624649

No recent failures. Let's fold this into the main timeout bug and revisit in that bug if this condition arises again.

Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.