android-hw jsreftest often exceeds max-run-time
Categories
(Testing :: General, defect, P3)
Tracking
(Not tracked)
People
(Reporter: gbrown, Assigned: ahal)
References
(Blocks 1 open bug)
Details
Attachments
(1 obsolete file)
In bug 1589796, many recent failures for exceeding the task max run time are for android-hw jsreftest.
Reporter | ||
Comment 1•4 years ago
|
||
There was an abrupt transition in run time and failures here:
suggesting a regression somewhere in the merge:
Reporter | ||
Comment 2•4 years ago
|
||
I think part of the increase in run time involves tooltool fetches, like host-utils.
In the earlier, successful, faster tasks, I typically see host-utils retrieval in a few seconds:
[task 2021-03-09T18:08:40.116Z] 18:08:35 INFO - Calling ['/usr/bin/python2.7', '-u', '/builds/task_161531149932043/workspace/mozharness/external_tools/tooltool.py', 'fetch', '-m', '/builds/task_161531149932043/workspace/build/hostutils/releng.manifest', '-o', '-c', '/builds/tooltool_cache'] with output_timeout 600
[task 2021-03-09T18:08:40.116Z] 18:08:35 INFO - INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz not present in local cache folder /builds/tooltool_cache
[task 2021-03-09T18:08:40.116Z] 18:08:35 INFO - INFO - Attempting to fetch from 'http://localhost:8099/tooltool.mozilla-releng.net/'...
[task 2021-03-09T18:08:40.116Z] 18:08:37 INFO - INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz fetched from http://localhost:8099/tooltool.mozilla-releng.net/ as /builds/task_161531149932043/workspace/build/hostutils/tmpliEnhc
[task 2021-03-09T18:08:40.116Z] 18:08:38 INFO - INFO - File integrity verified, renaming tmpliEnhc to host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-09T18:08:40.116Z] 18:08:38 INFO - INFO - Updating local cache /builds/tooltool_cache...
[task 2021-03-09T18:08:40.116Z] 18:08:38 INFO - INFO - Local cache /builds/tooltool_cache updated with host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-09T18:08:40.116Z] 18:08:38 INFO - INFO - untarring "host-utils-85.0a1.en-US.linux-x86_64.tar.gz"
[task 2021-03-09T18:08:40.116Z] 18:08:40 INFO - Return code: 0
In the later, slower, often-failing tasks, there are some tooltool 600 second timeouts on March 10. I don't see those in March 11 tasks, but there is still an apparent slow-down in many tooltool retrievals:
[task 2021-03-11T11:42:08.049Z] 11:35:47 INFO - Copy/paste: /usr/bin/python2.7 -u /builds/task_161546222549044/workspace/mozharness/external_tools/tooltool.py fetch -m /builds/task_161546222549044/workspace/build/hostutils/releng.manifest -o -c /builds/tooltool_cache
[task 2021-03-11T11:42:08.049Z] 11:35:47 INFO - Calling ['/usr/bin/python2.7', '-u', '/builds/task_161546222549044/workspace/mozharness/external_tools/tooltool.py', 'fetch', '-m', '/builds/task_161546222549044/workspace/build/hostutils/releng.manifest', '-o', '-c', '/builds/tooltool_cache'] with output_timeout 600
[task 2021-03-11T11:42:08.049Z] 11:35:47 INFO - INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz not present in local cache folder /builds/tooltool_cache
[task 2021-03-11T11:42:08.049Z] 11:35:47 INFO - INFO - Attempting to fetch from 'http://localhost:8099/tooltool.mozilla-releng.net/'...
[task 2021-03-11T11:42:08.049Z] 11:42:05 INFO - INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz fetched from http://localhost:8099/tooltool.mozilla-releng.net/ as /builds/task_161546222549044/workspace/build/hostutils/tmpmqEGEx
[task 2021-03-11T11:42:08.049Z] 11:42:06 INFO - INFO - File integrity verified, renaming tmpmqEGEx to host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-11T11:42:08.049Z] 11:42:06 INFO - INFO - Updating local cache /builds/tooltool_cache...
[task 2021-03-11T11:42:08.049Z] 11:42:06 INFO - INFO - Local cache /builds/tooltool_cache updated with host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-11T11:42:08.049Z] 11:42:06 INFO - INFO - untarring "host-utils-85.0a1.en-US.linux-x86_64.tar.gz"
[task 2021-03-11T11:42:08.049Z] 11:42:07 INFO - Return code: 0
:aerickson -- Hi! Do you have any idea what might be causing this slow-down? Any changes to host-utils? tooltool caching? bitbar devices?
Comment 3•4 years ago
|
||
There haven't been any recent changes.
The sheriffs noticed a lot of Bitbar jobs failing with network issues yesterday. Bitbar investigated and rebooted the docker hosts. Success rates seemed to improve slightly.
This morning there was an incident with Pulse cert rotation. Bug 1688892
Hopefully things stabilize here... will keep watching the hosts.
Comment 4•4 years ago
|
||
these will be moving to apple aarch64 when our pool of machines is available - ideally by the end of the month.
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
•
|
||
I created this patch before seeing Geoff's investigation. Joel, if you think we should not land this and instead investigate the root issue, feel free to r-.
Here's a try push:
https://treeherder.mozilla.org/jobs?repo=try&revision=3cbc3e673389ad48f9bfa78d3c5341d4894865d4
Lots of orange (mostly instances of bug 1697345), but at least no timeouts due to max-run-time.
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment 9•4 years ago
•
|
||
Joel, in the now obsoleted phabricator patch you mentioned:
I don't like 2 hour runtimes, especially on our very limited hardware. These jobs will be moving to apple silicon real soon (when the production pool is ready) and this problem will go away. can we consider turning them off in the meantime?
So I just wonder what this Android specific issue has to do with the upcoming M1 machines in the CI pool. Maybe there was an oversight?
Comment 10•4 years ago
|
||
jsreftests run on android for arm64 support and we can get that on the new apple platform which is faster and cheaper.
Comment 11•4 years ago
|
||
I see. Thank you for the info.
Andrew, could those tests be turned off in the meantime (the question that Joel asked in that same comment)?
Assignee | ||
Comment 12•4 years ago
|
||
I'm not sure, I don't really know anything about jsreftest.
Comment 13•4 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #12)
I'm not sure, I don't really know anything about jsreftest.
You and Joel are listed as peers so maybe https://wiki.mozilla.org/Modules/Testing needs an update?
Timothy could you maybe answer the question?
Comment 14•4 years ago
|
||
You're asking me if they can be turned off?
I'm listed as owner of the reftest module, meaning the code in that layout/tools/reftest directory. That doesn't including know what test coverage the jsreftest suite gives. I suggest talking to someone in js, they likely care about their test coverage.
Comment 15•4 years ago
|
||
In such a case we clearly need an update of https://wiki.mozilla.org/Modules/Testing because it says Reftest (+ jsreftest + crashtest)
.
Comment 16•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #15)
In such a case we clearly need an update of https://wiki.mozilla.org/Modules/Testing because it says
Reftest (+ jsreftest + crashtest)
.
the reftest code runs those suites, but the people who care about the suite are different from the people who are responsible for the code used to run it.
Assignee | ||
Comment 17•4 years ago
|
||
The Reftest module wiki was just updated very recently :)
I'm also the owner of the Mochitest harness, but I have no idea how important any given set of tests is or whether it's ok to turn them off. Developers of each domain own their own tests, and jsreftests are owned by the JS team (who likely know little about the actual harness that runs them).
Reporter | ||
Comment 18•3 years ago
|
||
android-hw jsreftests no longer run on mozilla-central. Failures continue on mozilla-beta and mozilla-release.
Description
•