Open Bug 1697835 Opened 4 years ago Updated 2 years ago

android-hw jsreftest often exceeds max-run-time

Categories

(Testing :: General, defect, P3)

Default
defect

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: gbrown, Assigned: ahal)

References

(Blocks 1 open bug)

Details

Attachments

(1 obsolete file)

In bug 1589796, many recent failures for exceeding the task max run time are for android-hw jsreftest.

I think part of the increase in run time involves tooltool fetches, like host-utils.

In the earlier, successful, faster tasks, I typically see host-utils retrieval in a few seconds:

[task 2021-03-09T18:08:40.116Z] 18:08:35     INFO - Calling ['/usr/bin/python2.7', '-u', '/builds/task_161531149932043/workspace/mozharness/external_tools/tooltool.py', 'fetch', '-m', '/builds/task_161531149932043/workspace/build/hostutils/releng.manifest', '-o', '-c', '/builds/tooltool_cache'] with output_timeout 600
[task 2021-03-09T18:08:40.116Z] 18:08:35     INFO -  INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz not present in local cache folder /builds/tooltool_cache
[task 2021-03-09T18:08:40.116Z] 18:08:35     INFO -  INFO - Attempting to fetch from 'http://localhost:8099/tooltool.mozilla-releng.net/'...
[task 2021-03-09T18:08:40.116Z] 18:08:37     INFO -  INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz fetched from http://localhost:8099/tooltool.mozilla-releng.net/ as /builds/task_161531149932043/workspace/build/hostutils/tmpliEnhc
[task 2021-03-09T18:08:40.116Z] 18:08:38     INFO -  INFO - File integrity verified, renaming tmpliEnhc to host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-09T18:08:40.116Z] 18:08:38     INFO -  INFO - Updating local cache /builds/tooltool_cache...
[task 2021-03-09T18:08:40.116Z] 18:08:38     INFO -  INFO - Local cache /builds/tooltool_cache updated with host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-09T18:08:40.116Z] 18:08:38     INFO -  INFO - untarring "host-utils-85.0a1.en-US.linux-x86_64.tar.gz"
[task 2021-03-09T18:08:40.116Z] 18:08:40     INFO - Return code: 0

In the later, slower, often-failing tasks, there are some tooltool 600 second timeouts on March 10. I don't see those in March 11 tasks, but there is still an apparent slow-down in many tooltool retrievals:

[task 2021-03-11T11:42:08.049Z] 11:35:47     INFO - Copy/paste: /usr/bin/python2.7 -u /builds/task_161546222549044/workspace/mozharness/external_tools/tooltool.py fetch -m /builds/task_161546222549044/workspace/build/hostutils/releng.manifest -o -c /builds/tooltool_cache
[task 2021-03-11T11:42:08.049Z] 11:35:47     INFO - Calling ['/usr/bin/python2.7', '-u', '/builds/task_161546222549044/workspace/mozharness/external_tools/tooltool.py', 'fetch', '-m', '/builds/task_161546222549044/workspace/build/hostutils/releng.manifest', '-o', '-c', '/builds/tooltool_cache'] with output_timeout 600
[task 2021-03-11T11:42:08.049Z] 11:35:47     INFO -  INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz not present in local cache folder /builds/tooltool_cache
[task 2021-03-11T11:42:08.049Z] 11:35:47     INFO -  INFO - Attempting to fetch from 'http://localhost:8099/tooltool.mozilla-releng.net/'...
[task 2021-03-11T11:42:08.049Z] 11:42:05     INFO -  INFO - File host-utils-85.0a1.en-US.linux-x86_64.tar.gz fetched from http://localhost:8099/tooltool.mozilla-releng.net/ as /builds/task_161546222549044/workspace/build/hostutils/tmpmqEGEx
[task 2021-03-11T11:42:08.049Z] 11:42:06     INFO -  INFO - File integrity verified, renaming tmpmqEGEx to host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-11T11:42:08.049Z] 11:42:06     INFO -  INFO - Updating local cache /builds/tooltool_cache...
[task 2021-03-11T11:42:08.049Z] 11:42:06     INFO -  INFO - Local cache /builds/tooltool_cache updated with host-utils-85.0a1.en-US.linux-x86_64.tar.gz
[task 2021-03-11T11:42:08.049Z] 11:42:06     INFO -  INFO - untarring "host-utils-85.0a1.en-US.linux-x86_64.tar.gz"
[task 2021-03-11T11:42:08.049Z] 11:42:07     INFO - Return code: 0

:aerickson -- Hi! Do you have any idea what might be causing this slow-down? Any changes to host-utils? tooltool caching? bitbar devices?

Flags: needinfo?(aerickson)

There haven't been any recent changes.

The sheriffs noticed a lot of Bitbar jobs failing with network issues yesterday. Bitbar investigated and rebooted the docker hosts. Success rates seemed to improve slightly.

This morning there was an incident with Pulse cert rotation. Bug 1688892

Hopefully things stabilize here... will keep watching the hosts.

Flags: needinfo?(aerickson)

these will be moving to apple aarch64 when our pool of machines is available - ideally by the end of the month.

Priority: -- → P3
Severity: -- → S3
Assignee: nobody → ahal
Status: NEW → ASSIGNED

I created this patch before seeing Geoff's investigation. Joel, if you think we should not land this and instead investigate the root issue, feel free to r-.

Here's a try push:
https://treeherder.mozilla.org/jobs?repo=try&revision=3cbc3e673389ad48f9bfa78d3c5341d4894865d4

Lots of orange (mostly instances of bug 1697345), but at least no timeouts due to max-run-time.

Attachment #9210002 - Attachment is obsolete: true

Joel, in the now obsoleted phabricator patch you mentioned:

I don't like 2 hour runtimes, especially on our very limited hardware. These jobs will be moving to apple silicon real soon (when the production pool is ready) and this problem will go away. can we consider turning them off in the meantime?

So I just wonder what this Android specific issue has to do with the upcoming M1 machines in the CI pool. Maybe there was an oversight?

Flags: needinfo?(jmaher)

jsreftests run on android for arm64 support and we can get that on the new apple platform which is faster and cheaper.

Flags: needinfo?(jmaher)

I see. Thank you for the info.

Andrew, could those tests be turned off in the meantime (the question that Joel asked in that same comment)?

Flags: needinfo?(ahal)

I'm not sure, I don't really know anything about jsreftest.

Flags: needinfo?(ahal)

(In reply to Andrew Halberstadt [:ahal] from comment #12)

I'm not sure, I don't really know anything about jsreftest.

You and Joel are listed as peers so maybe https://wiki.mozilla.org/Modules/Testing needs an update?

Timothy could you maybe answer the question?

Flags: needinfo?(tnikkel)

You're asking me if they can be turned off?

I'm listed as owner of the reftest module, meaning the code in that layout/tools/reftest directory. That doesn't including know what test coverage the jsreftest suite gives. I suggest talking to someone in js, they likely care about their test coverage.

Flags: needinfo?(tnikkel)

In such a case we clearly need an update of https://wiki.mozilla.org/Modules/Testing because it says Reftest (+ jsreftest + crashtest).

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #15)

In such a case we clearly need an update of https://wiki.mozilla.org/Modules/Testing because it says Reftest (+ jsreftest + crashtest).

the reftest code runs those suites, but the people who care about the suite are different from the people who are responsible for the code used to run it.

The Reftest module wiki was just updated very recently :)

I'm also the owner of the Mochitest harness, but I have no idea how important any given set of tests is or whether it's ok to turn them off. Developers of each domain own their own tests, and jsreftests are owned by the JS team (who likely know little about the actual harness that runs them).

android-hw jsreftests no longer run on mozilla-central. Failures continue on mozilla-beta and mozilla-release.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: