Open Bug 1622155 Opened 6 years ago Updated 4 years ago

gecko-t-bitbar-gw-unit-p2 Android 8.0 Pixel2 failures with | load failed: timed out waiting for reftest-wait to be removed | load failed: timed out after 300000 ms | application timed out after 370 seconds with no output

Categories

(Testing :: General, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell:infra])

Crash Data

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=292966605&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Tc1icqN9RJi-Wp5IPs3qbg/runs/0/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Tc1icqN9RJi-Wp5IPs3qbg/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1


There are a lot of Android 8.0 Pixel2 jsreftests that fail with the following:
https://treeherder.mozilla.org/logviewer.html#?job_id=292967477&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=292938627&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=292951061&repo=mozilla-beta

First failed here on autoland: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&selectedJob=292956372&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=reftest%2Candroid%2Cj&revision=651f6b4c7d7414fc73ea15080e8874bddc5c122b

central: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=eaac96bf8116f7e0d7c485c79b51153abc3df986&searchStr=reftest%2Candroid%2Cj&selectedJob=292938772

beta: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&group_state=expanded&resultStatus=testfailed%2Cbusted%2Cexception&revision=aa506197596f9e90a6cc1c6b6879c13df40bd90e&searchStr=reftest%2Candroid%2Cj&selectedJob=292951051

these run on https://firefox-ci-tc.services.mozilla.com/provisioners/proj-autophone/worker-types/gecko-t-bitbar-gw-unit-p2

retriggered some initial green jobs but those failed too: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&classifiedState=unclassified&searchStr=reftest%2Candroid%2Cj&revision=5f1e70cbdf730256a7c0c98271692ab3f9ead897&selectedJob=292966605

aerickson: It seems to be happening in trunk and beta without regard for the pushes which makes me wonder about networking at bitbar?

Flags: needinfo?(aerickson)
Summary: Intermittent [tier 2] Android 8.0 Pixel2 jsreftest failures with | load failed: timed out waiting for reftest-wait to be removed | oad failed: timed out after 300000 ms → Perma [tier 2] Android 8.0 Pixel2 jsreftest failures with | load failed: timed out waiting for reftest-wait to be removed | load failed: timed out after 300000 ms
Severity: normal → blocker
Priority: P5 → P1
Summary: Perma [tier 2] Android 8.0 Pixel2 jsreftest failures with | load failed: timed out waiting for reftest-wait to be removed | load failed: timed out after 300000 ms → gecko-t-bitbar-gw-unit-p2 Android 8.0 Pixel2 failures with | load failed: timed out waiting for reftest-wait to be removed | load failed: timed out after 300000 ms | application timed out after 370 seconds with no output
See Also: → 1622204

Andrew, I checked several of these and some at least are due to the device losing wifi during the test. I can think of several things that we all could do to make this better.

  1. Ask Bitbar to figure it out and maybe strengthen the signal for the devices by adding/moving access points.
  2. Figure out a way to detect lost wifi in a test. Perhaps just checking that the device has an ip address would be sufficient. If we could detect lost wifi, we probably couldn't recover without a failure and an orange result but we could terminate the test quicker and not make it wait for the full timeout.
  3. When we lose wifi, exit with a retry status quickly so we just try again.

#1 Seems like a good first step but not sufficient in the long run. I'll file bugs for #2, #3 later this morning.

Flags: needinfo?(bob)
Flags: needinfo?(bob)
See Also: → 1622816
Blocks: 1500977

Sakari/Bitbar says that all devices have strong connections to the Wifi access point. They're open to trying different channels.

Flags: needinfo?(aerickson)

Let's revisit when we have bug 1622816 available.

We're getting more failures today. We haven't made any progress. Our planned remedies:

  • BC/Snorp: Near term: Working to add code to mark job as RETRY if Wifi fails (via inspecting logcat?).
  • aerickson: Long term: Going to investigate using USB Ethernet for devices.

That is because the app is crashing with a java exception when the network connection is lost. This will help us diagnose the issue as an infrastructure and not a framework issue. See for example Caused by: java.lang.RuntimeException: Network connection has been lost.

We will be working to get Treeherder to offer this failure for classification and will work to get the jobs retried since this is an intermittent infra error. Once we have good data we can work with bitbar to resolve the issue.

Based on comment 12, removing the disable recommended tag

Whiteboard: [stockwell disable-recommended]
See Also: → 1624210
Whiteboard: [stockwell disable-recommended]

bitbar has some issues with their access points yesterday. They have supposedly fixed it as of 2020-04-01 4:30 PM PDT.

Thank you Bob, removing the disable tag.

Whiteboard: [stockwell disable-recommended]
Component: Testing → General
Product: Firefox for Android → Testing
Crash Signature: [@ libc.so + 0x6a100]
Whiteboard: [stockwell disable-recommended] → [stockwell:infra]

Hi, this is a bug whose current Severity is blocker but needs to be updated for the new Severity values as of May 4 2020.

I am moving its severity to S1.

Please review this bug's Severity and let Release Management know if it still is a high Severity bug.

Severity: blocker → S1

This appears to be very infrequent now... Marking down the severity.

Severity: S1 → S3
Priority: P3 → --
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.