Closed Bug 1512352 Opened 5 years ago Closed 5 years ago

Some android-em test tasks fail to retry on ADBTimeoutError

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

abandon multi-test test-verification once TBPL_RETRY is hit 5 years ago Geoff Brown [:gbrown] (pto Apr 11-18) 3.75 KB, patch		Details \| Diff \| Splinter Review
abandon multi-test test-verification once TBPL_RETRY is hit 5 years ago Geoff Brown [:gbrown] (pto Apr 11-18) 5.80 KB, patch	bc : review+	Details \| Diff \| Splinter Review

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Description

•

5 years ago

Noticed these examples in bug 1500266.

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=214822075&repo=mozilla-central&lineNumber=987
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=214583874&repo=autoland&lineNumber=1310
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=215592086&repo=mozilla-central&lineNumber=1003

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Updated

•

5 years ago

Priority: -- → P2

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 1

•

5 years ago

More examples:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216733447&repo=autoland&lineNumber=2411
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216678105&repo=autoland&lineNumber=1642
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216663234&repo=mozilla-inbound&lineNumber=2764
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216336407&repo=mozilla-central&lineNumber=982
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=216336405&repo=mozilla-central&lineNumber=972

Summary: Some android 7.0 x86 test tasks fail to retry on ADBTimeoutError → Some android-em test tasks fail to retry on ADBTimeoutError

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 2

•

5 years ago

The most common failures are in is_boot_completed and install_apk; in both cases, these would retry if the exception was handled.

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=bff993881a4c8dd9c67244db53d21936d62f3ad6

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 3

•

5 years ago

Additional cases:
 - per-test runs, where it appears the logger has matched for retry, but the exit code is over-ridden
 - task timeouts

Keywords: leave-open

Pulsebot

Comment 4

•

5 years ago

Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/52fa52aeb6ee
Handle more ADBTimeoutError edge cases in mozharness; r=me,a=test-only

Razvan Maries

Comment 5

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/52fa52aeb6ee

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 6

•

5 years ago

There are very few of these now. Here's a recent TV failure:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=221244928&repo=mozilla-inbound&lineNumber=2146

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 7

•

5 years ago

And another TV failure:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=220107430&repo=mozilla-inbound&lineNumber=2205

Bob Clary [:bc] (inactive)

Comment 8

•

5 years ago

I noticed that it suffers multiple ADBTimeoutErrors through repeated initializations of ADBDevice. I think creating an ADBDevice once, then passing it to XPCShellRemote rather than doing it each time would be better. It also appears that running the test after the ADBTimeoutError to get the 10 iterations is guaranteed to fail once the device enters the ADBTimeoutError state. Can you terminate the test once you get a time out?

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 9

•

5 years ago

Attached patch abandon multi-test test-verification once TBPL_RETRY is hit (obsolete) — Details — Splinter Review

Thanks :bc.

The way I see it, the issue is that the mozharness logic for test verification (and per-test coverage) keeps going even after a test harness has failed in such a way that TBPL_RETRY would be triggered: eg. android_emulator_unittest calls the xpcshell harness, it fails with ADBTimeoutError, and then android_emulator_unittest calls the xpcshell harness again, for the next file...and the next...

I think that behavior is okay - possibly even desirable - for many failures, but not for ADBTimeoutError, and probably not for anything that would cause TBPL_RETRY (which I equate with, "there's an unrecoverable infrastructure problem").

Here I force persistent ADBTimeoutError in the Android xpcshell harness:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=7d23c8ceaa602604d271be060b56c48aa8ffc887

and we get quick task retries on Android; desktop continues just fine too.

I didn't bother updating android_hardware_unittest.py since you don't run test-verify there, and don't support task retries (right?)...but that would be easy to change, if you prefer.

Attachment #9036020 - Flags: review?(bob)

Bob Clary [:bc] (inactive)

Comment 10

•

5 years ago

Comment on attachment 9036020 [details] [diff] [review]
abandon multi-test test-verification once TBPL_RETRY is hit

Review of attachment 9036020 [details] [diff] [review]:
-----------------------------------------------------------------

Let's do android_hardware_unittest.py as well since it does include the CodeCoverageMixin and therefore the SingleTestMixin. Might as well keep them in sync so if someone does try to run code coverage or test verification they won't be surprised.

::: testing/mozharness/scripts/android_emulator_unittest.py
@@ +433,5 @@
>                      self.record_status(tbpl_status, level=log_level)
>                      self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
> +                    if tbpl_status == TBPL_RETRY:
> +                        self.info("Per-test run abandoned due to RETRY status")
> +                        break

Do we want break here? Shouldn't we return False?

::: testing/mozharness/scripts/web_platform_tests.py
@@ +446,5 @@
>                  if len(per_test_args) > 0:
>                      self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
> +                    if tbpl_status == TBPL_RETRY:
> +                        self.info("Per-test run abandoned due to RETRY status")
> +                        return

Why is web_platform_tests.py inconsistent with the others in that it does not return a bool?

Attachment #9036020 - Flags: review?(bob)

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 11

•

5 years ago

Attached patch abandon multi-test test-verification once TBPL_RETRY is hit — Details — Splinter Review

OK, android_hardware_unittest.py is included now.

In all of these scripts, run_tests() is a mozharness "step" function, which usually doesn't return a value. I've changed all the run_tests break and return statements to simple "return" (including modifying an existing, unrelated "return False").

That makes the change to desktop_unittest appear inconsistent, but there we are returning from _run_category_suites() -- which should return False.

Attachment #9036020 - Attachment is obsolete: true

Attachment #9036066 - Flags: review?(bob)

Bob Clary [:bc] (inactive)

Comment 12

•

5 years ago

Comment on attachment 9036066 [details] [diff] [review]
abandon multi-test test-verification once TBPL_RETRY is hit

Review of attachment 9036066 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/scripts/desktop_unittest.py
@@ +944,5 @@
>                      if len(per_test_args) > 0:
>                          self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
> +                        if tbpl_status == TBPL_RETRY:
> +                            self.info("Per-test run abandoned due to RETRY status")
> +                            return False

Don't forget this one.

Attachment #9036066 - Flags: review?(bob) → review+

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 13

•

5 years ago

(In reply to Bob Clary [:bc:] from comment #12)

Comment on attachment 9036066 [details] [diff] [review]
abandon multi-test test-verification once TBPL_RETRY is hit

Review of attachment 9036066 [details] [diff] [review]:

::: testing/mozharness/scripts/desktop_unittest.py
@@ +944,5 @@
                 if len(per_test_args) > 0:
                     self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
                   if tbpl_status == TBPL_RETRY:
                       self.info("Per-test run abandoned due to RETRY status")
                       return False
Don't forget this one.

That one should return False: desktop_unittest.py has this code in a different function, _run_category_suites(), which needs to return a boolean.

Pulsebot

Comment 14

•

5 years ago

Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/500d91e30e9d
Improve task retry handling in test-verify; r=bc

Natalia Csoregi [:nataliaCs]

Comment 15

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/500d91e30e9d

Geoff Brown [:gbrown] (pto Apr 11-18)

Assignee

Comment 17

•

5 years ago

No new android-em failures in bug 1500266.

There is the remaining possibility of a task timeout interfering with the RETRY; I think that's best addressed by avoiding task timeouts / appropriate max-run-time settings.

Good enough for me!

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Keywords: leave-open

BMO Automation

Updated

•

3 years ago

Product: Firefox for Android → Firefox for Android Graveyard

You need to log in before you can comment on or make changes to this bug.