Closed Bug 1512352 Opened 5 years ago Closed 5 years ago

Some android-em test tasks fail to retry on ADBTimeoutError

Categories

(Firefox for Android Graveyard :: Testing, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: gbrown)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

Priority: -- → P2
The most common failures are in is_boot_completed and install_apk; in both cases, these would retry if the exception was handled.

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=bff993881a4c8dd9c67244db53d21936d62f3ad6
Additional cases:
 - per-test runs, where it appears the logger has matched for retry, but the exit code is over-ridden
 - task timeouts
Keywords: leave-open
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/52fa52aeb6ee
Handle more ADBTimeoutError edge cases in mozharness; r=me,a=test-only

I noticed that it suffers multiple ADBTimeoutErrors through repeated initializations of ADBDevice. I think creating an ADBDevice once, then passing it to XPCShellRemote rather than doing it each time would be better. It also appears that running the test after the ADBTimeoutError to get the 10 iterations is guaranteed to fail once the device enters the ADBTimeoutError state. Can you terminate the test once you get a time out?

Thanks :bc.

The way I see it, the issue is that the mozharness logic for test verification (and per-test coverage) keeps going even after a test harness has failed in such a way that TBPL_RETRY would be triggered: eg. android_emulator_unittest calls the xpcshell harness, it fails with ADBTimeoutError, and then android_emulator_unittest calls the xpcshell harness again, for the next file...and the next...

I think that behavior is okay - possibly even desirable - for many failures, but not for ADBTimeoutError, and probably not for anything that would cause TBPL_RETRY (which I equate with, "there's an unrecoverable infrastructure problem").

Here I force persistent ADBTimeoutError in the Android xpcshell harness:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=7d23c8ceaa602604d271be060b56c48aa8ffc887

and we get quick task retries on Android; desktop continues just fine too.

I didn't bother updating android_hardware_unittest.py since you don't run test-verify there, and don't support task retries (right?)...but that would be easy to change, if you prefer.

Attachment #9036020 - Flags: review?(bob)
Comment on attachment 9036020 [details] [diff] [review]
abandon multi-test test-verification once TBPL_RETRY is hit

Review of attachment 9036020 [details] [diff] [review]:
-----------------------------------------------------------------

Let's do android_hardware_unittest.py as well since it does include the CodeCoverageMixin and therefore the SingleTestMixin. Might as well keep them in sync so if someone does try to run code coverage or test verification they won't be surprised.

::: testing/mozharness/scripts/android_emulator_unittest.py
@@ +433,5 @@
>                      self.record_status(tbpl_status, level=log_level)
>                      self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
> +                    if tbpl_status == TBPL_RETRY:
> +                        self.info("Per-test run abandoned due to RETRY status")
> +                        break

Do we want break here? Shouldn't we return False?

::: testing/mozharness/scripts/web_platform_tests.py
@@ +446,5 @@
>                  if len(per_test_args) > 0:
>                      self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
> +                    if tbpl_status == TBPL_RETRY:
> +                        self.info("Per-test run abandoned due to RETRY status")
> +                        return

Why is web_platform_tests.py inconsistent with the others in that it does not return a bool?
Attachment #9036020 - Flags: review?(bob)

OK, android_hardware_unittest.py is included now.

In all of these scripts, run_tests() is a mozharness "step" function, which usually doesn't return a value. I've changed all the run_tests break and return statements to simple "return" (including modifying an existing, unrelated "return False").

That makes the change to desktop_unittest appear inconsistent, but there we are returning from _run_category_suites() -- which should return False.

Attachment #9036020 - Attachment is obsolete: true
Attachment #9036066 - Flags: review?(bob)
Comment on attachment 9036066 [details] [diff] [review]
abandon multi-test test-verification once TBPL_RETRY is hit

Review of attachment 9036066 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/scripts/desktop_unittest.py
@@ +944,5 @@
>                      if len(per_test_args) > 0:
>                          self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
> +                        if tbpl_status == TBPL_RETRY:
> +                            self.info("Per-test run abandoned due to RETRY status")
> +                            return False

Don't forget this one.
Attachment #9036066 - Flags: review?(bob) → review+

(In reply to Bob Clary [:bc:] from comment #12)

Comment on attachment 9036066 [details] [diff] [review]
abandon multi-test test-verification once TBPL_RETRY is hit

Review of attachment 9036066 [details] [diff] [review]:

::: testing/mozharness/scripts/desktop_unittest.py
@@ +944,5 @@

                 if len(per_test_args) > 0:
                     self.log_per_test_status(per_test_args[-1], tbpl_status, log_level)
  •                    if tbpl_status == TBPL_RETRY:
    
  •                        self.info("Per-test run abandoned due to RETRY status")
    
  •                        return False
    

Don't forget this one.

That one should return False: desktop_unittest.py has this code in a different function, _run_category_suites(), which needs to return a boolean.

Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/500d91e30e9d
Improve task retry handling in test-verify; r=bc

No new android-em failures in bug 1500266.

There is the remaining possibility of a task timeout interfering with the RETRY; I think that's best addressed by avoiding task timeouts / appropriate max-run-time settings.

Good enough for me!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: