Closed Bug 1690570 Opened 3 months ago Closed 3 months ago

Testing on Android: Delegate test runner to run on Android

Categories

(Core :: JavaScript Engine, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox88 --- fixed

People

(Reporter: nbp, Assigned: nbp)

References

Details

Attachments

(2 files)

Lately, we landed some optimization to execute the test suite faster by sharing the XDR version of the self-hosted code. This optimization had little to no impact on Android test suite execution.

One hypothesis is that the ADB round-trip for spawning each test is too costly, and made the improvement of the self-hosted code neglectable compared to the ADB overhead.

While this bug has the potential for improving the test suite, it is also an interesting data-point on whether we should pursue on-disk caching on the self-hosted code as XDR or not.

Blocks: 1458339
Blocks: 1692096

First shot at this, suggest that we might be able to save ~24 minutes per android JIT job!
Which would be roughly 24h (= 24min * 10 * 6) in total per full-test.

Depends on: 1692253
Blocks: 1531175

Ok, after applying these changes, I see that random tests which are supposed to return immediately, even skipped tests are doing timeout.
I did not see any of these before, and I wonder if this could be some form of shutdown hang in the JS Shell.

I will instrument to dump the stack on SIGTERM.

FWIW, I've experienced adb just terminating apparently at random when running tests locally.

(In reply to Lars T Hansen [:lth] from comment #5)

FWIW, I've experienced adb just terminating apparently at random when running tests locally.

With the patch all the tests are run by a single shell script which is executed with a single adb command.
The JS shell is wrapped by a timeout command provided by busybox.

Honestly, I am tempted to land the changes, even if these timeout issue exist, as this would be a huge time saving even if we have to restart a few test failure caused by these timeout.

(In reply to Nicolas B. Pierron [:nbp] from comment #4)

I see that random tests which are supposed to return immediately, even skipped tests are doing timeout.

I managed to instrument a JS Shell in such a way that it prints the stack¹, one of the failure reported the following stack, unfortunately lacking any JS shell symbols:

#01: ???[/data/local/tmp/test_root/bin/js +0x5e904]
#02: ???[/system/lib/libc.so +0x18aa8]
#03: syscall[/system/lib/libc.so +0x18e08]
#04: pthread_join[/system/lib/libc.so +0x47cf0]

Which seems to indicate that one of the Helper thread did not join back after the completion of the JS Shell.

¹ I learnt that inside mozglue with has MozStack* functions which are using various backend implementation and are able to recover this information.

Doing extra printf debugging, suggests that this could be an issue within the code which is joining the Helper threads.
However, I do not understand:

  1. Why aren't the processes killed after receiving a SIGTERM, which is supposed to be forwarded.
  2. Why the hang seems to get resolved as soon as the signal is received.
  3. What would be the issue in the HelperThread waiting system.

This try link show different tests failing, where the stack is printed when a SIGTERM is received and where fprintf are added next to all pthread_join calls.

The output seems to suggest that the we start joining with the HelperThread, but suddently fail at joining more threads, and wait until the signal handler is called, to resume joining.

The following try link suggest that the failure is not restricted to HelperThread but can also happen with WorkerThread.

After some discussion with the team, it was decided that the best alternative, given the low likelyhood and the randomness of the failure would be to re-run the tests which are failing due to a timeout, and only report the last execution.

This would reduce other timeout but would most likely mute pthread_join hanging issues.

Pushed by npierron@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6ce356a95473
Fix python style in gdb/run-tests.py . r=tcampbell
https://hg.mozilla.org/integration/autoland/rev/e400022129be
Batch JS Shell tests on Android to run a single ADB command. r=tcampbell
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 88 Branch
You need to log in before you can comment on or make changes to this bug.