Closed Bug 1142229 Opened 6 years ago Closed 6 years ago

Debug emulator shutdown timeout caused by dom/ipc/test_NuwaProcessCreation.html and dom/ipc/test_NuwaProcessDeadlock.html

Categories

(Core :: IPC, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox38 --- wontfix
firefox39 --- wontfix
firefox40 --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.0M --- fixed
b2g-v2.1 --- fixed
b2g-v2.1S --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: ahal, Assigned: cyu)

References

Details

Attachments

(2 files, 2 obsolete files)

All debug emulator mochitest-11 jobs have a shutdown timeout in them (e.g search for TEST-UNEXPECTED-FAIL in [1]). These are green because of a special hack in mozharness just for this issue.

RyanVM and I recently figured out test_NuwaProcessCreation.html and test_NuwaProcessDeadlock.html are somehow causing it. Interestingly, disabling either of those tests fixes the issue. It's only when both are run together that the timeout occurs:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0019c0ba19fd
https://treeherder.mozilla.org/#/jobs?repo=try&revision=698c547cf90d


[1] http://ftp.mozilla.org/pub/mozilla.org/b2g/try-builds/ahalberstadt@mozilla.com-e035ba199aec/try-emulator-debug/try_ubuntu64_vm-b2g-emulator-debug_test-mochitest-debug-11-bm122-tests1-linux64-build72.txt.gz
Just so it's clear here, we're very likely to be disabling these tests in short order as they're blocking other patches from landing and are causing a legitimate permafail at the moment.
Just to be sure, are we closing both test cases or just one to make it green? We need at least one test case as a guard for breakage of Nuwa functionality.
Ideally we'd disable neither :-). That said, happy to defer to your preference if it comes to that.
I can look into it, but the problem was only seen on the try server as I remember. Debugging is really painful as I could only printf and push to try to see the output. If there is debuggable environment for such cases, we can save lots of time on similar bugs. Do you have any suggestion?
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #5)
> Loaner slave?
> https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_a_slave

That looks pretty delicious. I'll try that.
Blocks: 1124695
No longer blocks: 923381
What do you think about disabling test_NuwaProcessDeadlock.html so we can fix this and bug 1132368 in one shot?
Flags: needinfo?(cyu)
We still need the test case to guard against new stuff added to the Nuwa process that creates a deadlock as the original bug did. I agree that we disable the test case temporarily, rinse, and then reenable the test case.
Flags: needinfo?(cyu)
This can be reproduced locally so no need for loaner slave.
Hooking the Nuwa test cases under ProcessPriorityManager turns out to be a bad idea. Setting process priorities is an indirect signal that can subject to timing problems easily. I think it's better to redesign the test cases that use more direct signals from where the Nuwa process are created/shutdown.
Attached file Fix the test cases (WIP) (obsolete) —
This should not block and timeout: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce95abcb1760

The root cause of this timeout is:
1. We shut down the Nuwa process.
1.1 The run loop of the main thread exits.
1.2 The Nuwa process shuts down XPCOM
1.3 Some observer of xpcom-shutdown fires a Timer to the Timer thread, which is frozen.
1.4 The Nuwa process deadlocks.
2. The b2g process starts to shut down.
2.1 The compositor thread shuts down, which cannot complete because the compositor thread is ref-counted.
2.2 The b2g process deadlocks.

The patch contains rework of the test cases and adds logic to shut down the Nuwa process. Ideally we can thaw the threads to let them shut down correctly. But it's still not functioning. I use _exit() to force shutting down the process, but it leads to abort() crashes in the b2g process. I need to figure out how to fix it. At least, with this patch we should not see timeouts anymore.
Attached patch Part 1: refactor the test cases. (obsolete) — Splinter Review
This cleans up the test case and remove the use of flaky timeouts. Also we use a dedicated pref value 'dom.ipc.preallocatedProcessManager.testMode' only for Nuwa test cases.
Attachment #8584449 - Attachment is obsolete: true
Attachment #8586642 - Flags: review?(ryanvm)
Comment on attachment 8586642 [details] [diff] [review]
Part 1: refactor the test cases.

I'm not a peer. Wish I could help here, but way out of my jurisdiction :)
Flags: needinfo?(cyu)
Attachment #8586642 - Flags: review?(ryanvm)
Comment on attachment 8586642 [details] [diff] [review]
Part 1: refactor the test cases.

Request review from :mrbkap.

Blake, since Kyle is on vacation, would you review the test cases? Thanks.
Flags: needinfo?(cyu)
Attachment #8586642 - Flags: review?(mrbkap)
Attachment #8586644 - Flags: review?(tlee) → review+
Comment on attachment 8586642 [details] [diff] [review]
Part 1: refactor the test cases.

Review of attachment 8586642 [details] [diff] [review]:
-----------------------------------------------------------------

Clearing this request until my question about the second pushPrefEnv calls is resolved.

::: dom/ipc/tests/test_NuwaProcessCreation.html
@@ +39,5 @@
>      cpmm.removeMessageListener("TEST-ONLY:nuwa-add-new-process", msgHandler);
> +
> +
> +    SpecialPowers.pushPrefEnv({
> +      'clear': [

This shouldn't be necessary. The initial pushPrefEnv knows that it has to clear the prefs if they don't exist to begin with. This just creates an extra round-trip that doesn't do anything.

::: dom/ipc/tests/test_NuwaProcessDeadlock.html
@@ +39,5 @@
>      cpmm.removeMessageListener("TEST-ONLY:nuwa-add-new-process", msgHandler);
> +
> +
> +    SpecialPowers.pushPrefEnv({
> +      'clear': [

Same here.
Attachment #8586642 - Flags: review?(mrbkap)
Update to the previous revision: remove the unnecessary SpecialPowers.pushPrefEnv() call to clear the pref values on test case shutdown.
Attachment #8586642 - Attachment is obsolete: true
Attachment #8589569 - Flags: review?(mrbkap)
Comment on attachment 8589569 [details] [diff] [review]
Part 1: refactor the test cases (v2),

Review of attachment 8589569 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Thanks!
Attachment #8589569 - Flags: review?(mrbkap) → review+
https://hg.mozilla.org/mozilla-central/rev/a255a2f9d4f3
https://hg.mozilla.org/mozilla-central/rev/6b67bc79ec2d
Assignee: nobody → cyu
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Nice, I've confirmed that this actually fixes some permafails we recently began to encounter on the older release branches as well after the chunking boundaries changed. Please can we nominate these patches for b2g32/34/37 approval? I've already done the rebases and verified green on Try.
The alternative to approval being that I disable the Nuwa tests on the release branches, but that doesn't seem like an ideal outcome when we have a relatively small, simple patch that can fix them instead.
It's pretty safe change so we should uplift to release branches. Part 2 contains a change in IPC, but no one is ever going to Close() the Nuwa process except in test cases that changes the pref "dom.ipc.processPrelaunch.enabled" dynamically.
Flags: needinfo?(cyu)
Comment on attachment 8586644 [details] [diff] [review]
Part 2: Shut down the Nuwa process with QuickExit()

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 922465
User impact if declined: None. This is a test case fix.
Testing completed: Automatic tests on try and locally.
Risk to taking this patch (and alternatives if risky): Pretty low. This is a fix in test case. Also the change is only reachable in debug build, it's never seen in release build.
Attachment #8586644 - Flags: approval-mozilla-b2g37?
Attachment #8586644 - Flags: approval-mozilla-b2g34?
Attachment #8586644 - Flags: approval-mozilla-b2g32?
Comment on attachment 8589569 [details] [diff] [review]
Part 1: refactor the test cases (v2),

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 922465
User impact if declined: None. This is a test case fix.
Testing completed: Automatic tests on try and locally.
Risk to taking this patch (and alternatives if risky): Pretty low. This is a fix in test case.
Attachment #8589569 - Flags: approval-mozilla-b2g37?
Attachment #8589569 - Flags: approval-mozilla-b2g34?
Attachment #8589569 - Flags: approval-mozilla-b2g32?
Attachment #8586644 - Flags: approval-mozilla-b2g37?
Attachment #8586644 - Flags: approval-mozilla-b2g37+
Attachment #8586644 - Flags: approval-mozilla-b2g34?
Attachment #8586644 - Flags: approval-mozilla-b2g34+
Attachment #8586644 - Flags: approval-mozilla-b2g32?
Attachment #8586644 - Flags: approval-mozilla-b2g32+
Comment on attachment 8589569 [details] [diff] [review]
Part 1: refactor the test cases (v2),

Spoke offline with Ryan and he gave me context and we feel its better to uplift this to help with permafails else we would have to end up disabling nuwa tests. Please note, these are deemed low risk as well, hence the a+
Attachment #8589569 - Flags: approval-mozilla-b2g37?
Attachment #8589569 - Flags: approval-mozilla-b2g37+
Attachment #8589569 - Flags: approval-mozilla-b2g34?
Attachment #8589569 - Flags: approval-mozilla-b2g34+
Attachment #8589569 - Flags: approval-mozilla-b2g32?
Attachment #8589569 - Flags: approval-mozilla-b2g32+
Duplicate of this bug: 1120580
Ouch, I meant to leave comment 31 in bug 1099195. Wrong tab. Sorry about that.
Flags: needinfo?(khuey)
You need to log in before you can comment on or make changes to this bug.