Closed Bug 977359 Opened 6 years ago Closed 6 years ago

Deadlock risk if we open PCompositor/PImageBridge on child process startup with Nuwa

Categories

(Core :: Graphics: Layers, defect)

All
Gonk (Firefox OS)
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla31
blocking-b2g 1.3+
Tracking Status
firefox29 --- wontfix
firefox30 --- fixed
firefox31 --- fixed
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: bent.mozilla, Assigned: cyu)

References

Details

Attachments

(4 files, 2 obsolete files)

While working on bug 956218 I found that 'opens' protocols block the calling thread while waiting on the IPC thread to open a channel. The Nuwa process freeze mechanism blocks the IPC thread and waits on the main thread when it resumes. Thus we have a deadlock hazard if we call Open on the main thread at child process startup.

I worked around this for PBackground but I didn't do anything for PCompositor/PImageBridge since it looks like we open them much later during startup. If we ever change this then we will need to do something here.
It doesn't look like we will encounter the same problem as in PBackground. PCompositor and PImageBridge are opened quite early in the Nuwa process:

The Nuwa process starts to freeze threads when it recvs app info (ContentParent::InitInternal() calling PContentParent::SendAppInfo()). Before that, we already called PCompositor::Open() and PImageBridge::Open(). Given that these messages should be processed in the same order in the Nuwa process, PCompositor and PImageBridge should already be opened.

Please note that we only block the IO thread when the Nuwa process forks a new process (by request from the chrome process). This already happens when ContentChild sends NuwaReady to the parent. That's another reason we should not run into the deadlock problem.
mwu saw this happen today on a device, so we need to fix this.
blocking-b2g: --- → 1.3?
Summary: Deadlock risk if we ever try to open PCompositor/PImageBridge on child process startup with Nuwa → Deadlock risk if we open PCompositor/PImageBridge on child process startup with Nuwa
Attached file mwu's thread stacks
thread 7 is attempting to start ImageBridge, but Nuwa has intercepted the condvar and blocked the thread.
Assignee: nobody → cyu
Attachment #8405316 - Flags: review?(khuey)
Attached patch Test case (obsolete) — Splinter Review
Attachment #8405317 - Flags: review?(khuey)
Test on emulator, fix + test case. https://tbpl.mozilla.org/?tree=Try&rev=efab0f236fe8
Test on emulator, only test case, to verify that the test case really works: https://tbpl.mozilla.org/?tree=Try&rev=54f5d640a699
What is the user experience here? What will happen to the user experience if this bug is not fixed in 1.3?
(In reply to Preeti Raghunath(:Preeti) from comment #7)
> What is the user experience here? What will happen to the user experience if
> this bug is not fixed in 1.3?

From the user POV, the homescreen does not show after unlocking the screen and no apps can be started. The only fix is to reboot the phone.
Nice test case.
Attached patch Test case (obsolete) — Splinter Review
Fix build break of the test case
Attachment #8405317 - Attachment is obsolete: true
Attachment #8405317 - Flags: review?(khuey)
Attachment #8405985 - Flags: review?(khuey)
Comment on attachment 8405985 [details] [diff] [review]
Test case

Review of attachment 8405985 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageLink.cpp
@@ +127,5 @@
>                  FROM_HERE,
>                  NewRunnableMethod(this, &ProcessLink::OnTakeConnectedChannel));
>          }
>  
> +#if defined(DEBUG) && defined(MOZ_NUWA_PROCESS)

Remove the DEBUG here. This is so cheap I think we can afford this and the test might otherwise not really test anything on OPT builds (are we even testing DEBUG builds?).
Attachment #8405985 - Flags: review?(khuey) → review+
Comment on attachment 8405316 [details] [diff] [review]
Fix Nuwa deadlock by delaying freezing the ImageBrigeChild thread.

Review of attachment 8405316 [details] [diff] [review]:
-----------------------------------------------------------------

I think khuey is on vacation this week so I will steal this but leave a feedback? up for him.
Attachment #8405316 - Flags: review?(khuey) → review+
Attachment #8405316 - Flags: feedback?(khuey)
(In reply to Andreas Gal :gal from comment #11)
> Comment on attachment 8405985 [details] [diff] [review]
> Test case
> 
> Review of attachment 8405985 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: ipc/glue/MessageLink.cpp
> @@ +127,5 @@
> >                  FROM_HERE,
> >                  NewRunnableMethod(this, &ProcessLink::OnTakeConnectedChannel));
> >          }
> >  
> > +#if defined(DEBUG) && defined(MOZ_NUWA_PROCESS)
> 
> Remove the DEBUG here. This is so cheap I think we can afford this and the
> test might otherwise not really test anything on OPT builds (are we even
> testing DEBUG builds?).

I think we have test against DEBUG build on emulators. I'll run it locally to verify that.
Repush to test on emulator:
Test case w/o the fix:  https://tbpl.mozilla.org/?tree=Try&rev=c6e47841d46d
Test case w/ the fix: https://tbpl.mozilla.org/?tree=Try&rev=44295f5887f2

I need to check if the sleep() trick is really effective and really checks for the deadlock (it is on my unagi).
Could you please add an explanation about what this sleep(5) is needed for?

I would also recommend checking the ImageBridge changes with :nical if you haven't already.
Attachment #8405316 - Flags: review?(nical.bugzilla)
Attachment #8405316 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 8405316 [details] [diff] [review]
Fix Nuwa deadlock by delaying freezing the ImageBrigeChild thread.

Review of attachment 8405316 [details] [diff] [review]:
-----------------------------------------------------------------

I'm around until the end of this week.

Do we need to do something similar for PCompositor?
Attachment #8405316 - Flags: feedback?(khuey) → feedback+
I'm impressed at how simple the test is.
Attached patch Test caseSplinter Review
Update the test case:
1. Increase the timeout to 90 seconds, based on local tests on the emulator.
2. Comment why we need this sleep(5) in MessageLink.cpp.
3. Change the trigger point of this sleep for testing: remove DEBUG-only build flag and add IsNuwaProcess() check since the deadlock test is meaningful only in the Nuwa process.
Attachment #8405985 - Attachment is obsolete: true
Attachment #8406832 - Flags: review+
I think we can remove the runtime check in the debug build for the deadlock because:

* We now have test cases for the Nuwa process (bug 922465 and this one).
* The 2 second timeout isn't good for the test case.
Attachment #8406836 - Flags: review?(khuey)
Retest on emulator since last test fails.

Test w/ the fix: https://tbpl.mozilla.org/?tree=Try&rev=d56cfa484ae0
Test w/o the fix: https://tbpl.mozilla.org/?tree=Try&rev=c9c20169d22e
blocking-b2g: 1.3? → 1.3+
Tests on emulator looks good. Test for all platforms:  https://tbpl.mozilla.org/?tree=Try&rev=363dc46c3c03
Another push to build B2G desktop OSX opt: https://tbpl.mozilla.org/?tree=Try&rev=13ea6e16da6e
Kyle, would you have a quick review of the new part: https://bugzilla.mozilla.org/attachment.cgi?id=8406836 which contains only removal of runtime timeout checks when we fork from the Nuwa process? Thanks.
Flags: needinfo?(khuey)
Faramarz wants all non-certification blockers to be removed from the blocking list, so this is no longer going to block 1.3.
blocking-b2g: 1.3+ → backlog
(In reply to Jason Smith [:jsmith] from comment #24)
> Faramarz wants all non-certification blockers to be removed from the
> blocking list, so this is no longer going to block 1.3.

If we don't take this patch we're going to ship a known deadlock at startup as a regression on 1.3.
Flags: needinfo?(khuey)
obviously there could be exceptions if the case is severe enough.  looks like this already has a review+ as well so it would make sense to include it in.  thanks.
I'll move this back to blocking-1.3+ then.
blocking-b2g: backlog → 1.3+
https://hg.mozilla.org/mozilla-central/rev/a7419fd65792
https://hg.mozilla.org/mozilla-central/rev/3b9fd735fea4
https://hg.mozilla.org/mozilla-central/rev/1c436dd6020f
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Comment on attachment 8405316 [details] [diff] [review]
Fix Nuwa deadlock by delaying freezing the ImageBrigeChild thread.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 771765
User impact if declined: The device fails to launch enter the homescreen or launch any app.
Testing completed: Manual and mochitest.
Risk to taking this patch (and alternatives if risky): Low. Only change in timing of some action.
Comment on attachment 8406832 [details] [diff] [review]
Test case

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 771765
Testing completed: Manual and mochitest.
Risk to taking this patch (and alternatives if risky): None. This part is the mochitest test case.
Attachment #8406832 - Flags: approval-mozilla-b2g28?
Attachment #8405316 - Flags: approval-mozilla-b2g28?
Comment on attachment 8406836 [details] [diff] [review]
Remove runtime check for the deadlock in the debug build.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 771765
User impact if declined: device lockup.
Testing completed: Manual and mochitest.
Risk to taking this patch (and alternatives if risky): Low. This part contains code removal only and is necessary for the test case.
Attachment #8406836 - Flags: approval-mozilla-b2g28?
Attachment #8405316 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
Attachment #8406832 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
Attachment #8406836 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
I had to back this out from b2g28 due to frequent (~50% of the time) B2G mochitest-8 hangs on shutdown. Eyeballing the logs, some sort of deadlock?
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3/rev/b6503c985761

GOOD RUN: https://tbpl.mozilla.org/php/getParsedLog.php?id=38202868&tree=Mozilla-B2g28-v1.3
BAD RUN:  https://tbpl.mozilla.org/php/getParsedLog.php?id=38205873&tree=Mozilla-B2g28-v1.3

Logcat from the end of the run:

GOOD:
14:06:02     INFO -  3170 INFO TEST-START | Shutdown
14:06:02     INFO -  3171 INFO Passed:  1873
14:06:06     INFO -  3172 INFO Failed:  0
14:06:06     INFO -  3173 INFO Todo:    0
14:06:06     INFO -  3174 INFO Slowest: 27950ms - /tests/layout/base/tests/test_bug399284.html
14:06:06     INFO -  3175 INFO SimpleTest FINISHED
14:06:06     INFO -  3176 INFO TEST-INFO | Ran 1 Loops
14:06:06     INFO -  3177 INFO SimpleTest FINISHED
14:06:06     INFO -  NOTE: child process received `Goodbye', closing down
14:06:06     INFO -  System JS : ERROR resource://gre/modules/ObjectWrapper.jsm:75 - Error: Expected an object as the target scope
14:06:06     INFO -  System JS : ERROR resource://gre/modules/ObjectWrapper.jsm:75 - Error: Expected an object as the target scope
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  NOTE: child process received `Goodbye', closing down
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:06:06     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:07:14     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/baMochitest INFO | runtestsb2g.py | Running tests: end.
14:07:19     INFO -  se/process_util_posix.cc, line 254
14:07:19     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:07:19     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:07:19     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:07:19     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:07:19     INFO -  [Parent 673] WARNING: waitpid failed pid:718 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
14:07:19     INFO - Return code: 0
14:07:19     INFO - Copying /builds/slave/test/build/emulator-5554.log to /builds/slave/test/build/blobber_upload_dir/emulator-5554.log
14:07:19     INFO - TinderboxPrint: mochitest<br/>1873/0/0
14:07:19     INFO - # TBPL SUCCESS #

BAD:
15:09:50     INFO -  04-21 22:08:30.441   720   720 I GeckoDump: 3170 INFO TEST-START | Shutdown
15:09:50     INFO -  04-21 22:08:30.451   720   720 I GeckoDump: 3171 INFO Passed:  1873
15:09:50     INFO -  04-21 22:08:30.451   720   720 I GeckoDump: 3172 INFO Failed:  0
15:09:50     INFO -  04-21 22:08:30.471   720   720 I GeckoDump: 3173 INFO Todo:    0
15:09:50     INFO -  04-21 22:08:30.481   720   720 I GeckoDump: 3174 INFO Slowest: 30477ms - /tests/layout/base/tests/test_bug399284.html
15:09:50     INFO -  04-21 22:08:30.481   720   720 I GeckoDump: 3175 INFO SimpleTest FINISHED
15:09:50     INFO -  04-21 22:08:30.512   720   720 I GeckoDump: 3176 INFO TEST-INFO | Ran 1 Loops
15:09:50     INFO -  04-21 22:08:30.531   720   720 I GeckoDump: 3177 INFO SimpleTest FINISHED
15:09:50  WARNING -  04-21 22:08:32.361   674   674 E GeckoConsole: [JavaScript Error: "Error: Expected an object as the target scope" {file: "resource://gre/modules/ObjectWrapper.jsm" line: 75}]
15:09:50  WARNING -  04-21 22:08:32.371   674   674 E GeckoConsole: [JavaScript Error: "Error: Expected an object as the target scope" {file: "resource://gre/modules/ObjectWrapper.jsm" line: 75}]
15:09:50     INFO -  04-21 22:08:32.391   674   680 I Gecko   : [Parent 674] WARNING: waitpid failed pid:720 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
15:09:50     INFO -  04-21 22:08:33.431   674   693 E Profiler: BPUnw: [1 total] thread_unregister_for_profiling(me=0x22a90)  (NOT REGISTERED)
15:09:50     INFO -  04-21 22:08:33.501   674   789 E Profiler: BPUnw: [1 total] thread_unregister_for_profiling(me=0x22f60)  (NOT REGISTERED)
15:09:50     INFO -  04-21 22:08:33.821   674   680 I Gecko   : [Parent 674] WARNING: waitpid failed pid:720 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
15:09:50     INFO -  04-21 22:08:33.821   674   680 I Gecko   : [Parent 674] WARNING: waitpid failed pid:720 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
15:09:50     INFO -  04-21 22:08:33.831   674   680 I Gecko   : [Parent 674] WARNING: waitpid failed pid:720 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
15:09:50     INFO -  04-21 22:08:35.361   674   680 I Gecko   : [Parent 674] WARNING: waitpid failed pid:720 errno:10: file ../../../gecko/ipc/chromium/src/base/process_util_posix.cc, line 254
15:09:50     INFO -  04-21 22:08:35.361   674   680 I Gecko   : [Parent 674] WARNING: Failed to deliver SIGKILL to 720!(3).: file ../../../gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc, line 118
15:09:50     INFO -  04-21 22:08:35.492    33    33 I ServiceManager: service 'media.resource_manager' died
15:09:50     INFO -  04-21 22:08:35.501    33    33 I ServiceManager: service 'permission' died
15:09:50     INFO - Return code: 0
15:09:50     INFO - Copying /builds/slave/test/build/emulator-5554.log to /builds/slave/test/build/blobber_upload_dir/emulator-5554.log
15:09:50     INFO - TinderboxPrint: mochitest<br/><em class="testfail">T-FAIL</em>
15:09:50  WARNING - # TBPL WARNING #
Doesn't look like a deadlock in mochitest-8. The difference is that we failed to kill the child process because it's already dead:
Failed to deliver SIGKILL to 720!(3) shows that we kill the process with SIGKILL and the errno is ESRCH, no such process.

It looks like the warnings like
15:09:50  WARNING -  04-21 22:08:32.361   674   674 E GeckoConsole: [JavaScript Error: "Error: Expected an object as the target scope" {file: "resource://gre/modules/ObjectWrapper.jsm" line: 75}]
makes it T-FAIL, but I don't think the fix itself should cause such damage to the test.

Retest with only the fix: https://tbpl.mozilla.org/?tree=Try&rev=23aa0eecbc09
(In reply to Cervantes Yu from comment #36)
> Retest with only the fix: https://tbpl.mozilla.org/?tree=Try&rev=23aa0eecbc09

Looks good :)
Once the test case comes in, mochitest-8 starts to see intermittent failures:

https://tbpl.mozilla.org/?tree=Try&rev=e1db3226c01c

while the test case is added to mochitest-9
With part 1 and 3, the test also looks good:

https://tbpl.mozilla.org/?tree=Try&rev=2844357baa2b

It's strange that the test case is run in mochitest-9 but once checked in mochitest-8 is failing.

Can we uplift the fix only?
It's probably changing the chunking around (so a test that previously ran in one chunk is now running in a different one).
Bhavana, what's your suggestion to uplifting? Uplifting to 1.3 and 1.3t of this bug is now blocked by the added test, not the fix itself. Can we uplift the fix and then track the why the new test case if offending to other tests? If you agree, I will open a bug for tracking the uplifting of the test case.
Flags: needinfo?(bbajaj)
(In reply to Cervantes Yu from comment #41)
> Bhavana, what's your suggestion to uplifting? Uplifting to 1.3 and 1.3t of
> this bug is now blocked by the added test, not the fix itself. Can we uplift
> the fix and then track the why the new test case if offending to other
> tests? If you agree, I will open a bug for tracking the uplifting of the
> test case.

:cyu, are you able to repro the mochitest failure locally as well ? If you are confident the failure is isolated and nothing to the code change, I would recommend checking-in the offending test later once its fully investigated and fixed by opening a new bug as the fix has stuck on 1.4 and 2.0.
Flags: needinfo?(bbajaj)
Flags: needinfo?(cyu)
I ran mochitest-8 locally many times and each passed. I will check in the fix and handle the test failure in a followup bug.
Flags: needinfo?(cyu)
See Also: → 1008905
You need to log in before you can comment on or make changes to this bug.