Closed Bug 1048011 Opened 10 years ago Closed 10 years ago

Fix a race condition in ProcLoaderParent

Categories

(Core :: IPC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35
blocking-b2g 2.1+
Tracking Status
firefox33 --- wontfix
firefox34 --- fixed
firefox35 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: khuey, Assigned: cyu)

References

Details

Attachments

(1 file, 1 obsolete file)

My latest push to try with the patches from bug 1038943 shows that we're leaking PProcLoaderParent.

https://tbpl.mozilla.org/php/getParsedLog.php?id=45135816&tree=Try&full=1#error0

That seems to imply that _ProcLoaderParentDestroy never gets called.
I'll take a look.
Assignee: nobody → cyu
Whiteboard: [MemShrink] → [MemShrink:P2]
Some debug instrumentation that explains the problem in more detail:

[time:1407746798360957][23139<-23128][PProcLoaderChild] Received Msg_Load([TODO])
[time:1407746798361353][23139->23128][PProcLoaderChild] Sending Msg_LoadComplete([TODO])
[23139]MessageChannel::Close()
NOTE: parent process received `Goodbye', closing down
[time:1407746798374903][23139->23128][PContentChild] Sending Msg_PCrashReporterConstructor([TODO])

When we send out LoadComplete() from child and Close() the channel right away, the LoadComplete() message will not be processed by the parent and the leak results. Actually it will be worse if the parent has the chance to receive LoadComplete() because the channel will be double closed and the parent process will abort in MessageChannel::Close().
See Also: → 1050122
Once the child process close the channel, the parent process should detect a channel error and delete actors.  It is weird that the parent seems don't detect the channel error.  I think there are something as the root cause of this bug.

For double closed issue, it is a weird design of APIs.  For my understanding, most APIs still require the user to close/release the resource even there is an error.  Our design is uncommon.  I suggest to file another bug to fix it.
Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak

This doesn't appear to fix the leak of PProcLoaderParent on my emulator.  I put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.
Attachment #8470693 - Flags: review?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5)
> Comment on attachment 8470693 [details] [diff] [review]
> Fix the race condition that cuases the leak
> 
> This doesn't appear to fix the leak of PProcLoaderParent on my emulator.  I
> put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.

Hmm, that's strange. I put printf()'s in ProcLoaderParent's ctor and dtor and saw them called. This is on a real device. Maybe emulator that runs much slower could make a difference. I will check this on emulator later.

130|root@android:/ # b2g.sh                                                    
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
========== ProcLoaderParent ctor 0x43d86840
*****************************************************************
***
*** This is stdout. Most of the useful output will be in logcat.
***
*****************************************************************
1408452134133   Marionette      INFO    marionette enabled via build flag and pref
1408452134741   Marionette      INFO    marionette-server.js loaded
1408452135082   Marionette      INFO    B2G emulator: no
1408452135085   Marionette      INFO    Device detected is unagi
1408452135089   Marionette      INFO    Listening on port 2828

1408452135091   Marionette      INFO    Marionette server ready
System JS : WARNING resource://gre/modules/Preferences.jsm:378 - mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the corre$t initial [[Prototype]] value using Object.create
========== ProcLoaderParent dtor 0x43d86840
See Also: → 1056924
If you can't reproduce this on the emulator let me know.  I see this all the time so I could debug it if needed.
I still see the dtor called on emulator, but the system app fails to load and mochitest fails run run. I am debugging why it fails to run.
Kyle are you using debug build? If yes then I think you run into bug 1050122.

On debug build, "dom.ipc.processPrelaunch.enabled" is false, and the Nuwa process will not be launched [1]. That is, no one will call base::LaunchAppProcLoader [2] and move ProcLoaderParent/Child forward. ProcLoaderParent is constructed in ProcLoaderClientGeckoInit() [3], which is called unconditionally from XRE_main() [4]. That means we always leak ProcLoaderParent and the forked process on debug build.

Since the issue is being handled in bug 1050122, I will focus this patch on the race condition between ProcLoaderParent and ProcLoaderChild. Could you review the patch for the race condition and leave the above issue in bug 1050122? Thanks.

[1] http://dxr.mozilla.org/mozilla-central/source/dom/ipc/PreallocatedProcessManager.cpp#180
[2] http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_linux.cc#205
[3] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/ProcessUtils_linux.cpp#190
[4] http://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#3817
Flags: needinfo?(khuey)
That's a good point.  I'll review this tomorrow.
Flags: needinfo?(khuey)
Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak

Review of attachment 8470693 [details] [diff] [review]:
-----------------------------------------------------------------

r=me
Attachment #8470693 - Flags: review+
No longer blocks: 1038943
Keywords: mlk
Summary: PProcLoaderParent is never freed → Fix a race condition in ProcLoaderParent
Whiteboard: [MemShrink:P2]
https://hg.mozilla.org/mozilla-central/rev/0a76964b8d8d
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Blocks: 1062255
We strongly suspect this for causing bug 1062255 crash and other Settings issues on Flame device. See bug 1062255 for regression range finding, crash dumps, etc

RyanVM, is it possible to backout?
Flags: needinfo?(ryanvm)
Had the wrong sheriff :)
Flags: needinfo?(ryanvm) → needinfo?(cbook)
(In reply to Zac C (:zac) from comment #15)
> We strongly suspect this for causing bug 1062255 crash and other Settings
> issues on Flame device. See bug 1062255 for regression range finding, crash
> dumps, etc
> 
> RyanVM, is it possible to backout?

got backed out in https://hg.mozilla.org/mozilla-central/rev/31ce2eb10676
Status: RESOLVED → REOPENED
Flags: needinfo?(cbook)
Resolution: FIXED → ---
We still need to land this, so you need to work with Cervantes to find a way for him to reproduce this.  This will fix things like bug 1056924.
blocking-b2g: --- → 2.1?
Flags: needinfo?(zcampbell)
See comment 18.
blocking-b2g: 2.1? → 2.1+
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #18)
> We still need to land this, so you need to work with Cervantes to find a way
> for him to reproduce this.  This will fix things like bug 1056924.

qDot has left such info in bug 1062255 (to replicate manually) but if he'd prefer to try by running the tests too then just ping me.
Flags: needinfo?(zcampbell)
Can you link to a specific comment because I don't see any STR in there.
Target Milestone: mozilla35 → ---
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #21)
> Can you link to a specific comment because I don't see any STR in there.

Pardon, I recalled qDot's comment was more clear.

To run the actual automation:
1. Follow these instructions to set up the environment:
https://developer.mozilla.org/en-US/Firefox_OS/Platform/Automated_testing/gaia-ui-tests/Gaia_UI_Tests_Run_Tests#Testing_on_Firefox_OS_Devices

2. When it comes to run the test, use this command line:
gaiatest --testvars gaiatest/testvars_flame_2imei.json --html-output=results/index.html  --address localhost:2828 --restart gaiatest/tests/unit/test_cleanup_gaia.py --repeat 49

However I am sure there are easier ways to cause it to fail. A shell script that does stop b2g, start b2g, wait 60 seconds, [some kind of life check like check marionette port open], in a loop might be sufficient.
(In reply to Zac C (:zac) from comment #22)
> However I am sure there are easier ways to cause it to fail. A shell script
> that does stop b2g, start b2g, wait 60 seconds, [some kind of life check
> like check marionette port open], in a loop might be sufficient.

Sorry for causing such trouble. I didn't expect this patch to be so catastrophic.

I can reproduce this by restarting b2g on the device, but it's not easy to see the failure on my device. Does gaia UI test make it easier to fail?
OK, I can reproduce it more reliably. The key is to start b2g right after previous shutdown. That's why I didn't capture it in gdb.
No worries Cervantes I know you didn't mean it!

Although during the test we wait for all processes containing 'b2g' to be gone so we thought it should be sufficiently torn down. Perhaps we need to wait further.
Fix the cross-process race condition from the other direction (i.e. parent actively closes the actor) seems to work from by tests so far, but I still don't know why it makes the difference.
Tracing the root cause why this patch will break the seemingly irrelevant indexedDB file lock brings me to incorrect file usage:

(gdb) fr 2
#2  0xb6a512f8 in unixLock (eFileLock=<optimized out>, id=0xaf209914) at /mnt/SSD/data/hg/mozilla-central/db/sqlite3/src/sqlite3.c:25865
25865         rc = sqliteErrorFromPosixError(tErrno, SQLITE_IOERR_LOCK);
(gdb) p *(unixFile*)id
$8 = {
... (omitted)
  h = 28, 
... (omitted)
  zPath = 0xaf209a28 "/data/local/storage/persistent/chrome/idb/2588645841ssegtnti.sqlite", 
... (omitted)
}

and file descriptor 28 is actually a socket:
lrwx------ root     root              2014-09-24 19:52 28 -> socket:[2715481]

I am still tracing the file descriptor is taken by unix domain socket.
It appears that ProcLoaderParent double closes the channel file descriptor, one in closing the IPC channel and the other in FileDescriptor dtor. Because indexedDB operations happen on another thread, we may close the wrong file descriptor and make indexedDB operations fail.
This patch also fixes the double-close of FD used in PProaLoader: once in shutting down the IPC channel and the other in mozilla::ipc::FileDescriptor dtor. The double-close problem is inherent in the original B2G Loader code, but revealed by the previous version of patch.

Try push log: https://tbpl.mozilla.org/?tree=Try&rev=bad4b48fd16a

I'll reland after the tests are green.
Attachment #8470693 - Attachment is obsolete: true
Attachment #8498100 - Flags: review+
Flags: needinfo?(echou)
wrong ni
Flags: needinfo?(echou)
https://hg.mozilla.org/mozilla-central/rev/a39933b3becc
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Comment on attachment 8498100 [details] [diff] [review]
Fix the race condition that cuases the leak

Approval Request Comment
[Feature/regressing bug #]: 977026
[User impact if declined]: minor memory leak or probably a crash.
[Describe test coverage new/current, TBPL]: manual test with debug instrumentation. Regression on try.
[Risks and why]: Low. This patch fixes a cross-process race condition that both sides could possibly close the channel. We make fix the race by always closing the channel from one side. This should be safe because we remove the race and nondeterminism.
Attachment #8498100 - Flags: approval-mozilla-aurora?
Attachment #8498100 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
NI Inder to see if the want to pull this in for the next stability test run.
Flags: needinfo?(ikumar)
Flags: needinfo?(ikumar)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: