Closed
Bug 1048011
Opened 10 years ago
Closed 10 years ago
Fix a race condition in ProcLoaderParent
Categories
(Core :: IPC, defect)
Core
IPC
Tracking
()
People
(Reporter: khuey, Assigned: cyu)
References
Details
Attachments
(1 file, 1 obsolete file)
4.18 KB,
patch
|
cyu
:
review+
bajaj
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
My latest push to try with the patches from bug 1038943 shows that we're leaking PProcLoaderParent.
https://tbpl.mozilla.org/php/getParsedLog.php?id=45135816&tree=Try&full=1#error0
That seems to imply that _ProcLoaderParentDestroy never gets called.
Updated•10 years ago
|
Whiteboard: [MemShrink] → [MemShrink:P2]
Assignee | ||
Comment 2•10 years ago
|
||
Attachment #8470693 -
Flags: review?(khuey)
Assignee | ||
Comment 3•10 years ago
|
||
Some debug instrumentation that explains the problem in more detail:
[time:1407746798360957][23139<-23128][PProcLoaderChild] Received Msg_Load([TODO])
[time:1407746798361353][23139->23128][PProcLoaderChild] Sending Msg_LoadComplete([TODO])
[23139]MessageChannel::Close()
NOTE: parent process received `Goodbye', closing down
[time:1407746798374903][23139->23128][PContentChild] Sending Msg_PCrashReporterConstructor([TODO])
When we send out LoadComplete() from child and Close() the channel right away, the LoadComplete() message will not be processed by the parent and the leak results. Actually it will be worse if the parent has the chance to receive LoadComplete() because the channel will be double closed and the parent process will abort in MessageChannel::Close().
Comment 4•10 years ago
|
||
Once the child process close the channel, the parent process should detect a channel error and delete actors. It is weird that the parent seems don't detect the channel error. I think there are something as the root cause of this bug.
For double closed issue, it is a weird design of APIs. For my understanding, most APIs still require the user to close/release the resource even there is an error. Our design is uncommon. I suggest to file another bug to fix it.
Reporter | ||
Comment 5•10 years ago
|
||
Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak
This doesn't appear to fix the leak of PProcLoaderParent on my emulator. I put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.
Attachment #8470693 -
Flags: review?(khuey)
Assignee | ||
Comment 6•10 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5)
> Comment on attachment 8470693 [details] [diff] [review]
> Fix the race condition that cuases the leak
>
> This doesn't appear to fix the leak of PProcLoaderParent on my emulator. I
> put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.
Hmm, that's strange. I put printf()'s in ProcLoaderParent's ctor and dtor and saw them called. This is on a real device. Maybe emulator that runs much slower could make a difference. I will check this on emulator later.
130|root@android:/ # b2g.sh
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
========== ProcLoaderParent ctor 0x43d86840
*****************************************************************
***
*** This is stdout. Most of the useful output will be in logcat.
***
*****************************************************************
1408452134133 Marionette INFO marionette enabled via build flag and pref
1408452134741 Marionette INFO marionette-server.js loaded
1408452135082 Marionette INFO B2G emulator: no
1408452135085 Marionette INFO Device detected is unagi
1408452135089 Marionette INFO Listening on port 2828
1408452135091 Marionette INFO Marionette server ready
System JS : WARNING resource://gre/modules/Preferences.jsm:378 - mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the corre$t initial [[Prototype]] value using Object.create
========== ProcLoaderParent dtor 0x43d86840
Reporter | ||
Comment 7•10 years ago
|
||
If you can't reproduce this on the emulator let me know. I see this all the time so I could debug it if needed.
Assignee | ||
Comment 8•10 years ago
|
||
I still see the dtor called on emulator, but the system app fails to load and mochitest fails run run. I am debugging why it fails to run.
Assignee | ||
Comment 9•10 years ago
|
||
Kyle are you using debug build? If yes then I think you run into bug 1050122.
On debug build, "dom.ipc.processPrelaunch.enabled" is false, and the Nuwa process will not be launched [1]. That is, no one will call base::LaunchAppProcLoader [2] and move ProcLoaderParent/Child forward. ProcLoaderParent is constructed in ProcLoaderClientGeckoInit() [3], which is called unconditionally from XRE_main() [4]. That means we always leak ProcLoaderParent and the forked process on debug build.
Since the issue is being handled in bug 1050122, I will focus this patch on the race condition between ProcLoaderParent and ProcLoaderChild. Could you review the patch for the race condition and leave the above issue in bug 1050122? Thanks.
[1] http://dxr.mozilla.org/mozilla-central/source/dom/ipc/PreallocatedProcessManager.cpp#180
[2] http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_linux.cc#205
[3] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/ProcessUtils_linux.cpp#190
[4] http://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#3817
Flags: needinfo?(khuey)
Reporter | ||
Comment 10•10 years ago
|
||
That's a good point. I'll review this tomorrow.
Flags: needinfo?(khuey)
Reporter | ||
Comment 11•10 years ago
|
||
Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak
Review of attachment 8470693 [details] [diff] [review]:
-----------------------------------------------------------------
r=me
Attachment #8470693 -
Flags: review+
Reporter | ||
Updated•10 years ago
|
Assignee | ||
Comment 12•10 years ago
|
||
Try push looks good: https://tbpl.mozilla.org/?tree=Try&rev=ee136047f739
Assignee | ||
Comment 13•10 years ago
|
||
Comment 14•10 years ago
|
||
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Comment 15•10 years ago
|
||
We strongly suspect this for causing bug 1062255 crash and other Settings issues on Flame device. See bug 1062255 for regression range finding, crash dumps, etc
RyanVM, is it possible to backout?
Flags: needinfo?(ryanvm)
Comment 17•10 years ago
|
||
(In reply to Zac C (:zac) from comment #15)
> We strongly suspect this for causing bug 1062255 crash and other Settings
> issues on Flame device. See bug 1062255 for regression range finding, crash
> dumps, etc
>
> RyanVM, is it possible to backout?
got backed out in https://hg.mozilla.org/mozilla-central/rev/31ce2eb10676
Status: RESOLVED → REOPENED
Flags: needinfo?(cbook)
Resolution: FIXED → ---
Reporter | ||
Comment 18•10 years ago
|
||
We still need to land this, so you need to work with Cervantes to find a way for him to reproduce this. This will fix things like bug 1056924.
blocking-b2g: --- → 2.1?
Flags: needinfo?(zcampbell)
Comment 20•10 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #18)
> We still need to land this, so you need to work with Cervantes to find a way
> for him to reproduce this. This will fix things like bug 1056924.
qDot has left such info in bug 1062255 (to replicate manually) but if he'd prefer to try by running the tests too then just ping me.
Flags: needinfo?(zcampbell)
Reporter | ||
Comment 21•10 years ago
|
||
Can you link to a specific comment because I don't see any STR in there.
Updated•10 years ago
|
Comment 22•10 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #21)
> Can you link to a specific comment because I don't see any STR in there.
Pardon, I recalled qDot's comment was more clear.
To run the actual automation:
1. Follow these instructions to set up the environment:
https://developer.mozilla.org/en-US/Firefox_OS/Platform/Automated_testing/gaia-ui-tests/Gaia_UI_Tests_Run_Tests#Testing_on_Firefox_OS_Devices
2. When it comes to run the test, use this command line:
gaiatest --testvars gaiatest/testvars_flame_2imei.json --html-output=results/index.html --address localhost:2828 --restart gaiatest/tests/unit/test_cleanup_gaia.py --repeat 49
However I am sure there are easier ways to cause it to fail. A shell script that does stop b2g, start b2g, wait 60 seconds, [some kind of life check like check marionette port open], in a loop might be sufficient.
Assignee | ||
Comment 23•10 years ago
|
||
(In reply to Zac C (:zac) from comment #22)
> However I am sure there are easier ways to cause it to fail. A shell script
> that does stop b2g, start b2g, wait 60 seconds, [some kind of life check
> like check marionette port open], in a loop might be sufficient.
Sorry for causing such trouble. I didn't expect this patch to be so catastrophic.
I can reproduce this by restarting b2g on the device, but it's not easy to see the failure on my device. Does gaia UI test make it easier to fail?
Assignee | ||
Comment 24•10 years ago
|
||
OK, I can reproduce it more reliably. The key is to start b2g right after previous shutdown. That's why I didn't capture it in gdb.
Comment 25•10 years ago
|
||
No worries Cervantes I know you didn't mean it!
Although during the test we wait for all processes containing 'b2g' to be gone so we thought it should be sufficiently torn down. Perhaps we need to wait further.
Assignee | ||
Comment 26•10 years ago
|
||
Fix the cross-process race condition from the other direction (i.e. parent actively closes the actor) seems to work from by tests so far, but I still don't know why it makes the difference.
Assignee | ||
Comment 27•10 years ago
|
||
Tracing the root cause why this patch will break the seemingly irrelevant indexedDB file lock brings me to incorrect file usage:
(gdb) fr 2
#2 0xb6a512f8 in unixLock (eFileLock=<optimized out>, id=0xaf209914) at /mnt/SSD/data/hg/mozilla-central/db/sqlite3/src/sqlite3.c:25865
25865 rc = sqliteErrorFromPosixError(tErrno, SQLITE_IOERR_LOCK);
(gdb) p *(unixFile*)id
$8 = {
... (omitted)
h = 28,
... (omitted)
zPath = 0xaf209a28 "/data/local/storage/persistent/chrome/idb/2588645841ssegtnti.sqlite",
... (omitted)
}
and file descriptor 28 is actually a socket:
lrwx------ root root 2014-09-24 19:52 28 -> socket:[2715481]
I am still tracing the file descriptor is taken by unix domain socket.
Assignee | ||
Comment 28•10 years ago
|
||
It appears that ProcLoaderParent double closes the channel file descriptor, one in closing the IPC channel and the other in FileDescriptor dtor. Because indexedDB operations happen on another thread, we may close the wrong file descriptor and make indexedDB operations fail.
Assignee | ||
Comment 29•10 years ago
|
||
This patch also fixes the double-close of FD used in PProaLoader: once in shutting down the IPC channel and the other in mozilla::ipc::FileDescriptor dtor. The double-close problem is inherent in the original B2G Loader code, but revealed by the previous version of patch.
Try push log: https://tbpl.mozilla.org/?tree=Try&rev=bad4b48fd16a
I'll reland after the tests are green.
Attachment #8470693 -
Attachment is obsolete: true
Attachment #8498100 -
Flags: review+
Updated•10 years ago
|
Flags: needinfo?(echou)
Assignee | ||
Comment 31•10 years ago
|
||
Comment 32•10 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Assignee | ||
Comment 33•10 years ago
|
||
Comment on attachment 8498100 [details] [diff] [review]
Fix the race condition that cuases the leak
Approval Request Comment
[Feature/regressing bug #]: 977026
[User impact if declined]: minor memory leak or probably a crash.
[Describe test coverage new/current, TBPL]: manual test with debug instrumentation. Regression on try.
[Risks and why]: Low. This patch fixes a cross-process race condition that both sides could possibly close the channel. We make fix the race by always closing the channel from one side. This should be safe because we remove the race and nondeterminism.
Attachment #8498100 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
Attachment #8498100 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 34•10 years ago
|
||
NI Inder to see if the want to pull this in for the next stability test run.
Flags: needinfo?(ikumar)
Comment 35•10 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•