bugzilla.mozilla.org has resumed normal operation. Attachments prior to 2014 will be unavailable for a few days. This is tracked in Bug 1475801.
Please report any other irregularities here.

Fix a race condition in ProcLoaderParent

RESOLVED FIXED in Firefox 34, Firefox OS v2.1

Status

()

Core
IPC
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: khuey, Assigned: cyu)

Tracking

unspecified
mozilla35
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.1+, firefox33 wontfix, firefox34 fixed, firefox35 fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

Details

Attachments

(1 attachment, 1 obsolete attachment)

My latest push to try with the patches from bug 1038943 shows that we're leaking PProcLoaderParent.

https://tbpl.mozilla.org/php/getParsedLog.php?id=45135816&tree=Try&full=1#error0

That seems to imply that _ProcLoaderParentDestroy never gets called.
(Assignee)

Comment 1

4 years ago
I'll take a look.
Assignee: nobody → cyu
Whiteboard: [MemShrink] → [MemShrink:P2]
(Assignee)

Comment 2

4 years ago
Created attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak

Try push: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=8956b8a3e248
Attachment #8470693 - Flags: review?(khuey)
(Assignee)

Comment 3

4 years ago
Some debug instrumentation that explains the problem in more detail:

[time:1407746798360957][23139<-23128][PProcLoaderChild] Received Msg_Load([TODO])
[time:1407746798361353][23139->23128][PProcLoaderChild] Sending Msg_LoadComplete([TODO])
[23139]MessageChannel::Close()
NOTE: parent process received `Goodbye', closing down
[time:1407746798374903][23139->23128][PContentChild] Sending Msg_PCrashReporterConstructor([TODO])

When we send out LoadComplete() from child and Close() the channel right away, the LoadComplete() message will not be processed by the parent and the leak results. Actually it will be worse if the parent has the chance to receive LoadComplete() because the channel will be double closed and the parent process will abort in MessageChannel::Close().
See Also: → bug 1050122

Comment 4

4 years ago
Once the child process close the channel, the parent process should detect a channel error and delete actors.  It is weird that the parent seems don't detect the channel error.  I think there are something as the root cause of this bug.

For double closed issue, it is a weird design of APIs.  For my understanding, most APIs still require the user to close/release the resource even there is an error.  Our design is uncommon.  I suggest to file another bug to fix it.
Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak

This doesn't appear to fix the leak of PProcLoaderParent on my emulator.  I put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.
Attachment #8470693 - Flags: review?(khuey)
(Assignee)

Comment 6

4 years ago
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5)
> Comment on attachment 8470693 [details] [diff] [review]
> Fix the race condition that cuases the leak
> 
> This doesn't appear to fix the leak of PProcLoaderParent on my emulator.  I
> put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.

Hmm, that's strange. I put printf()'s in ProcLoaderParent's ctor and dtor and saw them called. This is on a real device. Maybe emulator that runs much slower could make a difference. I will check this on emulator later.

130|root@android:/ # b2g.sh                                                    
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
========== ProcLoaderParent ctor 0x43d86840
*****************************************************************
***
*** This is stdout. Most of the useful output will be in logcat.
***
*****************************************************************
1408452134133   Marionette      INFO    marionette enabled via build flag and pref
1408452134741   Marionette      INFO    marionette-server.js loaded
1408452135082   Marionette      INFO    B2G emulator: no
1408452135085   Marionette      INFO    Device detected is unagi
1408452135089   Marionette      INFO    Listening on port 2828

1408452135091   Marionette      INFO    Marionette server ready
System JS : WARNING resource://gre/modules/Preferences.jsm:378 - mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the corre$t initial [[Prototype]] value using Object.create
========== ProcLoaderParent dtor 0x43d86840
(Assignee)

Updated

4 years ago
See Also: → bug 1056924
If you can't reproduce this on the emulator let me know.  I see this all the time so I could debug it if needed.
(Assignee)

Comment 8

4 years ago
I still see the dtor called on emulator, but the system app fails to load and mochitest fails run run. I am debugging why it fails to run.
(Assignee)

Comment 9

4 years ago
Kyle are you using debug build? If yes then I think you run into bug 1050122.

On debug build, "dom.ipc.processPrelaunch.enabled" is false, and the Nuwa process will not be launched [1]. That is, no one will call base::LaunchAppProcLoader [2] and move ProcLoaderParent/Child forward. ProcLoaderParent is constructed in ProcLoaderClientGeckoInit() [3], which is called unconditionally from XRE_main() [4]. That means we always leak ProcLoaderParent and the forked process on debug build.

Since the issue is being handled in bug 1050122, I will focus this patch on the race condition between ProcLoaderParent and ProcLoaderChild. Could you review the patch for the race condition and leave the above issue in bug 1050122? Thanks.

[1] http://dxr.mozilla.org/mozilla-central/source/dom/ipc/PreallocatedProcessManager.cpp#180
[2] http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_linux.cc#205
[3] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/ProcessUtils_linux.cpp#190
[4] http://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#3817
Flags: needinfo?(khuey)
That's a good point.  I'll review this tomorrow.
Flags: needinfo?(khuey)
Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak

Review of attachment 8470693 [details] [diff] [review]:
-----------------------------------------------------------------

r=me
Attachment #8470693 - Flags: review+
No longer blocks: 1038943
Keywords: mlk
Summary: PProcLoaderParent is never freed → Fix a race condition in ProcLoaderParent
Whiteboard: [MemShrink:P2]
https://hg.mozilla.org/mozilla-central/rev/0a76964b8d8d
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35

Updated

4 years ago
Blocks: 1062255

Comment 15

4 years ago
We strongly suspect this for causing bug 1062255 crash and other Settings issues on Flame device. See bug 1062255 for regression range finding, crash dumps, etc

RyanVM, is it possible to backout?
Flags: needinfo?(ryanvm)

Comment 16

4 years ago
Had the wrong sheriff :)
Flags: needinfo?(ryanvm) → needinfo?(cbook)
(In reply to Zac C (:zac) from comment #15)
> We strongly suspect this for causing bug 1062255 crash and other Settings
> issues on Flame device. See bug 1062255 for regression range finding, crash
> dumps, etc
> 
> RyanVM, is it possible to backout?

got backed out in https://hg.mozilla.org/mozilla-central/rev/31ce2eb10676
Status: RESOLVED → REOPENED
Flags: needinfo?(cbook)
Resolution: FIXED → ---
We still need to land this, so you need to work with Cervantes to find a way for him to reproduce this.  This will fix things like bug 1056924.
blocking-b2g: --- → 2.1?
Flags: needinfo?(zcampbell)
See comment 18.
blocking-b2g: 2.1? → 2.1+

Comment 20

4 years ago
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #18)
> We still need to land this, so you need to work with Cervantes to find a way
> for him to reproduce this.  This will fix things like bug 1056924.

qDot has left such info in bug 1062255 (to replicate manually) but if he'd prefer to try by running the tests too then just ping me.
Flags: needinfo?(zcampbell)
Can you link to a specific comment because I don't see any STR in there.
status-b2g-v2.1: --- → affected
status-b2g-v2.2: --- → affected
Target Milestone: mozilla35 → ---

Comment 22

4 years ago
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #21)
> Can you link to a specific comment because I don't see any STR in there.

Pardon, I recalled qDot's comment was more clear.

To run the actual automation:
1. Follow these instructions to set up the environment:
https://developer.mozilla.org/en-US/Firefox_OS/Platform/Automated_testing/gaia-ui-tests/Gaia_UI_Tests_Run_Tests#Testing_on_Firefox_OS_Devices

2. When it comes to run the test, use this command line:
gaiatest --testvars gaiatest/testvars_flame_2imei.json --html-output=results/index.html  --address localhost:2828 --restart gaiatest/tests/unit/test_cleanup_gaia.py --repeat 49

However I am sure there are easier ways to cause it to fail. A shell script that does stop b2g, start b2g, wait 60 seconds, [some kind of life check like check marionette port open], in a loop might be sufficient.
(Assignee)

Comment 23

4 years ago
(In reply to Zac C (:zac) from comment #22)
> However I am sure there are easier ways to cause it to fail. A shell script
> that does stop b2g, start b2g, wait 60 seconds, [some kind of life check
> like check marionette port open], in a loop might be sufficient.

Sorry for causing such trouble. I didn't expect this patch to be so catastrophic.

I can reproduce this by restarting b2g on the device, but it's not easy to see the failure on my device. Does gaia UI test make it easier to fail?
(Assignee)

Comment 24

4 years ago
OK, I can reproduce it more reliably. The key is to start b2g right after previous shutdown. That's why I didn't capture it in gdb.

Comment 25

4 years ago
No worries Cervantes I know you didn't mean it!

Although during the test we wait for all processes containing 'b2g' to be gone so we thought it should be sufficiently torn down. Perhaps we need to wait further.
(Assignee)

Comment 26

4 years ago
Fix the cross-process race condition from the other direction (i.e. parent actively closes the actor) seems to work from by tests so far, but I still don't know why it makes the difference.
(Assignee)

Comment 27

4 years ago
Tracing the root cause why this patch will break the seemingly irrelevant indexedDB file lock brings me to incorrect file usage:

(gdb) fr 2
#2  0xb6a512f8 in unixLock (eFileLock=<optimized out>, id=0xaf209914) at /mnt/SSD/data/hg/mozilla-central/db/sqlite3/src/sqlite3.c:25865
25865         rc = sqliteErrorFromPosixError(tErrno, SQLITE_IOERR_LOCK);
(gdb) p *(unixFile*)id
$8 = {
... (omitted)
  h = 28, 
... (omitted)
  zPath = 0xaf209a28 "/data/local/storage/persistent/chrome/idb/2588645841ssegtnti.sqlite", 
... (omitted)
}

and file descriptor 28 is actually a socket:
lrwx------ root     root              2014-09-24 19:52 28 -> socket:[2715481]

I am still tracing the file descriptor is taken by unix domain socket.
(Assignee)

Comment 28

4 years ago
It appears that ProcLoaderParent double closes the channel file descriptor, one in closing the IPC channel and the other in FileDescriptor dtor. Because indexedDB operations happen on another thread, we may close the wrong file descriptor and make indexedDB operations fail.
(Assignee)

Comment 29

4 years ago
Created attachment 8498100 [details] [diff] [review]
Fix the race condition that cuases the leak

This patch also fixes the double-close of FD used in PProaLoader: once in shutting down the IPC channel and the other in mozilla::ipc::FileDescriptor dtor. The double-close problem is inherent in the original B2G Loader code, but revealed by the previous version of patch.

Try push log: https://tbpl.mozilla.org/?tree=Try&rev=bad4b48fd16a

I'll reland after the tests are green.
Attachment #8470693 - Attachment is obsolete: true
Attachment #8498100 - Flags: review+
Flags: needinfo?(echou)
wrong ni
Flags: needinfo?(echou)
https://hg.mozilla.org/mozilla-central/rev/a39933b3becc
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
(Assignee)

Comment 33

4 years ago
Comment on attachment 8498100 [details] [diff] [review]
Fix the race condition that cuases the leak

Approval Request Comment
[Feature/regressing bug #]: 977026
[User impact if declined]: minor memory leak or probably a crash.
[Describe test coverage new/current, TBPL]: manual test with debug instrumentation. Regression on try.
[Risks and why]: Low. This patch fixes a cross-process race condition that both sides could possibly close the channel. We make fix the race by always closing the channel from one side. This should be safe because we remove the race and nondeterminism.
Attachment #8498100 - Flags: approval-mozilla-aurora?

Updated

4 years ago
Attachment #8498100 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
NI Inder to see if the want to pull this in for the next stability test run.
Flags: needinfo?(ikumar)
https://hg.mozilla.org/releases/mozilla-aurora/rev/81044b79b857
status-b2g-v2.1: affected → fixed
status-b2g-v2.2: affected → fixed
status-firefox33: --- → wontfix
status-firefox34: --- → fixed
status-firefox35: --- → fixed
Duplicate of this bug: 1056924

Updated

3 years ago
Flags: needinfo?(ikumar)
You need to log in before you can comment on or make changes to this bug.