1048011 - Fix a race condition in ProcLoaderParent

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Description

•

10 years ago

My latest push to try with the patches from bug 1038943 shows that we're leaking PProcLoaderParent.

https://tbpl.mozilla.org/php/getParsedLog.php?id=45135816&tree=Try&full=1#error0

That seems to imply that _ProcLoaderParentDestroy never gets called.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 1

•

10 years ago

I'll take a look.

Assignee: nobody → cyu

Nicholas Nethercote [inactive]

Updated

•

10 years ago

Whiteboard: [MemShrink] → [MemShrink:P2]

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 2

•

10 years ago

Attached patch Fix the race condition that cuases the leak (obsolete) — Details — Splinter Review

Try push: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=8956b8a3e248

Attachment #8470693 - Flags: review?(khuey)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 3

•

10 years ago

Some debug instrumentation that explains the problem in more detail:

[time:1407746798360957][23139<-23128][PProcLoaderChild] Received Msg_Load([TODO])
[time:1407746798361353][23139->23128][PProcLoaderChild] Sending Msg_LoadComplete([TODO])
[23139]MessageChannel::Close()
NOTE: parent process received `Goodbye', closing down
[time:1407746798374903][23139->23128][PContentChild] Sending Msg_PCrashReporterConstructor([TODO])

When we send out LoadComplete() from child and Close() the channel right away, the LoadComplete() message will not be processed by the parent and the leak results. Actually it will be worse if the parent has the chance to receive LoadComplete() because the channel will be double closed and the parent process will abort in MessageChannel::Close().

Ting-Yu Chou [:ting] (away)

Updated

•

10 years ago

Comment 4

•

10 years ago

Once the child process close the channel, the parent process should detect a channel error and delete actors.  It is weird that the parent seems don't detect the channel error.  I think there are something as the root cause of this bug.

For double closed issue, it is a weird design of APIs.  For my understanding, most APIs still require the user to close/release the resource even there is an error.  Our design is uncommon.  I suggest to file another bug to fix it.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Comment 5

•

10 years ago

Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak

This doesn't appear to fix the leak of PProcLoaderParent on my emulator.  I put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.

Attachment #8470693 - Flags: review?(khuey)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 6

•

10 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5)
> Comment on attachment 8470693 [details] [diff] [review]
> Fix the race condition that cuases the leak
> 
> This doesn't appear to fix the leak of PProcLoaderParent on my emulator.  I
> put a MOZ_ASSERT(false) in _ProcLoaderParentDestroy too and it does not fire.

Hmm, that's strange. I put printf()'s in ProcLoaderParent's ctor and dtor and saw them called. This is on a real device. Maybe emulator that runs much slower could make a difference. I will check this on emulator later.

130|root@android:/ # b2g.sh                                                    
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
[11509] WARNING: Tried to RegisterCallback without an AtExitManager: file /mnt/SSD/data/hg/mozilla-central/ipc/chromium/src/base/at_exit.cc, line 40
========== ProcLoaderParent ctor 0x43d86840
*****************************************************************
***
*** This is stdout. Most of the useful output will be in logcat.
***
*****************************************************************
1408452134133   Marionette      INFO    marionette enabled via build flag and pref
1408452134741   Marionette      INFO    marionette-server.js loaded
1408452135082   Marionette      INFO    B2G emulator: no
1408452135085   Marionette      INFO    Device detected is unagi
1408452135089   Marionette      INFO    Listening on port 2828

1408452135091   Marionette      INFO    Marionette server ready
System JS : WARNING resource://gre/modules/Preferences.jsm:378 - mutating the [[Prototype]] of an object will cause your code to run very slowly; instead create the object with the corre$t initial [[Prototype]] value using Object.create
========== ProcLoaderParent dtor 0x43d86840

Cervantes Yu [:cyu] [:cervantes]

Assignee

Updated

•

10 years ago

Comment 7

•

10 years ago

If you can't reproduce this on the emulator let me know.  I see this all the time so I could debug it if needed.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 8

•

10 years ago

I still see the dtor called on emulator, but the system app fails to load and mochitest fails run run. I am debugging why it fails to run.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 9

•

10 years ago

Kyle are you using debug build? If yes then I think you run into bug 1050122.

On debug build, "dom.ipc.processPrelaunch.enabled" is false, and the Nuwa process will not be launched [1]. That is, no one will call base::LaunchAppProcLoader [2] and move ProcLoaderParent/Child forward. ProcLoaderParent is constructed in ProcLoaderClientGeckoInit() [3], which is called unconditionally from XRE_main() [4]. That means we always leak ProcLoaderParent and the forked process on debug build.

Since the issue is being handled in bug 1050122, I will focus this patch on the race condition between ProcLoaderParent and ProcLoaderChild. Could you review the patch for the race condition and leave the above issue in bug 1050122? Thanks.

[1] http://dxr.mozilla.org/mozilla-central/source/dom/ipc/PreallocatedProcessManager.cpp#180
[2] http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/process_util_linux.cc#205
[3] http://dxr.mozilla.org/mozilla-central/source/ipc/glue/ProcessUtils_linux.cpp#190
[4] http://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#3817

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Comment 10

•

10 years ago

That's a good point.  I'll review this tomorrow.

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Comment 11

•

10 years ago

Comment on attachment 8470693 [details] [diff] [review]
Fix the race condition that cuases the leak

Review of attachment 8470693 [details] [diff] [review]:
-----------------------------------------------------------------

r=me

Attachment #8470693 - Flags: review+

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Updated

•

10 years ago

No longer blocks: 1038943

Keywords: mlk

Summary: PProcLoaderParent is never freed → Fix a race condition in ProcLoaderParent

Whiteboard: [MemShrink:P2]

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 12

•

10 years ago

Try push looks good: https://tbpl.mozilla.org/?tree=Try&rev=ee136047f739

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 13

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/0a76964b8d8d

Ryan VanderMeulen [:RyanVM]

Comment 14

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/0a76964b8d8d

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla35

Zac C (:zac)

Updated

•

10 years ago

Blocks: 1062255

Zac C (:zac)

Comment 15

•

10 years ago

We strongly suspect this for causing bug 1062255 crash and other Settings issues on Flame device. See bug 1062255 for regression range finding, crash dumps, etc

RyanVM, is it possible to backout?

Flags: needinfo?(ryanvm)

Zac C (:zac)

Comment 16

•

10 years ago

Had the wrong sheriff :)

Flags: needinfo?(ryanvm) → needinfo?(cbook)

Carsten Book [:Tomcat]

Comment 17

•

10 years ago

(In reply to Zac C (:zac) from comment #15)
> We strongly suspect this for causing bug 1062255 crash and other Settings
> issues on Flame device. See bug 1062255 for regression range finding, crash
> dumps, etc
> 
> RyanVM, is it possible to backout?

got backed out in https://hg.mozilla.org/mozilla-central/rev/31ce2eb10676

Status: RESOLVED → REOPENED

Flags: needinfo?(cbook)

Resolution: FIXED → ---

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Comment 18

•

10 years ago

We still need to land this, so you need to work with Cervantes to find a way for him to reproduce this.  This will fix things like bug 1056924.

blocking-b2g: --- → 2.1?

Flags: needinfo?(zcampbell)

Andrew Overholt [:overholt]

Comment 19

•

10 years ago

See comment 18.

blocking-b2g: 2.1? → 2.1+

Zac C (:zac)

Comment 20

•

10 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #18)
> We still need to land this, so you need to work with Cervantes to find a way
> for him to reproduce this.  This will fix things like bug 1056924.

qDot has left such info in bug 1062255 (to replicate manually) but if he'd prefer to try by running the tests too then just ping me.

Flags: needinfo?(zcampbell)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Reporter

Comment 21

•

10 years ago

Can you link to a specific comment because I don't see any STR in there.

Ryan VanderMeulen [:RyanVM]

Updated

•

10 years ago

status-b2g-v2.1: --- → affected

status-b2g-v2.2: --- → affected

Target Milestone: mozilla35 → ---

Zac C (:zac)

Comment 22

•

10 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #21)
> Can you link to a specific comment because I don't see any STR in there.

Pardon, I recalled qDot's comment was more clear.

To run the actual automation:
1. Follow these instructions to set up the environment:
https://developer.mozilla.org/en-US/Firefox_OS/Platform/Automated_testing/gaia-ui-tests/Gaia_UI_Tests_Run_Tests#Testing_on_Firefox_OS_Devices

2. When it comes to run the test, use this command line:
gaiatest --testvars gaiatest/testvars_flame_2imei.json --html-output=results/index.html  --address localhost:2828 --restart gaiatest/tests/unit/test_cleanup_gaia.py --repeat 49

However I am sure there are easier ways to cause it to fail. A shell script that does stop b2g, start b2g, wait 60 seconds, [some kind of life check like check marionette port open], in a loop might be sufficient.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 23

•

10 years ago

(In reply to Zac C (:zac) from comment #22)
> However I am sure there are easier ways to cause it to fail. A shell script
> that does stop b2g, start b2g, wait 60 seconds, [some kind of life check
> like check marionette port open], in a loop might be sufficient.

Sorry for causing such trouble. I didn't expect this patch to be so catastrophic.

I can reproduce this by restarting b2g on the device, but it's not easy to see the failure on my device. Does gaia UI test make it easier to fail?

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 24

•

10 years ago

OK, I can reproduce it more reliably. The key is to start b2g right after previous shutdown. That's why I didn't capture it in gdb.

Zac C (:zac)

Comment 25

•

10 years ago

No worries Cervantes I know you didn't mean it!

Although during the test we wait for all processes containing 'b2g' to be gone so we thought it should be sufficiently torn down. Perhaps we need to wait further.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 26

•

10 years ago

Fix the cross-process race condition from the other direction (i.e. parent actively closes the actor) seems to work from by tests so far, but I still don't know why it makes the difference.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 27

•

10 years ago

Tracing the root cause why this patch will break the seemingly irrelevant indexedDB file lock brings me to incorrect file usage:

(gdb) fr 2
#2  0xb6a512f8 in unixLock (eFileLock=<optimized out>, id=0xaf209914) at /mnt/SSD/data/hg/mozilla-central/db/sqlite3/src/sqlite3.c:25865
25865         rc = sqliteErrorFromPosixError(tErrno, SQLITE_IOERR_LOCK);
(gdb) p *(unixFile*)id
$8 = {
... (omitted)
  h = 28, 
... (omitted)
  zPath = 0xaf209a28 "/data/local/storage/persistent/chrome/idb/2588645841ssegtnti.sqlite", 
... (omitted)
}

and file descriptor 28 is actually a socket:
lrwx------ root     root              2014-09-24 19:52 28 -> socket:[2715481]

I am still tracing the file descriptor is taken by unix domain socket.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 28

•

10 years ago

It appears that ProcLoaderParent double closes the channel file descriptor, one in closing the IPC channel and the other in FileDescriptor dtor. Because indexedDB operations happen on another thread, we may close the wrong file descriptor and make indexedDB operations fail.

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 29

•

10 years ago

Attached patch Fix the race condition that cuases the leak — Details — Splinter Review

This patch also fixes the double-close of FD used in PProaLoader: once in shutting down the IPC channel and the other in mozilla::ipc::FileDescriptor dtor. The double-close problem is inherent in the original B2G Loader code, but revealed by the previous version of patch.

Try push log: https://tbpl.mozilla.org/?tree=Try&rev=bad4b48fd16a

I'll reland after the tests are green.

Attachment #8470693 - Attachment is obsolete: true

Attachment #8498100 - Flags: review+

Wesley Huang [:wesley_huang] (EPM) (NI me)

Updated

•

10 years ago

Flags: needinfo?(echou)

Eric Chou [:ericchou] [:echou]

Comment 30

•

10 years ago

wrong ni

Flags: needinfo?(echou)

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 31

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/a39933b3becc

Ryan VanderMeulen [:RyanVM]

Comment 32

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/a39933b3becc

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla35

Cervantes Yu [:cyu] [:cervantes]

Assignee

Comment 33

•

10 years ago

Comment on attachment 8498100 [details] [diff] [review]
Fix the race condition that cuases the leak

Approval Request Comment
[Feature/regressing bug #]: 977026
[User impact if declined]: minor memory leak or probably a crash.
[Describe test coverage new/current, TBPL]: manual test with debug instrumentation. Regression on try.
[Risks and why]: Low. This patch fixes a cross-process race condition that both sides could possibly close the channel. We make fix the race by always closing the channel from one side. This should be safe because we remove the race and nondeterminism.

Attachment #8498100 - Flags: approval-mozilla-aurora?

bhavana bajaj [:bajaj]

Updated

•

10 years ago

Attachment #8498100 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

bhavana bajaj [:bajaj]

Comment 34

•

10 years ago

NI Inder to see if the want to pull this in for the next stability test run.

Flags: needinfo?(ikumar)

Ryan VanderMeulen [:RyanVM]

Comment 35

•

10 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/81044b79b857

status-b2g-v2.1: affected → fixed

status-b2g-v2.2: affected → fixed

status-firefox33: --- → wontfix

status-firefox34: --- → fixed

status-firefox35: --- → fixed

Inder

Updated

•

9 years ago

Flags: needinfo?(ikumar)

Fix the race condition that cuases the leak 10 years ago Cervantes Yu [:cyu] [:cervantes] 2.06 KB, patch	khuey : review+	Details \| Diff \| Splinter Review
Fix the race condition that cuases the leak 10 years ago Cervantes Yu [:cyu] [:cervantes] 4.18 KB, patch	cyu : review+ bajaj : approval-mozilla-aurora+	Details \| Diff \| Splinter Review