Closed Bug 1349699 Opened 7 years ago Closed 7 years ago

Add some IPC lifetime assertions

Categories

(Core :: IPC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: billm, Assigned: billm)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

For a long time I've wanted to add some assertions to ensure that IPC channels are closed properly. These patches do that.
Already talked this one over with dvander and got an r+.
Attachment #8850163 - Flags: review+
This patch tries to make sure that we have closed the Chromium IPC channel before we destroy the mLink object. If we destroy mLink while the channel is still open, then any incoming data will cause us to call virtual methods on mLink, and that will crash.
Attachment #8850164 - Flags: review?(dvander)
Attachment #8850164 - Flags: review?(dvander) → review+
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/16e000b3fe0c
Assert when destroying a MessageLoop that a live MessageChannel is attached to (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/3b9eadd61e26
Assert that the Chromium channel is closed when MessageLoop is destroyed (r=dvander)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/13f1d3918fe5
Assert when destroying a MessageLoop that a live MessageChannel is attached to (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/5cc766c0864e
Assert that the Chromium channel is closed when MessageLoop is destroyed (r=dvander)
Thanks for the backout. I'll try disabling on Android for now. It looks like it's broken. I filed bug 1353262 about this.
Can you take a look at the crashes in comment 8, David? The ProtocolName is showing up as PGPUChild. It looks like we're asserting because the compositor thread is shutting down before the PGPUChild channel has been closed. I'm not sure why though.
Flags: needinfo?(dvander)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d28d4ab5091b
Assert that the Chromium channel is closed when MessageLoop is destroyed (r=dvander)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/48ad3b64dea6
Revert "Bug 1349699 - Assert that the Chromium channel is closed when MessageLoop is destroyed (r=dvander)"
David suggested calling Close() unconditionally here:
http://searchfox.org/mozilla-central/source/gfx/ipc/GPUProcessHost.cpp#155-164
Flags: needinfo?(dvander)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5f26e6ae5421
Assert when destroying a MessageLoop that a live MessageChannel is attached to (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b5b065f9a16
Assert that the Chromium channel is closed when MessageLoop is destroyed (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/7c11fae5ee51
Close GPU channel unconditionally (r=dvander)
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb79e576b617
Revert "Bug 1349699 - Close GPU channel unconditionally (r=dvander)"
https://hg.mozilla.org/integration/mozilla-inbound/rev/356d65d040d4
Revert "Bug 1349699 - Assert that the Chromium channel is closed when MessageLoop is destroyed (r=dvander)"
https://hg.mozilla.org/integration/mozilla-inbound/rev/b5d165f89e76
Revert "Bug 1349699 - Assert when destroying a MessageLoop that a live MessageChannel is attached to (r=dvander)"
The "Close GPU channel unconditionally" patch seems to regress Android:
https://treeherder.mozilla.org/logviewer.html#?job_id=89705278&repo=mozilla-inbound

Randall, do your patches fix that problem too?
Flags: needinfo?(rbarker)
I rebased my patch set and applied the patches in comment 13. I removed the android guards and ran the image/test/mochitest/test_discardAnimatedImage.html several times in the 4.3 emulator with out seeing the issues in comment 15. I also ran all the image/test/mochitest test in the emulator and all passed.
Flags: needinfo?(rbarker)
It looks like bug 1354499 fixed the real problem behind the test_discardAnimatedImage.html failure. I'll try to land again.
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b52ae1c94254
Close GPU channel unconditionally (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/0e4de1c13808
Assert when destroying a MessageLoop that a live MessageChannel is attached to (r=dvander)
https://hg.mozilla.org/integration/mozilla-inbound/rev/8123d3e11f7d
Assert that the Chromium channel is closed when MessageLoop is destroyed (r=dvander)
Do you think we can uplift these patches to beta?
Flags: needinfo?(wmccloskey)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #20)
> Do you think we can uplift these patches to beta?

Seems a bit risky. We'll at least need to fix the crashes it causes (bug 1356365). Then we'd have to backport those fixes to Aurora.
Flags: needinfo?(wmccloskey)
Depends on: 1356516
Depends on: 1363601
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Depends on: 1372850
A friend of mine bisected a regression on his website to

https://hg.mozilla.org/integration/mozilla-inbound/rev/48ad3b64dea6

from this bug. Which seems strange, since that commit just removes some code. On the other hand it's in ipc code, and the bug is somehow related to his antivirus, which perhaps is relevant to ipc?

Maybe someone will have an idea. Some more details:

1. This is on https://www.kitely.com/
2. When the bug happens, loading the page on Firefox stalls and takes tens of seconds before the page is rendered and loaded, while on other browsers its very snappy.
3. This happens on my friend's machine but not mine. It appears to show up in his google analytics reports, where Firefox load time is 3.6x slower than other browsers, so it's not just a problem on his machine, and he's worried about losing users that are on Firefox.
4. Disabling his antivirus (Avast) gets rid of the bug. (He doesn't have any Firefox plugin or addon from the antivirus, or other suspicious addon or plugin.)
5. The bug is a regression, it used to be fine, and the regression seems to be on the firefox side as older firefox is still ok while newer is broken (so it's not a regression in the antivirus or something else).
6. Nothing odd shows up in the network tab in devtools. Using the profiler, the stall time is reported as spent in XRE_InitChildProcess.
Can you file a separate bug for that issue, Alon? It seems like it should go under "External software affecting Firefox" in Bugzilla. Maybe someone who watches that component will know how to reach out to the AV vendor.
I'll do that, thanks. I just first wanted to see if there was something obvious I was missing here, since bisection pointed to this bug.
Depends on: 1373365
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Depends on: 1423261
No longer depends on: 1423261, 1363601
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: