Closed
Bug 1363378
Opened 8 years ago
Closed 7 years ago
[e10s] Tab processes are not killed after crashing Firefox
Categories
(Core :: Security: Process Sandboxing, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | wontfix |
firefox55 | + | wontfix |
firefox56 | + | wontfix |
firefox57 | + | verified |
firefox58 | --- | verified |
People
(Reporter: emilghitta, Assigned: jld)
References
Details
(Keywords: regression, Whiteboard: sb+)
Attachments
(2 files)
[Affected versions]:
Firefox 55.0a1 (Build Id:20170508100218)
Firefox 54.0b6 (Build Id:20170508210007)
[Unaffected versions]
Firefox 53.0.2 (Build Id:20170504175505)
Firefox 52.1.1 esr (Build Id:20170504112025)
[Affected platforms]:
Ubuntu 16.04 LTS 64bit
[Steps to reproduce]:
1.Launch Firefox with a clean profile.
2.Enable e10s (if it is not already enabled).
3.Got to https://www.youtube.com webpage.
4.Open 2 random videos.
5.Install Crash Me Now add-on.
6.Crash the browser.
[Expected result]:
After crashing Firefox, the audio stops playback and the opened processes are killed.
[Actual result]:
After crashing Firefox, the audio continues to playback and the opened processes are not killed. The audio can be heard even if starting Firefox again after submitting the crash report.
[Regression range]:
I think that this issue requires finding the regression manually.
I will get back to this as soon as time permits.
[Additional notes]
Please note that this issue occurs only when e10s is enabled.
Please note that the Web Content process is not killed after Firefox crashed.
For more information regarding this issue please observe the following video: https://drive.google.com/open?id=0B94iuIVMr-TER21BbTlncURPRGM
Reporter | ||
Comment 1•7 years ago
|
||
This issue is reproducible with 56.0a1 (Build Id:20170712100301) using Ubuntu 16.04 64 bit by following the up mentioned steps.
Severity: normal → major
status-firefox56:
--- → affected
Comment 2•7 years ago
|
||
Mike, can you take look at these crashes or help find someone to investigate?
Flags: needinfo?(mconley)
Comment 3•7 years ago
|
||
Sounds like media stuff to me... perhaps jwwang knows what's going on?
Flags: needinfo?(mconley) → needinfo?(jwwang)
Comment 4•7 years ago
|
||
If it's easy to reproduce maybe a quick regression test would help?
Keywords: regression
Comment 5•7 years ago
|
||
As noted in comment 0, the Web Content process is not killed after Firefox crashed. The audio decoder still continues to decode and output sound. If audio is expected to stop immediately, the system should kill the content process also.
Flags: needinfo?(jwwang)
Comment 6•7 years ago
|
||
I did a manual regression and got to the following result:
Last-good buildid: 2016-10-06 (20161006030208)
First-bad buildid: 2016-10-07 (20161007030207)
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b9944879c9a60a9aba4a744a7401bc38e0f39c4
Possible culprit:
43bc30457ad4 JW Wang — Bug 1308066 - remove mLowDataThresholdUsecs. r=kikuo
47f0358bfff2 JW Wang — Bug 1308065 - move mBufferingWait into BufferingState. r=kikuo
Keywords: regressionwindow-wanted
Comment 7•7 years ago
|
||
Requesting needinfo from JW Wang since he is the one that fixed the possible identified culprit.
Flags: needinfo?(jwwang)
Comment 8•7 years ago
|
||
(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> 44879c9a60a9aba4a744a7401bc38e0f39c4
That list is pretty long. Please use mozregression to further nail down the range. Usually it gives you the exact changeset at the end, by also caring about different branches (eg. merges from autoland or inbound)
Flags: needinfo?(bogdan.maris)
Comment 9•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #8)
> (In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> > 44879c9a60a9aba4a744a7401bc38e0f39c4
>
> That list is pretty long. Please use mozregression to further nail down the
> range. Usually it gives you the exact changeset at the end, by also caring
> about different branches (eg. merges from autoland or inbound)
Due to the nature of the bug (it involves crash) I am unable to use mozregression to determine a more accurate range (this is why Emil commented in #0 that it would require manual regression). Also I tried to use tinderbox builds but I never reproduced on them, not even in more recent builds where the issue should have reproduced with no problems.
Flags: needinfo?(bogdan.maris)
Comment 10•7 years ago
|
||
JW, from comment 5, are you intending to work on this? Or do you think it should go back to the DOM/content folks, or to someone else on the Audio/Video team? Are you able to reproduce it in Nightly?
tracking-firefox55:
--- → +
Comment 11•7 years ago
|
||
Yes I can repro this issue in Nightly. However, I don't think this is a media bug as stated in comment 5.
Flags: needinfo?(jwwang)
Comment 12•7 years ago
|
||
(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> I did a manual regression and got to the following result:
>
> Last-good buildid: 2016-10-06 (20161006030208)
> First-bad buildid: 2016-10-07 (20161007030207)
>
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> 44879c9a60a9aba4a744a7401bc38e0f39c4
>
> Possible culprit:
> 43bc30457ad4 JW Wang — Bug 1308066 - remove mLowDataThresholdUsecs. r=kikuo
> 47f0358bfff2 JW Wang — Bug 1308065 - move mBufferingWait into
> BufferingState. r=kikuo
There is no way for these bugs to be the culprit.
Bug 1308066 changes the type of some members. No call flow is changed.
Bug 1308065 moves a member from class A to class B. No call flow is changed.
Comment 13•7 years ago
|
||
It's been a while since the report. Can you please re-check if this is still a problem?
Flags: needinfo?(emil.ghitta)
Reporter | ||
Comment 14•7 years ago
|
||
Hi Henrik!
Indeed, this issue is still reproducible.
I managed to reproduce it on Firefox 57.0a1(Build Id:20170809100326)and it seems that this issue "rides the trains" since it is reproducible with Firefox 56.0b1(BuildId:20170808170225) now.
Thanks!
status-firefox57:
--- → affected
Flags: needinfo?(emil.ghitta)
Comment 15•7 years ago
|
||
And this happens for Windows only? Have you tried those steps on other platforms yet?
Reporter | ||
Comment 16•7 years ago
|
||
I managed to reproduce this issue only on Linux (as stated in comment 0) having e10 enabled.
This issue is not reproducible on Windows and macOs.
Comment 17•7 years ago
|
||
Emil, it might be good to get a process sample (dump) from one or multiple of those still runnning webcontent processes.
Flags: needinfo?(emil.ghitta)
Reporter | ||
Comment 18•7 years ago
|
||
Hi Henrik!
I generated 2 process dumps (from 2 running webcontent processes):
https://drive.google.com/open?id=0BzOkHBmdnAoaVkxTOGNDNmRkeUE
Hope that this helps.
Flags: needinfo?(emil.ghitta)
Comment 20•7 years ago
|
||
(In reply to Emil Ghitta, QA [:emilghitta] from comment #18)
> I generated 2 process dumps (from 2 running webcontent processes):
> https://drive.google.com/open?id=0BzOkHBmdnAoaVkxTOGNDNmRkeUE
Each of those dumps are 1.4GB in size! Not sure if that is really helpful to get it analyzed. Karl, what would be a good way to get a process sample of content processes, which continue to run after Firefox has been closed on Linux?
Flags: needinfo?(karlt)
Comment 21•7 years ago
|
||
I guess kill -ABRT <content-process-id> fails to produce a crash report because there is no browser process to manage the submission?
If you can find a build with symbols, then gdb can produce a backtrace.
Perhaps ASAN builds have symbols? I don't know.
But a backtrace may not be helpful. As the process is still running apparently, it has moved on from when it should have responded.
To diagnose this, I'd find the code that should detect the pipe close, and enable logging to find out what happened after that.
Finding a regression range would be the simplest way to track down a likely cause.
I'd start with the builds under https://tools.taskcluster.net/index/gecko.v2.mozilla-central.pushdate.2016.10.06
Flags: needinfo?(karlt)
Comment 22•7 years ago
|
||
(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> I did a manual regression and got to the following result:
>
> Last-good buildid: 2016-10-06 (20161006030208)
> First-bad buildid: 2016-10-07 (20161007030207)
Ok, so I setup up my old Linux machine again, and installed both builds. I can verify this regression range at least now.
What I observed is that when I close the next instance of Firefox as started by the crash manager all content processes from the first instance are also gone. It also helps just to wait about 30s. Then at least any played audio is also gone but the content processes remain running.
Before crashing the browser:
$ ps -ef | grep 6531
henrik 6531 1 35 11:20 ? 00:00:09 /home/henrik/code/gecko/firefox/firefox
henrik 6648 6531 66 11:20 ? 00:00:12 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab
After crashing with crash reporter open:
henrik 6648 1 39 11:20 ? 00:00:28 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab
henrik 6737 1 1 11:21 ? 00:00:00 /home/henrik/code/gecko/firefox/crashreporter /home/henrik/.mozilla/firefox/mwad0hks.default/minidumps/2bf8a428-f4b0-1fe5-733decef-578f4de8.dmp
After the restart via crash manager:
henrik 6648 1 27 11:20 ? 00:00:31 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab
henrik 6744 1 25 11:22 ? 00:00:02 /home/henrik/code/gecko/firefox/firefox
> I guess kill -ABRT <content-process-id> fails to produce a crash report because there is no browser process to manage the submission?
Indeed, I tried that but without any success.
I narrowed down the range to the following merge changeset from autoland:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8&tochange=d8e1f5cf0a70a53e8a5532809096a0a5bf729196
I will continue with builds from autoland to get the exact causing bug.
Comment 23•7 years ago
|
||
Here the final regression range:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=3e9a2031152f&tochange=c838d2546cad
So it's related to the sandbox changes as made on bug 1289718. Gian-Carlo can you please have a look?
Updated•7 years ago
|
Flags: needinfo?(gpascutto)
Whiteboard: sb?
Updated•7 years ago
|
Assignee: nobody → gpascutto
Component: DOM: Content Processes → Security: Process Sandboxing
Priority: -- → P1
Whiteboard: sb? → sb+
Assignee | ||
Comment 25•7 years ago
|
||
Can you reproduce this with MOZ_SANDBOX_LOGGING=1 set in the environement? (For old builds before bug 1317735, which landed in 54.0a1, use MOZ_SANDBOX_VERBOSE=1 instead.)
My guess is that something is reacting badly to the error code we return when the broker hangs up the connection. Specifically, I think we just pass along EPIPE from the socket I/O, but EIO would be more appropriate, and getting the caller to actually give up might need something else entirely (or maybe just unilaterally killing the process).
The logs should at least let us see what file is affected, which could help pin down what's getting stuck.
Flags: needinfo?(hskupin)
Comment 26•7 years ago
|
||
Jed, as attached you can find the output from sandbox logging. Around lines 2508 the crash occurred and I let the crash reporter restart Firefox. At the end I closed Firefox normally. I hope there is some useful data in here for you to investigate.
Flags: needinfo?(hskupin)
Comment 27•7 years ago
|
||
I can also prevent this behavior when changing the sandbox level back to 1.
Comment 28•7 years ago
|
||
So it looks as if it's the BrokerClient (in a crash reporting child?) that fails to connect to the Broker (in the parent), gets an EIO and ends up confused?
Assignee | ||
Comment 29•7 years ago
|
||
So this is weird:
Sandbox: rewriting /proc/self/exe -> /proc/4525/exe
Sandbox: Recording mapping /home/henrik/code/gecko/firefox/firefox -> /proc/4525/exe
Sandbox: SandboxBroker: denied op=open rflags=0 perms=0 path= for pid=4525
Sandbox: Failed errno -13 op 0 flags 00 path
Sandbox: SandboxBroker: denied op=open rflags=0 perms=0 path= for pid=4525
Sandbox: Failed errno -13 op 0 flags 00 path
It's trying to pass open() an empty string. Normally that's ENOENT, but we're returning EACCES. If something is treating ENOENT specially that might be it. We could try to find out what's making that call by crashing in the broker client if the path is empty.
Also, there aren't too many uses of /proc/self/exe in our code — the profiler is one, and a copy of libbacktrace used as part of the Rust crate backtrace-sys (media has been using Rust for a while) is another. Or it could be a system library.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)
> So it looks as if it's the BrokerClient (in a crash reporting child?) that
> fails to connect to the Broker (in the parent), gets an EIO and ends up
> confused?
We're not returning EIO, I don't think; we just return the error from sendmsg, which is probably EPIPE if the parent has exited. I was suggesting that EIO is probably what a "real" filesystem would do if it couldn't access its underlying storage, and EPIPE is really not an expected response to open()ing a regular file; it probably doesn't make a difference, but maybe it does.
Comment 30•7 years ago
|
||
Mark 55- and 56- as we are only 1 week away from RC.
Assignee | ||
Comment 31•7 years ago
|
||
I can reproduce this, and I'm reasonably sure I know what's happening: the file broker socket isn't close-on-exec, so the crashreporter process inherits a copy of it, and the child process blocks in the broker client waiting for a response it will never get. In particular, it blocks in the Gecko IPC I/O thread (trying to open /tmp/GeckoChildCrash%d.extra), and that's probably the thread that would normally notice that the parent process went away.
Assignee: gpascutto → jld
Assignee | ||
Comment 32•7 years ago
|
||
Reproducing this on a debug build requires some environment variables: MOZ_CRASHREPORTER=t (debug builds disable the normal crash reporter by default), as well as the misleadingly named XRE_NO_WINDOWS_CRASH_DIALOG=t — without the latter, even with this bug fixed, when the I/O thread returns from the broker call and crashes due to the IPC error caused by the parent process's nonexistence, it will do the usual debug-build “sleeping for 300 seconds” “Type … to attach your debugger” thing (probably because Breakpad tries to request a minidump from the nonexistent parent process and fails, in which case it delegates to the previous signal handler instead of immediately exiting), and the phantom audio will continue for a while.
Comment hidden (mozreview-request) |
Comment 34•7 years ago
|
||
mozreview-review |
Comment on attachment 8907882 [details]
Bug 1363378 - Set close-on-exec in sandbox-related sockets held by parent process.
https://reviewboard.mozilla.org/r/179568/#review184848
Attachment #8907882 -
Flags: review?(gpascutto) → review+
Comment 35•7 years ago
|
||
Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b549b2ed2efc
Set close-on-exec in sandbox-related sockets held by parent process. r=gcp
Comment 36•7 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Updated•7 years ago
|
Reporter | ||
Comment 38•7 years ago
|
||
I have verified that this issue is no longer reproducible using the "Crash Me Now (Simple)"add-on, on Firefox Nightly 58.0a1 (BuildId:20171002220204) with Ubuntu 16.04 64bit.
This issues is also verified fixed on Firefox 57.0b4 (BuildId:20170928180207) with Ubuntu 16.04 64bit using the following snippet:
Cu.import("resource://gre/modules/ctypes.jsm");
let zero = new ctypes.intptr_t(8);
let badptr = ctypes.cast(zero, ctypes.PointerType(ctypes.int32_t));
badptr.contents;
You need to log in
before you can comment on or make changes to this bug.
Description
•