Closed Bug 1395504 Opened 7 years ago Closed 6 years ago

Infinite hang of web content process when parent process crashes [@ CrashStatsLogForwarder::Log() | mozilla::dom::PContentChild::SendGraphicsError()]

Categories

(Core :: Graphics, defect, P1)

57 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox55 --- wontfix
firefox56 + wontfix
firefox57 - wontfix
firefox58 + wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- fixed

People

(Reporter: whimboo, Assigned: haik)

References

(Blocks 1 open bug)

Details

(Keywords: hang, regression, Whiteboard: gfx-noted)

Attachments

(5 files, 2 obsolete files)

[Tracking Requested - why for this release]:

For one of our Marionette unit tests which is forcing a crash of Firefox by just executing the following code in chrome scope, the Firefox process never exits due to a still running web content process. The latter doesn't even get killed by the background monitor thread.

Code to crash Firefox:


              Components.utils.import("resource://gre/modules/ctypes.jsm");

              // ctypes checks for NULL pointer derefs, so just go near-NULL.
              var zero = new ctypes.intptr_t(8);
              var badptr = ctypes.cast(zero, ctypes.PointerType(ctypes.int32_t));
              var crash = badptr.contents;

The test:

https://dxr.mozilla.org/mozilla-central/rev/db7f19e26e571ae1dd309f5d2f387b06ba670c30/testing/marionette/harness/marionette_harness/tests/unit/test_crash.py#164

Here an excerpt of the output from the Activity Monitor sample:

Sampling process 866 for 3 seconds with 1 millisecond of run time between samples
Sampling completed, processing symbols...
2017-08-31 02:28:36.749 sample[881:63819] Metadata.framework [Error]: couldn't get the client port
Analysis of sampling plugin-container (pid 866) every 1 millisecond
Process:         plugin-container [866]
Path:            /Users/cltbld/build/Nightly.app/Contents/MacOS/plugin-container.app/Contents/MacOS/plugin-container

Call graph:
    2610 Thread_62643   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2610 start  (in plugin-container) + 52  [0x10657ee84]
    +   2610 main  (in plugin-container) + 89  [0x10657eee9]
    +     2610 XRE_InitChildProcess(int, char**, XREChildData const*)  (in XUL) + 1778  [0x109d18f92]
    +       2610 MessageLoop::Run()  (in XUL) + 73  [0x106d8e2c9]
    +         2610 XRE_RunAppShell()  (in XUL) + 270  [0x109d1924e]
    +           2610 nsAppShell::Run()  (in XUL) + 259  [0x10890fe73]
    +             2610 nsBaseAppShell::Run()  (in XUL) + 46  [0x1088b8c5e]
    +               2610 MessageLoop::Run()  (in XUL) + 73  [0x106d8e2c9]
    +                 2610 mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*)  (in XUL) + 291  [0x106dc50a3]
    +                   2610 NS_ProcessNextEvent(nsIThread*, bool)  (in XUL) + 39  [0x1069632e7]
    +                     2610 nsThread::ProcessNextEvent(bool, bool*)  (in XUL) + 1685  [0x106961a05]
    +                       2610 mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel*, void (mozilla::ipc::MessageChannel::*)(), false, (mozilla::RunnableKind)1>::Run()  (in XUL) + 44  [0x106dca7cc]
    +                         2610 mozilla::layers::PCompositorManagerChild::OnChannelError()  (in XUL) + 19  [0x107058673]
    +                           2610 mozilla::layers::PCompositorManagerChild::DestroySubtree(mozilla::ipc::IProtocol::ActorDestroyReason)  (in XUL) + 154  [0x10705848a]
    +                             2610 mozilla::layers::PCompositorBridgeChild::DestroySubtree(mozilla::ipc::IProtocol::ActorDestroyReason)  (in XUL) + 1663  [0x10704b39f]
    +                               2610 mozilla::layers::CompositorBridgeChild::ActorDestroy(mozilla::ipc::IProtocol::ActorDestroyReason)  (in XUL) + 79  [0x1074ea51f]
    +                                 2610 mozilla::gfx::Log<1, mozilla::gfx::CriticalLogger>::Flush()  (in XUL) + 168  [0x1072d0508]
    +                                   2610 mozilla::gfx::CriticalLogger::OutputMessage(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool)  (in XUL) + 48  [0x1072d5b40]
    +                                     2610 CrashStatsLogForwarder::Log(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)  (in XUL) + 164  [0x1075215b4]
    +                                       2610 mozilla::dom::PContentChild::SendGraphicsError(nsCString const&)  (in XUL) + 30  [0x10707881e]
    +                                         2610 mozilla::dom::PContent::Msg_GraphicsError(int)  (in XUL) + 24  [0x10705d138]


As it looks like it's trying to send the crash stats logs and specifically a graphics error, but it isn't successful. It would be interesting to know why this process cannot be killed and Firefox hangs forever.
Blocks: 1376795
So I can see this hang for various of our crash tests, and it only happens for those tests which actually crash the chrome (parent) process. It's not happening for the content crash tests. For a list of affected tests please have a look at the blocking list of bug 1376795.

I will try if I can narrow down the range a bit so we can figure out what could have caused this problem.

In the meantime would you have an idea, Cervantes? Or is that more a question to GFX people?
Component: IPC → Graphics
Flags: needinfo?(cyu)
Summary: Infinite hang of web content process when Firefox crashes [@ CrashStatsLogForwarder::Log() | mozilla::dom::PContentChild::SendGraphicsError()] → Infinite hang of web content process when parent process crashes [@ CrashStatsLogForwarder::Log() | mozilla::dom::PContentChild::SendGraphicsError()]
Looks like Milan is also back from PTO... so setting ni? for him too.
Flags: needinfo?(milan)
Checking more on the code it looks like that George added all that via bug 1297828 already a while ago.
Flags: needinfo?(cyu) → needinfo?(gw)
Flags: needinfo?(milan)
Flags: needinfo?(gw)
Flags: needinfo?(dvander)
For info this crash is happening quite a lot in our Marionette unit test test_crash.py. There is a list of around 30 failures for each of bug 1377030, bug 1376773, and bug 1326124 per week. So they have a high impact for us.

Getting someone to actually look at this would be great.
This is actually also reproducible with the crash tests and Firefox 56.0a1. I'm currently checking if I can narrow down the range. Interestingly I only see the crashes for opt builds made in CI but not for Nightly builds.
Attached file minimized testcase
Testcase to be used to reproduce the problem. Pick the correct CI build and run the following command:

mach marionette test --binary %path% -vv --run-until-failure testcase.py

This will run the test in a loop of 30 iterations by default. Usually the hang should be visible within that range. If not, just run it again.
Blocks: 1396803
Attachment #8903086 - Attachment description: activity monitor sample → activity monitor sample (mozilla::layers::PCompositorManagerChild::OnChannelError())
When I'm using an older build like here the last CI build from June 17th (316014448e427486cce211217e2aee2f18ee3d69), I get a different call stack for the hanging webcontent process.
Roughly this infinite hang seem to have started between 2017-04-01 and 2017-05-01. Here the changeset list which is still huge:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=7a3f514cf8490d271ee373a1d2999e4ea4dee2d7&tochange=2e7c10a9b86e30691f67855f6c8f98d984508d7c

I'm currently working on narrowing the regression range down. In the meantime I had a look at the changesets and noticed bug 1356365. Maybe this is a related issue, which simply masked the crash? Is there a way to disable the GPU process?
(In reply to Henrik Skupin (:whimboo) from comment #8)
> Roughly this infinite hang seem to have started between 2017-04-01 and
> 2017-05-01. Here the changeset list which is still huge:
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=7a3f514cf8490d271ee373a1d2999e4ea4dee2d7&tochange=2e7c
> 10a9b86e30691f67855f6c8f98d984508d7c
> 
> I'm currently working on narrowing the regression range down. In the
> meantime I had a look at the changesets and noticed bug 1356365. Maybe this
> is a related issue, which simply masked the crash? Is there a way to disable
> the GPU process?

On OSX, there is no GPU process. I will try to setup the environment to run the test in comment 6.
Please note that I had to request a loaner from RelEng to get it reproduced. I wasn't able to with my local machine. But maybe your setup is different.

I continued to further nail down the regression range. Here is the range I have so far:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=ce69b6e1773e9e0d0a190ce899f34b1658e66ca4&tochange=a374c35469935a874fefe64d3e07003fc5bc8884

Given that it happens really rarely it takes a while to prove that a build is not affected. Maybe there is something in the list which looks suspicious to you.
(In reply to Henrik Skupin (:whimboo) from comment #10)
> Please note that I had to request a loaner from RelEng to get it reproduced.
> I wasn't able to with my local machine. But maybe your setup is different.
> 
> I continued to further nail down the regression range. Here is the range I
> have so far:
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=ce69b6e1773e9e0d0a190ce899f34b1658e66ca4&tochange=a374
> c35469935a874fefe64d3e07003fc5bc8884
> 
> Given that it happens really rarely it takes a while to prove that a build
> is not affected. Maybe there is something in the list which looks suspicious
> to you.

From the push log, I only saw the following bug was related to gfx IPC. Could you try to reproduce with/without the follwoing patch? 

1574c02ea13f	vincentliu — Bug 1350330 - Make sure IPC channel still Open before sending IPC message. r=dvander
Sorry, the last mentioned pushlog was not correct. I was able now to reproduce with a build from April 15th, which I originally used. So please give me some more time to run those tests.
I can track this for 56 and 57, sounds complex enough that a fix for 56 may be unlikely. Sounds like 55 was affected, if you can reproduce from nightly build from April 15.
So this is actually a merge commit from autoland which started the regression on mozilla-central. Sadly it contains itself a merge from m-c to autoland, so that it clearly needs further work.

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8dd662ed33876fc7d86882fd19bc9f56d96ffa41&tochange=3243c8fc3ce7831dda843b60d6bb2d7e4acf1fd4

I will continue.
So I got this final regression range which I cannot further bisect because the builds of all three commits have the same source revision 8fbfe8c77eaa.

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=3f13d894a371f3bcc84dfdbc732f00d428cb8f97&tochange=8fbfe8c77eaa41c79aaf48a07666986d300dd0cf

Surprisingly there is a breakpad client patch which got landed at this time. Given that this problem only appears for forced crashes, I could really see a relationship to bug 1350917.

Ted, Mike, and Georg, could at least one of you have a look at this broken behavior?
Flags: needinfo?(ted)
Flags: needinfo?(mconley)
Flags: needinfo?(gfritzsche)
I don't *think* bug 1350917 would be likely to cause this issue, but it's certainly the most suspicious out of the changes in that regression range. It looks to me (per your stack in comment 0) like this is simply the graphics logger code not realizing that it shouldn't try to flush its log when the connection has already terminated.
Flags: needinfo?(ted)
Not sure either why bug 1350917 might cause this. On the other hand, perhaps the graphics changes in that commit (bug 1351426) are causing the logging to be attempted.

I agree that we should be avoiding the log flush after the IPC teardown.
Flags: needinfo?(mconley)
I ran the tests multiple times for the builds on the last pushlog, and it clearly shows this range. With bug 1350917 it fails, but maybe as my other attachment shows there is some other cause which results in the same/similar behavior? I could run again and to another trace.
I will try with https://tools.taskcluster.net/index/artifacts/gecko.v2.autoland.pushdate.2017.04.13.20170413014424.firefox/macosx64-opt now which seems to be the build for the graphics change which I didn't find before.
Whiteboard: gfx-noted
The hang is clearly not reproducible for the build with the changeset 5f4e3982c169. I ran the crash test 1200 times without a single hang during shutdown. But once I use the build for 8fbfe8c77eaa it hung after 197 iterations. And this with the exactly same output as attachment 8903086 [details] for GFX/layers.

I backed out 8fbfe8c77eaa locally from recent mozilla-central, and pushed a try build:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=98b7c5ece7941651d482ab8c031dc0c512310145

I will check if the hang is gone and report back soon.
I accidentally pushed an artifact build, so here the new try build with --no-artifact:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=85e6b908fa3d6cc10e8338509bf677f43ddafec1
So even without the breakpad change the hang is visible. It means I'm out of ideas here, and we maybe should just fix the GFX part here. Not sure if other threads could cause similar behavior.

If there is something else you would like to know. Otherwise I will hand back the loaner to RelEng tomorrow.
Flags: needinfo?(gfritzsche)
I'm going to hand back the one-click loaner given the low capacity of MacOS machines. If more testing is needed we might have to request another one.
Henrik, what's the actual patch that "fixes" this that you mention above?
Flags: needinfo?(dvander) → needinfo?(hskupin)
Priority: -- → P3
(In reply to Milan Sreckovic [:milan] from comment #24)
> Henrik, what's the actual patch that "fixes" this that you mention above?

There is no patch available. Also not sure which comment you are referring to. Sorry, if something was unclear, but this bug clearly needs a dev to get this fixed.
Flags: needinfo?(hskupin) → needinfo?(milan)
I was referencing comment 20 "The hang is clearly not reproducible for the build with the changeset 5f4e3982c169".
Flags: needinfo?(milan)
We are about to build 56 RC. This might be too late for 56. Mark 56 as won't fix.
(In reply to Milan Sreckovic [:milan] from comment #26)
> I was referencing comment 20 "The hang is clearly not reproducible for the
> build with the changeset 5f4e3982c169".

So that was the last working commit which didn't produce a test failure when running the test hundreds of times. As mentioned in comment 21, the backout of it didn't chagne anything for recent builds where it is still failing. Means there is no patch/fix available, and I'm out of ideas here. We really need someone from the GFX team here.
This doesn't really seem like a gfx bug, it is lower level, and GFX is the first thing to trigger it. Some questions:

Does commenting out the message at [1] fix the problem, or do we just hang somewhere else instead?

Is the stack in comment #0 hanging? Is the parent process fully destroyed at that point? I wonder if the OS is not marking the pipe as closed, or if the parent process is somehow still alive in some kind of zombie state.

[1] http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/gfx/layers/ipc/CompositorBridgeChild.cpp#557
Sorry that I cannot answer this but I do not have the knowledge in that area. If there is someone going to work on it, it might be good to request a MacOS loaner again which makes it easier to reproduce.
This is one of the most occurring crashes for Marionette tests at the moment. David, what can be done to make progress on this bug?
Flags: needinfo?(dvander)
Hi Milan, I am reviewing 57+ tracked bugs and wondering if this one will be fixed in 57 or not. Should I just wontfix this for 57 now? If a fix is ready in time and low risk, we can consider uplifting to beta57.
Flags: needinfo?(milan)
I wouldn't track this for 57.
Flags: needinfo?(milan)
Not tracking for 57 per comment 33.  This still needs more investigation and might not make 57.
Milan, this hang is most likely the cause of all the crash issues we get for Marionette unit tests on MacOS. If we cannot fix it, those useful tests will be disabled soon. This would be kinda sad to see, because those offered us a couple of other Firefox misbehavior in the past. I don't know how difficult and complex it would be to get fixed, but maybe you can give some feedback in regards the latest finding?

Maybe some hints in how to get this fixed would be welcome too, so that someone could contribute to it?
Flags: needinfo?(milan)
Flags: needinfo?(dvander)
David, who do you think would be a good person to answer the questions about "low level" issues mentioned in comment 29?  So that we can keep this going.
Flags: needinfo?(milan) → needinfo?(dvander)
this has 251 failures associated with it in the last week (assuming the 4 bugs that depend on this have 100% failures being the same root cause).

:milan, :dvander doesn't seem to be responding- can you help find someone else to look into this?
Flags: needinfo?(milan)
Yeah, we're stuck with some higher priority 57 issues, this one will have to wait for some bandwidth.  I understand it's tracked for 57, but it is a theoretical crash based on the testing.  In the meantime, maybe Bill can answer some of the questions on how to keep this going.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(milan)
Flags: needinfo?(dvander)
I don't really have time to look at this, but it would help if someone could answer David's question in comment 29:
"Does commenting out the message at [1] fix the problem, or do we just hang somewhere else instead?"

If commenting out that line makes the hangs go away, then we know what to do. Otherwise, this will take a lot of time to track down and understand.

[1] http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/gfx/layers/ipc/CompositorBridgeChild.cpp#557
Flags: needinfo?(wmccloskey)
I pushed to try for marionette on osx and got some of the other failures that are "dependent" upon this bug:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8878da4a67da740afa89ce8998ebfc351df57f26

I am not sure where this crash is at, I see assertions in failures looking back for a few days on trunk branches, not crashes.  Maybe I need to look deeper in the logs.

In summary, I don't know what the real problem is here or how to see it, so I am not sure if my try push using the hack suggested in comment 40 fixed it or not.
The patch attached or not doesn't make a difference to me for hangs during shutdown I see locally. In my case the parent process is gone, but the content process is still around and doesn't close. Looks like those days the underlying issue for the Marionette unit test is more related to bug 1404190. 

Given that we have seen a hang like this also outside of Marionette the bug should still be valid.
No longer blocks: 1376773
I actually just seem to have hit this bug with the crash test. Sampling the process showed me:

>    +                                                     2511 mozilla::gfx::Log<1, mozilla::gfx::CriticalLogger>::Flush()  (in XUL) + 168  [0x108bdd9e8]
>    +                                                       2511 mozilla::gfx::CriticalLogger::OutputMessage(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool)  (in XUL) + 48  [0x108be2ea0]
>    +                                                         2511 CrashStatsLogForwarder::Log(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)  (in XUL) + 164  [0x108e46b04]
>    +                                                           2511 mozilla::dom::PContentChild::SendGraphicsError(nsTString<char> const&)  (in XUL) + 107  [0x1088f5c8b]

It could just be a single case happening once in a while beside the problem on bug 1404190.

Or well, the content process here is trying to send a graphics error to the parent, right? But what happens when the parent is no longer around, as it happened in this case? Does the content process wait forever for a reply or confirmation?
Blocks: 1376773
Flags: needinfo?(wmccloskey)
Flags: needinfo?(dvander)
As I was informed by Andrew McCreight bug 1404190 only covers debug and ASan builds, but not opt builds. So it would clearly be a different issue.

That said I continued to run tests and made a lot of samples. In all the cases so far always the same stack is visible for the content process. It can be seen in this new attachment. Here an excerpt:

    +                                           8045 mozilla::detail::RunnableMethodImpl<mozilla::ipc::MessageChannel*, void (mozilla::ipc::MessageChannel::*)(), false, (mozilla::RunnableKind)1>::Run()  (in XUL) + 44  [0x10588dccc]
    +                                             8045 mozilla::ipc::PBackgroundChild::OnChannelError()  (in XUL) + 27  [0x1059de48b]
    +                                               8045 mozilla::ipc::PBackgroundChild::DeallocSubtree()  (in XUL) + 1892  [0x1059dd774]
    +                                                 8045 mozilla::dom::DeallocClientManagerChild(mozilla::dom::PClientManagerChild*)  (in XUL) + 18  [0x106bec752]
    +                                                   8045 arena_dalloc(void*, unsigned long, arena_t*)  (in libmozglue.dylib) + 117  [0x102243045]
    +                                                     8045 arena_t::DallocSmall(arena_chunk_t*, void*, arena_chunk_map_t*)  (in 

Maybe Jim also has an idea what could go on here.
Attachment #8905402 - Attachment is obsolete: true
Flags: needinfo?(jmathies)
Btw the tests in the last comment have all been done with a build which has the line commented out:

http://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/gfx/layers/ipc/CompositorBridgeChild.cpp#557
Flags: needinfo?(bill.mccloskey)
(In reply to Henrik Skupin (:whimboo) from comment #12)
> Sorry, the last mentioned pushlog was not correct. I was able now to
> reproduce with a build from April 15th, which I originally used. So please
> give me some more time to run those tests.

This comment may have been wrong given that this might actually be a crash...

(In reply to Peter Chang[:pchang] from comment #11)
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=ce69b6e1773e9e0d0a190ce899f34b1658e66ca4&tochange=a374
> > c35469935a874fefe64d3e07003fc5bc8884
> 
> From the push log, I only saw the following bug was related to gfx IPC.
> Could you try to reproduce with/without the follwoing patch? 
> 
> 1574c02ea13f	vincentliu — Bug 1350330 - Make sure IPC channel still Open
> before sending IPC message. r=dvander

... which this patch fixed. At least the description of the bug makes sense to the situation which we have here. The parent process crashes, and the content process wants to send an IPC message to the parent, which no longer exists. I'm not that sure if backing out this patch would be that helpful.

It would be great to get some feedback in how to proceed.
Asking Haik to take a look when he has free cycles.
Flags: needinfo?(jmathies)
Flags: needinfo?(haftandilian)
Flags: needinfo?(dvander)
Since yesterday all the crashes we have seen in test_crash.py on MacOS for debug builds are gone. Did something important land for IPC since March 7th which could have fixed that?
So I tried my testcase with the most recent Nightly build of Firefox and I can still reproduce it. By checking the activity monitor sample I feel that this hang is actually something which seem to be relevant for bug 1435343 (RuntimeService::CrashIfHanging) but we never even crash here.

Andrea, would you mind to have a look at the attached activity monitor sample? I would appreciate. Thanks.
Flags: needinfo?(amarchesini)
Moving the NI to jimm again.
Flags: needinfo?(amarchesini) → needinfo?(jmathies)
Just in case something has changed in the last 2 months here an updated sample from the activity monitor.
(In reply to Andrea Marchesini [:baku] from comment #50)
> Moving the NI to jimm again.

What's this ni about please?
Flags: needinfo?(jmathies)
Flags: needinfo?(haftandilian) → needinfo?(jmathies)
This is on my radar. I'll assign myself the bug for now. So far I've done some testing and was able to reproduce a hang that looked like the bug, but for the me the test automatically cleaned it up. I tried getting this running in a loop on my secondary machine, but ran into failures I haven't resolved yet. I will try to make some progress on this next week. Sorry for the delay.
Assignee: nobody → haftandilian
Flags: needinfo?(jmathies)
Haik, it is good to hear that you were able to reproduce the hang. And yes, the Marionette harness kills the process after a timeout of 120s by default. As best you modify the following line and replace the argument for wait with None:

https://dxr.mozilla.org/mozilla-central/rev/55d91695f4bb951c224005155baef054a786c1f7/testing/marionette/client/marionette_driver/marionette.py#807

Then Marionette will wait forever, and you should have enough time to investigate.
Haik, another 20 days passed by. Could you please have a look at this bug? Please check my last comment 54 in how to prevent Marionette from shutting down the browser.
Flags: needinfo?(haftandilian)
I'm using your tip to disable the timeout. I've been trying to reproduce this today. I had to change the testcase to expect a different error message:

  > with self.assertRaisesRegexp(IOError, "unexpectedly closed"):
  < with self.assertRaisesRegexp(IOError, "crashed"):

I've been running it for a few hours so far and haven't hit the problem. I've set this to run on my spare system so I can keep the test running.
Flags: needinfo?(haftandilian)
(In reply to Haik Aftandilian [:haik] from comment #56)
> I'm using your tip to disable the timeout. I've been trying to reproduce
> this today. I had to change the testcase to expect a different error message:
> 
>   > with self.assertRaisesRegexp(IOError, "unexpectedly closed"):
>   < with self.assertRaisesRegexp(IOError, "crashed"):

No, you should not make this change. Having it different means you experience an abnormal behavior and that no minidump files have been created. Please revert that.

> I've been running it for a few hours so far and haven't hit the problem.
> I've set this to run on my spare system so I can keep the test running.

Interesting. I can clearly reproduce with a build from 03-09 but don't see it either with a build from 03-29. I will try myself later today. Maybe it got fixed by something else. I will let you know.
The test failure here appears to be a deadlock which is caused by a bug in the Mac breakpad exception handling. The affected content process never exits because the breakpad exception handler running in the process is waiting for a lock held by a thread that is suspended.

In the test, when a content process is being shutdown, the breakpad ExceptionHandler runs and attempts to uninstall itself. Here's the stack showing ExceptionHandler::UninstallHandler() stuck trying to acquire a memory allocator lock due to an object deletion. It is a bug for ExceptionHandler::UninstallHandler() to be freeing memory when its argument |bool in_exception| is true and in this case it is true. The call old_handler_.reset(), where old_handler_ is a scoped pointer, causes the free.

  thread #5 (breakpad thread)
    frame #0 libsystem_kernel.dylib`syscall_thread_switch() 
    frame #1 libsystem_platform.dylib`_OSSpinLockLockYield() 
    frame #2 libmozglue.dylib`Mutex::Lock() at Mutex.h:66
    frame #3 libmozglue.dylib`AutoLock<Mutex>::AutoLock() at Mutex.h:129
    frame #4 libmozglue.dylib`AutoLock<Mutex>::AutoLock() at Mutex.h:127
    frame #5 libmozglue.dylib`arena_dalloc() at mozjemalloc.cpp:3498
    frame #6 libmozglue.dylib`BaseAllocator::free() at mozjemalloc.cpp:4238
    frame #7 libmozglue.dylib`Allocator<MozJemallocBase>::free() at malloc_decls.h:40
    frame #8 libmozglue.dylib`Allocator<ReplaceMallocBase>::free() at malloc_decls.h:40
    frame #9 libmozglue.dylib`::free(void *)() at malloc_decls.h:40
    frame #10 libmozglue.dylib`zone_free_definite_size() at zone.c:200
    frame #11 XUL`google_breakpad::scoped_ptr<sigaction>::reset() at scoped_ptr.h:107
    frame #12 XUL`google_breakpad::ExceptionHandler::UninstallHandler() at exception_handler.cc:697
    frame #13 XUL`google_breakpad::ExceptionHandler::WaitForMessage() at exception_handler.cc:585
    frame #14 libsystem_pthread.dylib`_pthread_body() 
    frame #15 libsystem_pthread.dylib`_pthread_start() 
    frame #16 libsystem_pthread.dylib`thread_start()
    
  thread #1 (main thread, holding the lock, suspended)
    frame #0 libmozglue.dylib`arena_run_reg_dalloc() at mozjemalloc.cpp:2298
    frame #1 libmozglue.dylib`arena_t::DallocSmall() at mozjemalloc.cpp:3418
    frame #2 libmozglue.dylib`arena_dalloc() at mozjemalloc.cpp:3504
    frame #3 libmozglue.dylib`BaseAllocator::free() at mozjemalloc.cpp:4238
    frame #4 libmozglue.dylib`Allocator<MozJemallocBase>::free() at malloc_decls.h:40
    frame #5 libmozglue.dylib`Allocator<ReplaceMallocBase>::free() at malloc_decls.h:40
    frame #6 libmozglue.dylib`::free(void *)() at malloc_decls.h:40
    frame #7 XUL`nsTArrayInfallibleAllocator::Free() at nsTArray.h:215
    frame #8 XUL`nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::~nsTArray_base() at nsTArray-inl.h:22
    frame #9 XUL`nsTArray_Impl<mozilla::layout::PVsyncChild*, nsTArrayInfallibleAllocator>::~nsTArray_Impl() at nsTArray.h:868
    frame #10 XUL`nsTArray<mozilla::layout::PVsyncChild*>::~nsTArray() at nsTArray.h:2322
    frame #11 XUL`nsTArray<mozilla::layout::PVsyncChild*>::~nsTArray() at nsTArray.h:2322
    frame #12 XUL`mozilla::ipc::PBackgroundChild::DestroySubtree() at PBackgroundChild.cpp:2859
    frame #13 XUL`mozilla::ipc::PBackgroundChild::OnChannelError() at PBackgroundChild.cpp:2497
    frame #14 XUL`mozilla::ipc::MessageChannel::NotifyMaybeChannelError() at MessageChannel.cpp:2597
    frame #15 XUL`mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError() at MessageChannel.cpp:2630
    
The hang happens because another thread is holding the lock needed by the ExceptionHandler thread. The thread holding the lock (and all other threads except the ExceptionHandler thread) doesn't make progress and release the lock because it is suspended.

Most of the time this is hit, the main thread is holding a mozjemalloc arena lock and is executing PBackgroundChild::DestroySubtree(), but this could be caused by any thread holding a lock that is needed by memory freeing code.

All of the activity monitor samples Henrik posted are instances of this bug, but I never noticed the ExceptionHandler stack.

Without the fix, I can normally hit the hang after about ~10 iterations of the test case on a MacBook Pro (Retina, 15-inch, Mid 2015). With the fix installed, I've run a the test for 536 iterations without a hang. I'm running more tests on the other machine I have available.

I'll post the fix for the breakpad code and see about upstreaming it. The breakpad issue was introduced back in 2012 with https://github.com/google/breakpad/commit/b1f858f26b136044922eeae0d86488b17b54efff
Priority: P3 → P1
Attachment #8967193 - Attachment description: Breakpad patch to avoid triggered by ExceptionHandler::UninstallHandler() freeing memory → Breakpad patch to avoid deadlock triggered by ExceptionHandler::UninstallHandler() freeing memory
Ted, assuming this fix is correct, would we upstream it or land it in central?
Flags: needinfo?(ted)
The code under breakpad-client has been forked, so you can land there directly.
Flags: needinfo?(ted)
The posted fix intentionally leaks a sigaction struct, which is 16 bytes, but this only occurs when a breakpad exception is being handled and a minidump is being written before shutting down. And ExceptionHandler::UninstallHandler() already leaks some memory in this situation. All other threads have been suspended so it is not safe to run code that could try to acquire a lock. See comment 58 for details.

More explanation can be found here <https://chromium.googlesource.com/breakpad/breakpad/+/master/docs/client_design.md>:
"Library code, including runtime library code, must be avoided unless it provably meets the above guidelines. For example, this means that the STL string class may not be used, because it performs operations that attempt to allocate and use heap memory. It also means that many C runtime functions must be avoided, particularly on Windows, because of heap operations that they may perform."
Attachment #8967193 - Attachment is obsolete: true
Depends on: 1041474
Ted, if you find the time for the review I would appreciate given that we still see a lot of crashes for Marionette tests due to that problem. Thanks.
Flags: needinfo?(ted)
I can also review this if Ted's busy, just redirect the request to me.
(In reply to Gabriele Svelto [:gsvelto] from comment #65)
> I can also review this if Ted's busy, just redirect the request to me.

Thanks. I redirected this to Gabriele since Ted has a full review queue.
Apologies. I've been doing better at keeping my review queue down but I was at a work week last week so I didn't keep on top of it (and a fairly large patch series snuck in there as well).
Flags: needinfo?(ted)
Comment on attachment 8967597 [details]
Bug 1395504 - Infinite hang of web content process when parent process crashes

https://reviewboard.mozilla.org/r/236294/#review244780

Really glad you fixed this, it's been plaguing us for months. Did we open an upstream bug too? While this is the part of breakpad we forked it's not a bad idea to play nice with upstream.
Attachment #8967597 - Flags: review?(gsvelto) → review+
Comment on attachment 8967597 [details]
Bug 1395504 - Infinite hang of web content process when parent process crashes

https://reviewboard.mozilla.org/r/236294/#review244780

Thanks. I'll try and get it landed upstream too. I saw some comments indicating it wasn't maintained anymore but I'll try.
IIRC the Windows and Mac parts of Breakpad aren't actively maintained, but they'll accept patches.
Pushed by haftandilian@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/933eb0b6a922
Infinite hang of web content process when parent process crashes r=gsvelto
https://hg.mozilla.org/mozilla-central/rev/933eb0b6a922
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
(In reply to Haik Aftandilian [:haik] from comment #70)
> Thanks. I'll try and get it landed upstream too. I saw some comments
> indicating it wasn't maintained anymore but I'll try.

I wouldn't bother, TBH. Chrome is using crashpad for exception handling on macOS.
See Also: → 1457545
No longer blocks: 1504482
Blocks: 1377359
You need to log in before you can comment on or make changes to this bug.