Closed Bug 1122203 Opened 9 years ago Closed 3 years ago

Content process keeps crashing under PLayerTransactionChild::SendPTextureConstructor in e10s mode

Categories

(Core :: Graphics, defect, P3)

x86
macOS
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
e10s + ---

People

(Reporter: bzbarsky, Unassigned)

References

Details

(Whiteboard: gfx-noted)

When I just updated Firefox it put itself back into e10s mode, then the content process proceeded to crash.  I went through manually reloading tabs, and as I was doing that it crashed again.  The crashes are in the same place, so it looks like I can reproduce this so far, in case that's useful.

The crash reports are:

  https://crash-stats.mozilla.com/report/index/a4a4360e-843e-4bf7-83fe-bf7b82150115
  https://crash-stats.mozilla.com/report/index/f9d37e4e-513f-447a-8789-397602150115

Both show us crashing on line 207 of the generated PLayerTransactionChild.cpp, in PLayerTransactionChild::SendPTextureConstructor.  The relevant code is:

    bool __sendok = (mChannel)->Send(__msg);
    if ((!(__sendok))) {
        NS_RUNTIMEABORT("constructor for actor failed");

where line 207 is the NS_RUNTIMEABORT.
Blocks: 1111396
Flags: needinfo?(davidp99)
bz,

Are you saying that you can reproduce this locally but don't have an STR?  This looks like a failure of the TextureHost, which often comes with some reason, printed to the console.  Any chance you can get the console output from before the crash?

Also, are you using the OGL compositor or the basic compositor?  (Probably OGL but it cant hurt to check.)
Flags: needinfo?(bzbarsky)
I can reproduce locally in my normal browsing profile.  I haven't managed to in my debug build, because it basically refuses to load the session for reasons that escape me....  Presumably the STR involves my session.  ;)

> This looks like a failure of the TextureHost, which often comes with some reason, printed
> to the console

Which console?  The Firefox browser console, or some other one?

> Also, are you using the OGL compositor or the basic compositor?

How do I tell?
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] from comment #2)
> Which console?  The Firefox browser console, or some other one?

I meant the debug log, which isn't so useful since you are getting different behavior in the debug build.  The debug log (...meaning the stuff printed to the Visual Studio or cygwin console during development) could still be useful since whatever is causing debug to behave differently is probably directly relevant, but this is starting to sound like a case of profile corruption.

> > Also, are you using the OGL compositor or the basic compositor?
> 
> How do I tell?

...thats a good question since you cant even launch (booting firefox in safe mode wont give the right result).  Usually, you'de look at about:support and check about:config for the layers.offmainthreadcomposition.force-basic value, but that will be quite hard since you crash in the first 11ms.

----

So it sounds like this profile (your main profile) crashes at launch... but you have other profiles that still run fine?  I can't think of anything that can be done to diagnose this that isn't very involved.  All that's left is to reset your profile from safe mode and, assuming you can launch after that, start enabling any addons you recall using (and restore any settings you may recall changing) to see if the issue returns.  That's what I'd recommend you do anyway, to get your busted profile back, but its unlikely to reproduce the bug.

The instructions for resetting / safe mode:
https://support.mozilla.org/en-US/kb/reset-firefox-easily-fix-most-problems

If you can't boot to safe mode, or if it doesn't repair the problem, then this will get interesting.

Side question: The crash logs show that the firefox build is amd64... are you running Hackintosh?  Or is the crash report just wrong?
Flags: needinfo?(bzbarsky)
> I meant the debug log,

So just stdout/stderr, in a debug build?  I can certainly try harder to reproduce this in a debug build.

> meaning the stuff printed to the Visual Studio or cygwin console

Uh.... I'm on Mac, yes?  ;)  I assume you mean stdout/stderr.

> since you cant even launch

Sure I can.  The content process crashes.  The chrome process is running fine.  If I try to actually load the tabs again in the content process, the content process crashes again once enough of them are loaded.

> Usually, you'de look at about:support

For which exact information?

> and check about:config for the layers.offmainthreadcomposition.force-basic 

Set to the default value: false.

> So it sounds like this profile (your main profile) crashes at launch...

No, it crashes partway through restoring my session.  And, again, what crashes is only the content process.

If I turn off e10s everything works fine, but I assume we're not going through the ipdl for LayerTransaction at all at that point.

> I can't think of anything that can be done to diagnose this that isn't very involved. 

I'm fine with involved, as long as you give me an idea of what I'm looking for.  First step will be to try reproducing in the debug build again.

> All that's left is to reset your profile from safe mode

Since this has to do with the exact set of web pages I have in my session, I doubt that will be very helpful.

> start enabling any addons you recall using

I know exactly which addons I have installed.  I can verify that they're not implicated later tonight, but I'm 99.9% sure they're not.  They're DOM Inspector, tab stats, and JIT Inspector.

> The crash logs show that the firefox build is amd64

That's the hardware architecture.  Also known as x86-64.  All our Mac builds are on this hardware architecture, and have been for years.

> are you running Hackintosh?

Very much no.
Flags: needinfo?(bzbarsky)
OK, so I tried copying the whole profile, not just the sessionstore file, and now the session is restored correctly in a debug build.  Partway through that it crshes, but unfortunately not with the crash in this bug.  Instead:

[Child 56226] ###!!! ABORT: constructor for actor failed: file ./PNeckoChild.cpp, line 313

followed by the _chrome_ process aborting.

I'm going to try a local opt build next.  Where does the debug logging you want output from live, so I can try turning it on in that build?
I guess the other relevant bit in the debug build is this part somewhat before the abort:

  [Child 56226] WARNING: pipe error: Broken pipe: file ../../../mozilla/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 728

then a bunch of:

  ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv

  [Child 56226] WARNING: MsgDropped in ContentChild: file ../../../mozilla/dom/ipc/ContentChild.cpp, line 1805

and after a bit of that the PNeckoChild abort.
OK, I got that completely mixed up.  Let me try this again.

That is what I was looking for -- the stdout spew.  Too bad the location of the bug moved -- I was hoping for a clearer message from TextureHost.  The PNeckoChild error is probably late in the failure.

The debug results suggest that the issue isn't ever with the LayerTransaction.  It's probably an IPC error -- if the PNeckoChild lost the connect to the parent process then the PLayerTransactionChild would also.  ...And its not hard to imagine the debug run showing a different asynchronous failure from the release.
Flags: needinfo?(davidp99)
Fwiw, my opt build finished, and it too crashes due to aborts in PNeckoChild, as far as I can tell..  This might be a slightly different tab set from the one I had open yesterday, but at this point I have a snapshot of it, at least.

So yes, I agree that this is likely to be a more generic IPC failure.
Component: Graphics → IPC
Oh, and this time I built the opt build from rev cac6192956ab, which matches the nightly I was seeing the problem with initially, I think.
Boris, try breaking in the parent process here: http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#3196

Maybe the parent is forcefully killing the child.
Breakpoint on that line is not being hit before the parent process aborts.

That said, I tried this a few more times, and sometimes in the opt build the parent process stays up even after the NeckoChild abort and then I get:

[Child 47776] WARNING: file ../../../mozilla/ipc/chromium/src/base/shared_memory_posix.cc, line 225

[Parent 47772] WARNING: pipe error (33): Message too long: file ../../../mozilla/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 456

###!!! [Child][MessageChannel] Error: Channel error: cannot send/recv

[Child 47776] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file ../../../mozilla/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 20

[Child 47776] ###!!! ABORT: constructor for actor failed: file ./PLayerTransactionChild.cpp, line 207

and then a bunch of 

###!!! [Parent][MessageChannel] Error: Channel error: cannot send/recv
And when I then tried to actually load the tabs, it went along for a while, and then in the parent process:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
[Switching to process 47772 thread 0x3713]
0x00000001019f91b2 in poll (filedes=<value temporarily unavailable, due to optimizations>, nfds=<value temporarily unavailable, due to optimizations>, timeout=<value temporarily unavailable, due to optimizations>) at ../../../../../../mozilla/nsprpub/pr/src/md/unix/unix.c:3786

This is from the socket transport service polling.

This was followed by another:

[Child 47781] ###!!! ABORT: constructor for actor failed: file ./PNeckoChild.cpp, line 313

in the console.
You might try a syscall tracing tool. On Macs there's something called dtruss, although I haven't used it.
Boris, would you be willing to send me the session? I could try to reproduce. Otherwise, it may be up to you to debug.
Flags: needinfo?(bzbarsky)
Bill, I sent you the sessionstore file, I hope.
Flags: needinfo?(bzbarsky)
Flags: needinfo?(wmccloskey)
Looking at this bug a little more closely, it seems like file descriptors are somehow involved. One possible idea is that it could be related to bug 1036682. You could try running with RLIMIT_NOFILE=4864 as suggested in the bug. If the problem goes away, then we're probably running out of file descriptors.

Another possibility is that the buffer here is too small:
http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.h#121
Perhaps you could try doubling the size and see if the problem goes away?
Flags: needinfo?(wmccloskey)
I assume you meant RLIMIT_FILENO, but in any case I set both env vars.  When I set them to 4864, the chrome process crashes.  When I set them to 16384, the chrome process crashes.  When I set them to 65535 (is that even a valid value?) the content process crashes.

Going to try the buffer increase bit.
Raising the buffer size to 128 and using 16384 for the RLIMIT env vars... the chrome process crashes.
OK, Bill and I debugged this for a bit today.

We're totally running out of file descriptors.  Once we get to 4000-some, the dup() call in SharedMemory::CreateOrOpen fails, the DCHECK following it does nothing in an opt build, and then mapped_file_ is bogus and the failures start cascading and everything falls down.

I tried checking what calls SharedMemory::CreateOrOpen here and I'm getting thousands of calls with stacks coming from mozilla::layers::ClientTiledPaintedLayer::RenderLayer (stack below for those who care about the details).

There _are_ some calls to SharedMemory::Close, coming off processing SHMEM_DESTROYED_MESSAGE_TYPE in PCompositorChild::OnMessageReceived.  But as far as I can tell we're not destroying nearly as fast as we create.  In particular, if I do some printf logging of the creations and destructions, I see about 2555 shmem creations and about 150 destructions before things start to go south.  

Bill and I tried setting the "layers.enable-tiles" preference to false, and the problem went away completely; I can start the browser just fine.  So this really does seem to be file descriptor exhaustion due to the thousands of shared memory areas the tiling code is creating.

George, needinfoing you because Bill said you'd been looking at something similar.  David, needinfoing you because you were looking at this before on the graphics side, and we're back to this being primarily triggered by graphics.

Stack to the shmem CreateOrOpen calls:

#0  base::SharedMemory::CreateOrOpen (this=0x125430df0, name=@0x7fff5fbfb720, posix_flags=514, size=8192) at ../../../mozilla/ipc/chromium/src/base/shared_memory_posix.cc:163
#1  0x0000000100734626 in base::SharedMemory::Create (this=0x125430df0, cname=<value temporarily unavailable, due to optimizations>, read_only=<value temporarily unavailable, due to optimizations>, open_existing=<value temporarily unavailable, due to optimizations>, size=8192) at ../../../mozilla/ipc/chromium/src/base/shared_memory_posix.cc:79
#2  0x000000010077cc2d in mozilla::ipc::SharedMemoryBasic::Create (this=0x125430dd0, aNbytes=8192) at SharedMemoryBasic_chromium.h:40
#3  0x00000001007794b0 in mozilla::ipc::CreateSegment (aNBytes=8192, aHandle={fd = -1, auto_close = false}) at Shmem.cpp:145
#4  0x0000000100779357 in already_AddRefed<mozilla::ipc::SharedMemory>::take () at /Users/bzbarsky/mozilla/inbound/obj-firefox-opt/dist/include/mozilla/AlreadyAddRefed.h:509
#5  0x0000000100779357 in mozilla::ipc::Shmem::Alloc () at Shmem.cpp:163
#6  0x000000010085ed26 in nsRefPtr<mozilla::ipc::SharedMemory>::nsRefPtr<mozilla::ipc::SharedMemory> () at PCompositorChild.cpp:649
#7  0x000000010085ed26 in already_AddRefed<mozilla::ipc::SharedMemory>::take () at PCompositorChild.cpp:109
#8  0x000000010085ed26 in nsRefPtr<mozilla::ipc::SharedMemory>::nsRefPtr<mozilla::ipc::SharedMemory> () at /Users/bzbarsky/mozilla/inbound/obj-firefox-opt/dist/include/nsRefPtr.h:106
#9  0x000000010085ed26 in mozilla::layers::PCompositorChild::CreateSharedMemory (this=0x1000, aSize=140734799787808, aType=4294967295, aUnsafe=<value temporarily unavailable, due to optimizations>, aId=0x1000) at PCompositorChild.cpp:163
#10 0x00000001009a3a44 in mozilla::layers::PLayerTransactionChild::AllocUnsafeShmem (this=0x125430df0, aSize=140734799787808, aType=4294967295, aOutMem=0x7fff5fbfb8f0) at PLayerTransactionChild.cpp:958
#11 0x0000000100e90599 in mozilla::layers::ShadowLayerForwarder::AllocUnsafeShmem (this=<value temporarily unavailable, due to optimizations>, aSize=<value temporarily unavailable, due to optimizations>, aType=<value temporarily unavailable, due to optimizations>, aShmem=<value temporarily unavailable, due to optimizations>) at ShadowLayers.cpp:727
#12 0x0000000100e7e0c6 in mozilla::layers::ISurfaceAllocator::AllocShmemSection (this=0x11da24d50, aSize=4, aShmemSection=0x125430d98) at ISurfaceAllocator.cpp:228
#13 0x0000000100e4ddf2 in operator new () at /Users/bzbarsky/mozilla/inbound/obj-firefox-opt/dist/include/mozilla/mozalloc.h:388
#14 0x0000000100e4ddf2 in mozilla::layers::TileClient::GetBackBuffer (this=0x7fff5fbfbd48, aDirtyRegion=@0x7fff5fbfbb98, aContent=<value temporarily unavailable, due to optimizations>, aMode=<value temporarily unavailable, due to optimizations>, aCreatedTextureClient=0x7fff5fbfbbb7, aAddPaintedRegion=@0x7fff5fbfbb78) at TiledContentClient.cpp:163
#15 0x0000000100e4f1bd in mozilla::layers::ClientTiledLayerBuffer::ValidateTile (this=0x124ed6718, aTile=@0x7fff5fbfbd48, aTileOrigin=@0x7fff5fbfbd40, aDirtyRegion=@0x7fff5fbfbf28) at TiledContentClient.cpp:1110
#16 0x0000000100e647af in mozilla::layers::TiledLayerBuffer<mozilla::layers::ClientTiledLayerBuffer, mozilla::layers::TileClient>::Update (this=0x124ed6718, aNewValidRegion=@0x1233a9df0, aPaintRegion=@0x7fff5fbfc590) at TiledLayerBuffer.h:535
#17 0x0000000100e4e843 in mozilla::layers::ClientTiledLayerBuffer::PaintThebes (this=0x124ed6718, aNewValidRegion=@0x1233a9df0, aPaintRegion=@0x7fff5fbfc590, aCallback=<value temporarily unavailable, due to optimizations>, aCallbackData=<value temporarily unavailable, due to optimizations>) at TiledContentClient.cpp:941
#18 0x0000000100e3f4c6 in mozilla::layers::ClientTiledPaintedLayer::ClientManager () at /Users/bzbarsky/mozilla/inbound/mozilla/gfx/layers/client/ClientTiledPaintedLayer.h:425
#19 0x0000000100e3f4c6 in mozilla::layers::ClientTiledPaintedLayer::RenderLayer (this=0x1233a9c00) at ClientTiledPaintedLayer.cpp:163
Severity: normal → critical
Component: IPC → Graphics
Flags: needinfo?(gwright)
Flags: needinfo?(davidp99)
Jeff and I have been discussing this today. We think we're going to increase the soft limit for fds to start with, and look into a long-term solution involving the usage of fewer tiles. Specifically, stop using tiles for things like non-scrollable content.
Flags: needinfo?(gwright)
There is an additional (long term) solution: having a kind of "shmem allocator" and group tiles/video frames in big shmems rather than having a shmem per tile. Not a low hanging fruit but I definitely think we should do it eventually.
> We think we're going to increase the soft limit for fds to start with

What do you propose to increase the soft limit to?

For estimation purposes, if I were running my browser full-screened instead of the size I run it at, each window would be about 1920x1200, on a high-DPI display.  So presumably either 15x10 = 150 tiles per window or 12x8 == 96 tiles per window, depending on what exactly the coordinate space is for the tiles in my case (running at scaled resolution on an mbp).  We seem to be using 2 fds per tile, by the above counts, so 200-300 fds per window.  At the 4000 limit, that's 10-20 windows.  We'd have to bump the limit above 100,000 to get out of the range of session sizes I've seen people mention casually in conversation.

(Also, for scale, everything else in Firefox together is using ~150fds as far as I can tell...)

The solution of comment 21, one shmem per window, say, seems like a decent idea...

It might also be worth checking whether the use of 2 fds per shmem is in fact expected.
So just in case this wasn't clear, this is totally blocking me being able to dogfood e10s.  ;)
Can you please check if bug 1036682 had any effect?

Also, you can always turn off tiling if you want to dogfood e10s :-).
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] from comment #22)
> It might also be worth checking whether the use of 2 fds per shmem is in
> fact expected.

Tiles are (lazily) double-buffered, so if something needs to be redrawn which happens a lot, you get two textures for the same tile (-> 2 shmems -> 2fds). If the layer happens to be a component-alpha layer, you get up to not 2 but 4 textures per tile (so 4 fds). That's a lot of fds as it adds up. It's expected but not particularly wanted.

> The solution of comment 21, one shmem per window, say, seems like a decent
> idea...
> 

I filed bug 1128503 to track grouping textures in bigger shmems (I filed it because I am convinced we'll need to do it eventually but it hasn't climbed the priority chain yet).
> Can you please check if bug 1036682 had any effect?

Doesn't look like it did.

I guess I can turn off tiling if that's a useful mode to test, good point.
Flags: needinfo?(bzbarsky)
This looks like a dupe of bug 1036682. However, there have been a very light scattering of reports(8) of this crash, since that bug was fixed a month ago.
Flags: needinfo?(davidp99)
As per bz's comment in bug 1036682 (https://bugzilla.mozilla.org/show_bug.cgi?id=1036682#c85), I'm closing this as a dupe.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
No, this is not a duplicate.  By "working correctly now" in bug 1036682, I meant that the FD limit is now correctly increased to the max the OS will allow.  That higher limit is still not large enough to actually allow me to start the browser.  See the calculations in comment 22 for why: the current gfx code is using so many fds that with the sizes of sessions people use we'll blow out any fd limit the OS will let us set.

> However, there have been a very light scattering of reports(8) of this crash

Yes, BECAUSE THIS BUG IS NOT FIXED.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
So, one way to fix this is to reduce the number of tiles, but tha(In reply to Not doing reviews right now from comment #26)
> > Can you please check if bug 1036682 had any effect?
> 
> Doesn't look like it did.
> 
> I guess I can turn off tiling if that's a useful mode to test, good point.

While we're on workarounds, increasing the tile size would probably be a better option - tiles don't have to be square, and can be any (reasonable) size, so setting layers.tile-height and layers.tile-width to non-default value can drastically reduce the number of tiles and thus fd's used.

If that helps, we can reopen the "change the default size" discussion.  Actually, it should happen, and we should re-open that discussion anyway.
I resolved this as a dupe because I think that fixing the bug properly is already adequately covered in bug 1130545. Milan, do you want to dupe this on bug 1130545 or keep this open separately?
Flags: needinfo?(milan)
> I think that fixing the bug properly is already adequately covered in bug 1130545

I suggest you run through the numbers for what happens for a user with 1000 tabs at a typical window size.  I don't think just increasing the size of the tiles is likely to do anything other than mask the bug, unless you make the size of the tiles comparable to the size of the screen (e.g. in the 1000 tab scenario, you can't really go below tiles that are 1/2 screen in width/height as far as I can see).
George, I thought you were going to dupe this to a bug on using one texture for all the tiles. That's the only way we're really going to fix this.
Sounds like we want this bug to be "don't be limited by the system fd limits".  That would certainly not be then fixed by increasing the fd limits or using fewer file descriptors, it would just postpone the inevitable.  So, agreed, separate bug.

However, I honestly don't know how important it is for us to support this workflow.  I'm seeing 31 crashes with this signature in the past month; perhaps there are others that show up with a different signature, and if that's the case we should add them up, but I can't see being able to argue a major undertaking or even getting people to do a more detailed design on the alternative until I can back it up with some data.
Flags: needinfo?(milan)
> or using fewer file descriptors

I think it could in fact be fixed by using fewer file descriptors, where "fewer" is "two orders of magnitude".  Right now we're using several hundred fds per window for gfx (and, again, everything else in Firefox uses about 150 fds, so at, say, 4 fds per window you could easily go to 1000 windows without running into issues).  If we were using single-digit numbers of fds per window, we would not be having this conversation.  ;)

> However, I honestly don't know how important it is for us to support this workflow.

How important is it that I be able to use Firefox? ;)

> I'm seeing 31 crashes with this signature in the past month

Right, because this bug keeps the browser from startign up in a useful way at all.  I assume all the nightly users running into it disabled e10s just like I did, or stopped using nightlies!  In case it wasn't clear, I can't use the browser at _all_ in e10s mode due to this bug, so I never hit this crash in practice because I simply don't try to use the browser in a way that would trigger this crash.
Blocks: e10s-gfx
No longer blocks: 1111396
Priority: -- → P2
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information

Hey Boris,
Can you still reproduce this issue or should we close it?

Flags: needinfo?(bzbarsky)

Clearly the main issue (crashes due to running out of fds) was fixed at some point, I think. Whether Fireofx is still using a ton of fds, I don't know and don't have a good way to test.

Flags: needinfo?(bzbarsky)

There's been ongoing work to reduce the number of fds, especially on Linux, so I don't think this bug serves a purpose.

Status: REOPENED → RESOLVED
Closed: 9 years ago3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.