Closed Bug 1036682 Opened 10 years ago Closed 9 years ago

Zooming map crashes in mozalloc_abort(char const*) | Abort | NS_DebugBreak | mozilla::layers::PLayerTransactionChild::SendPTextureConstructor(mozilla::layers::PTextureChild*, mozilla::layers::SurfaceDescriptor const&, mozilla::layers::TextureFlags const&)

Categories

(Core :: Graphics: Layers, defect)

All
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
e10s m5+ ---
firefox33 --- unaffected
firefox34 --- unaffected
firefox35 ? affected

People

(Reporter: cpeterson, Assigned: gw280)

References

Details

(Keywords: crash, Whiteboard: [leave-open])

Crash Data

Attachments

(4 files, 4 obsolete files)

STR:
1. In an e10s window, load Mozilla Location Service map: https://location.services.mozilla.com/map
2. Zooooooooooom map to view street-level detail.

RESULT:
100% reproducible CRASH!


This bug was filed from the Socorro interface and is 
report bp-760da5b0-8138-42e2-81de-a30e52140709.
=============================================================

0 	libmozalloc.dylib 	mozalloc_abort(char const*) 	memory/mozalloc/mozalloc_abort.cpp
1 	XUL 	Abort 	xpcom/base/nsDebugImpl.cpp
2 	XUL 	NS_DebugBreak 	xpcom/base/nsDebugImpl.cpp
3 	XUL 	mozilla::layers::PLayerTransactionChild::SendPTextureConstructor(mozilla::layers::PTextureChild*, mozilla::layers::SurfaceDescriptor const&, mozilla::layers::TextureFlags const&) 	obj-firefox/x86_64/ipc/ipdl/PLayerTransactionChild.cpp
4 	XUL 	mozilla::layers::ShadowLayerForwarder::CreateTexture(mozilla::layers::SurfaceDescriptor const&, mozilla::layers::TextureFlags) 	gfx/layers/ipc/ShadowLayers.cpp
5 	XUL 	mozilla::layers::TextureClient::InitIPDLActor(mozilla::layers::CompositableForwarder*) 	gfx/layers/client/TextureClient.cpp
6 	XUL 	mozilla::layers::ImageClientSingle::AddTextureClient(mozilla::layers::TextureClient*) 	gfx/layers/client/CompositableClient.cpp
7 	XUL 	mozilla::layers::ImageClientSingle::UpdateImageInternal(mozilla::layers::ImageContainer*, unsigned int, bool*) 	gfx/layers/client/ImageClient.cpp
8 	XUL 	mozilla::layers::ImageClientSingle::UpdateImage(mozilla::layers::ImageContainer*, unsigned int) 	gfx/layers/client/ImageClient.cpp
9 	XUL 	mozilla::layers::ClientImageLayer::RenderLayer() 	gfx/layers/client/ClientImageLayer.cpp
10 	XUL 	mozilla::layers::ClientContainerLayer::RenderLayer() 	gfx/layers/client/ClientContainerLayer.h
11 	XUL 	mozilla::layers::ClientContainerLayer::RenderLayer() 	gfx/layers/client/ClientContainerLayer.h
12 	XUL 	mozilla::layers::ClientContainerLayer::RenderLayer() 	gfx/layers/client/ClientContainerLayer.h
13 	XUL 	mozilla::layers::ClientContainerLayer::RenderLayer() 	gfx/layers/client/ClientContainerLayer.h
14 	XUL 	mozilla::layers::ClientLayerManager::EndTransactionInternal(void (*)(mozilla::layers::ThebesLayer*, gfxContext*, nsIntRegion const&, mozilla::layers::DrawRegionClip, nsIntRegion const&, void*), void*, mozilla::layers::LayerManager::EndTransactionFlags) 	gfx/layers/client/ClientLayerManager.cpp
15 	XUL 	mozilla::layers::ClientLayerManager::EndTransaction(void (*)(mozilla::layers::ThebesLayer*, gfxContext*, nsIntRegion const&, mozilla::layers::DrawRegionClip, nsIntRegion const&, void*), void*, mozilla::layers::LayerManager::EndTransactionFlags) 	gfx/layers/client/ClientLayerManager.cpp
16 	XUL 	nsDisplayList::PaintForFrame(nsDisplayListBuilder*, nsRenderingContext*, nsIFrame*, unsigned int) const 	layout/base/nsDisplayList.cpp
17 	XUL 	nsLayoutUtils::PaintFrame(nsRenderingContext*, nsIFrame*, nsRegion const&, unsigned int, unsigned int) 	layout/base/nsDisplayList.cpp
18 	XUL 	PresShell::Paint(nsView*, nsRegion const&, unsigned int) 	layout/base/nsPresShell.cpp
19 	XUL 	nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) 	view/src/nsViewManager.cpp
20 	XUL 	nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) 	view/src/nsViewManager.cpp
21 	XUL 	nsRefreshDriver::Tick(long long, mozilla::TimeStamp) 	layout/base/nsRefreshDriver.cpp
22 	XUL 	nsRefreshDriver::FinishedWaitingForTransaction() 	layout/base/nsRefreshDriver.cpp
23 	XUL 	mozilla::layers::CompositorChild::RecvDidComposite(unsigned long long const&, unsigned long long const&) 	gfx/layers/ipc/CompositorChild.cpp
24 	XUL 	mozilla::layers::PCompositorChild::OnMessageReceived(IPC::Message const&) 	obj-firefox/x86_64/ipc/ipdl/PCompositorChild.cpp
25 	XUL 	mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) 	ipc/glue/MessageChannel.cpp
26 	XUL 	mozilla::ipc::MessageChannel::OnMaybeDequeueOne() 	ipc/glue/MessageChannel.cpp
27 	XUL 	MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) 	ipc/chromium/src/base/message_loop.cc
28 	XUL 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc
29 	XUL 	mozilla::ipc::DoWorkRunnable::Run() 	ipc/glue/MessagePump.cpp
30 	XUL 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
31 	XUL 	NS_ProcessPendingEvents(nsIThread*, unsigned int) 	xpcom/glue/nsThreadUtils.cpp
32 	XUL 	nsBaseAppShell::NativeEventCallback() 	widget/xpwidgets/nsBaseAppShell.cpp
33 	XUL 	nsAppShell::ProcessGeckoEvents(void*) 	widget/cocoa/nsAppShell.mm
34 	CoreFoundation 	CoreFoundation@0x7f661 	
35 	CoreFoundation 	CoreFoundation@0x70d12 	
36 	CoreFoundation 	CoreFoundation@0x7049f
No longer blocks: core-e10s
Is this still reproducible? I can't repro on my linux laptop.
I can no longer repro (this particular) crash when zooming the Mozilla Location Service map.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Blocks: old-e10s-m2
No longer depends on: old-e10s-m2
I can repro this e10s crash again. Same STR on OS X: zoom map in and out for about 10 seconds.

bp-d6379d74-5c53-4e99-bc67-b9f4c2140922
bp-50b7d16f-9f7a-4c6b-85cc-068f12140923
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
I just got this crash as well in e10s, while watching a youtube video: https://crash-stats.mozilla.com/report/index/d437ea25-48d8-4fb1-a6e1-b21912141023
Assignee: nobody → nical.bugzilla
Looks like there's 2 ways MessageChannel::Send can return false:

http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#503

If it's an actor id mismatch it's going to be a bit tough since I don't know how that works, and also we are creating an actor so I don't know what the mismatch would be about.
If it's because the connection closed somehow, fixing this crash shouldn't be hard but it'll probably move the problem to why did the connection close (OOM, maybe?). Creating texture actors happens a lot so it's not surprising if it happens to be the one message that races with loosing the ipdl connection. Just speculation at this point.
This check can't hurt since there is no way to SendPTextureConstructor can do anything but crash when the ipdl connection is closed. If it fixes the crash (or move it to another stack) it'll tell us that we are loosing the connection which is a problem of its own, if it doesn't then we know that it's an actor mismatch.
Attachment #8522212 - Flags: review?(bjacob)
Comment on attachment 8522212 [details] [diff] [review]
don't try to create the actor if the message channel is closed

Review of attachment 8522212 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/ShadowLayers.cpp
@@ +812,5 @@
>                                      TextureFlags aFlags)
>  {
>    if (!HasShadowManager() ||
> +      !mShadowManager->IPCOpen()
> +      !mShadowManager->GetIPCChannel()->Connected()) {

Missing ||   so this doesn't compile.

If this fixes the bug, then doesn't this imply that this new GetIPCChannel()->Connected() condition should be tested by IPCOpen()?
Attachment #8522212 - Flags: review?(bjacob) → review-
Attached patch updated patchSplinter Review
Woops, forgot to qrefresh before submitting. I agree that IPCOpen would be a better place, although I was hesitant to go that way since it is (IIRC) specific to the shutdown sequence and I wasn't sure I should touch that part. That said I think it's pretty safe to put it there.
This patch has the check in IPCOpen
Attachment #8522212 - Attachment is obsolete: true
Attachment #8522273 - Flags: review?(bjacob)
Attachment #8522273 - Flags: review?(bjacob) → review+
sorry had to back this out for test failures like https://treeherder.mozilla.org/ui/logviewer.html#?job_id=3899338&repo=mozilla-inbound
Flags: needinfo?(nical.bugzilla)
I'm getting this crash on e10s on just by using and tab switching between IRCCloud and bugzilla
I know there's already a patch in here, but I also read a comment suggesting that we might not know the underlying cause of the crash.

I was able to get a stack:

* thread #1: tid = 0x70ea01, 0x000000010007431d libmozalloc.dylib`mozalloc_abort(msg=0x00007fff5fbf8088) + 93 at mozalloc_abort.cpp:37, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000010007431d libmozalloc.dylib`mozalloc_abort(msg=0x00007fff5fbf8088) + 93 at mozalloc_abort.cpp:37
    frame #1: 0x00000001006884f5 XUL`Abort(aMsg=0x00007fff5fbf8088) + 21 at nsDebugImpl.cpp:469
    frame #2: 0x0000000100687f50 XUL`NS_DebugBreak(aSeverity=3, aStr=0x0000000000000000, aExpr=0x0000000000000000, aFile=0x00000001072dd314, aLine=221) + 1232 at nsDebugImpl.cpp:426
    frame #3: 0x0000000100d8a97a XUL`mozilla::Logger::~Logger(this=0x00007fff5fbf8640) + 330 at logging.cc:47
    frame #4: 0x0000000100d8a825 XUL`mozilla::Logger::~Logger(this=0x00007fff5fbf8640) + 21 at logging.cc:14
    frame #5: 0x00000001006c8a15 XUL`mozilla::LogWrapper::~LogWrapper(this=0x00007fff5fbf8640) + 21 at logging.h:59
    frame #6: 0x00000001006c89f5 XUL`mozilla::LogWrapper::~LogWrapper(this=0x00007fff5fbf8640) + 21 at logging.h:59
    frame #7: 0x0000000100d660b3 XUL`base::SharedMemory::CreateOrOpen(this=0x000000012ddaf8f8, name=0x00007fff5fbf87f0, posix_flags=514, size=274432) + 963 at shared_memory_posix.cc:221
    frame #8: 0x0000000100d65ca8 XUL`base::SharedMemory::Create(this=0x000000012ddaf8f8, cname=0x00007fff5fbf8850, read_only=false, open_existing=false, size=274432) + 184 at shared_memory_posix.cc:79
    frame #9: 0x0000000100e12bee XUL`mozilla::ipc::SharedMemoryBasic::Create(this=0x000000012ddaf8d0, aNbytes=274432) + 94 at SharedMemoryBasic_chromium.h:40
    frame #10: 0x0000000100dfa373 XUL`mozilla::ipc::CreateSegment(aNBytes=274432, aHandle=mozilla::ipc::SharedMemoryBasic::Handle at 0x00007fff5fbf8900) + 227 at Shmem.cpp:145
    frame #11: 0x0000000100dfa0d7 XUL`mozilla::ipc::Shmem::Alloc(=IHadBetterBeIPDLCodeCallingThis_OtherwiseIAmADoodyhead at 0x00007fff5fbf89d8, aNBytes=262160, aType=TYPE_BASIC, aUnsafe=true, aProtect=false) + 295 at Shmem.cpp:365
    frame #12: 0x0000000100f8db5e XUL`mozilla::layers::PCompositorChild::CreateSharedMemory(this=0x00000001167df400, aSize=262160, aType=TYPE_BASIC, aUnsafe=true, aId=0x00007fff5fbf8ba4) + 110 at PCompositorChild.cpp:649
    frame #13: 0x0000000100f8dd4c XUL`_ZThn16_N7mozilla6layers16PCompositorChild18CreateSharedMemoryEmNS_3ipc12SharedMemory16SharedMemoryTypeEbPi(this=0x00000001167df410, aSize=262160, aType=TYPE_BASIC, aUnsafe=true, aId=0x00007fff5fbf8ba4) + 76 at UnifiedProtocols4.cpp:667
    frame #14: 0x00000001011a5df4 XUL`mozilla::layers::PLayerTransactionChild::CreateSharedMemory(this=0x0000000111a51f50, aSize=262160, aType=TYPE_BASIC, aUnsafe=true, aId=0x00007fff5fbf8ba4) + 84 at PLayerTransactionChild.cpp:706
    frame #15: 0x00000001011a7793 XUL`mozilla::layers::PLayerTransactionChild::AllocUnsafeShmem(this=0x0000000111a51f50, aSize=262160, aType=TYPE_BASIC, aOutMem=0x000000012b851358) + 83 at PLayerTransactionChild.cpp:958
    frame #16: 0x0000000101bc4580 XUL`mozilla::layers::ShadowLayerForwarder::AllocUnsafeShmem(this=0x0000000117b38680, aSize=262160, aType=TYPE_BASIC, aShmem=0x000000012b851358) + 192 at ShadowLayers.cpp:725
    frame #17: 0x0000000101b4f3c1 XUL`mozilla::layers::ShmemTextureClient::Allocate(this=0x000000012b8512e0, aSize=262160) + 177 at TextureClient.cpp:585
    frame #18: 0x0000000101b4fef2 XUL`mozilla::layers::BufferTextureClient::AllocateForSurface(this=0x000000012b8512e0, aSize=mozilla::gfx::IntSize at 0x00007fff5fbf8cb8, aFlags=ALLOC_DEFAULT) + 354 at TextureClient.cpp:718
    frame #19: 0x0000000101b47d37 XUL`mozilla::layers::TextureClient::CreateForDrawing(aAllocator=0x0000000117b38680, aFormat=B8G8R8X8, aSize=mozilla::gfx::IntSize at 0x00007fff5fbf8d68, aMoz2DBackend=COREGRAPHICS, aTextureFlags=IMMEDIATE_UPLOAD, aAllocFlags=ALLOC_DEFAULT) + 535 at TextureClient.cpp:395
    frame #20: 0x0000000101b6511c XUL`mozilla::layers::TextureClientPool::GetTextureClient(this=0x000000011fd69ac0) + 412 at TextureClientPool.cpp:65
    frame #21: 0x0000000101b69821 XUL`mozilla::layers::TileClient::GetBackBuffer(this=0x00007fff5fbf99f8, aDirtyRegion=0x00007fff5fbf9708, aContent=COLOR, aMode=SURFACE_COMPONENT_ALPHA, aCreatedTextureClient=0x00007fff5fbf9727, aAddPaintedRegion=0x00007fff5fbf96e0, aCanRerasterizeValidRegion=false, aBackBufferOnWhite=0x00007fff5fbf96d8) + 753 at TiledContentClient.cpp:752
    frame #22: 0x0000000101b6b13a XUL`mozilla::layers::ClientTiledLayerBuffer::ValidateTile(this=0x000000012b676c78, aTile=0x00007fff5fbf99f8, aTileOrigin=0x00007fff5fbf99f0, aDirtyRegion=0x00007fff5fbf9d68) + 522 at TiledContentClient.cpp:1105
    frame #23: 0x0000000101b88d37 XUL`mozilla::layers::TiledLayerBuffer<mozilla::layers::ClientTiledLayerBuffer, mozilla::layers::TileClient>::Update(this=0x000000012b676c78, aNewValidRegion=0x0000000126d8de00, aPaintRegion=0x00007fff5fbfa660) + 3655 at TiledLayerBuffer.h:501
    frame #24: 0x0000000101b6a68b XUL`mozilla::layers::ClientTiledLayerBuffer::PaintThebes(this=0x000000012b676c78, aNewValidRegion=0x0000000126d8de00, aPaintRegion=0x00007fff5fbfa660, aCallback=0x0000000103e0fd30, aCallbackData=0x00007fff5fbfb300)(mozilla::layers::PaintedLayer*, gfxContext*, nsIntRegion const&, mozilla::layers::DrawRegionClip, nsIntRegion const&, void*), void*) + 1227 at TiledContentClient.cpp:959
    frame #25: 0x0000000101b46e34 XUL`mozilla::layers::ClientTiledPaintedLayer::RenderLayer(this=0x0000000126d8dc00) + 1204 at ClientTiledPaintedLayer.cpp:380
    frame #26: 0x0000000101b4734c XUL`_ZThn544_N7mozilla6layers23ClientTiledPaintedLayer11RenderLayerEv(this=0x0000000126d8de20) + 28 at Unified_cpp_gfx_layers2.cpp:471
    frame #27: 0x0000000101b577b5 XUL`mozilla::layers::ClientLayer::RenderLayerWithReadback(this=0x0000000126d8de20, aReadback=0x00007fff5fbfa798) + 37 at ClientLayerManager.h:370
    frame #28: 0x0000000101b62cb3 XUL`mozilla::layers::ClientContainerLayer::RenderLayer(this=0x000000011df23000) + 355 at ClientContainerLayer.h:69
    frame #29: 0x0000000101b62e1c XUL`_ZThn536_N7mozilla6layers20ClientContainerLayer11RenderLayerEv(this=0x000000011df23218) + 28 at Unified_cpp_gfx_layers2.cpp:76
    frame #30: 0x0000000101b42cba XUL`mozilla::layers::ClientLayerManager::EndTransactionInternal(this=0x0000000112b29400, aCallback=0x0000000103e0fd30, aCallbackData=0x00007fff5fbfb300, =END_DEFAULT)(mozilla::layers::PaintedLayer*, gfxContext*, nsIntRegion const&, mozilla::layers::DrawRegionClip, nsIntRegion const&, void*), void*, mozilla::layers::LayerManager::EndTransactionFlags) + 378 at ClientLayerManager.cpp:268
    frame #31: 0x0000000101b42e9b XUL`mozilla::layers::ClientLayerManager::EndTransaction(this=0x0000000112b29400, aCallback=0x0000000103e0fd30, aCallbackData=0x00007fff5fbfb300, aFlags=END_DEFAULT)(mozilla::layers::PaintedLayer*, gfxContext*, nsIntRegion const&, mozilla::layers::DrawRegionClip, nsIntRegion const&, void*), void*, mozilla::layers::LayerManager::EndTransactionFlags) + 91 at ClientLayerManager.cpp:298
    frame #32: 0x0000000103e96047 XUL`nsDisplayList::PaintRoot(this=0x00007fff5fbfb258, aBuilder=0x00007fff5fbfb300, aCtx=0x0000000000000000, aFlags=13) + 2839 at nsDisplayList.cpp:1444
    frame #33: 0x0000000103ec8d30 XUL`nsLayoutUtils::PaintFrame(aRenderingContext=0x0000000000000000, aFrame=0x000000011fc14458, aDirtyRegion=0x00007fff5fbfbc30, aBackstop=0, aFlags=772) + 4272 at nsLayoutUtils.cpp:3152
    frame #34: 0x0000000103f2328f XUL`PresShell::Paint(this=0x000000011dcad000, aViewToPaint=0x000000011675b970, aDirtyRegion=0x00007fff5fbfbc30, aFlags=1) + 2143 at nsPresShell.cpp:6337
    frame #35: 0x00000001039ef2ce XUL`nsViewManager::ProcessPendingUpdatesPaint(this=0x000000011dca0880, aWidget=0x0000000112b28f80) + 526 at nsViewManager.cpp:443
    frame #36: 0x00000001039eef76 XUL`nsViewManager::ProcessPendingUpdatesForView(this=0x000000011dca0880, aView=0x000000011675b970, aFlushDirtyRegion=true) + 502 at nsViewManager.cpp:384
    frame #37: 0x00000001039eff14 XUL`nsViewManager::ProcessPendingUpdates(this=0x000000011dca0880) + 132 at nsViewManager.cpp:1075
    frame #38: 0x0000000103df6245 XUL`nsRefreshDriver::Tick(this=0x000000011dc4dc00, aNowEpoch=1416343232301476, aNowTime=TimeStamp at 0x00007fff5fbfc180) + 4645 at nsRefreshDriver.cpp:1354
    frame #39: 0x0000000103dfd08c XUL`mozilla::RefreshDriverTimer::TickDriver(driver=0x000000011dc4dc00, jsnow=1416343232301476, now=TimeStamp at 0x00007fff5fbfc1b8) + 92 at nsRefreshDriver.cpp:173
    frame #40: 0x0000000103dfcf62 XUL`mozilla::RefreshDriverTimer::Tick(this=0x0000000117bf1780) + 322 at nsRefreshDriver.cpp:164
    frame #41: 0x0000000103dfce11 XUL`mozilla::RefreshDriverTimer::TimerTick(aTimer=0x000000011dc99860, aClosure=0x0000000117bf1780) + 33 at nsRefreshDriver.cpp:190
    frame #42: 0x000000010077534a XUL`nsTimerImpl::Fire(this=0x000000011dc99860) + 986 at nsTimerImpl.cpp:621
    frame #43: 0x0000000100775761 XUL`nsTimerEvent::Run(this=0x0000000117bfd980) + 209 at nsTimerImpl.cpp:714
    frame #44: 0x0000000100770246 XUL`nsThread::ProcessNextEvent(this=0x0000000111a7e040, aMayWait=false, aResult=0x00007fff5fbfc5c3) + 2086 at nsThread.cpp:830
    frame #45: 0x00000001007c700b XUL`NS_ProcessPendingEvents(aThread=0x0000000111a7e040, aTimeout=20) + 171 at nsThreadUtils.cpp:207
    frame #46: 0x0000000103a0a409 XUL`nsBaseAppShell::NativeEventCallback(this=0x00000001166fdac0) + 201 at nsBaseAppShell.cpp:98
    frame #47: 0x0000000103a84901 XUL`nsAppShell::ProcessGeckoEvents(aInfo=0x00000001166fdac0) + 433 at nsAppShell.mm:377
    frame #48: 0x00007fff8d312b31 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
    frame #49: 0x00007fff8d312455 CoreFoundation`__CFRunLoopDoSources0 + 245
    frame #50: 0x00007fff8d3357f5 CoreFoundation`__CFRunLoopRun + 789
    frame #51: 0x00007fff8d3350e2 CoreFoundation`CFRunLoopRunSpecific + 290
    frame #52: 0x00007fff93349eb4 HIToolbox`RunCurrentEventLoopInMode + 209
    frame #53: 0x00007fff93349c52 HIToolbox`ReceiveNextEventCommon + 356
    frame #54: 0x00007fff93349ae3 HIToolbox`BlockUntilNextEventMatchingListInMode + 62
    frame #55: 0x00007fff8f1df533 AppKit`_DPSNextEvent + 685
    frame #56: 0x00007fff8f1dedf2 AppKit`-[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] + 128
    frame #57: 0x0000000103a83557 XUL`-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x00000001167b18f0, _cmd=0x00007fff8fa0d404, mask=18446744073709551615, expiration=0x422d63c37f00000d, mode=0x00007fff7ba161c0, flag='\x01') + 119 at nsAppShell.mm:118
    frame #58: 0x00007fff8f1d61a3 AppKit`-[NSApplication run] + 517
    frame #59: 0x0000000103a852a7 XUL`nsAppShell::Run(this=0x00000001166fdac0) + 167 at nsAppShell.mm:651
    frame #60: 0x00000001049e49b9 XUL`XRE_RunAppShell + 345 at nsEmbedFunctions.cpp:731
    frame #61: 0x0000000100df7754 XUL`mozilla::ipc::MessagePumpForChildProcess::Run(this=0x0000000111a21240, aDelegate=0x00007fff5fbfe048) + 196 at MessagePump.cpp:272
    frame #62: 0x0000000100d8ba55 XUL`MessageLoop::RunInternal(this=0x00007fff5fbfe048) + 117 at message_loop.cc:233
    frame #63: 0x0000000100d8b965 XUL`MessageLoop::RunHandler(this=0x00007fff5fbfe048) + 21 at message_loop.cc:226
    frame #64: 0x0000000100d8b90d XUL`MessageLoop::Run(this=0x00007fff5fbfe048) + 45 at message_loop.cc:200
    frame #65: 0x00000001049e4203 XUL`XRE_InitChildProcess(aArgc=3, aArgv=0x00007fff5fbff638, aGMPLoader=0x0000000000000000) + 2691 at nsEmbedFunctions.cpp:568
    frame #66: 0x00000001000018db plugin-container`content_process_main(argc=6, argv=0x00007fff5fbff638) + 299 at plugin-container.cpp:190
    frame #67: 0x00000001000019d2 plugin-container`main(argc=7, argv=0x00007fff5fbff638) + 34 at MozillaRuntimeMain.cpp:11
    frame #68: 0x0000000100001794 plugin-container`start + 52
Opening the following URL in an e10s build crashes quite often for me:

http://12illustrations.com/post/32805685485/tmnt-by-chris-uminga

If it doesn't crash at first just force-reload a few times until it does. Currently disabling a few things to see if some add-on causes it.
This would be a better fix if the issue is that we are using the channel after it's shut down, and is good to take even if not.
Flags: needinfo?(nical.bugzilla)
Attachment #8528689 - Flags: review?(jmuizelaar)
Attachment #8528689 - Flags: review?(jmuizelaar) → review+
I haven't been able to reproduce this crash both with and without my (speculative) patch. Is anyone able to reproduce this on today's nightly?
Yes - I can still reproduce it using the steps in duped bug 1100902:

STR:

1.) Enable e10s and visit https://www.mozilla.org/en-US/firefox/os/

2.) Scroll up/down the page
I can also reproduce using the steps in comment 19.
I can reliably crash my Nightlies visiting https://www.mozilla.org/en-US/firefox/os/ but I can't get a custom build to fail, not even with the default profile I use for Nightly. I failed so far trying to get a custom build with the Nightly build config - but maybe that would help us reproduce locally?
Running:

/Applications/FirefoxNightly.app/Contents/MacOS/firefox -P default

I can't make the content process crash. If I however click the "Nightly" icon that's fixed in the dock and then navigate to the fxos page I can crash it reliably. That's the same executable with the same profile...
Pinned my custom build to the dock and can reliably crash the content process. Nicolas, can you reproduce this with the new information provided?
Flags: needinfo?(nical.bugzilla)
It also seems to crash more reliably with more tabs open. So to reproduce it's best to open 50 GitHub (or something) tabs in a window and open the FxOS page as the last, and then scroll up and down.
It really does only happen for me when Firefox is pinned to the dock. In that case I even see artifacts when scrolling, I don't see those when executing "mach run".
TextureClientPool::GetTextureClient() sets the local variable |textureClient| but both of the functions can return a nullptr. With GFX_DEBUG_TRACK_CLIENTS_IN_POOL=1 we should check whether we have a textureClient at all, no?

http://mxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClientPool.cpp#107
(In reply to Tim Taubert [:ttaubert] from comment #26)
> TextureClientPool::GetTextureClient() sets the local variable
> |textureClient| but both of the functions can return a nullptr. With
> GFX_DEBUG_TRACK_CLIENTS_IN_POOL=1 we should check whether we have a
> textureClient at all, no?
> 
> http://mxr.mozilla.org/mozilla-central/source/gfx/layers/client/
> TextureClientPool.cpp#107

Looks like that's just another thing we might hit in a debug build but isn't the actual problem here. It's the backtrace in comment #13 we're hitting mostly.
When starting Firefox from the dock RLIMIT_NOFILE=800 which seems to be a little too small with dozens of tabs open. When starting Firefox with |mach run| RLIMIT_NOFILE=4864. That's why we don't run into this issue in the latter case. I wonder which of those is the right rlimit? And shouldn't we handle these types of failures better? Are we creating too many file handles?
Good catch. I couldn't reproduce this on the 10.10 mac I borrowed, which version are you using? 800 isn't a lot of file descriptors. We typically have one fd per texture we share with the compositor, so at least 2 per layer, more with tiling (1 or 2 per tile), and that's only gfx.
Flags: needinfo?(nical.bugzilla)
I used Nightly for testing, resp. fx-team tip for custom builds.
Oh and OS X 10.9
This seems to be beyond graphics; let me find a better person to assign to.
Assignee: nical.bugzilla → milan
CatLee, can we explain comment 29 with the RLIMIT_NOFILE being different by a factor of 5 between the different builds?
Flags: needinfo?(catlee)
As catlee pointed out, comment 23 explains this limit is not a part of the build.
Flags: needinfo?(catlee)
(In reply to Tim Taubert [:ttaubert] from comment #27)
> (In reply to Tim Taubert [:ttaubert] from comment #26)
> > TextureClientPool::GetTextureClient() sets the local variable
> > |textureClient| but both of the functions can return a nullptr. With
> > GFX_DEBUG_TRACK_CLIENTS_IN_POOL=1 we should check whether we have a
> > textureClient at all, no?
> > 
> > http://mxr.mozilla.org/mozilla-central/source/gfx/layers/client/
> > TextureClientPool.cpp#107
> 
> Looks like that's just another thing we might hit in a debug build but isn't
> the actual problem here...

I created bug 1109828 to take care of this.
When running |mach run| from iTerm I see RLIMIT_FILENO=4864. That seems to come from iTerm somehow and doesn't behave the same with Terminal.app.

RLIMIT_FILENO=800 is explained easily as 256 is the default value on OS X. After nsSocketTransportService::DiscoverMaxCount() was called once we bump it up to 800 (bug 607741 + 250) and it stays there forever.

If 800 is a low limit, should we consider increasing it? If the dup() call in our IPC code is the only path hitting it should we try and raise it lazily when receiving EMFILE?

Adding Patrick who seems to know about socket/file limits.
I'm a little worried about just increasing the limit. On my Linux system, the hard limit (ulimit -Hn) is 4096. I don't think we're allowed to use more than that without user intervention.

If a normal website requires more than 800 file descriptors, and if we can never exceed 4096, then it would be very easy for a badly behaving site to cause lots of problems. Is there a way that the graphics code could use fewer (but larger) textures?
here's the weird thing about socket numbers - the select() api will basically crash if you pass it a fd with a value > FD_SETSIZE because the fd value is an offset into a bit array of FD_SETSIZE. on winsock that's 1024.. which is controlled deep in nspr by setting a define before including some windows headers.

now gecko is smart enough not to use select(), but history has told us our fd's end up in places other than gecko code. Addons and especially LSP's (third party network stack hooks on windows) - and this code does indeed call select() on the fd. boom.

And fd's are a process global resource of course.. so if graphics starts using a large amount of them (that never see an LSP, thankfully) that still can result in networking allocating one later with a high value which finds its way into these processes.

so that's the backstory.

Easy thing 1 - bump up the 800 number to whatever we can probe the OS for at max on non-windows. I'm assuming this is a windows bug though too, right? This won't cost RAM unless we actually use them.

Other idea - bump up the NSPR limit. I do worry about the compatbility of this - FD_SETSIZE needs to be #defined universally before using the select() calls and we have no way to control that inside a third party piece of code.. our value is technically customized now, but its a pretty common value. Maybe I worry too much - select() has been out of style for many years now.
(In reply to Patrick McManus [:mcmanus] from comment #40)
> Easy thing 1 - bump up the 800 number to whatever we can probe the OS for at
> max on non-windows. I'm assuming this is a windows bug though too, right?

ttaubert and I can reproduce this crash on OS X. I don't know if it happens on Windows at all.
Is it curious that we have this show up under E10S where we should have more processes and thus more overall file descriptors, or is it because IPC is now cross-process that we use more of them.

I'm pretty sure we ran into something similar on B2G and had to up the max number to 1024, so that may be the first thing we may try, just go from 800 to 1024, see if it helps.

Also, I wonder if higher FD limit is just masking a problem - perhaps increasing that number reduces a chance of us reusing the same file descriptor, and saves us from operations where we may operate on an FD we shouldn't be, or something like that.
Any chance of progress on this one? I typically hit this once or twice a day. "A tab crashed" but affects all tabs...

Latest crash:
https://crash-stats.mozilla.com/report/index/bp-6e7c4751-4449-4ed8-8b0e-29b462141218

Happened while browsing on zillow.com, zooming out around Petaluma, CA. That'll teach me to think of any place other than Petaluma...

OSX 10.8.5, Retina MacBook Pro
Other tabs included moco gmail, bugzilla
Re-nomming because this makes browsing some web properties pretty much impossible with e10s enabled.
[Tracking Requested - why for this release]: regression in 35. Makes e10s terrible for some use cases
Assignee: milan → gwright
Flags: needinfo?(gwright)
I'm having trouble reproducing this, having tried both comment 1 and comment 18's steps. This is on OS X with a release build. I've also tried a fresh e10s profile and tested with lots of tabs open.

Any other suggestions? I'll keep trying different for now but if anyone can come up with a semi-reliable method for reproducing that'd be great. I'm on a 13" rMBP with intel graphics running 10.9.5 for what it's worth.
So I also just tried reproducing this with no success... It's true I haven't seen this one for a while, so perhaps something else that landed in platform fixed it.
I haven't seen this bug in a long time. When I filed the bug, I could easily reproduce the crash within 5 seconds, but I can no longer reproduce it. I don't know if this is relevant, but I was running OS X 10.9 when I could reproduce the bug and now I'm running OS X 10.10.

We can probably close this bug as WFM.
Status: REOPENED → NEW
Keywords: reproducible
George, I'll let you close it as WFM as you're assigned to it.
WFM!
Status: NEW → RESOLVED
Closed: 10 years ago9 years ago
Flags: needinfo?(gwright)
Resolution: --- → WORKSFORME
Just loading Google maps still triggers this for me. I don't see it in a new profile, though. I will try to figure out what in my profile might be triggering it.
I also hit this a ton today again. :(

George, if you're in the office tomorrow, come find me and I'll reproduce for you.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #52)
> Just loading Google maps still triggers this for me. I don't see it in a new
> profile, though. I will try to figure out what in my profile might be
> triggering it.

Guess the number of tabs in your old profile is a lot higher? See comment #39 and before.

This problem won't just go away because we're not actually crashing, we just run out of fds...
(In reply to Tim Taubert [:ttaubert] from comment #54)
> (In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #52)
> > Just loading Google maps still triggers this for me. I don't see it in a new
> > profile, though. I will try to figure out what in my profile might be
> > triggering it.
> 
> Guess the number of tabs in your old profile is a lot higher? See comment
> #39 and before.
> 
> This problem won't just go away because we're not actually crashing, we just
> run out of fds...

So the cause behind this crash is known / understood then?

I have hit this again, following gw280's request for me to try to reproduce with just my integrated graphics card forced on (via gfxCardStatus). I can reproduce if I have a high number of tabs loaded, and scroll quickly on https://www.mozilla.org/en-US/firefox/os/.

Is this something that just comes with the territory when using a single content process for so many tabs?
Flags: needinfo?(ttaubert)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #55)
> (In reply to Tim Taubert [:ttaubert] from comment #54)
> > (In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #52)
> > This problem won't just go away because we're not actually crashing, we just
> > run out of fds...
> 
> So the cause behind this crash is known / understood then?

Yeah, we run out of file descriptors, RLIMIT_FILENO=800 for Firefox on Mac. This number is much higher if start with ./mach from iTerm for example so it was hard to reproduce at first.

> Is this something that just comes with the territory when using a single
> content process for so many tabs?

It seems that all the graphics layer IPC can cause us to try to use >800 fds and we don't (can't easily?) handle that without crashing. I'm not sure but we might have 800 fds per child process - so the chance of running into it then might be lower. For power users with a ton of tabs/groups this still might be easy to hit.
Flags: needinfo?(ttaubert)
That sounds bad.

mconley: thanks for checking that for me. Can you give an idea of what "high number of tabs" actually means? I tried locally with around 15 and couldn't get it to crash, which led to my suspecting the way the code interacts with the NVIDIA gpu (which I don't have).
> Is this something that just comes with the territory when using a single
> content process for so many tabs?

Both the parent and the child need an fd for the shared memory region, so the chrome process would presumably still run out of file descriptors even with process-per-tab.

Why do we have so many shared memory regions? My understanding is that we throw away shared memory graphics buffers for tabs that aren't visible. And it seems pretty crazy that we need 800 shared memory regions for one web page. It really sounds like there's some sort of leak or inefficiency here that we need to fix. If that's the case, it might be more related to how long the browser has been running rather than the number of tabs open at the time.
Is this any harder to reproduce if you set preferences layers.tile-height and layers.tile-width to 512 instead of 256?
(In reply to Milan Sreckovic [:milan] from comment #59)
> Is this any harder to reproduce if you set preferences layers.tile-height
> and layers.tile-width to 512 instead of 256?

From a quick bit of testing that does seem to make this crash go away for me. I'll leave those flipped to confirm.

(I certainly have a lot of tabs in this profile, and often leave it running for >24 hours usually.)
There is not much of a downside to increasing the tile size on OS X, the 256x256 was really meant for B2G, and we never followed up on the conversations to make this change.  I agree it isn't the actual solution, but if it can make the problem go away, it's an easy start.  There are also conversations about avoiding the FD usage for shared textures (there is an alternative on OS X), as well as doing special allocators so that we don't have the #FD == #shared textures + other stuff, but instead have more than one shared textures per FD, but that's a longer conversation.  Also, slightly change how the videos deal with shared textures.
George is following up on this.
Let's increase the tile-size for now, as I think we should do this anyway as per comment 61, to minimise the impact of this crash.

We should still work on a solution to gracefully fail when we run out of FDs though.
Attachment #8555973 - Flags: review?(jmuizelaar)
Comment on attachment 8555973 [details] [diff] [review]
0001-Bug-1036682-Increase-the-tile-size-to-512x512-on-non.patch

Review of attachment 8555973 [details] [diff] [review]:
-----------------------------------------------------------------

It would be good to have some justification for this outside of just minimizing memory usage.

::: modules/libpref/init/all.js
@@ +3922,5 @@
>  pref("layers.tile-height", 256);
> +#else
> +pref("layers.tile-height", 512);
> +#endif
> +

This only changes the height.
Attachment #8555973 - Flags: review?(jmuizelaar) → review-
Attachment #8555973 - Attachment is obsolete: true
Attachment #8558085 - Flags: review?(jmuizelaar)
Comment on attachment 8558085 [details] [diff] [review]
0001-Bug-1036682-Bump-RLIMIT_NOFILE-to-the-hard-ceiling-o.patch

Review of attachment 8558085 [details] [diff] [review]:
-----------------------------------------------------------------

Seems reasonable.
Attachment #8558085 - Flags: review?(jmuizelaar) → review+
So on my machine at least, RLIMIT_NOFILE was 2560 before, and that explains why I was unable to reproduce this bug. The ceiling on my machine is huge (9223372036854775807) so I'm not sure if we actually want to set it to the hard ceiling. Maybe something like min(16384, nofile_max)?
Take 2, with a new cap imposed of 16k
Attachment #8558085 - Attachment is obsolete: true
Attachment #8558096 - Flags: review?(jmuizelaar)
Comment on attachment 8558096 [details] [diff] [review]
0001-First-attempt-at-a-spinner-for-excessively-long-page.patch

Review of attachment 8558096 [details] [diff] [review]:
-----------------------------------------------------------------

Wrong patch.
Attachment #8558096 - Flags: review?(jmuizelaar) → review-
Nice job, me.
Attachment #8558096 - Attachment is obsolete: true
Attachment #8558098 - Flags: review?(jmuizelaar)
Comment on attachment 8558098 [details] [diff] [review]
0001-Bug-1036682-Bump-RLIMIT_NOFILE-to-the-hard-ceiling-o.patch

Review of attachment 8558098 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxPlatformMac.cpp
@@ +87,5 @@
> +    // buffered tiles in e10s, so let's bump the soft limit to the hard limit for the OS
> +    // up to a new cap of 16k
> +    struct rlimit limits;
> +    if (getrlimit(RLIMIT_NOFILE, &limits) == 0) {
> +        limits.rlim_cur = std::min((rlim_t)16384, limits.rlim_max);

how about rlimt_t(16384) instead
Attachment #8558098 - Flags: review?(jmuizelaar) → review+
I'm going to go ahead and resolve this now, and if we want to reduce the number of fds used by tiles we should create another bug.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
So when I run this in my local build, I get:

WARNING: Unable to bump RLIMIT_NOFILE to the maximum number on this OS: file ../../../mozilla/gfx/thebes/gfxPlatformMac.cpp, line 93

which presumably means the patch is not having any actual effect.  

Do we have any actual indication that this patch ever successfully raises the RLIMIT_NOFILE value?
Flags: needinfo?(gwright)
And in particular, it's failing with EINVAL.

Looking around a bit, on recent Mac you can't set RLIMIT_NOFILE to anything larger than OPEN_MAX (e.g. see the "compatibility" bit in "man setrlimit" or <https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/setrlimit.2.html>).

Over here, OPEN_MAX seems to be 10240.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Also, did the bug on increasing tile size ever get filed?
That's odd, because it definitely increased it on my machine (10.9); I called getrlimit afterwards and checked the value had gone up, and it most certainly did.

I'll prepare a patch that accounts for OPEN_MAX.
Flags: needinfo?(gwright)
Flagging snorp as the usual suspects are on a plane or something
Attachment #8560618 - Flags: review?(snorp)
Comment on attachment 8560618 [details] [diff] [review]
OPEN_MAX-setrlimit.patch

Review of attachment 8560618 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nit

::: gfx/thebes/gfxPlatformMac.cpp
@@ +88,1 @@
>      // up to a new cap of 16k

We're not setting it to 16k anymore
Attachment #8560618 - Flags: review?(snorp) → review+
(In reply to Boris Zbarsky [:bz] from comment #77)
> Also, did the bug on increasing tile size ever get filed?

It did not. I have filed bug 1130545 now.
Boris, are you still seeing the issue where the number of fds isn't increased?
Flags: needinfo?(bzbarsky)
It's working correctly now.
Flags: needinfo?(bzbarsky)
OK, resolving so we can move the discussion for reduction of fds to bug 1130545.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.