Closed Bug 1416726 Opened 2 years ago Closed 2 years ago

mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor

Categories

(Core :: Graphics, defect, P2)

58 Branch
Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox-esr52 --- unaffected
firefox57 --- unaffected
firefox58 --- fixed
firefox59 --- fixed
firefox60 + fixed
firefox61 --- fixed

People

(Reporter: cg+zbmvyynohtmvyyn, Assigned: rhunt)

References

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0
Build ID: 20171112220346

Steps to reproduce:

Use latest nightly 58a1 on Linux 64bit
Go to http://wsprnet.org
Select map
Select band "all" from dropdown
Push update button


Actual results:

Page crashes ( for example https://crash-stats.mozilla.com/report/index/c8baf58f-3373-4f2b-ba8f-310150171113 )



Expected results:

No crash.
Reproduced on two linux machines with latest nightly.

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozilla%3A%3Aipc%3A%3AFatalError%20%7C%20mozilla%3A%3Adom%3A%3AContentChild%3A%3AFatalErrorIfNotUsingGPUProcess%20%7C%20mozilla%3A%3Alayers%3A%3APCompositorBridgeChild%3A%3ASendPTextureConstructor

Not crashing on windows.

Using mozregression gives result:
2017-11-12T22:21:23: DEBUG : Using url: https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?changeset=713c3225f14883b71e0d6febda505e56889f8234&full=1 2017-11-12T22:21:24: DEBUG : Found commit message: Don't do PaintThebes after PaintOffMainThread. (bug 1399692 part 10, r=dvander) MozReview-Commit-ID: J0IOzqIGRtz 2017-11-12T22:21:24: INFO : The bisection is done.
Crash Signature: [@ mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor ]
Keywords: crash, regression
OS: Unspecified → Linux
Component: Untriaged → DOM
Product: Firefox → Core
Sometimes the same procedure results in different crash signature:
[@ mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPLayerTransactionConstructor ]

For example
bp-b316a2ce-89a3-4377-ba06-cd07d0171113

Still happens on 59.0a1 nightly
Crash Signature: [@ mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor ] → [@ mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor ] [@ mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozil…
Blocks: 1399692
Component: DOM → Graphics
Flags: needinfo?(rhunt)
Priority: -- → P2
Whiteboard: [gfx-noted]
Flags: needinfo?(hshih)
This should be fixed by reverting some content client changes in bug 1416921.
Assignee: nobody → rhunt
Depends on: 1416921
Flags: needinfo?(rhunt)
Ryan, how does it look with bug 1416921 landed?
Flags: needinfo?(rhunt)
It looks like this has been fixed now.
Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(rhunt)
Flags: needinfo?(hshih)
Resolution: --- → FIXED
Do we need to land something on Beta for 58 still?
Flags: needinfo?(rhunt)
Target Milestone: --- → mozilla59
Yes, this bug and bug 1416921 is still reproducible on Beta for 58. I'll request uplift.
Flags: needinfo?(rhunt)
(In reply to Ryan Hunt [:rhunt] from comment #6)
> Yes, this bug and bug 1416921 is still reproducible on Beta for 58. I'll
> request uplift.

I was looking through the patch that landed on inbound to make sure I didn't add any fixups that aren't in the patch on bugzilla, and it matches. But I also found a line of code that was dropped by that bug that shouldn't have and filed bug 1421871 about it. That should be landed and uplifted with bug 1416921.
bugs 1416921 and 1421871 are in 58, so I'll go ahead and mark this fixed.
It seems the root cause for the crash was not fixed. The problem that was fixed just made this problem easier to trigger.

The crash can be still triggered by going to the wsprnet.org page, selecting map, selecting "all" from dropdown, clicking update and after that zooming in and out repeatedly with ctrl+mouse wheel.

New trace here: bp-13a6f565-3d13-4d09-8083-c6b670171226

I did a new bisection, this time the regression is all the way back to:

2017-12-26T19:41:39: DEBUG : Using url: https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?changeset=a19ac5da60828cdcf374e6798dc7e07b8b4c1b7d&full=1
2017-12-26T19:41:41: DEBUG : Found commit message:
Bug 1325227 - Part 10: Use blocking read locks instead of forcing a synchronous transaction when using ContentClientRemoteBuffer. r=nical

Should this bug be reopened, or entered into a new one?
Flags: needinfo?(rhunt)
Crash Signature: mozilla::layers::PCompositorBridgeChild::SendPLayerTransactionConstructor ] → mozilla::layers::PCompositorBridgeChild::SendPLayerTransactionConstructor ] [@ mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPTexture…
I'd say this bug can be reopened. This confirms my suspicions that this wasn't exactly related to OMTP, and is a preexisting issue made worse with timing changes. 

The page linked creates many layers (I forget how many but it may have been in the thousand), and this can cause resource failures we don't handle well.

I think it would be worth figuring out why we create so many layers, and trying to fix that.
No longer blocks: 1399692
Status: RESOLVED → REOPENED
Ever confirmed: true
Flags: needinfo?(rhunt)
Resolution: FIXED → ---
Crash Signature: mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor ] → mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor ] [@ mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPLayerTransaction…
Another STR that always crashes for me (on Nightly 20180129100406 and Release 58):

Go to:
http://www.erlangen.mein-abfallkalender.de/containerstandorte
Click with mouse on + icon (zoom in): crash

Clicking on - icon (zoom out) does not crash
Zooming with Ctrl+Mousewheel does not crash

bp-ee86ad44-3804-4446-b6f2-001fb0180129
I have this crash on github, meetup.com, etc (10 times in 2 days)

Happy to test anything.

Ryan, is there anything I could do to help?
Flags: needinfo?(rhunt)
This is approximately 20% of all OSX Nightly crashes for the Jan 31 Nightly. eg a very common crash. Though it isn't coming from all that many installations.
With my STR from above, I got this regression range:
INFO: Last good revision: bcda57aa0c3b291dcd5de5d7fd048af69dbdff4d
INFO: First bad revision: a19ac5da60828cdcf374e6798dc7e07b8b4c1b7d
INFO: Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=bcda57aa0c3b291dcd5de5d7fd048af69dbdff4d&tochange=a19ac5da60828cdcf374e6798dc7e07b8b4c1b7d
Could you provide a link for the OSX crashes? I'm having a difficulty finding it on crash-reports. I only see linux crashes.
Flags: needinfo?(continuation)
Would you happen to have a link to one of the crashes you had?

What platform was it on? OSX vs Linux/Windows use different code paths, and it'd help to know if this is related to parallel painting with tiling which just was enabled on OSX.
Flags: needinfo?(rhunt)
(In reply to Ryan Hunt [:rhunt] from comment #15)
> Could you provide a link for the OSX crashes? I'm having a difficulty
> finding it on crash-reports. I only see linux crashes.

Hmm, yeah, that's strange. I don't see any OSX crashes, either. Maybe they got removed, or I somehow was mistaken.
Flags: needinfo?(continuation)
Some examples:
bp-f68aa17a-f9f2-4274-b420-9eed80180206
bp-0f209b46-946e-4dc2-9fbd-7f51d0180201
bp-c1edbc14-66c8-4f7d-aa8e-52e4b0180201

Linux only; this is causing my nightly firefox to be unusable.
Flags: needinfo?(rhunt)
Hmm, those crash signatures are:

[@ mozalloc_abort | NS_DebugBreak | mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPLayerTransactionConstructor ]

Which has a different cause from:

mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor 

Both come from IPC being down, but the SendPTextureConstructor one is from running out of shmem's, and I'm not sure what's going on with the SendPLayerTransactionConstructor one. It's happening during the startup process of a Tab and it looks like IPC is already down? I can't see any more information about why IPC is failing.

Has this just started happening, and if so would it be possible to use mozregression to find when the instability started? Are you able to open any tabs successfully? If possible log output would also help. About:support might have some clues as well.
Flags: needinfo?(rhunt)
I've tried to narrow down the SendPLayerTransactionConstructor crashes. They are more easily reproduced on recent builds, but so far I've been able to reproduce them back to 2017-05-04 build.
Builds much earlier than that are too crashy to easily differentiate between the crashes (SendPLayerTransactionConstructor or SendPTextureConstructor), because the whole browser tends to crash upon entering the page.

I suspect the root cause is the same for both, bisecting including the browser crashes gets this result:

2018-02-07T11:57:31: DEBUG : Using url: https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?changeset=a19ac5da60828cdcf374e6798dc7e07b8b4c1b7d&full=1
2018-02-07T11:57:32: DEBUG : Found commit message:
Bug 1325227 - Part 10: Use blocking read locks instead of forcing a synchronous transaction when using ContentClientRemoteBuffer. r=nical

I've noticed that for the SendPLayerTransactionConstructor, the crash is a bit different. It can be reproduced the same way using scroll wheel zoom on the wsprnet page. Sometimes, instead of getting the "tab crashed" page, I only get a white page which is not responsive. Switching to another tab and back, I get the spinner to the middle of the white page.
So I suspect it's the "tab crashed" page crashing instead in this case.
Maybe the previous crash on the wsprnet page took down IPC and it's not for some reason restored when the "tab crashed" page takes over.
I find that I usually have updates pending when this crash happens to me; is it possible that we fail to launch the GPU process (because it's incompatible with the currently running Firefox or something) and then this is the result?
(In reply to Nathan Froyd [:froydnj] from comment #21)
> I find that I usually have updates pending when this crash happens to me; is
> it possible that we fail to launch the GPU process (because it's
> incompatible with the currently running Firefox or something) and then this
> is the result?

According to bug 1366808, it could theoretically happen, as the GPU process has the same build ID check as the content processes.
I just received https://crash-stats.mozilla.com/report/index/f42156f2-8786-4b3b-a08a-d53540180207 while an update was not pending (in fact, had recently restarted from an update), so maybe that's not a super-great theory.

I have a smart bookmark pointing at http://www.imcpl.org/cgi-bin/searchdirect_new.pl?restrict=catalog&q=%s with a keyword set, and entering "$KEYWORD search terms" in the URL bar (+ hitting enter, of course) is particularly likely to trigger this crash.
I keep hitting this, so some new data points:

https://crash-stats.mozilla.com/report/index/06ea525a-8d31-41fb-847c-f59010180210
https://crash-stats.mozilla.com/report/index/1e8d511e-a269-46e7-b4b0-817b60180210
https://crash-stats.mozilla.com/report/index/8bea56d4-4d47-4b34-9cc2-63ae70180210

Crashes within a minute of starting the browser, and:

https://crash-stats.mozilla.com/report/index/6b8566fd-e445-4190-93fc-4c6d50180210
https://crash-stats.mozilla.com/report/index/8c6df4f1-6a93-4cc5-9680-7e0ec0180210
https://crash-stats.mozilla.com/report/index/04913626-0637-4558-9698-c431a0180210

are all crashes within five seconds of starting the browser.  Are we sure there's not some sort of race in starting the GPU process, where one part of the code thinks we are using the GPU process, but another part of the code disagrees?
There's been race conditions in this part of the code for sure. But on linux the GPU process should be disabled (and I see that in the crash reports), which means that we shouldn't be ignoring this IPC error in this case.

For some reason the MessageChannel to the compositor is down right away after startup in these crashes. Either we never connected, or the connection was broken for some reason. I don't think it's likely we never connected, because in that case we'd crash much sooner than paint. Most likely in TabChild::InitRendering. 

For the crashes with a STR earlier in this bug, the connection went down because of a huge number of PaintedLayer's being created, but that shouldn't be the case with these latest crashes as it's right at startup.

Andrew, do you have any idea of what could be going on here or a diagnostic we could add to get more information?
Flags: needinfo?(aosmond)
The parent and the content process split the file handles between them from my own testing using the STR from comment 11. CrossProcessSemaphore appears to not release its file handles after getting shared. In short, we create so many of them, that we run out of file handles. (CrossProcessMutex has a similar flaw.) Matches with the push log.
Flags: needinfo?(aosmond)
I'm just gonna summarize the findings we have here so I don't forget over the long weekend.

The problem is we are running out of file handles from creating so many PaintedLayers which use blocking readlocks which use CrossProcessSemaphore. The reason we create so many PaintedLayers on these pages with maps is because the maps create absolutely positioned pins that set the top and left properties enough times quick enough that it triggers layerization for each one. They are pretty small, so we are able to layerize a lot before we run out of AGRBudget. [1]

Possible solutions:

1. We could introduce a count limit to the AGR budget, similar to the layer count limit [2]. But if a page is truly animating a lot of layers there's a chance that could cause a painting regression.

2. We could change how we allocate CrossProcessSemaphores to share shmems from a pool thereby lowering the amount of file handles we need to open.

3. We could rework the layers IPC to only send the shmem file handle once to the compositor, then close it, and send some other ID in each layer transaction. This would probably completely get around the file handle limits.

[1] https://searchfox.org/mozilla-central/rev/cac28623a15ace458a8f4526e107a71db1519daf/layout/painting/nsDisplayList.cpp#1786
[2] https://searchfox.org/mozilla-central/rev/cac28623a15ace458a8f4526e107a71db1519daf/layout/painting/FrameLayerBuilder.cpp#4039
Some layerization tweaks, ie solution 1, certainly might help. But in my experience, fixing these bugs is like playing whack-a-mole, and it'll be pretty much impossible to ensure we never have too many layers on every site.

So I think for this bug we have to do option 2 or 3.

At the same time I will look in to sites creating too many layers and see what can be done.
this will probably regress in higher numbers once 60.0 is released to the beta audience.
Did anything change in 60 that would make this more common?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=992fa5bf9ea6a99b0a33df6f71470b44ed486423

There are 2 IPC related failures in this try run that are a little suspicious, but are in unaffected call stacks so I think it's okay.
Comment on attachment 8958113 [details]
Allocate TextureReadLock at TextureClient creation and drop file handles immediately after. (bug 1416726, )

https://reviewboard.mozilla.org/r/227070/#review232830

::: commit-message-fdd1a:14
(Diff revision 1)
> +1. Create the TextureReadLock in the TextureClient constructor so it's available before IPC creation
> +    a. This is superficial as EnableReadLock was always called before IPC creation
> +1. Send the ReadLockDescriptor in the PTextureConstructor message and close the file handle
> +1. Receive the ReadLockDescriptor in TextureHost and close the file handle
> +1. Send a boolean flag in layer transactions if the texture is read locked instead of a descriptor
> +1. Use a boolean flag in TextureHost to determine if the ReadLock must be unlocked instead of a nullptr

nit: 1, 2, 3 ... instead of 1, 1, 1 :).

::: gfx/layers/client/ContentClient.cpp:765
(Diff revision 1)
>  
>    RefPtr<TextureClient> textureClientOnWhite;
>    if (aFlags & TextureFlags::COMPONENT_ALPHA) {
>      textureClientOnWhite = textureClient->CreateSimilar(
>        mForwarder->GetCompositorBackendType(),
> -      aFlags | ExtraTextureFlags(),
> +      aFlags | ExtraTextureFlags() | TextureFlags::BLOCKING_READ_LOCK,

textureClientOnWhite did not originally have a blocking read lock. Was it missing?

::: gfx/layers/client/TextureClient.cpp:1364
(Diff revision 1)
>  {
>    mData->FillInfo(mInfo);
>    mFlags |= mData->GetTextureFlags();
> +
> +  if (mFlags & TextureFlags::NON_BLOCKING_READ_LOCK) {
> +    EnableReadLock();

Might want to assert here that BLOCKING_READ_LOCK isn't set.
Attachment #8958113 - Flags: review?(aosmond) → review+
Comment on attachment 8958113 [details]
Allocate TextureReadLock at TextureClient creation and drop file handles immediately after. (bug 1416726, )

https://reviewboard.mozilla.org/r/227070/#review232830

> textureClientOnWhite did not originally have a blocking read lock. Was it missing?

No it wasn't, it was added erroneously.

> Might want to assert here that BLOCKING_READ_LOCK isn't set.

Yes, good idea.
Pushed by rhunt@eqrion.net:
https://hg.mozilla.org/integration/mozilla-inbound/rev/45b5743ae3ad
Allocate TextureReadLock at TextureClient creation and drop file handles immediately after. (bug 1416726, r=aosmond)
Blocks: 1445008
https://hg.mozilla.org/mozilla-central/rev/45b5743ae3ad
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: mozilla59 → mozilla61
I can't reproduce the crash anymore on latest nightly. Thanks for the fix!
Thanks for the fix Ryan! If there specific sites in this bug causing bad layerization, even if they now no longer cause this crash, let's file specific bugs for them and I'll look in to them
(In reply to Jamie Nicol [:jnicol] from comment #38)
> Thanks for the fix Ryan!

Yes, thank you!

> If there specific sites in this bug causing bad
> layerization, even if they now no longer cause this crash, let's file
> specific bugs for them and I'll look in to them

Are there easy ways for users to tell if sites are causing poor layerizations besides the browser crashing?
Flags: needinfo?(jnicol)
Short answer: poor performance and/or crashing often indicate poor layerizations but the user might not easily be able to attribute it to that.

Longer answer:

Poor layerization should never cause crashes in theory. It does of course, due to bugs such as this one, and as always, OOMs.

The easiest way to see the layerization is to set layers.draw-borders=true. Sometimes this will make it immediately obvious that we've done a bad job. Though this isn't always easy to see, nor 100% objective. Using the gecko profiler can also help - lots of time spent allocating layers is usually a big indication, for example. Also enabling layers.dump or layout.display-list.dump will log the layer tree which can be inspected.

So if we enable layer borders and take your smart bookmark example above - you'll see the loading spinners each create a fair number of tiny layers while they spin. Not sure why they are doing so, nor if this is quite bad enough to be the only problem on this page. But it's certainly worth investigating.

Likewise on http://wsprnet.org/drupal/wsprnet/map, each of the markers on the map gets their own layer, and there are a lot of them. From a glance I don't think this is necessarily bad, however - how else would they move correctly with the map while zooming? So I'm glad this bug got fixed in a way other than simply limiting the number of layers we create.
Flags: needinfo?(jnicol)
Depends on: 1445067
No longer depends on: 1445067
Is this something that should be considered for uplift?
Flags: needinfo?(rhunt)
Yes I think it should be, I'll request uplift.
Flags: needinfo?(rhunt)
Comment on attachment 8958113 [details]
Allocate TextureReadLock at TextureClient creation and drop file handles immediately after. (bug 1416726, )

Approval Request Comment
[Feature/Bug causing the regression]: Graphics texture IPC used a cross process synchronization mechanism (read locks) that required a file descriptor be open for each layer on the screen. In some edge cases we end up generating thousands of layers, and that can cause us to crash from file descriptor exhaustion.

[User impact if declined]: Some web pages may crash such as the google maps embedding pages listed in this bug.

[Is this code covered by automated tests?]: No

[Has the fix been verified in Nightly?]: Yes

[Needs manual test from QE? If yes, steps to reproduce]: No

[List of other uplifts needed for the feature/fix]: Just this bug, the uploaded mozreview request has a few nits that were fixed locally before being pushed to central.

[Is the change risky?]: Moderately

[Why is the change risky/not risky?]: The way read locks are sent to the compositor was changed, and while this new method is more straightforward it was not a single line fix.

[String changes made/needed]: None
Attachment #8958113 - Flags: approval-mozilla-beta?
Comment on attachment 8958113 [details]
Allocate TextureReadLock at TextureClient creation and drop file handles immediately after. (bug 1416726, )

this is slightly scary, but seems to be doing ok in nightly and was verified to fix the issue with running out of file descriptors by at least one affected user, so let's get this on beta60
Attachment #8958113 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Ryan Hunt [:rhunt] from comment #43)

> [Needs manual test from QE? If yes, steps to reproduce]: No

Marking this as qe- per Ryan's assessment on manual testing needs.
Flags: qe-verify-
There are no more crashes with signature "mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPLayerTransactionConstructor" in nightly 61 & beta 60 but there still are ones with signature "mozilla::ipc::FatalError | mozilla::dom::ContentChild::FatalErrorIfNotUsingGPUProcess | mozilla::layers::PCompositorBridgeChild::SendPTextureConstructor" (however the crash volume dropped after the patch landed).
:rhunt, Should I file a new bug ?
Flags: needinfo?(rhunt)
Yes that would be good, that signature can have multiple reasons for it happening so it'd be good to handle what remains in a separate bug.
Flags: needinfo?(rhunt)
See Also: → 1484288
You need to log in before you can comment on or make changes to this bug.