Closed Bug 1088300 Opened 10 years ago Closed 8 years ago

Firefox 33 crash in mozilla::layers::TextureClient::InitIPDLActor(mozilla::layers::CompositableForwarder*)

Categories

(Core :: Graphics: CanvasWebGL, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- affected
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: bjacob, Assigned: nical)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(6 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-b2b10163-6c55-4ca9-9a35-8af2a2141023.
=============================================================

Here's how I got this crash:

browse to maps.google.com
switch to 'earth' mode (the satellite view icon on bottom left)
zoom some, pan some

Note it's happening in Firefox 33, not in Nightly.
Crash Signature: [@ mozilla::layers::TextureClient::InitIPDLActor(mozilla::layers::CompositableForwarder*)] → [@ mozilla::layers::TextureClient::InitIPDLActor(mozilla::layers::CompositableForwarder*)] [@ mozilla::layers::TextureClient::InitIPDLActor]
5 crashes with this signature, all different install times, on the 6-29 Nightly, making it the #2 OSX crash. Example: https://crash-stats.mozilla.com/report/index/7c46e33d-37fa-4985-8d18-6293d2160630
FWIW, I hit this right after printing a page in bp-e4a2e7ed-61ac-4213-9215-74b9b2160804, but later did the same thing with no crash.
...I spoke too soon. Same crash, after closing the tab I printed. (Unfortunately the tab was for renewing a license and registration on http://www.dmv.ca.gov/, so I can't really try a 3rd time.)
bp-5da15ec5-54fe-436e-9b08-ba9d32160804, and I'm on a current OS X Nightly.
Looks like TiledContentClient with a null mCompositableClient, which would suggest ClientMultiTiledLayerBuffer with a null mCompositableClient.  :nical, we have some asserts, but nothing stronger - anything that stands out?
Flags: needinfo?(nical.bugzilla)
Whiteboard: [gfx-noted]
It would mean ValidateTile was not called on that particular tile and/or the tile is a placeholder tile.
We don't need to store mCompositableClient in the tile. we have the information a few levels up the stack and we only need here, so I'll put up a quick patch that removes the member and pass the compositable by parameter instead, to avoid messing this up again.
That will take care of the crash and prevent future similar issues, but I don't know whether the current problem is the symptom of a bigger bug.
Flags: needinfo?(nical.bugzilla)
Assignee: nobody → nical.bugzilla
This is a first step. I don't actually think that the compositable was null because I didn't run into any code path that would have led to that, but with this patch we have stronger guarantee that it cannot happen.
There are other things that we can do to remove unnecessary state from TileClient, I'm working on followup patches.
Attachment #8783942 - Flags: review?(milan)
TileClient::mAllocator is interestingly trickier. I tried to move it out of the tile and store it at the TiledLayerBufferLevel, but it might cause issues when switching between non-component-alpha and component-alpha. It's also less clear that mAllocator is not null when we need it.
For now it is a lot simpler to just null-check mAllocator.

It would be interesting to move it out of the tile and get to a point where we have the guarantee that different allocators are not getting mixed in the same tile buffer (if the render mode changes, remove all tiles, get a new allocator, etc). But that means spending some non-trivial amount of time shaving off technical debt for something that is not near the top of my priority list.
Attachment #8783962 - Flags: review?(milan)
Attachment #8783962 - Flags: review?(milan) → review+
Comment on attachment 8783942 [details] [diff] [review]
Remove mCompositableClient from TileClient, pass the compositable and layer by reference to remove the possibility of unexpectedly storing null pointers

Review of attachment 8783942 [details] [diff] [review]:
-----------------------------------------------------------------

I understand not storing the pointer, but going up the stack to pick it up - I don't understand where we're guarding against null pointers with this patch.?
Flags: needinfo?(nical.bugzilla)
(In reply to Milan Sreckovic [:milan] from comment #10)
> I understand not storing the pointer, but going up the stack to pick it up -
> I don't understand where we're guarding against null pointers with this
> patch.?

We are passing temporary references around (once you have your reference you can assume it is not null and it cannot be nulled, so in case of an error you can directly blame the creator of the reference up the stack) which are all created by passing "*this" in the various caller methods that create the references.
The only way for the references to be null would be for the callers to be method calls on a null object. It's debatable to assume that "this" is never null, but whenever it happens it usually crashes quickly and loudly. So worse case we are not dealing with the unexpected state of a TileClient in the middle of the tile buffer, but dealing instead with calling a method on a null layer up the stack which is simpler to identify and reason about.
Flags: needinfo?(nical.bugzilla)
Thanks - I didn't chase all the calls to see that all the references were coming from *this.  I'm comfortable assuming this is not null.
Comment on attachment 8783942 [details] [diff] [review]
Remove mCompositableClient from TileClient, pass the compositable and layer by reference to remove the possibility of unexpectedly storing null pointers

Review of attachment 8783942 [details] [diff] [review]:
-----------------------------------------------------------------

From what I can tell, this wouldn't stop any crashes, but it would move the ones caused by a null "this" pointer to a more explicit place, so we would get better information.  Cool.
Attachment #8783942 - Flags: review?(milan) → review+
(In reply to Milan Sreckovic [:milan] from comment #14)
> Created attachment 8784427 [details]
> Bug 1088300: Diagnosting patch - MOZ_CRASH if we can't get an actor, instead
> of a nullptr crash.
> 
> Review commit: https://reviewboard.mozilla.org/r/73898/diff/#index_header
> See other reviews: https://reviewboard.mozilla.org/r/73898/

I don't know if this is likely, but it won't hurt - and we do get crashes on nightly, so we should know soon enough.
Comment on attachment 8784427 [details]
Bug 1088300: Diagnosting patch - MOZ_CRASH if we can't get an actor, instead of a nullptr crash.

https://reviewboard.mozilla.org/r/73898/#review71960
Attachment #8784427 - Flags: review?(nical.bugzilla) → review+
Crash volume for signature 'mozilla::layers::TextureClient::InitIPDLActor':
 - nightly (version 51): 102 crashes from 2016-08-01.
 - aurora  (version 50): 321 crashes from 2016-08-01.
 - beta    (version 49): 7 crashes from 2016-08-02.
 - release (version 48): 15 crashes from 2016-07-25.
 - esr     (version 45): 1 crash from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly      27      19      28
 - aurora      122      99      15
 - beta          1       3       2
 - release       5       4       3
 - esr           0       0       0

Affected platforms: Windows, Mac OS X

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly           #17
 - aurora  #426      #8
 - beta              #1255
 - release #3174     #331
 - esr
Pushed by msreckovic@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1f2fa3500013
Diagnosting patch - MOZ_CRASH if we can't get an actor, instead of a nullptr crash. r=nical
I just hit this crash after printing a GMail email and switching tabs.

https://crash-stats.mozilla.com/report/index/deab0fd7-0a07-4921-9717-23f472160829
Crash with extra info: https://crash-stats.mozilla.com/report/index/81089099-142f-47b1-a81b-e95532160828

We crash with the null mActor set here: https://dxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#904 in this call:

aForwarder->CreateTexture(desc,
                          aForwarder->GetCompositorBackendType(),
                          GetFlags(),
                          mSerial));

desc is a SurfaceDescriptor of type TSurfaceDescriptorBuffer
aForwarder's compositor backend type is LAYERS_NONE
GetFlags returns DISALLOW_BIG_IMAGE + IMMEDIATE_UPLOAD
mSerial is 15020

The only thing I can find that would return nullptr without crashing earlier, seems to be when aForwarder is a ShadowLayerForwarder, and ShadowLayerForwarder::CreateTexture takes an early exit here:

  if (!HasShadowManager() ||
      !mShadowManager->IPCOpen() ||
      !mShadowManager->Manager()) {
    return nullptr;
  }

in https://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/ShadowLayers.cpp#925, but I don't know this code and only did a quick look.
(In reply to Markus Stange [:mstange] from comment #20)
> I just hit this crash after printing a GMail email and switching tabs.
> 
> https://crash-stats.mozilla.com/report/index/deab0fd7-0a07-4921-9717-
> 23f472160829

I'd make the same conclusion on this one as on the one in comment 22.
Flags: needinfo?(nical.bugzilla)
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8ff54e28b15
Remove mCompositableClient from TileClient, pass the compositable and layer by reference to remove the possibility of unexpectedly storing null pointers. r=milan
https://hg.mozilla.org/integration/mozilla-inbound/rev/1ec40ecc63f8
Null-check TileClient::mAllocator. r=milan
Backed out for unused variable error in SingleTiledContentClient.h:

https://hg.mozilla.org/integration/mozilla-inbound/rev/bfa7a5faf3b3fa940287745f65cedbb3a551c358
https://hg.mozilla.org/integration/mozilla-inbound/rev/b6211770c5dda33f7b503336b00815a676822a01

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1ec40ecc63f835c2f1347daeb3759fd95e5f8592
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=34951338&repo=mozilla-inbound
> /builds/slave/m-in-m64-000000000000000000000/build/src/obj-firefox/x86_64/dist/include/mozilla/layers/SingleTiledContentClient.h:86:23: error: private field 'mManager' is not used [-Werror,-Wunused-private-field]
This also asserts:
remoteautomation.py | application crashed [@ mozilla::layers::DiscardTexture]
Assertion failure: aAllocator, at /home/worker/workspace/build/src/gfx/layers/client/TiledContentClient.cpp:566 
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=1ec40ecc63f835c2f1347daeb3759fd95e5f8592
Comment on attachment 8787317 [details] [diff] [review]
More diagnostics for a possible failure case. r=mchang

Review of attachment 8787317 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/ShadowLayers.cpp
@@ +929,5 @@
>  {
>    if (!HasShadowManager() ||
>        !mShadowManager->IPCOpen() ||
>        !mShadowManager->Manager()) {
> +    gfxCriticalNote << "ShadowLayerForwarder::CreateTexture fails with " << HasShadowManager() << ", " << mShadowManager->IPCOpen() << ", " << ", " << mShadowManager->IsDestroyed() << ", " << !!mShadowManager->Manager();

nit: Add more context to each param e.g. "Destroyed: " << mShadowManager->IsDestroyed().
Attachment #8787317 - Flags: review?(mchang) → review+
Pushed by msreckovic@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2d673f97bf4d
More diagnostic to see what went wrong. r=mchang
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/565bac97a123
Remove mCompositableClient from TileClient, pass the compositable and layer by reference to remove the possibility of unexpectedly storing null pointers. r=milan
https://hg.mozilla.org/integration/mozilla-inbound/rev/41632799ddc4
Null-check TileClient::mAllocator. r=milan
Crash Signature: [@ mozilla::layers::TextureClient::InitIPDLActor(mozilla::layers::CompositableForwarder*)] [@ mozilla::layers::TextureClient::InitIPDLActor] → [@ mozilla::layers::TextureClient::InitIPDLActor(mozilla::layers::CompositableForwarder*)] [@ mozilla::layers::TextureClient::InitIPDLActor] [@ mozilla::layers::ShadowLayerForwarder::CreateTexture]
It would be interesting to see if we are in a situation where the forwarder is already unusable before we begin rendering the layer tree.
Flags: needinfo?(nical.bugzilla)
Attachment #8790733 - Flags: review?(milan)
Comment on attachment 8790733 [details] [diff] [review]
Don't render a ClientManager if IPC is already gone

Review of attachment 8790733 [details] [diff] [review]:
-----------------------------------------------------------------

Is it worth recording whether it's the mForwarder being null or IPCOPen failing that's the cause?
Attachment #8790733 - Flags: review?(milan) → review+
I've just crashed during printing again, this time I saved an egencia itinerary as a PDF.
https://crash-stats.mozilla.com/report/index/d366d0e8-9b36-44c3-bafc-b48872160914
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7ac9449aa8d2
Don't attempt to paint a ClientLayerManager if IPC is down. r=milan
(In reply to Milan Sreckovic [:milan] from comment #34)
> Is it worth recording whether it's the mForwarder being null or IPCOPen
> failing that's the cause?

I checked and in fact null-checking the forwarder isn't even necessary since it's created in the constructor and never nulled out. I should have removed the null check but it won't hurt anyway.

I am still puzzled about the interaction between printing and this stack trace. Need to dig some more into the printing code.
We have a number of crashes where this message got triggered, before we removed it in bug 1297568, for example: https://crash-stats.mozilla.com/report/index/114d021c-0d11-46e2-ab7c-bdece2161007

Returning false doesn't seem to stop us from crashing - we just crash in GetTextureForwarder()->IsSameProcess() call, because GetTextureForwarder() returns null.
Flags: needinfo?(nical.bugzilla)
The return doesn't help because it's not propagated out of EndTransaction. We'll need to fix that for the GPU process too. But just doing the same check in ShadowLayerForwarder should work.

If we re-add the gfxCriticalError, can we re-add it as a gfxCriticalNote?
I'd vote for it to be downgraded to gfxCriticalNote (see https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1297568&attachment=8795584), but we can also make use of MOZ_ASSERT_IF and only assert when not using the GPU process.

Would it be worth adding a new gfxCriticalError method that only triggers in non-GPU process worlds?
I don't think so. We should just be rigorous - if it's a diagnostic patch, make that explicit and we can work around it for the GPU process. If it's supposed to fix a crash, we have to fully handle the error. If it's both, use gfxCriticalNote. We mistakenly removed diagnostics but in fairness there are so many of these checks it's hard to tell.
All critical errors and notes are diagnostic.  No need to be more explicit than that.
Diagnostics by definition should be temporary, though. We have many, many browser-crashing asserts as "diagnostics" that should have been handled as normal failure cases. Sometimes they're *both*, which doesn't make sense.

If the information they provide is "does this happen," there are ways to do that without crashing the browser. Telemetry, crash annotations, notes, etc. If the information is a stack trace (which looks not the case in this particular bug), then we'd want to make sure that after getting data, the crash is fixed and the assert removed. If we get no data, then we'd want to just remove the diagnostic entirely (or downgrade it).
Round two. This patch checks in EndTransaction and EndEmptyTransaction rather than EndTransactionInternal. The patch also makes sure we revoke the latest refresh driver transaction id if we are not going to go through with it. If I am very lucky, this will take care of the hypothetical "transaction id leakage" I have been after for a little while.
Flags: needinfo?(nical.bugzilla)
Attachment #8799454 - Flags: review?(dvander)
See Also: → 1260517
(In reply to David Anderson [:dvander] from comment #44)
> Diagnostics by definition should be temporary, though. We have many, many
> browser-crashing asserts as "diagnostics" that should have been handled as
> normal failure cases. Sometimes they're *both*, which doesn't make sense.

I don't really care - whatever we need to actually fix the bug.  If :nical can fix this without the extra information, then we don't need the extra information.
Comment on attachment 8799454 [details] [diff] [review]
Don't attempt to paint a ClientLayerManager if IPC is down (v2)

Review of attachment 8799454 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! It does look like the old code would skip revoking the transaction id.
Attachment #8799454 - Flags: review?(dvander) → review+
Pushed by nsilva@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/02eb2bbe96fb
Don't attempt to paint a ClientLayerManager if IPC is down (take two). r=dvander
Comment on attachment 8799454 [details] [diff] [review]
Don't attempt to paint a ClientLayerManager if IPC is down (v2)

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: Crashes.
[Describe test coverage new/current, TreeHerder]: None.
[Risks and why]: low, simple patch.
[String/UUID change made/needed]: None.
Attachment #8799454 - Flags: approval-mozilla-beta?
Attachment #8799454 - Flags: approval-mozilla-aurora?
Comment on attachment 8799454 [details] [diff] [review]
Don't attempt to paint a ClientLayerManager if IPC is down (v2)

Crash fix, this is a pretty low volume crash on Beta50, still makes sense to uplift it, Aurora51+, Beta50+
Attachment #8799454 - Flags: approval-mozilla-beta?
Attachment #8799454 - Flags: approval-mozilla-beta+
Attachment #8799454 - Flags: approval-mozilla-aurora?
Attachment #8799454 - Flags: approval-mozilla-aurora+
has problems uplifting to aurora

grafting 369346:02eb2bbe96fb "Bug 1088300 - Don't attempt to paint a ClientLayerManager if IPC is down (take two). r=dvander"
merging gfx/layers/client/ClientLayerManager.cpp
warning: conflicts while merging gfx/layers/client/ClientLayerManager.cpp! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
(use 'hg resolve' and 'hg graft --continue')
Flags: needinfo?(nical.bugzilla)
Aurora uplift:
https://hg.mozilla.org/releases/mozilla-aurora/rev/22963ffebf2833b64e9cd692e02b268ffb32c6c9

(Beta uplift still building locally)
Flags: needinfo?(nical.bugzilla)
Status: NEW → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: