Closed Bug 966543 Opened 7 years ago Closed 7 years ago

Crashes at MacIOSurface::GetDevicePixelWidth() with OMTC

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla30
Tracking Status
firefox27 + wontfix
firefox28 + verified
firefox29 + verified
firefox30 - verified
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed

People

(Reporter: smichaud, Assigned: mattwoodrow)

Details

(Keywords: topcrash-mac)

Crash Data

Attachments

(3 files)

These are all crashes on a secondary compositor thread.

The oldest ones I can find date from 2013-11:

bp-4c458196-4734-45f9-8ffc-777392131114
bp-0d3aa9a6-8677-4d7a-af77-1d3202131121

These are the #2 topcrasher on the 27 branch, and the #9 topcrasher on the 29 branch.

These crashes happen here:
http://hg.mozilla.org/mozilla-central/annotate/540d85f60c57/gfx/2d/MacIOSurface.cpp#l299
and here:
http://hg.mozilla.org/mozilla-central/annotate/540d85f60c57/gfx/layers/opengl/MacIOSurfaceTextureHostOGL.cpp#l97

I'd guess that MacIOSurfaceTextureHostOGL.mSurface is null.  MacIOSurface::LookupSurface(), called in that class's constructor, can return null.

Matt, I'm CCing you because these crashes happen in code you wrote.

Here are a few recent examples:
bp-1256c709-a51a-47f3-b064-fbc0d2140103
bp-6f4bc490-d0a3-48e2-ada4-16d5c2140105
bp-f6bd0adc-03b4-4421-a2ba-1bdd02140104
Crash Signature: [@ MacIOSurface::GetDevicePixelWidth() ]
Attached patch macio-crashSplinter Review
I think this is most likely the issue. The ID we pass in the SurfaceDescriptor doesn't hold a reference to the surface, so it seems possible that the TextureClient was being destroyed (and the surface with it) before the TextureHost was created.
Attachment #8369261 - Flags: review?(nical.bugzilla)
Comment on attachment 8369261 [details] [diff] [review]
macio-crash

Review of attachment 8369261 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/opengl/MacIOSurfaceTextureClientOGL.cpp
@@ +75,5 @@
> +  virtual void DeallocateSharedData(ISurfaceAllocator*) MOZ_OVERRIDE
> +  {
> +    mSurface = nullptr;
> +  }
> + 

nit: white space
Attachment #8369261 - Flags: review?(nical.bugzilla) → review+
Tracking for Fx27 as this is a 27 specific top mac crasher, but not something we can hold tomorrow's release for. In case we have chemspill/dot release, will consider this as a ride along if the patch sticks in Fx28 and is low risk.
https://hg.mozilla.org/mozilla-central/rev/2d0317f079e3
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
I still see crashes with this in Firefox 30.0a1 after this landed. 

For example:
https://crash-stats.mozilla.com/report/index/ebc9ad2f-beff-496e-919a-a93512140206
> bp-ebc9ad2f-beff-496e-919a-a93512140206

This is so far the only crash in a build with this bug's patch.  I suspect there will be more.  But it'll be a week before we can tell whether or not the frequency has changed.
> https://hg.mozilla.org/mozilla-central/rev/2d0317f079e3

There have been 6 crashes with this patch (on the 30 branch) in the week since it landed:

bp-9bea4c69-d7ed-4f7b-8197-64d3c2140209
bp-ebc9ad2f-beff-496e-919a-a93512140206
bp-15a84672-e3e6-47d2-bf8f-9b8912140211
bp-bfcd2bf6-dcdd-450f-ab32-e6f9b2140210
bp-f855a4e2-027a-4db5-93f2-304f32140209
bp-4d7d68f1-0aa5-4d4f-be53-45d2f2140211

There were five of these crashes (on the 29 branch) in the week before this patch landed.

So this patch has made no difference at all to these crashes.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
> I'd guess that MacIOSurfaceTextureHostOGL.mSurface is null.

I've confirmed this, by looking at XUL in Hopper Disassembler.

According to the raw crash logs, these crashes take place trying to read memory at offset 0x8, in MacIOSurface::GetDevicePixelWidth().  The assembly code for this method is very short, so I'll paste it in here.  This method is called with a pointer to "this" (a MacIOSurface pointer) in the $rdi register, in this code snippet:  mSurface->GetDevicePixelWidth().  mSurface is MacIOSurfaceTextureHostOGL.mSurface.

push       rbp
mov        rbp, rsp
mov        rdi, qword [ds:rdi+0x8]
pop        rbp
jmp        qword [ds:__ZN15MacIOSurfaceLib6sWidthE] ; MacIOSurfaceLib::sWidth

Almost certainly the crashes happen at "line 3", whose pseudo code is something like this:

$rdi = *(rdi + 0x8)

In other words, MacIOSurface::GetDevicePixelWidth() is called with this == NULL.
Hrm, so it looks like DropTextureData is only called for TEXTURE_DEALLOCATE_CLIENT textures.

Nico, I think what is happening here is that the texture host hasn't been created at the time when we destroy the texture client, because it's still sitting in the message queue.

Creating the texture host is what takes a reference to the shared object, so we have a race condition.

Do we have anything in place to protect against this? I think we could potentially have this issue for a few different texture types.
Flags: needinfo?(nical.bugzilla)
Hmm, I don't think we have anything to protect about destroying a TextureClient we just sent to the compositor before the TextureHost got created.

I am actually surprised that this can happen at all. It seems like if we run into this we shouldn't have shared the TextureClient in the first place.

We could give the TextureData to the TextureChild even when TEXTURE_DEALLOCATE_CLIENT is not set, and tell the TextureChild to forget the texture in this case, instead of deallocating it (since the texture will be deallocated on the host side).
Flags: needinfo?(nical.bugzilla)
Don't we need a synchronous transaction for that though?

I can't even do an extra AddRef on the client side, and skip the AddRef when receiving on the host side since we might be passing cross-process.

MacIOSurface is a bit weird in that it's a c++ object (which we AddRef/Release) but wrapping a system level object which we can't manually refcount. There's no way to guarantee the id number we pass through IPDL stays valid without keeping the c++ object in the client alive.

I think we need an asynchronous version of the delete message so we can get notified when the host has processed all the messages.
Assigning to Matt since it looks like that's who's doing the work of finding the fix here.  Matt, we still have a couple more weeks of Beta where slightly more speculative fixes can be tried to see if they impact the crash volume.
Assignee: nobody → matt.woodrow
Probably wallpapering over the problem a bit, but this should stop people from crashing.

If my guess about the race condition is correct, then we're only trying to render this for a single frame anyway, so not a big loss.
Attachment #8376870 - Flags: review?(nical.bugzilla)
Just puts all the **Host code together instead of mixed through the file.
Attachment #8376871 - Flags: review?(nical.bugzilla)
(In reply to Matt Woodrow (:mattwoodrow) from comment #12)
> Don't we need a synchronous transaction for that though?

TextureClient/Host async shutdown goes as follows:
* client side sends the RemoveTexture message
* host side receives it, deallocates the shared data (or not depending on the TextureFlags)
* host sends __delete__

So if we need to keep the data alive, we can pass it to the TextureParent, and the latter will be deleted when receiving __delete__ asynchronously.

> I think we need an asynchronous version of the delete message so we can get
> notified when the host has processed all the messages.

We already sort of have this if we overload Recv__delete__ on PTextureChild (which we don't currently do but we should).
Comment on attachment 8376870 [details] [diff] [review]
Check for null mSurface

Review of attachment 8376870 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/opengl/MacIOSurfaceTextureHostOGL.cpp
@@ +95,5 @@
>  
>  gfx::SurfaceFormat
>  MacIOSurfaceTextureHostOGL::GetFormat() const {
> +  if (!mSurface) {
> +    return gfx::SurfaceFormat::R8G8B8A8;

Should we return SurfaceFormat::UNKNOWN here?
Attachment #8376870 - Flags: review?(nical.bugzilla) → review+
Attachment #8376871 - Flags: review?(nical.bugzilla) → review+
https://hg.mozilla.org/mozilla-central/rev/48cd98deecd7
https://hg.mozilla.org/mozilla-central/rev/6bd01c3b4945
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Flagging for verification based on crash-stats. We'll need to give this a few days for data collection.
Keywords: verifyme
Comment on attachment 8376871 [details] [diff] [review]
Move some code around

There haven't been any of these crashes in builds that contain this patch (starting with firefox-2014-02-18-03-02-02-mozilla-central).  So it looks like this bug is now actually fixed!
We have just seen this crash on Mac OS X 10.7.5 (x86_64) on Nightly de
 
https://crash-stats.mozilla.com/report/index/1b0e40f2-e96e-4e3e-b752-12f2e2140224
https://crash-stats.mozilla.com/report/index/0fb87fbf-7fdb-40bd-ae63-75d3b2140224
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Those crashes were in the firefox-2014-02-09-03-02-03-mozilla-central nightly.  This bug's actual fix only landed as of the firefox-2014-02-18-03-02-02-mozilla-central nightly.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
I looked at the data for Firefox 30.0a1 in the last week and there are 5 crashes reported. However all of these crashes occur with a build ID of 20140217030202 or earlier. Since the fix landed on February 17 and would have appeared first in the February 18 Nightly builds, I think we can call this fixed.

Source:
https://crash-stats.mozilla.com/report/list?signature=MacIOSurface%3A%3AGetDevicePixelWidth%28%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&version=Firefox%3A30.0a1&hang_type=any&date=2014-02-24+20%3A00%3A00&range_value=1#tab-reports
Comment on attachment 8376870 [details] [diff] [review]
Check for null mSurface

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown
User impact if declined: Occasional crashes with plugins/webgl on OSX.
Testing completed (on m-c, etc.): Been on m-c for a week, confirmed it fixes the issue using crashstats.
Risk to taking this patch (and alternatives if risky): Very low risk, just adds null checks.
String or IDL/UUID changes made by this patch: None
Attachment #8376870 - Flags: approval-mozilla-beta?
Attachment #8376870 - Flags: approval-mozilla-aurora?
Attachment #8376870 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8376870 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I just checked crash-stats and I'm seeing 0 crashes with Firefox 28.0b7 and 0 crashes with Firefox 29.0a2 after 2014-02-25. I'm calling this verified fixed based on these stats.
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.