Closed Bug 1219330 Opened 4 years ago Closed 4 years ago

crash in mozilla::layers::TextureClient::CreateForYCbCr with a 0xffffffffe5e5e609 address

Categories

(Core :: Graphics, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox41 --- unaffected
firefox42 + wontfix
firefox43 + fixed
firefox44 --- fixed
firefox45 --- fixed
b2g-v2.5 --- fixed

People

(Reporter: kairo, Assigned: nical)

References

Details

(Keywords: crash, sec-moderate, Whiteboard: [b2g-adv-main2.5-][post-critsmash-triage][adv-main43+])

Crash Data

Attachments

(4 files, 3 obsolete files)

[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-3d56b4aa-6787-47bf-8585-6b15a2151028.
=============================================================

Stack Trace:
0 	xul.dll 	mozilla::layers::TextureClient::CreateForYCbCr(mozilla::layers::ISurfaceAllocator*, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits>, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits>, mozilla::StereoMode, mozilla::layers::TextureFlags) 	gfx/layers/client/TextureClient.cpp
1 	xul.dll 	mozilla::layers::SharedPlanarYCbCrImage::Allocate(mozilla::layers::PlanarYCbCrData&) 	gfx/layers/ipc/SharedPlanarYCbCrImage.cpp
2 	xul.dll 	mozilla::layers::SharedPlanarYCbCrImage::SetData(mozilla::layers::PlanarYCbCrData const&) 	gfx/layers/ipc/SharedPlanarYCbCrImage.cpp
3 	xul.dll 	mozilla::VideoData::Create(mozilla::VideoInfo const&, mozilla::layers::ImageContainer*, mozilla::layers::Image*, __int64, __int64, __int64, mozilla::VideoData::YCbCrBuffer const&, bool, __int64, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&) 	dom/media/MediaData.cpp
4 		@0x20c0f83f 	

It looks like all those stacks are cut off with a hex address at that point.

Crashes with this signature have been present in 42 betas and newer nightly/dev-edition builds with low frequency, but in early 42 RC data, this is the #3 signature with ~2.2% of the overall data.

The address is always 0xffffffffe5e5e609, which points to the e5e5... pattern that is now our memory poison and the eax register is the exact poison "0xe5e5e5e5" so this smells like a UAF and I'm marking this as security.

The signature itself does appear a single-digit amount of times in 41 but not with a poison address, so I consider 41 unaffected. The 43 and 44 cases do have the poison as well, though.

It affects all versions of Windows from XP to 10, and 32bit as well as 64bit builds, in the latter case the rax register is the poison "0xe5e5e5e5e5e5e5e5" (while the address is [mis]reported as 0xffffffffffffffff).

Milan, is this something that looks like a graphics or like a media bug? We should take a look today to find out if this is something we need to get fixed for the 42 release (we are in RC already) or not.
Flags: needinfo?(milan)
Will follow up.
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(milan)
Flags: needinfo?(matt.woodrow)
Duplicate of this bug: 1219362
The 64bit crash reports seem to have better stacks, looks like a VPX decoder problem.

https://crash-stats.mozilla.com/report/index/dfe5867b-6475-494b-857a-f4c3c2151028
Flags: needinfo?(matt.woodrow)
aAllocator has been destroyed at this point.

The poisoned EAX value comes from dereferencing aAllocator (to get the vtable address), and then we crash trying to call IsSameProcess in this vtable.
I can see the VPX decoder on the stack when debugging the 32bit crash report in VS, so I guess it's just a bug with crash-stats.mozilla.org failing to unwind.
FFmpeg PDM would use an almost identical code path, so if you have crashes in VPXDecoder, it will happen on Linux too.

That is schedule a task on its flushable task queue and attempt to decode the frame.

I don't see how we could get to that point should the decoder be shutdown (as pending tasks are flushed first).

The Windows PDM does check that its decoder wasn't destroyed, but none of the other PDM do.

So the question remains: how can the allocator be destroyed at that point. Was gfx shutdown before the decoder ?
It looks like in the caller (SharedPlanarYCbCrImage::Allocate) that mCompositable (an ImageClient instance) is still alive and valid at this point.

So it appears that ImageClient/CompositableClient::mForwarder (CompositableForwarder*) has been destroyed, and we're holding a raw pointer to this.

I guess we've destroyed the ImageBridge for some reason, while we still have an asyncImageContainer around that has an ImageClient for that connection.

Don't think I can get much more from the minidump without a full memory dump.

Nical, any ideas on how we're supposed to handle this?
(In reply to Jean-Yves Avenard [:jya] from comment #6)
> FFmpeg PDM would use an almost identical code path, so if you have crashes
> in VPXDecoder, it will happen on Linux too.

That's believable, but we have a lot less linux users, so might be hard to find crash reports.
Important crash (+ security), tracking.
(In reply to Matt Woodrow (:mattwoodrow) from comment #7)
> I guess we've destroyed the ImageBridge for some reason, while we still have
> an asyncImageContainer around that has an ImageClient for that connection.

On some reports we can see that the main thread is already past the shutdown of all gfx things.

> Nical, any ideas on how we're supposed to handle this?

Sorry to repeat myself in many different bugs, but we need all of the code that holds on to gfx resources to make sure it is shutdown in time. We can probably paper over part of the issue by adding checks in places where we have enough levels of indirection, but there is only so much we can do this way since things that hold on to gfx resources have direct access to the underlying data from any thread. In many of the reports we have a decoder outputting video frames after all of gfx has been shut down, which really needs to be fixed. Since this is a top crash late in beta I am looking into papering over the crash in gfx as much as possible but this will most likely just move a lot of the crashes to another stack.
Flags: needinfo?(nical.bugzilla)
Are you saying that it should be up to the client (here the VPXDecoder) to check that the mozilla::layers::ISurfaceAllocator aAllocator didn't get destroyed?

or that one of the ImageContainer's element didn't get shutdown?
If so how?

Seeing that the only gfx object held by a decoder is ImageContainer, which ref-counted, holding a reference to the object kind of means it's in use from an abstract POV.
That it can be shutdown seems rather counter productive.

FWIW, having VideoData::Create returns null is checked by all video decoders as a failure, which will cause anything to abort.

It appears to me that the only place to check that gfx was shutdown can be done in SharedPlanarYCbCrImage::Allocate(PlanarYCbCrData& aData) which should return false.

So one way could be to have VideoData::Create check that SetData() did succeed 
This is made difficult as SharedPlanarYCbCrImage::SetData() returns a void; returning a bool could help testing.
Alternatively if it's too much trouble to have it return a bool; it could check the data (like checking the actual data pointer to see if it did get set)
hmmm. actually no, layers::SharedPlanarYCbCrImage hold a strong reference to ImageClient, only holds a weak reference to the CompositableForwarder (which is has been destroyed).

So it's up to ImageContainer::CreateImage to return null.

That said; I still don't get how it's possible that gfx gets destroy before the decoder; especially there as it means the video is still playing
This will probably avoid some of the crashes, but it does not fix the real issue. It should be fairly safe to uplift, though.
Attachment #8680613 - Flags: review?(matt.woodrow)
typo.
Attachment #8680613 - Attachment is obsolete: true
Attachment #8680613 - Flags: review?(matt.woodrow)
Attachment #8680623 - Flags: review?(matt.woodrow)
Note that because I have this potential UAF as security, we need a security rating (or a good explanation of why there is no security risk at all) and if the rating is sec-high or sec-crit, we also need sec-approval. That said, I just talked to Sylvestre and we'd like this to land today if possible on m-r and get it into a second RC build for 42.
Nicolas, do you know someone else who could review that? Thanks
Flags: needinfo?(nical.bugzilla)
It's UAF, but at shutdown, so it feels like sec-moderate (perhaps sec-high, but probably not.)
Flags: needinfo?(nical.bugzilla)
Keywords: sec-moderate
putting back the ni to nical :)
Flags: needinfo?(nical.bugzilla)
Comment on attachment 8680623 [details] [diff] [review]
cancel texture creation if the ISurfaceAllocator is disconnected

Review of attachment 8680623 [details] [diff] [review]:
-----------------------------------------------------------------

Seems reasonable for now.
Attachment #8680623 - Flags: review?(matt.woodrow) → review+
(In reply to Jean-Yves Avenard [:jya] from comment #11)
> Are you saying that it should be up to the client (here the VPXDecoder) to
> check that the mozilla::layers::ISurfaceAllocator aAllocator didn't get
> destroyed?

I would say the best is to shut down all systems that hold gfx resources synchronously before we shut gfx down, rather than have to handle media outliving gfx and test on a thread that something hasn't been destroyed on another thread. But someone with more insights into how media resources are managed might have better suggestions. It looks like (looking at MediaShutdownManager) at least some of media is explicitly trying to be shut down synchronously when NS_XPCOM_SHUTDOWN_OBSERVER_ID is fired which is what we need. But it appears that some of the resources are still around after that (as we can see in some of the crash reports here).

> 
> or that one of the ImageContainer's element didn't get shutdown?
> If so how?

I am hoping that if we make 100% sure media is shut down before gfx, the likelyhood of something using ImageContainers off the main thread while we shut gfx down gets pretty low (since AFAIK ImageContainer is used only by media, gfx and plugin stuff).

> 
> Seeing that the only gfx object held by a decoder is ImageContainer, which
> ref-counted, holding a reference to the object kind of means it's in use
> from an abstract POV.
> That it can be shutdown seems rather counter productive.
> 

Yeah, shutdown is awkward because we use ref-counting all over the code base but at the same time shutdowns happens synchronously. Since xpcom's shutdown destroys the machinery that lets us use threads (among other things), there's this "deadline" to handle anything related to compositing, while reference counting + how we share some of the things across threads makes the where and when of deallocation hard to predict.


> FWIW, having VideoData::Create returns null is checked by all video decoders
> as a failure, which will cause anything to abort.

I assume you mean abort as in stop decoding (rather than MOZ_CRASH)?

> 
> It appears to me that the only place to check that gfx was shutdown can be
> done in SharedPlanarYCbCrImage::Allocate(PlanarYCbCrData& aData) which
> should return false.

The patch I attached checks when creating the TextureClient, which is equivalent. My worry is that while we avoid crashing there, we are still in a weird state where gfx is shut down and things that call into gfx are still alive, so we may crash somewhere else.

> 
> So one way could be to have VideoData::Create check that SetData() did
> succeed 
> This is made difficult as SharedPlanarYCbCrImage::SetData() returns a void;
> returning a bool could help testing.

I'll add the boolean in a subsequent patch.
Flags: needinfo?(nical.bugzilla)
Too late for 42, so let's have this patch bake in central for a few days before we uplift it.
Assignee: nobody → nical.bugzilla
Attached patch Handle SetData failure (obsolete) — Splinter Review
This patch makes SetData return a boolean and makes the call sites handle the result (by returning null or by asserting, or both, depending on how the callers handle other errors).
Attachment #8681241 - Flags: review?(jyavenard)
patch is empty.
Attached patch Handle SetData failure (obsolete) — Splinter Review
Uploaded the wrong patch.
Attachment #8681241 - Attachment is obsolete: true
Attachment #8681241 - Flags: review?(jyavenard)
Attachment #8681294 - Flags: review?(jyavenard)
https://hg.mozilla.org/mozilla-central/rev/a6d9d8927dd2
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Comment on attachment 8681294 [details] [diff] [review]
Handle SetData failure

Review of attachment 8681294 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, passing on to :jesup as it touches webrtc

::: dom/media/webrtc/MediaEngineDefault.cpp
@@ +250,5 @@
>  		     0, 0);
>  #endif
>  
> +  if (!ycbcr_image->SetData(data)) {
> +    MOZ_ASSERT(false);

probably would still need to release data; as I'm guessing the called expect that it happened.
I don't know that code though.
Attachment #8681294 - Flags: review?(rjesup)
Attachment #8681294 - Flags: review?(jyavenard)
Attachment #8681294 - Flags: review+
Comment on attachment 8681294 [details] [diff] [review]
Handle SetData failure

Review of attachment 8681294 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/webrtc/MediaEngineDefault.cpp
@@ +250,5 @@
>  		     0, 0);
>  #endif
>  
> +  if (!ycbcr_image->SetData(data)) {
> +    MOZ_ASSERT(false);

This would still need to call ReleaseFrame(data) to free the buffer referenced by data.  It also should return (probably an error, though it really doesn't matter much).

::: media/webrtc/signaling/src/mediapipeline/MediaPipeline.cpp
@@ +1504,5 @@
> +    if (!yuvImage->SetData(yuvData)) {
> +      // This class seems to expect image_ to not be null so we don't early-return
> +      // here, but there's probably something that should be done if we get into
> +      // this branch because we assume SetData always succeeds although it might fail.
> +      MOZ_ASSERT(false);

This should early-return here; image_ holds a pointer to the most-recent-image.
Attachment #8681294 - Flags: review?(rjesup) → review-
How about this (only addressed the comments, the rest of the patch hasn't changed)?
Attachment #8681294 - Attachment is obsolete: true
Attachment #8681836 - Flags: review?(rjesup)
Comment on attachment 8681836 [details] [diff] [review]
Handle SetData failure v2

Review of attachment 8681836 [details] [diff] [review]:
-----------------------------------------------------------------

You missed one of the two comments.... r+ with that fixed too

::: media/webrtc/signaling/src/mediapipeline/MediaPipeline.cpp
@@ +1502,5 @@
>      yuvData.mStereoMode = StereoMode::MONO;
>  
> +    if (!yuvImage->SetData(yuvData)) {
> +      MOZ_ASSERT(false);
> +    }

Needs early return on failure here
Attachment #8681836 - Flags: review?(rjesup) → review+
Group: gfx-core-security → core-security-release
A note on MOZ_ASSERT(false) - you may want to consider using gfxCriticalError() instead, which will assert in debug builds, but also leave a trace in crash reports in case a crash happens afterwards, making it easier to see if this ever happens in a non-debug build...
Whiteboard: [b2g-adv-main2.5-]
Hi nical, Do we know if this helped (maybe less of a sec issue even if it just moved the crash signature in some cases?)   Do you want to uplift this?
Flags: needinfo?(nical.bugzilla)
Comment on attachment 8680623 [details] [diff] [review]
cancel texture creation if the ISurfaceAllocator is disconnected

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: crashes
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: low: simple patch, is already in nightly and aurora. 
[String/UUID change made/needed]:
Flags: needinfo?(nical.bugzilla)
Attachment #8680623 - Flags: approval-mozilla-beta?
Comment on attachment 8680623 [details] [diff] [review]
cancel texture creation if the ISurfaceAllocator is disconnected

Crash fix, baked on m-c and m-a, let's take it for beta.
Attachment #8680623 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Neither of the patches that landed on m-c seem to be on aurora. (They landed after m-c became 45, and I don't see a mention of this bug number in aurora's shortlog.)
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(lhenry)
Half-baked then.  Well, let's land it on aurora too.
Flags: needinfo?(lhenry)
Comment on attachment 8680623 [details] [diff] [review]
cancel texture creation if the ISurfaceAllocator is disconnected

Uplifting to aurora as well.
Attachment #8680623 - Flags: approval-mozilla-aurora+
This needs both patches uplifted, right?
(In reply to Wes Kocher (:KWierso) from comment #35)
> Neither of the patches that landed on m-c seem to be on aurora. (They landed
> after m-c became 45, and I don't see a mention of this bug number in
> aurora's shortlog.)

Ah sorry, I thought it landed before the uplift, but I should have checked.

I think that the first patch is the most important to avoid the crash, the other one is good to take too, but more likely to cause merge conflicts since it touches more places. I am hoping that the first one is enough as far as uplifting is concerned but the other one is not too risky in principle so we can uplift as well.
Flags: needinfo?(nical.bugzilla)
(In reply to Nicolas Silva [:nical] from comment #39)
> (In reply to Wes Kocher (:KWierso) from comment #35)
> > Neither of the patches that landed on m-c seem to be on aurora. (They landed
> > after m-c became 45, and I don't see a mention of this bug number in
> > aurora's shortlog.)
> 
> Ah sorry, I thought it landed before the uplift, but I should have checked.
> 
> I think that the first patch is the most important to avoid the crash, the
> other one is good to take too, but more likely to cause merge conflicts
> since it touches more places. I am hoping that the first one is enough as
> far as uplifting is concerned but the other one is not too risky in
> principle so we can uplift as well.

both patches applied cleanly to aurora so i uplifted them both
I'm hitting a conflict uplifting to beta. Can we get a rebased patch for it?
Flags: needinfo?(nical.bugzilla)
Comment on attachment 8688070 [details] [diff] [review]
Part 1 - beta patch

Review of attachment 8688070 [details] [diff] [review]:
-----------------------------------------------------------------

thanks
Attachment #8688070 - Flags: review?(nical.bugzilla) → review+
Comment on attachment 8688071 [details] [diff] [review]
Part 2- beta patch

Review of attachment 8688071 [details] [diff] [review]:
-----------------------------------------------------------------

thanks
Attachment #8688071 - Flags: review?(nical.bugzilla) → review+
Flags: needinfo?(nical.bugzilla)
Note, this should make it into the beta 5 release this Friday.
Whiteboard: [b2g-adv-main2.5-] → [b2g-adv-main2.5-][post-critsmash-triage]
Whiteboard: [b2g-adv-main2.5-][post-critsmash-triage] → [b2g-adv-main2.5-][post-critsmash-triage][adv-main43+]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.