Closed Bug 1027103 Opened 6 years ago Closed 6 years ago

Google Maps Street View crashes in gfxContext::PushClipsToDT(mozilla::gfx::DrawTarget*)

Categories

(Core :: Graphics, defect, critical)

32 Branch
x86_64
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla33
Tracking Status
firefox31 + wontfix
firefox32 + verified
firefox33 + verified

People

(Reporter: GPHemsley, Assigned: bas.schouten)

References

()

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Viewing Google Maps Street View reproduceably crashes.

It seems Google recently unveiled a new interface for Google Maps, including Street View, and this new interface crashes when advancing along a road in Street View.

bp-be9e155d-a3e4-4373-925c-d9f172140618
bp-ca618fde-12e6-40c0-8dd6-cc7722140618

Likely related to one or more of the following bugs:

bug 974656
bug 1010262
bug 1011864
bug 1011348
bug 805406
Bas, after bug 1011864 and bug 1011348, we have another case here where the PushClipsToDT issue can be reproduced, and this one is a case that many people are likely to run into if those steps are reproducible by them. Is this helpful to find patches to reduce the cases of us running into this?
Flags: needinfo?(bas)
Keywords: crash
Looks like we should null-check the parameter aDT. Nothing seems to ensure that gfxContext::mDT is not null when we hit this path.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #1)
> Bas, after bug 1011864 and bug 1011348, we have another case here where the
> PushClipsToDT issue can be reproduced, and this one is a case that many
> people are likely to run into if those steps are reproducible by them. Is
> this helpful to find patches to reduce the cases of us running into this?

Not in itself, one of those reproducible cases is just other code being silly. (i.e. a window open of size 999999etc.). Yes, we fail to create a surface that's bigger than the amount of memory we can address, essentially we instantly OOM. Since we know OOM to be the biggest cause of the PushClipsToDT problem that's not surprising.

The other one we'll have to have a look but it seems to be linux only I doubt it's related to a more generic case where we're running into it.

In this case if I use the STR, my memory quickly jumps on 1.1 GB on my AMD machine in a clean session. Of those 1.1 GB about 400-600MB seem to be in JS. There's some 100MB of VRAM reported to be used by WebGL. 112 active WebGL textures. We know some intel drivers may use a little more memory as well. I think we're again just flooding our address space with stuff and the crash is happening because of that. So I suspect this is just a generic OOM. But I couldn't reproduce the crash in a clean session.

Having said that, in a session I had running for 2 days I was already using 1.2GB, 400MB decommited JS areas and 500 MB other JS stuff. There when I loaded Google maps, and didn't even go into streetview it jumped to 1600MB, in streetview it's then pretty trivial to OOM :). This seems like the most likely scenario by which a user would run into this.
Flags: needinfo?(bas)
Benjamin, if many of those crashes are generic OOM as Bas says, why do we not get a "OOM | foo" signature on them? Do we need to adjust our classifier or do we need product patches to improve that, or both?
Flags: needinfo?(benjamin)
We don't know from the crash signature that this is OOM, and we don't have an OOMAllocationSize annotation, so there's currently no way for crash-stats to know for sure that this is OOM instead of some other crash in this method.

One possible way to fix this is to move the crash earlier to a spot that is more likely to know the allocation size involved. Here's the codepaths:

http://hg.mozilla.org/mozilla-central/annotate/bdac18bd6c74/gfx/thebes/gfxContext.cpp#l2283 PushNewDT

Here we know the width/height/format of the target but we'd have to extrapolate that to an actual allocation size.

At http://hg.mozilla.org/mozilla-central/annotate/bdac18bd6c74/gfx/2d/DrawTargetSkia.cpp#l597 DrawTargetSkia::CreateSimilarDrawTarget we know that this is a skia/in-memory target.

Or if D2D is enabled, we could be at http://hg.mozilla.org/mozilla-central/annotate/bdac18bd6c74/gfx/2d/DrawTargetD2D.cpp#l1256 and we could be failing to create a surface which might be in process address space, card memory, or both.

Another reason that the D2D version can fail, entirely apart from OOM, is if the graphics card resets. This can be due to driver updates, driver crashes, bad behavior in Firefox, or bad behavior in other programs. I really believe that the we should make these surface failures truly fallible and recoverable.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #5)
> We don't know from the crash signature that this is OOM, and we don't have
> an OOMAllocationSize annotation, so there's currently no way for crash-stats
> to know for sure that this is OOM instead of some other crash in this method.
> 
> One possible way to fix this is to move the crash earlier to a spot that is
> more likely to know the allocation size involved. Here's the codepaths:
> 
> http://hg.mozilla.org/mozilla-central/annotate/bdac18bd6c74/gfx/thebes/
> gfxContext.cpp#l2283 PushNewDT
> 
> Here we know the width/height/format of the target but we'd have to
> extrapolate that to an actual allocation size.

With a lot of Intel drivers we don't know if address space consumption will be triggered for some reason, but I think your suggestion is a good one, if I can add something to improve OOM annotation in crash reports, please give me some pointers and I'll make that work. There's still some threat of false positives though.

> Or if D2D is enabled, we could be at
> http://hg.mozilla.org/mozilla-central/annotate/bdac18bd6c74/gfx/2d/
> DrawTargetD2D.cpp#l1256 and we could be failing to create a surface which
> might be in process address space, card memory, or both.
> 
> Another reason that the D2D version can fail, entirely apart from OOM, is if
> the graphics card resets. This can be due to driver updates, driver crashes,
> bad behavior in Firefox, or bad behavior in other programs. I really believe
> that the we should make these surface failures truly fallible and
> recoverable.

The driver reset case -should- be fixed (i.e. it should safely recover). But I most definitely believe there could still be a situation in which they're not. Making these allocations truly fallible/recoverable is possible, but unless this for some reason someone just requesting an insane size surface, there's not much we can do to truly recover. It would just mean we will artifact badly, and at the point that happens, the question is whether you want a user to have a non-functional browser or to crash?

Having said that, I have an idea, how about if the call fails, initially, we attempt to create a 64x64 surface. If for some reason the surface requested was very large this call will then succeed, and since we still have balance groups the amount of artifacting will be limited. If this call -also- fails it means we're probably thoroughly done for and won't be able to recover well anyway, and crashing is probably the best thing to do. (Unless there's a bug with driver resets left, which, considering the ratio of these bugs which are with D2D on, isn't that high anymore these days, and we should address that issue separately)

What do you think of that idea?
Flags: needinfo?(benjamin)
(In reply to Bas Schouten (:bas.schouten) from comment #6)
> If this call -also- fails
> it means we're probably thoroughly done for and won't be able to recover
> well anyway, and crashing is probably the best thing to do.

In this case, if we are pretty sure we are OOM, we probably should crash with a stack and/or annotation that we can easily detect as OOM on the crash-stats side.
Benjamin should be able to give you all the pointers on what needs to be done so we can determine that.
> With a lot of Intel drivers we don't know if address space consumption will
> be triggered for some reason, but I think your suggestion is a good one, if
> I can add something to improve OOM annotation in crash reports, please give
> me some pointers and I'll make that work. There's still some threat of false
> positives though.

I think in this case the subsequent code will currently absolutely fail if we return null, so it should be safe to crash earlier.

To add the OOM annotation, use NS_ABORT_OOM(computedsize).

> The driver reset case -should- be fixed (i.e. it should safely recover). But

This is definitely not the case. I can reliably trigger Firefox crashes while restarting my graphics card in nightly builds using "devcon restart <deviceID>". The exact signature varies. bp-3b919eab-a9ff-4889-b222-ff9cf2140620 is the most recent.

I think that we do recover if the card reset happens in between painting, but if it happens during painting we still end up trying to create surfaces on the "old" device and end up crashing in various places on null surfaces.

> Having said that, I have an idea, how about if the call fails, initially, we
> attempt to create a 64x64 surface. If for some reason the surface requested
> was very large this call will then succeed, and since we still have balance
> groups the amount of artifacting will be limited.

This sounds like a good thing to try.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> > The driver reset case -should- be fixed (i.e. it should safely recover). But
> 
> This is definitely not the case. I can reliably trigger Firefox crashes
> while restarting my graphics card in nightly builds using "devcon restart
> <deviceID>". The exact signature varies.
> bp-3b919eab-a9ff-4889-b222-ff9cf2140620 is the most recent.

This doesn't work on my main machine. I'll try it on some other machines. This -shouldn't- be the case as per the way things should work according to DirectX documentation, but who knows, there might be something special here.

> I think that we do recover if the card reset happens in between painting,
> but if it happens during painting we still end up trying to create surfaces
> on the "old" device and end up crashing in various places on null surfaces.

That's definitely the most likely (although DirectX docs suggests everything should succeed until your next presentation or cooperative level tests) scenario, but even if that is still broken, the scenario should be fairly rare even when a driver reset does occur.

> > Having said that, I have an idea, how about if the call fails, initially, we
> > attempt to create a 64x64 surface. If for some reason the surface requested
> > was very large this call will then succeed, and since we still have balance
> > groups the amount of artifacting will be limited.
> 
> This sounds like a good thing to try.

I'll create a separate bug and add a patch.
Depends on: 1028491
I am experiencing this issue (cf crash 3a7d6cdb-6e69-4af7-ae72-789862140622) with 31.
This bug came up during the sign off of beta5. 
Bas, since it seems that you have been working on,  I assigned you this bug. It would be nice to have a fix for this beta6 next week.  Thanks
Assignee: nobody → bas
(In reply to Sylvestre Ledru [:sylvestre] from comment #12)
> This bug came up during the sign off of beta5. 
> Bas, since it seems that you have been working on,  I assigned you this bug.
> It would be nice to have a fix for this beta6 next week.  Thanks

I just want to make one thing very clear. What's happening here is an OOM from everything I've been able to tell. What I'm doing is -not- going to prevent us from crashing. I'm simply fixing a stack trace that's common in the situation of an OOM from actually more obviously showing it's an OOM.

The STR in this bug will still lead us to OOM, that OOM will still most likely occur in the graphics area (simply because graphics regularly requests relatively large blocks of contiguous memory). Everything will stay the same from the end user perspective, but our data will be better.

The only thing that can be done to really 'fix' this bug is to reduce the memory usage of google maps, and as far as about:memory tells us, that fix is not likely to be found in graphics.
Need-info to make sure this gets read.
Flags: needinfo?(sledru)
OK. Thanks for the information.
For curiosity (and sorry if the question is stupid) is that an OOM in the graphic card memory or of the system?
Would it help to let the Google Maps developers now that they are using a lot of memory and that should be improved it?
Flags: needinfo?(sledru)
(In reply to Sylvestre Ledru [:sylvestre] from comment #15)
> OK. Thanks for the information.
> For curiosity (and sorry if the question is stupid) is that an OOM in the
> graphic card memory or of the system?
> Would it help to let the Google Maps developers now that they are using a
> lot of memory and that should be improved it?

We're running out of address space on the system. Since some drivers can map GPU memory into your address space that can be caused partially by GPU memory usage.

The website does seem to use a fair amount of GPU memory (about 300 MB on nightly). But also a good amount of JS memory.

The total memory usage appears to be a little better for me on nightly than on release, fwiw.
So it seems I'm confusing two things. I could repro an OOM crash on streetview, but the original report in this bug is a linux only crash. I'll look into that separately, this shouldn't be related to OOM because they occur on 64-bit linux builds.
I have been able to reproduce on a different machine:
https://crash-stats.mozilla.com/report/index/431cfa68-79a0-40a2-9f9e-312d82140630
Also GNU/Linux Debian 64 but NVidia instead of Intel gfx card.
No longer depends on: 1028491
So far I haven't been able to reproduce the crash on my main computer, and only got my secondary laptop to entirely freeze (not just firefox). I'll try again on the secondary laptop and also on another computer with less RAM.

In the mean time, I want to rule out the possibility of mDT to be null so here is a little patch that asserts in every place we assume mDT to be non-null, in the hope to get more useful stack traces if it's the case, and if anything, let me concentrate on the problem rather than try to follow every code paths to check if it is possible.
Attachment #8449580 - Flags: review?(bas)
(In reply to Nicolas Silva [:nical] from comment #19)
> Created attachment 8449580 [details] [diff] [review]
> Add null-check assertions
> 
> So far I haven't been able to reproduce the crash on my main computer, and
> only got my secondary laptop to entirely freeze (not just firefox). I'll try
> again on the secondary laptop and also on another computer with less RAM.
> 
> In the mean time, I want to rule out the possibility of mDT to be null so
> here is a little patch that asserts in every place we assume mDT to be
> non-null, in the hope to get more useful stack traces if it's the case, and
> if anything, let me concentrate on the problem rather than try to follow
> every code paths to check if it is possible.

Maybe I was unclear.. we know that mDT is null.. If you look at the crash-address it's 0x0, which is the vtable ptr location for an object at 0x0. Then when you look at the line (http://hg.mozilla.org/releases/mozilla-aurora/annotate/fe74f647e986/gfx/thebes/gfxContext.cpp#l2128) GetDeviceTransform() is non-virtual and mStateStack or any offset there-of would never be 0x0. So the only candidate for being null here is aDT.

aDT we know from http://hg.mozilla.org/releases/mozilla-aurora/annotate/fe74f647e986/gfx/thebes/gfxContext.cpp#l1597 to be equal to mDT.

All we really need to figure out is why CreateSimilarDT fails.
Comment on attachment 8449580 [details] [diff] [review]
Add null-check assertions

Review of attachment 8449580 [details] [diff] [review]:
-----------------------------------------------------------------

Most of these checks seem excessive. I've marked the one that seems sane, but most of them I think have no real point, all of those will crash on the next line at 0x0 anyway when it tries to access the vtable ptr for mDT as it calls GetBackendType().

::: gfx/thebes/gfxContext.cpp
@@ +109,5 @@
>    , mDT(aTarget)
>    , mOriginalDT(aTarget)
>  {
>    MOZ_COUNT_CTOR(gfxContext);
> +  MOZ_ASSERT(aTarget);

this check is sensible

@@ +2298,5 @@
>    CurrentState().drawTarget = newDT;
>    CurrentState().deviceOffset = clipBounds.TopLeft();
>  
>    mDT = newDT;
> +  MOZ_ASSERT(mDT);

This check is sensible, but bug 1028491 has a better solution for this portion of the code.
Attachment #8449580 - Flags: review?(bas) → review-
(In reply to Bas Schouten (:bas.schouten) from comment #21)
> Comment on attachment 8449580 [details] [diff] [review]
> Add null-check assertions
> 
> Review of attachment 8449580 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Most of these checks seem excessive. I've marked the one that seems sane,
> but most of them I think have no real point, all of those will crash on the
> next line at 0x0 anyway when it tries to access the vtable ptr for mDT as it
> calls GetBackendType().

Or other functions on the DT :).
(In reply to Bas Schouten (:bas.schouten) from comment #21)
> Comment on attachment 8449580 [details] [diff] [review]
> Add null-check assertions
> 
> Review of attachment 8449580 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Most of these checks seem excessive. I've marked the one that seems sane,
> but most of them I think have no real point, all of those will crash on the
> next line at 0x0 anyway when it tries to access the vtable ptr for mDT as it
> calls GetBackendType().

True. My incentive to over-assert everywhere is to make it extra clear what the contract is (mDT must never be null if mCairo is null, and if it happens it is the responsibility of the callers, not a forgotten null-check in gfxContext), so that the reader doesn't need to go through the whole class to understand it. Some of these assertions serve the purpose of documenting more than actually catching invalid states.
This is on the computer that doesn't crash but on the test case I just ran into this situation:

I haven't yet figured out why, but there is a canvas2d (of size 845x762) that does a DrawImage with a very big clip (extents = { x:350, y:445, w:631559, y:379313}) which causes cairo to try to allocate an A8 mask of the size of the clip extents (and fail at it).

So we end up here: http://dxr.mozilla.org/mozilla-central/source/gfx/cairo/cairo/src/cairo-surface-fallback.c?from=_composite_trap_region#502

with a 845 by 762 destination surface (the canvas), an enormous clip path and a failed attempt at creating the mask.

Perhaps we could teach cairo to limit the mask to the size of the destination.
fwiw, I observed the crash on my computer and know how to compile m-c with a patch applied. I'm happy to help reproducing or narrowing it down.
Still in the same stack, I forgot to mention that the src pattern has a surface of size 514 by 514 (so it doesn't explain the crazy clip extents).
It is getting more and more interesting.
The Canvas2D and gfxContext issues appear to be related.

I finally managed to reproduce the crash in gfxContext. When this happens we fail to create a similar draw target of a DrawTarget which cairo status is CAIRO_STATUS_INVALID_SIZE.

After inserting some checks in DrawTargetCairo I identified that the DrawTargetCairo that does not belong to a CanvasRenderingContext2D first gets into an invalid state inside of BasicCanvasLayer::Paint when we are basically painting the canvas into the gfxContex. The pushClip crash happens later.
Depends on: 1034584
Depends on: 1034593
Just for the record, this type of the bug, with the new patches in bug 1028491 for reporting OOM, would now be reported as an OOM, we'll need to keep an eye out for this OOM stack trace as to make sure we don't start missing other real bugs.
(In reply to Bas Schouten (:bas.schouten) from comment #28)
> Just for the record, this type of the bug, with the new patches in bug
> 1028491 for reporting OOM, would now be reported as an OOM, we'll need to
> keep an eye out for this OOM stack trace as to make sure we don't start
> missing other real bugs.

We are keeping an eye on any OOM signatures of significance all the time anyhow. Can that patch be uplifted to more stable branches where we'd possibly get better data?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #29)
> (In reply to Bas Schouten (:bas.schouten) from comment #28)
> > Just for the record, this type of the bug, with the new patches in bug
> > 1028491 for reporting OOM, would now be reported as an OOM, we'll need to
> > keep an eye out for this OOM stack trace as to make sure we don't start
> > missing other real bugs.
> 
> We are keeping an eye on any OOM signatures of significance all the time
> anyhow. Can that patch be uplifted to more stable branches where we'd
> possibly get better data?

It should be fairly safe, yes.
Summary: Google Maps Street View crashes in gfxContext::PushClipsToDT(mozilla::gfx::DrawTarget*) → Google Maps Street View displays a black screen
Sylvestre, why did you change the summary away from what the discussion in this bug was about all along? If we don't crash any more, maybe this is a dupe of one of the bugs that actually landed the fixes for crashing.

The graphics glitches are AFAIK bug 1034593 and not the one here.
Summary: Google Maps Street View displays a black screen → Google Maps Street View displays a black screen on Linux
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #31)
> Sylvestre, why did you change the summary away from what the discussion in
> this bug was about all along? If we don't crash any more, maybe this is a
> dupe of one of the bugs that actually landed the fixes for crashing.
> 
> The graphics glitches are AFAIK bug 1034593 and not the one here.

Restoring the original bug summary until the current status is determined.
Summary: Google Maps Street View displays a black screen on Linux → Google Maps Street View crashes in gfxContext::PushClipsToDT(mozilla::gfx::DrawTarget*)
Gordon, we discussed about this on IRC. I updated the title to match the release notes (we have this bug in the known issue section).
Moreover, the initial bug "Google Maps Street View crashes in gfxContext::PushClipsToDT(mozilla::gfx::DrawTarget*)" has been fixed by 1034584.
Summary: Google Maps Street View crashes in gfxContext::PushClipsToDT(mozilla::gfx::DrawTarget*) → Google Maps Street View displays a black screen on Linux
Nicolas, is there more to happen with the patch here or is this bug fixed (and therefore a dupe nowadays) by one of the dependencies?
Flags: needinfo?(nical.bugzilla)
Summary: Google Maps Street View displays a black screen on Linux → Google Maps Street View crashes in gfxContext::PushClipsToDT(mozilla::gfx::DrawTarget*)
OK. I won't waste any more time fighting. Using 1034593 for the release notes.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #34)
> Nicolas, is there more to happen with the patch here or is this bug fixed
> (and therefore a dupe nowadays) by one of the dependencies?

Thi bug is fixed by the combination of its two dependencies.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(nical.bugzilla)
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
I could not reproduce the original issue on Ubuntu 13.04 x64, but I also cannot see any new crashes in Socorro, with [@ gfxContext::PushClipsToDT(mozilla::gfx::DrawTarget*)] or [@ mozilla::gfx::DrawTargetD2D::MaskSurface(mozilla::gfx::Pattern const&, mozilla::gfx::SourceSurface*, mozilla::gfx::PointTyped<mozilla::gfx::UnknownUnits>, mozilla::gfx::DrawOptions const&)], since the 32.0 build in July 22nd (ID: 20140722195627).
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.