Closed Bug 1028491 Opened 5 years ago Closed 5 years ago

Abort for OOM when PushNewDT fails badly

Categories

(Core :: Graphics, defect)

x86_64
Windows 8.1
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- fixed
firefox32 --- fixed
firefox33 --- fixed

People

(Reporter: bas.schouten, Assigned: bas.schouten)

Details

(Whiteboard: [qa-])

Attachments

(1 file)

When PushNewDT tries to create a surface, currently we will soon crash if the surface creation fails. The most common reason for this surface creation to fail, is an OOM situation.

What we could try is to create a 'reasonable size' surface, this would balance all the rendering calls and everything, so some artifacting would occur due to incomplete drawing. But things would generally be 'okay'. If this fails, we're in a more catastrophic failure mode - either we ran completely out of memory or we can't create surfaces anymore. In this case we can abort for OOM, since this will be the most common cause.
This patch implements the suggested approach.
Attachment #8443853 - Flags: review?(jmuizelaar)
Blocks: 1027103
Comment on attachment 8443853 [details] [diff] [review]
Try to create a reasonably sized surface, otherwise OOM abort.

Review of attachment 8443853 [details] [diff] [review]:
-----------------------------------------------------------------

I'm worried that this will break some invariants and cause other problems down the line. Can you assuage those fears?
Tracking because it blocks the resolution of a critical bug
(In reply to Jeff Muizelaar [:jrmuizel] from comment #2)
> Comment on attachment 8443853 [details] [diff] [review]
> Try to create a reasonably sized surface, otherwise OOM abort.
> 
> Review of attachment 8443853 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'm worried that this will break some invariants and cause other problems
> down the line. Can you assuage those fears?

Not really, except that the behavior here will be strictly better than any alternative :). There will be some drawing artifacts if the surface being requested is too big for memory, but strictly for this pushed group. Once it's popped things are back to normal again.
Removing the dependency + tracking (cf comment #17 on bug 1027103)
Attachment #8443853 - Flags: review?(jmuizelaar) → review+
https://hg.mozilla.org/mozilla-central/rev/466653e7d7eb
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Bas, you said in bug 1027103 that this "should be fairly safe" to uplift, could you request approval and get it into Aurora or possibly even Beta so we get better data for those OOM crashes? Thanks!
Flags: needinfo?(bas)
Comment on attachment 8443853 [details] [diff] [review]
Try to create a reasonably sized surface, otherwise OOM abort.

Approval Request Comment
[Feature/regressing bug #]: Non-regression
[User impact if declined]: More confusing data surrounding PushClipsToDT
[Describe test coverage new/current, TBPL]: Nightly
[Risks and why]: Missing actual bugs due to OOM false positives
[String/UUID change made/needed]: None.
Attachment #8443853 - Flags: approval-mozilla-beta?
Attachment #8443853 - Flags: approval-mozilla-aurora?
Flags: needinfo?(bas)
Attachment #8443853 - Flags: approval-mozilla-beta?
Attachment #8443853 - Flags: approval-mozilla-beta+
Attachment #8443853 - Flags: approval-mozilla-aurora?
Attachment #8443853 - Flags: approval-mozilla-aurora+
Bas, please advise if there's a good way for QA to test this change.
Whiteboard: [qa-]
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #11)
> Bas, please advise if there's a good way for QA to test this change.

I think the only thing we can test is that new signatures might appear and PushClipsToDT should drop.
From very early data on 31.0b8, it looks like between this and bug 1034584, the PushClipsToDT signature practically vanished on both desktop and Android!
I also can't see significant new signatures appearing, which surprises me.
I can consistently hit this abort by going to maps.bing.com and zooming in a few levels. I get an "OOM | small" of 16k: bp-d5ceda8f-a67d-4099-ae94-544652140731

The strange thing is that it happens even in a fresh launch with tons of virtual address space and physical memory available. In the dump above, I have 1875MB free VA, in large contiguous blocks. I don't understand how a 16k allocation failed.

Bas: Is this abort supposed to mean an "ordinary" OOM? If so, then I am completely puzzled how it can happen. Or, is there something special about this type of failure, e.g. related to gfx memory or something?

KaiRo: Although you didn't see any new signatures, did you see a rise in OOM|small?
Flags: needinfo?(kairo)
Flags: needinfo?(bas)
(In reply to David Major [:dmajor] from comment #14)
> KaiRo: Although you didn't see any new signatures, did you see a rise in
> OOM|small?

We did for sure on Android, it's very much possible there was a rise elsewhere as well.
(I don't have much time to look at that though, given I'm on the airport on the way to some vacation.)
Flags: needinfo?(kairo)
I'm hitting what I believe to be this problem constantly on Android

#0  0x7e05b500 in NS_ABORT_OOM (aSize=<optimized out>) at /home/aaronmt/Mozilla/Fennec/xpcom/base/nsDebugImpl.cpp:625
#1  0x7e6ac64e in gfxContext::PushNewDT (this=this@entry=0x8556f710, content=content@entry=COLOR_ALPHA) at /home/aaronmt/Mozilla/Fennec/gfx/thebes/gfxContext.cpp:1636

I continue to crash on Nightly nonstop on my Nexus 5 (Android 4.4.4) browsing Reddit
Why wasn't this just a debug assertion?
Ah, nm, I understand. I don't get why this would cause Aaron's problem, though, unless he was about to OOM anyway.
There's other possible causes, the best way to check what's going on is by looking at why the DrawTarget creation is failing.
Flags: needinfo?(bas)
You need to log in before you can comment on or make changes to this bug.