Closed Bug 595593 Opened 11 years ago Closed 10 years ago

Repeated use of getContext on canvas elements causes crash

Categories

(Core :: JavaScript Engine, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -
blocking1.9.2 --- .17+
status1.9.2 --- .17-fixed
blocking1.9.1 --- .19+
status1.9.1 --- .19-fixed

People

(Reporter: pleasestand, Assigned: cjones)

References

()

Details

(Keywords: crash, testcase, Whiteboard: [sg:dos (trunk)][sg:critical? (1.9 branches)])

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.9) Gecko/20100824 Firefox/3.6.9
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.9) Gecko/20100824 Firefox/3.6.9

Creating new canvas elements and obtaining the "2d" contexts in a tight loop causes both Firefox 3.6.9 and the 11-Sep-2010 nightly to crash.

Reproducible: Always

Steps to Reproduce:
1. Open the javascript: URL provided.
Actual Results:  
After several seconds, Firefox/Minefield crashes.

Expected Results:  
After several more seconds, the script should have timed out, which would have made stopping it possible.
As it's seems it's an OOM Crash looking at my Windows XP Task Manager with the Page File being occupied and a WinDBG Stacktrace.

0012d010 781399cd MOZCRT19!arena_malloc_small(struct arena_s * arena = 0xffffffff, unsigned int size = 0x20, int zero = 0)+0x6c [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\jemalloc.c @ 3750]
0012d020 004d105a MOZCRT19!malloc(unsigned int size = 0x1057da64)+0x3d [e:\builds\moz2_slave\mozilla-central-win32-nightly\build\obj-firefox\memory\jemalloc\crtsrc\jemalloc.c @ 5873]
This is similar to that GLContext thing, but with just pure memory exhaustion....

_This_ case really needs to be handled via GC and the like, no?
Target Milestone: --- → flash10
If we had xpcomgc, maybe; I don't see how JS GC can handle it without it being involved in our OOM handling -- we only use a tiny amount of the JS heap.  Only way we could solve this is if we were able to do a GC during malloc() failure.

(Note that canvas defaults to 300x200 [i have no idea why] if no dimensions are specified, so we're using 300*200*4 ~= 1/4MB per canvas element here.)
Target Milestone: flash10 → ---
js gc has an "allocated memory" tracker.  We could tell it about the memory we're allocating here like built-in js objects do.

> Note that canvas defaults to 300x200 [i have no idea why]

It's 300x150, which is the default replaced element size in CSS, because that's how big IE4/NS4 made iframes by default.
Yeah, and we are really trying to get rid of that allocated memory counter because its total clown shoes. It doesn't count fragmentation and is mostly just guessing for realloc(). What we really want is actual working set measurement and management. I have an old patch for that if anyone cares.

For this specific bug, while I agree that good working set management is great to avoid OOM, OOM on large allocations also should do direct failure handling. While do we crash here? Whats the full stack?
Status: UNCONFIRMED → NEW
Ever confirmed: true
XtC4UaLL, want to attach the full stack?
Attached file WinDbg Log
I made this Log against Mozilla/5.0 (Windows NT 5.1; rv:2.0b10pre) Gecko/20110124 Firefox/4.0b10pre ID:20110124030332 (http://hg.mozilla.org/mozilla-central/rev/80266029824b).

Alternatively I provide a Breakpad Crash ID:
bp-fb3059b5-7fbd-4a87-889c-446eb2110124 [@ mozalloc_handle_oom() ]
Large/variable allocations must be fallible_t and check.

/be
That seems to be crashing during the constructor of the 

  nsTArray<ContextState> mStyleStack;

member.  This is an infallible array initialized thusly:

  mStyleStack(20)

in the nsCanvasRenderingContext2D constructor (why is this not an auto array?  Who knows.)

In any case, sizeof(ContextState) is 88 bytes over here on Mac 64-bit; I'm not sure whether the 0x79 in that log is the correct size on Windows or just the debugger lying.  In any case, that's a 1760 byte allocation.  Does that count as "large"?  :(
Not large, obviously, but then why did we crash?

/be
The testcase runs an infinite loop doing the following:

1) createElement("canvas"): 80 bytes plus whatever allocations js-wrapping does.
2) getContext("2d"): 
          68 bytes for the object itself
        1760 bytes for the ContextState array
     180,000 bytes for the graphics surface backing store
             plus whatever allocations js-wrapping does

The image surface allocation is null-checked (e.g. see the fallible allocation in gfxImageSurface::gfxImageSurface); failure to allocate the image buffer is silent and doesn't throw back into JS.  Perhaps it should!

Of the remaining allocations, the ContextState array allocation is most likely to fail, no?  So it would be the most likely place to crash.

The 180KB allocations make sure that we don't create enough JS garbage in the process of doing all this to trigger GC, I assume.
Do we know that the 3.6.x crash is safe?  It's not using infallible tarray.
A last resort could be to alloc and init the canvas context in a fallible factory function, and use fallible tarray for mStyleStack.
I suspect that on trunk making image surface init failure throw to JS would be quite enough.

I'm not sure what the answer to comment 12 is.  :(
Alright, so I think the real fix here is working set management. The 2d context is one example of a DOM node that allocates a small JS object header and with it a crap ton of C++ storage. The latter is never accounted for. I am sure there are other ways to trigger this. Just loop over something that allocates a WN that happens to be backed by a large-ish C++ struct. If the test is changed to keep the objects all alive, what do we do then? A GC couldn't save us. Again, I really think we need true working set management. I will refresh my patch. Triaging to XPConnect and taking.
Actually JS engine is probably more appropriate.
Assignee: nobody → general
Component: Canvas: 2D → JavaScript Engine
QA Contact: canvas.2d → general
Assignee: general → gal
> If the test is changed to keep the objects all alive, what do we do then?

Run out of memory in an honest-to-God way.  And if the script catches OOM exceptions, even my proposal from comment 11 doesn't help and we crash.  That's why we need content processes...  A web page that really wants to can _always_ eat up RAM, unless we start throwing uncatchable exceptions and disabling event delivery or some such.
Actually we should hide this bug. The private comments give away that something is up and there is a public test case. If its safe, we can re-open.
Group: core-security
We're not redoing GC memory pressure feedback at this stage of the release! That is guaranteed to cause latent bugs to bite.

/be
Who said anything about "the release"?  This isn't a blocker; it's not even nominated.  We need to answer the question in comment 12, but that's all we need to do for 2.0 here.  I wouldn't even make the "image surface init fail should throw to JS" change for 2.0 at this point!
bz: why wouldn't we do what you proposed in comment 14 if we think there's a risk of exploitability here?

If we don't, let's not make this bug s-s!

/be
> bz: why wouldn't we do what you proposed in comment 14 

Because of compat worries?

This bug is s-s because we don't know the answer to your question in comment 21!
Someone needs to find out. If we can't find out, we should do what comment 14 says. We should probably just do that anyway instead of going back and forth on it here!

Something does not add up. This bug is a crash bug. It seems to have a specific fix: fallible allocation and OOM failure propagation. It *might* go away with a dreamy GC scheduling architecture from the future -- but it might not. We do not know yet. So we're to sit on an s-s bug that may have already leaked until next Firefox release?

Let's please just fix the allocation site.

/be
I do not understand "compat worries". We have OOM as uncatchable failure already. What code could possibly demand that it not happen for this bug's test?

/be
Brendan, we know this bug is not a security issue on trunk.

I just looked at the code.  Failure of that mStyleStack buffer allocation is ok for 1.9.2 in and of itself.  However failure of the append in InitializeWithSurface is not; that needs to be fixed.  Also, failure of the append in Save() seems to be unsafe; it won't die right there, but after that access to CurrenState() will index out of bounds.

> What code could possibly demand that it not happen for this bug's test?

If we're unlucky, the answer is "facebook".  ;)  Seriously...

In any case, we need a 1.9.2 patch here to throw on those append failures.  I'd be willing to consider a patch to make trunk and maybe branch throw on image surface alloc failure.  It'd need to land before Friday.

Chris, want to work on this?
blocking1.9.2: --- → ?
bz: thanks for diagnosing. I still don't understand how OOMing can break a site, where indexing out of bounds makes that site work. Seems this is spot-fixable, in any case, and not something requiring GC-from-the-future -- confirm or deny.

/be
Yeah, I think fixing this particular thing doesn't require magic.
Dammit, what happened to bugzilla? I can't reassign to Chris. Andreas, can you do it?

/be
Assignee: gal → jones.chris.g
Wow, lots of excitement in this bug. I just made a patch to feed this one allocation into the engine's memory pressure heuristic (mallocBytes). But if there is a WebGL only fix, even better. Handing over to Chris.
I didn't see any other problems beyond what Boris pointed out in comment 25.  In SetDimensions(), if CheckSurfaceSize() fails, the context soldiers on with the invalid width/height values.  That makes me a bit nervous but it looks like the mValid check is done properly where the dims are used, so no problems there.

This code has a curious mix of vanilla operator new() with null-checking sometimes and sometimes not (doesn't matter, the null checks are pointless), but also use of operator new(nothrow) with meaningful null checks.  Just for kicks, I ran the testcase in an rlimit-ed setup a few times to see what cropped up.  I saw
 - crashing below nsMemoryImpl::FlushMemory, below a failed nsTArray::SetCapacity (for mStyleStack)
 - std::bad_alloc from NS_NewCSSParser (for mCSSParser)
 - std::bad_alloc from gfxPlatformGtk::CreateOffscreenSurface (below SetDimensions())
from ~10 runs.

Based on failed SetCapacity() not being only theoretical, I'd definitely blocking+ this for 1.9.2, if it wasn't otherwise planned to be.  I don't particularly care about the other two for 1.9.2.
Attachment #507055 - Flags: review?(vladimir)
What about the 1.9.1 branch? Looks like the patched code is the same.
blocking1.9.1: --- → ?
blocking1.9.2: ? → .15+
Keywords: crash, testcase
Yes, we should take this on both.
blocking1.9.1: ? → .18+
Whiteboard: [sg:dos (trunk)][sg:critical? (1.9 branches)]
Comment on attachment 507055 [details] [diff] [review]
Check for failed tarray realloc

Looks fine, though please nuke the bogus { }'s around the single return :-)

Weird that this particular nsTArray is an issue; this feels like wallpaper a bit, but better than nothing.
Attachment #507055 - Flags: review?(vladimir) → review+
Alloc failures for that particular tarray will cause nsCanvasRenderingContext2D to read outside the array bounds.  There are gazillion other OOM sites here that are just safe crashers.
blocking2.0: --- → -
Note that I do think we should have a separate patch (in a separate bug?) for throwing on image surface alloc failure.
Boris or Andreas, one of you guys want to file the followup on the surface allocs?  I'm not sure what the final plan is for that.
This was checked in on branches a while ago (comment 36)
Group: core-security
Blocks: 650079
Flags: in-testsuite-
You need to log in before you can comment on or make changes to this bug.