Firefox 23/Aurora/Nightly crashes on executing an Emscripten application.

RESOLVED FIXED in Firefox 26

Status

()

defect
--
critical
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: jujjyl, Assigned: djvj)

Tracking

(4 keywords)

23 Branch
mozilla26
x86
Windows 7
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox22 unaffected, firefox23 wontfix, firefox24 affected, firefox25 affected, firefox26 fixed, firefox-esr17 unaffected, b2g18 unaffected)

Details

()

Attachments

(2 attachments, 2 obsolete attachments)

User Agent: Opera/9.80 (Windows NT 6.1; WOW64) Presto/2.12.388 Version/12.16

Steps to reproduce:

Run this on a Windows system: http://clb.demon.fi/dump/bugs/emcc_crash/Tests_d.html



Actual results:

Tested on Firefox 23 Stable, Firefox 24.0a1 (2013-06-16) Nightly, Firefox 26.0a1 (2013-08-14) Nightly, Firefox 25.0a2 (2013-08-14) Aurora, and they all crash.

Firefox 22 Stable works.

See a screenshot of the crash site: https://dl.dropboxusercontent.com/u/40949268/Bugs/Greenshot_2013-08-15_08-21-49.png

Also, a partial call stack of an old manually built Nightly version (Firefox 24.0a1 (2013-06-16) Nightly): https://dl.dropboxusercontent.com/u/40949268/Bugs/Greenshot_2013-08-15_08-29-13.png


Expected results:

The test suite should finish without crashing the browser.
Posted file stack trace
Keywords: crash
The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=97cfc16ba5dc&tochange=c232bec6974d
Severity: normal → critical
Status: UNCONFIRMED → NEW
Component: Untriaged → General
Ever confirmed: true
Product: Firefox → Core
Hardware: x86_64 → x86
Version: 26 Branch → 23 Branch
Regression window(m-c)
Good:
http://hg.mozilla.org/mozilla-central/rev/b31bfbb2bdc4
Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20130403 Firefox/23.0 ID:20130403083520
Crash:
http://hg.mozilla.org/mozilla-central/rev/b5cb88ccd907
Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20130403 Firefox/23.0 ID:20130403084327
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b31bfbb2bdc4&tochange=b5cb88ccd907

Regression window(ionmonkey)
Good:
http://hg.mozilla.org/projects/ionmonkey/rev/ff681c5eadaf
Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20130320 Firefox/22.0 ID:20130320040225
Crash:
http://hg.mozilla.org/projects/ionmonkey/rev/280e5ed3f0b7
Mozilla/5.0 (Windows NT 5.1; rv:22.0) Gecko/20130321 Firefox/22.0 ID:20130321040202
Pushlog:
http://hg.mozilla.org/projects/ionmonkey/pushloghtml?fromchange=ff681c5eadaf&tochange=280e5ed3f0b7
Assignee: nobody → general
Component: General → JavaScript Engine
Assignee: general → kvijayan
Group: core-security
Spent most of the day tracking this down with Jukka, and I think I know what's happening.

There is a major issue and a minor issue.  Major issue first:

Baseline tries to compile a script with a LOT of locals (more than 16k, meaning it allocates a stack frame of size 128k or so), succeeds, and promptly crashes when it tries to execute, presumably with a stack range issue.

Baseline needs to limit the number of locals in scripts it'll compile to something reasonable, like a 128 or so.

The minor issue:

Jukka first encountered this issue when testing some emscripten compiled code.  The function in question should have been handled by Odin and never been run by Baseline.  Should look into why this is happening.
An updated test case of the original that we modified with Kannan to contain a bit of debug logging is available at URL

http://clb.demon.fi/dump/bugs/emcc_crash_with_debuglogging/Tests_d.html?1&1&TestFileNormalizePath

Note the GET parameters, which are convenient to limit the test suite to run only the crashing test.

The crash occurs as soon as the function __Z30TestFunc_TestFileNormalizePathR4Test (line 28309) gets optimized by baseline JIT. This occurs in the while(1) loop (line 28315) when label==69.

I wonder if Odin has some parameter that it will not try to compile the function if it's too long? Luke, are you interested in taking a peek, is that expected behavior? Note that this Tests_d.html was built before Alon landed the outlining feature to the emscripten compiler.
Added Luke Wagner as CC. Feel free to ignore if this is not related to your area :)
Sure: Odin only kicks in when there is a "use asm" directive.  It would appear there is none in this testcase.  Perhaps you have an old build of Emscripten (b/c it should build asm.js by default now)?  Note also, when Odin sees "use asm", it'll emit message to the Web Console (on success or failure); if you don't see anything, than either there was no "use asm" or it's not in the right place (the first statement of a function, although we try to warn about this too).
Oh, sorry Luke. I knew about the "use asm" directive, but never happened to check it for this file. 

Also, I now realize I had gone and disabled asm.js altogether to workaround https://github.com/kripken/emscripten/issues/1423 , but it's now been fixed and I think I can restore asm.js builds for MathGeoLib unit tests. So the omission of asm.js in that file is as expected. Sorry for the noise :)
As an incredibly wild attempt, I just took the offending function and stripped it to a .html file of its own, and seems that I got lucky, and the crash is reproduced there.

See the file http://clb.demon.fi/dump/bugs/emcc_crash_minimal/crash.html
Posted patch fix-bug-905523.patch (obsolete) — Splinter Review
Simple fix.  Testing.
Kannan: I was thinking that this might not be a great fix since it'll make the performance cliff for functions with many variables (a pretty common occurrence in Emscripten code) much bigger.  Does the JIT have some assumption like "I can always push one frame without stack-overflow-checking" that is being invalidated by large frames?
Is this always going to result in a controlled crash, or could you end up bypassing some kind of stack overflow check?
fwiw the testcase in comment 9 does not crash me in Nightly on a Win 7 64-bit system.
(In reply to Luke Wagner [:luke] from comment #11)
> Kannan: I was thinking that this might not be a great fix since it'll make
> the performance cliff for functions with many variables (a pretty common
> occurrence in Emscripten code) much bigger.  Does the JIT have some
> assumption like "I can always push one frame without
> stack-overflow-checking" that is being invalidated by large frames?

Baseline just does the stack overflow check AFTER initializing the locals in the frame.  We may be able to move that to before initialization, but I'm not positive if it's OK to do that with an un-initialized frame (moving it to before would also move it before scope chain initialization for the frame).

After talking with Jan, this seems like it should be OK, since we can just keep the frame size small and pretend that it has zero locals when doing the stack check, and then move forward from there.
(In reply to Kannan Vijayan [:djvj] from comment #14)
Great!
Since I'm fixing this for baseline, I wonder if a similar issue exists for Ion.  Ion adds an OverRecursed check on frame entry, but one can imagine a particularly large expression inside the ion code (forcing a large number of non-overlapping spills) causing a single frame to extend past the threshold between the ionStackTop and the real stack limit.

As far as I can tell, Ion doesn't impose any real limits on locals or maximum stack depth.
Posted patch fix-bug-905523.patch (obsolete) — Splinter Review
More robust fix.  Passes jit-tests on 32-bit and 64-bit linux.  Running in try now.

Changes:
1. StackCheck call moved before local slots get initialized in the frame.
2. StackCheck changed to be predictive.  We take the frame size, subtract it from the current SP, and check that against the ionStackLimit.
3. Noticed that intialization of locals to UndefinedValue was a fully unrolled loop.  For a script with 10k locals, we'd emit 10k PushValues.  Not good.  This was changed to a partially unrolled loop (4 pushes per iteration) in cases where more than 4 locals are present.
4. Minor changes to the way ScopeChains are initialized in Baseline jit.  For Eval and global scripts, the scope chain is passed in using R1.scratchReg().  Moving the StackCheck to happen before the ScopeChain initialization was causing this to get clobbered.  The behaviour has been changed to store any scope chain in the frame immediately (and later updating it if it needs updating), instead of keeping it live in R1.scratchReg() until scopeChain initialization code is reached.
Attachment #796687 - Attachment is obsolete: true
Comment on attachment 796901 [details] [diff] [review]
fix-bug-905523.patch

Review of attachment 796901 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/VMFunctions.cpp
@@ +135,5 @@
> +    //
> +    // Note that we can reach here if ionStackLimit is MAXADDR, but interrupt
> +    // has not yet been set to 1. That's okay; it will be set to 1 very shortly,
> +    // and in the interim we might just fire a few useless calls to
> +    // CheckOverRecursed.

This documentation has just been fixed to reflect the intent of this function.
This patch still wasn't fixing the windows crash.  Looked into it a bit more and discovered that I hadn't handled the OSR case for frames being pushed onto the C stack.

I fixed the handling of that case, and it still crashed.  So I changed it to print out the following info before any frame was pushed: the current stack pointer, the tentative stack pointer after the frame push, and the result of js::GetNativeStackLimit.

Debugger shows that the "tentative stack pointer" is still within the native stack limit, but the stack address space is not mapped in at all.

I'm wondering if windows keeps a "protected zone" of mapped stack space, which, when written into, causes it to map more pages into the stack region.  If we bypass this protected zone entirely with a single huge frame, it might treat it as a bug (or an exploit attempt), and crash.

Testing this hypothesis by "filling" OSR stack frames incrementally when they're pushed.
Several fixes:

1. When Baseline pushes the frame locals on entry, it used to emit a fully unrolled loop.  (i.e. 500 locals meant 500 separate PUSH instructions emitted).  This has been changed to do a partially unrolled loop, with a maximum of 4 pushes per iteration.  Reduces jitcode size on functions with large numbers of locals.

2. Stack check has been moved to BEFORE the baseline frame is pushed, not after.  Also, stack check has been modified to operate using the "expected stack pointer after pushing the frame" instead of the "current stack pointer".

3. Baseline OSR entry has been modified to do the above as well.

4. During baseline OSR, we blindly subtract the frame size from the stack pointer, and then do a VMCall (passing a pointer into the frame) to allow a C++ function to initialize the baseline frame from the interpreter frame.  This code has been changed so that in windows, the "frame memory" is touched every 1k bytes, to ensure that the stack memory is paged in.
Attachment #796901 - Attachment is obsolete: true
Attachment #800987 - Flags: review?
Attachment #800987 - Flags: review? → review?(efaustbmo)
Comment on attachment 800987 [details] [diff] [review]
fix-bug-905523.patch

Review of attachment 800987 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. I am happy to see that Jan has also thought about whether calling out with the lied-about frame size is safe. r=me.

::: js/src/jit/arm/Trampoline-arm.cpp
@@ +242,5 @@
> +        // Can't push large frames blindly on windows.  Touch frame memory incrementally.
> +        masm.ma_lsl(Imm32(3), numStackValues, scratch);
> +        masm.subPtr(scratch, framePtr);
> +        {
> +            masm.ma_sub(sp, Imm32(1024), scratch);

Can we change this value to 4095? It seems silly to check less than PAGE_SIZE increments, and I am pretty sure there's an edgecase with being exactly at the edge of the mapped page and using PAGE_SIZE exactly.

::: js/src/jit/x64/Trampoline-x64.cpp
@@ +179,5 @@
> +        masm.lshiftPtr(Imm32(3), scratch);
> +        masm.subPtr(scratch, framePtr);
> +        {
> +            masm.movePtr(rsp, scratch);
> +            masm.subPtr(Imm32(1024), scratch);

Again, this value as well.

::: js/src/jit/x86/Trampoline-x86.cpp
@@ +172,5 @@
> +        masm.shll(Imm32(3), scratch);
> +        masm.subPtr(scratch, framePtr);
> +        {
> +            masm.movePtr(esp, scratch);
> +            masm.subPtr(Imm32(1024), scratch);

And here.
Attachment #800987 - Flags: review?(efaustbmo) → review+
Comment on attachment 800987 [details] [diff] [review]
fix-bug-905523.patch

Review of attachment 800987 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. I am happy to see that Jan has also thought about whether calling out with the lied-about frame size is safe. r=me.

::: js/src/jit/arm/Trampoline-arm.cpp
@@ +242,5 @@
> +        // Can't push large frames blindly on windows.  Touch frame memory incrementally.
> +        masm.ma_lsl(Imm32(3), numStackValues, scratch);
> +        masm.subPtr(scratch, framePtr);
> +        {
> +            masm.ma_sub(sp, Imm32(1024), scratch);

Can we change this value to 4095? It seems silly to check less than PAGE_SIZE increments, and I am pretty sure there's an edgecase with being exactly at the edge of the mapped page and using PAGE_SIZE exactly.

::: js/src/jit/x64/Trampoline-x64.cpp
@@ +179,5 @@
> +        masm.lshiftPtr(Imm32(3), scratch);
> +        masm.subPtr(scratch, framePtr);
> +        {
> +            masm.movePtr(rsp, scratch);
> +            masm.subPtr(Imm32(1024), scratch);

Again, this value as well.

::: js/src/jit/x86/Trampoline-x86.cpp
@@ +172,5 @@
> +        masm.shll(Imm32(3), scratch);
> +        masm.subPtr(scratch, framePtr);
> +        {
> +            masm.movePtr(esp, scratch);
> +            masm.subPtr(Imm32(1024), scratch);

And here.
https://hg.mozilla.org/mozilla-central/rev/0452b5b504d0

This probably missed the boat for Fx24, but you still may want to see about getting it landed on esr24. Also, why did this land without security approval?
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
The boat for Fx24 and esr24 has sailed as we have already gone to build with our final Release candidates and this will be a wontfix there.

Needinfo'ing :dveditz here to get the security rating here and get in touch with me asap if this needs to go in Fx24 or ESR24 given the landing on central.
Flags: needinfo?(dveditz)
Kannan: please retroactively request sec-approval? on the patch and answer the questions, which are trying to get at determining the risk of the patch vs. the risk of the vulnerability being fixed. The patch looks more complex than we'd like to take with no beta testing whatsoever. I'm hoping the security problem is not so bad because clearly if this was originally found via Emscripten-generated code we have to assume others will be able to find it as well. But without knowing what is actually going on it seems more likely that an access violation (I'm assuming?) in JITted code is exploitable.
Flags: needinfo?(dveditz) → needinfo?(kvijayan)
Didn't mean to clear my needinfo... still need to answer comment 25 after Kannan answers comment 26
Flags: needinfo?(dveditz)
(In reply to Daniel Veditz [:dveditz] from comment #26)
> Kannan: please retroactively request sec-approval? on the patch and answer
> the questions, which are trying to get at determining the risk of the patch
> vs. the risk of the vulnerability being fixed. The patch looks more complex
> than we'd like to take with no beta testing whatsoever. I'm hoping the
> security problem is not so bad because clearly if this was originally found
> via Emscripten-generated code we have to assume others will be able to find
> it as well. But without knowing what is actually going on it seems more
> likely that an access violation (I'm assuming?) in JITted code is
> exploitable.

The issue is a pure DoS.  It's "exploitable" in that an attacker could generate code to crash the browser, but that's the extent of the damage possible.  It's a pathological case that's unlikely to be hit accidentally: it's impossible on JS functions with <512 local vars, and even then unlikely on JS functions with <10k local vars.  In this case, the trigger was a JS function with 30k locals.

I think we can remove the security flag on the bug and not uplift it for 24.  Do you still want me to request a sec-approval?
Flags: needinfo?(kvijayan)
Unhiding. If it's not a security issue, then sec-approval is not required.
Group: core-security
Assuming regressionwindow-wanted is satisfied by comment 3. Please re-add if something more is needed.
Depends on: 918603
Flags: needinfo?(dveditz)
Keywords: csec-dos
You need to log in before you can comment on or make changes to this bug.