Closed Bug 474529 Opened 16 years ago Closed 15 years ago

TM: Avoid artificial OOM conditions

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.9.2a1

People

(Reporter: BijuMailList, Assigned: graydon)

Details

(4 keywords, Whiteboard: [sg:critical?][crashtest?])

Attachments

(3 files)

Attached file arrayCreation.html
I have modified attachment 280909 [details] on bug 396191 by adding one more iteration.
Now it crashes Firefox. 

Steps:-
1. open attached arrayCreation.html 

Result:- crashes on the last iteration

Expected:- dont crash
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090119 Minefield/3.2a1pre ID:20090119074946

crash report:
Signature	nanojit::LirBufWriter::insLinkToFar(nanojit::LOpcode, nanojit::LIns*)
UUID	c0652bb7-9fc9-42e4-bac3-202f32090120

perhaps related to bug 470310 ?
Keywords: crash, testcase
Summary: Modified test on Bug 396191 crashes firefox → TM: Modified test on Bug 396191 crashes firefox
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090122 Minefield/3.2a1pre ID:20090122021228

Still crashes with a build from the tracemonkey tree (I don't know if the patch has made it to mozilla-central yet though). Crash happens one frame earlier than before.

crash report:
Signature    nanojit::LirBufWriter::insLinkTo(nanojit::LOpcode, nanojit::LIns*)
bp-a9ef17cb-f6de-43a9-a467-2d5a22090123
(I tested this on latest TM checkout debug build on WinXP SP3)

Assertion failed: _buf->_nextPage (c:/tracemonkey/js/src/nanojit/LIR.cpp:168)

=====

(I tested this on latest-trunk mozilla-central nightly build on WinXP SP3)

Turning security-sensitive and nominating blocking1.9.1? just-in-case, due to
!exploitable PROBABLY_EXPLOITABLE result. I also confirm this as being JIT-only, as turning JIT content off doesn't crash.


1:002> !exploitable -v
HostMachine\HostUser
Executing Processor Architecture is x86
Debuggee is in User Mode
Debuggee is a live user mode debugging session on the local machine
Event Type: Exception
*** WARNING: Unable to verify checksum for C:\Documents and Settings\Administrator\Desktop\firefox\xul.dll
Exception Faulting Address: 0x4
First Chance Exception Type: STATUS_ACCESS_VIOLATION (0xC0000005)
Exception Sub-Type: Write Access Violation

Exception Hash (Major/Minor): 0x563a0616.0x32486061

Stack Trace:
js3250!nanojit::LirBufWriter::insLinkToFar+0x6
js3250!nanojit::LirBufWriter::insLinkTo+0x63
js3250!nanojit::LirBufWriter::skip+0xc9
js3250!TraceRecorder::snapshot+0x4b8
js3250!TraceRecorder::box_jsval+0x9b
js3250!TraceRecorder::newArray+0x1e8
js3250!TraceRecorder::functionCall+0x13a
js3250!TraceRecorder::monitorRecording+0xa82
js3250!js_Interpret+0x1d62
js3250!js_Invoke+0x289
js3250!js_InternalInvoke+0x102
js3250!JS_CallFunctionValue+0x27
xul!nsJSContext::CallEventHandler+0x162
xul!nsJSEventListener::HandleEvent+0x230
xul!nsEventListenerManager::HandleEventSubType+0x6b
xul!nsEventListenerManager::HandleEvent+0x222
xul!nsEventTargetChainItem::HandleEvent+0xb4
xul!nsEventTargetChainItem::HandleEventTargetChain+0x1eb
xul!nsEventDispatcher::Dispatch+0x4f5
xul!DocumentViewerImpl::LoadComplete+0x10c
xul!nsDocShell::EndPageLoad+0x63
xul!nsWebShell::EndPageLoad+0xa9
xul!nsDocShell::OnStateChange+0x83
xul!nsDocLoader::FireOnStateChange+0x123
xul!nsDocLoader::doStopDocumentLoad+0x1c
xul!nsDocLoader::DocLoaderIsEmpty+0x130
xul!nsDocLoader::OnStopRequest+0xc8
xul!nsLoadGroup::RemoveRequest+0xb8
xul!nsDocument::DoUnblockOnload+0x3b
Instruction Address: 0x480156

Description: User Mode Write AV near NULL
Short Description: WriteAV
Exploitability Classification: PROBABLY_EXPLOITABLE
Recommended Bug Title: Probably Exploitable - User Mode Write AV near NULL starting at js3250!nanojit::LirBufWriter::insLinkToFar+0x6 (Hash=0x563a0616.0x32486061)

User mode write access violations that are near NULL are probably exploitable.
Group: core-security
Flags: blocking1.9.1?
Keywords: assertion
Summary: TM: Modified test on Bug 396191 crashes firefox → TM: Crash [@ nanojit::LirBufWriter::insLinkToFar] or "Assertion failed: _buf->_nextPage"
Whiteboard: [sg:critical?]
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P1
Assignee: general → graydon
This is another case in which we have run out of official memory in the LIR buffer, it has set it's "I ran out of memory" flag, and we are still piling new data into it, because it'll be a while until we next check the flag.

As with all such cases, the "correct" fix is to either teach NJ to be infallible with memory allocation (say, make it abort the process when it *really* runs out of memory, and never OOM in this unofficial self-mutilating fashion due to merely passing its high water mark), or teach all the callers of it to check OOM after every call, or teach NJ to return NULL when it OOMs and check for NULL after every call.

I expect the response to these suggestions will be "all those are expensive and/or intrusive and/or poor style", so we'll carry on playing russian roulette with the notion that we can check the OOM status "often enough" that we don't keep reproducing this situation. If we do, we will get what we pay for: more bugs exactly like this one.
Easy fix. Check for the OOM flag in the loop. We do that every else too. We could actually do a simple static analysis to prove that we do this inside all loops, where it really hurts. We have 4k reserve so for straight-line code we should be good.
(In reply to comment #5)
> Easy fix. Check for the OOM flag in the loop. We do that every else too. We
> could actually do a simple static analysis to prove that we do this inside all
> loops, where it really hurts. We have 4k reserve so for straight-line code we
> should be good.

No. I appreciate what you're trying to do here -- try to find "the last
bug" in our OOM-checking code -- but I want to propose a counterargument
that I think is more correct.

Currently we have two conditions being modeled via the OOM flag in NJ. One
is "we really ran out of memory", like no more system memory, and one is
just exceeding our 16mb or 32mb highwater mark for the TM code cache + LIR
buffers.

I think this is a mistake. I think that TM is actually not taking very good
care to check the OOM flag often enough, and I think there are lots of
cases still lurking in which it doesn't and won't. And we're about to ship
a product. Your hedging language here suggests you actually lack confidence
in the notion that we're looking at "the last" of these bugs too. I think
we're not. I'd bet my new bike we're not.

What I'd propose is -- at least for now -- modeling these two conditions
separately. Currently we upgrade every highwater-overflow (which is
frequent, predictable, guaranteed to be happening every few minutes on
millions of desktops) to a full-scale OOM.  This is unnecessary. We have,
in almost every one of these cases (say 99.9999%) lots of memory left. More
than a page. Probably thousands of pages. Certainly enough that there's no
sense turning a modest bug (exceeding highwater) into a likely-exploitable
memory error.

I don't want to patch over the problem of catching *real* OOMs
indefinitely, but they're so comparatively rare that I don't think it's
wise to be doing what we're currently doing. I've got a patch I'm about to
post that does both what you ask and also, along the lines I'm suggesting,
backs off using the OOM flag for anything other than a *real* OOM. Instead,
it tells the fragmento to use as much memory as it can (up to 1<<32), and
probes it from time to time to flush when it notices we're over the
vastly-lower limit.

This is *much more practically safe* when sent out to the world. Please
consider that fact.  A followup bug to "audit the OOM paths" is entirely
reasonable. But I think opening users up to this is not. It's not going to
spontaneously audit all those paths and make them correct to ship. It'll
just make a lot of people vulnerable.
Sure, the proposed approach is much more effective. As long we check often enough, even mobile should not have an issue. Even under extreme memory pressure this should not behave worse than what we have now (we have a bunch of unchecked malloc/reallocs that crash when hit).
Attachment #370344 - Flags: review?(gal) → review+
Comment on attachment 370344 [details] [diff] [review]
A conservative fix that removes the artificial-OOM behavior

I am noting with amusement that you actually added the !lirbuf->outOMem() check as well, as suggested ;)
Summary: TM: Crash [@ nanojit::LirBufWriter::insLinkToFar] or "Assertion failed: _buf->_nextPage" → TM: Avoid artificial OOM conditions
http://hg.mozilla.org/mozilla-central/rev/fe60e7460cd1
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
better implemented as a crash test and checked in when fixed.
Flags: in-testsuite?
Whiteboard: [sg:critical?] → [sg:critical?][crashtest?]
Verified fixed with testcase given in comment 0 on trunk and 1.9.1 with the
following debug builds:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2a1pre)
Gecko/20090422 Minefield/3.6a1pre ID:20090422224452

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b4pre)
Gecko/20090422 Shiretoko/3.5b4pre ID:20090422122043
Status: RESOLVED → VERIFIED
OS: Windows XP → All
Hardware: x86 → All
Target Milestone: --- → mozilla1.9.2a1
Flags: in-testsuite? → in-testsuite+
Group: core-security
Flags: wanted1.9.0.x-
test checked into 1.9.0, 1.9.1, 1.9.2, tracemonkey. 1.9.3 will get picked up in the next merge.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: