Assertion failure: !ontrace(), at jstracer.cpp:2800

VERIFIED FIXED

Status

()

Core
JavaScript Engine
--
critical
VERIFIED FIXED
8 years ago
4 years ago

People

(Reporter: decoder, Assigned: billm)

Tracking

(Blocks: 1 bug)

Trunk
x86_64
Linux
Points:
---
Bug Flags:
sec-bounty +

Firefox Tracking Flags

(blocking2.0 final+)

Details

(Whiteboard: [sg:critical?][hardblocker][fixed-in-tracemonkey][has patch])

Attachments

(4 attachments)

(Reporter)

Description

8 years ago
The following code asserts on tracemonkey tip:

options('tracejit');

gczeal(2);

function foo() {
        return /foo/;
}

function test() {
    var obj = {};
    for (var i = 0; i < 50; i++) {
        obj["_" + i] = "J";
    }
}

test();
Group: core-security
Jason, you made this a security bug w/o explanation, so you get the bug to either fix or explain why this is a security bug and how severe, but feel free to pass this to someone else if there's a better owner here.
Assignee: general → jorendorff
Whiteboard: [sg:critical?]
This assertion is pretty scary because we're garbage collecting traces while on trace. An opt gczeal build doesn't crash for me, but this needs investigation.

What prevents us from collecting live traces? Do we mark the running trace (and all its depending traces) as live?

And if nothing prevents that, what prevents us from GC'ing while on trace? There's a JS_ON_TRACE check, but this looks at only the compartment of the given cx.
blocking2.0: --- → final+

Comment 3

8 years ago
Walking the thread JS stack should reveal all the compartments currently running code, and we can check all of them.
Andreas, any idea on the first question?

Comment 5

8 years ago
We should not be flushing the trace cache as long any code is running from that compartment.
Here's what seems to be happening. We're in the shell, so we only have atomsCompartment and the main compartment. We do an AtomizeString on trace from the main compartment. Somehow cx->compartment gets set to atomsCompartment. So when we check JS_ON_TRACE(cx) in the allocation routine, we get false, even though the current thread is actually on trace.

The way we're supposed to protect against a GC while on trace is as follows. A thread that's on trace should never initiate a GC. (I'm not sure what happens if there's no memory available). When a GC starts, the other threads may be on trace, but we tell them to LeaveTrace.

All this seems to be somewhat broken in the compartment world. I'll think about how to best handle this.
I guess I found the same issue with bug 625794.
The stack there looks like:

Assertion failure: !ontrace(), at ../jstracer.cpp:2823

Program received signal SIGABRT, Aborted.
0x00007ffff7bcf7bb in raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
42	../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
	in ../nptl/sysdeps/unix/sysv/linux/pt-raise.c
(gdb) bt
#0  0x00007ffff7bcf7bb in raise (sig=<value optimized out>)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000594800 in JS_Assert (s=0x747f25 "!ontrace()", 
    file=0x746a3b "../jstracer.cpp", ln=2823) at ../jsutil.cpp:83
#2  0x00000000005c0c96 in js::TraceMonitor::sweep (this=0xabf598, cx=0xaba4f0)
    at ../jstracer.cpp:2823
#3  0x00000000004644aa in JSCompartment::sweep (this=0xabf250, cx=0xaba4f0, 
    releaseInterval=0) at ../jscompartment.cpp:413
#4  0x00000000004ad7de in SweepCrossCompartmentWrappers (cx=0xaba4f0)
    at ../jsgc.cpp:2185
#5  0x00000000004ae0a1 in MarkAndSweep (cx=0xaba4f0, gckind=GC_NORMAL)
    at ../jsgc.cpp:2463
#6  0x00000000004ae428 in GCUntilDone (cx=0xaba4f0, comp=0x0, gckind=GC_NORMAL)
    at ../jsgc.cpp:2738
#7  0x00000000004ae603 in js_GC (cx=0xaba4f0, comp=0x0, gckind=GC_NORMAL)
    at ../jsgc.cpp:2809
#8  0x00000000004abdcb in RunLastDitchGC (cx=0xaba4f0) at ../jsgc.cpp:1117
#9  0x00000000004b835d in RefillTypedFreeList<JSShortString> (cx=0xaba4f0, 
    thingKind=8) at ../jsgc.cpp:1137
#10 0x00000000004abef2 in RefillFinalizableFreeList (cx=0xaba4f0, thingKind=8)
    at ../jsgc.cpp:1198
#11 0x00000000004de342 in NewFinalizableGCThing<JSShortString> (cx=0xaba4f0, 
    thingKind=8) at ../jsgcinlines.h:127
---Type <return> to continue, or q <return> to quit---
#12 0x00000000004de119 in js_NewGCShortString (cx=0xaba4f0)
    at ../jsgcinlines.h:154
#13 0x00000000005758fe in NewShortString (cx=0xaba4f0, chars=0x7ffff6812050, 
    length=1) at ../jsstr.cpp:3440
#14 0x00000000005760bb in js_NewStringCopyN (cx=0xaba4f0, s=0x7ffff6812050, 
    n=1) at ../jsstr.cpp:3593
#15 0x0000000000450d77 in js_AtomizeString (cx=0xaba4f0, 
    strArg=0x7ffff6812040, flags=0) at ../jsatom.cpp:522
#16 0x00000000005e1add in RootedStringToId (cx=0xaba4f0, namep=0xad9f10, 
    idp=0x7fffffffcdd0) at ../jstracer.cpp:12280
#17 0x00000000005e43d4 in SetPropertyByName (cx=0xaba4f0, obj=0x7ffff680d0d0, 
    namep=0xad9f10, vp=0x7fffffffce08, strict=0) at ../jstracer.cpp:12840

Comment 8

8 years ago
Why did the tracerState move into the compartment? I thought we wanted to leave that on the thread.
(In reply to comment #8)
> Why did the tracerState move into the compartment? I thought we wanted to leave
> that on the thread.

That was bug 623050, which fixed it so that we don't flush the trace cache while we're still on trace.

Comment 10

8 years ago
Is see. So this was to deal with workers right? Because they can roam between threads.

Comment 11

8 years ago
bent, do workers roam between threads while JS code is executing?
(In reply to comment #10)
> Is see. So this was to deal with workers right? Because they can roam between
> threads.

Bug 623050 was meant to deal with the possibility (that doesn't actually happen in Firefox) that we'd be on trace in some compartment, leave the compartment and pause for some reason while still on trace, and have another thread enter the compartment and do a flush. Now that tracerState is in the compartment, we can prevent the flush from happening.

However, this bug here is pretty much orthogonal to that one (and comment 0 is from before bug 623050 landed). This bug has existed since we moved TraceMonitor to the compartment. The basic problem is that JS_ON_TRACE gives an incorrect answer if we change compartments while on trace. I just noticed the following comment above the JS_ON_TRACE definition:
  /*
   * N.B. JS_ON_TRACE(cx) is true if JIT code is on the stack in the current
   * thread, regardless of whether cx is the context in which that trace is
   * executing.  cx must be a context on the current thread.
   */
We're blatantly violating this comment right now.

It would be easy to add a flag to ThreadData that says whether the current thread is on trace. However, there are other places in the engine where we find the current traceMonitor via cx->compartment, and it now seems likely that these places might be getting the wrong TraceMonitor if a compartment switch happened.

I think a more complete solution would be to add a traceCompartment field to ThreadData. This would be NULL if the thread is not on trace, and otherwise it would point to the (single) compartment that is on trace. Then the definition of JS_TRACE_MONITOR would return JS_THREAD_DATA(cx)->traceCompartment.traceMonitor. It would be an asserted error to use the macro if you're not on trace.

For this to work, we need to ensure that a thread isn't running two traces at once in different compartments. I think we could detect this and prevent it (by deep-bailing the existing trace when running the newer one?). Maybe it's already impossible? It definitely doesn't seem like something we want.

Does this sound reasonable? It should be a short patch.

Comment 13

8 years ago
Sounds reasonable. Lets whip up a patch.
I forgot that we also use JS_TRACE_MONITOR all the time when recording. It seems like there are two things to worry about here:

1. Can a worker switch threads while recording?
2. While recording, can a compartment switch happen?

It probably makes sense to have two classes of macros: those used during recording and those used while on trace. I'm just not sure what to do in the recording case. If we go through the compartment, we get killed by possibility #1 above. If we go through the thread, #1 kills us.
bent should confirm, but I'm pretty sure workers never switch threads while they're executing JS, they only switch between invocations.
(In reply to comment #15)
> bent should confirm, but I'm pretty sure workers never switch threads while
> they're executing JS, they only switch between invocations.

So then when we're recording, we can find the traceMonitor by going through the thread. In that case, we somehow need to prevent ourselves from recording in two compartments at once on the same thread. This seems possible, though.

I'll try to put something together tonight.
Assignee: jorendorff → wmccloskey
(In reply to comment #15)
> bent should confirm, but I'm pretty sure workers never switch threads while
> they're executing JS, they only switch between invocations.

Confirmed, they do not switch while running.
Created attachment 504063 [details] [diff] [review]
patch

This adds the tracerCompartment field. However, it uses a looser invariant than what I described above. When it wants a traceMonitor, first it looks in the cx->thread->tracerCompartment. If that's NULL, it looks in cx->compartment. I decided not to make it look at tracerCompartment alone, because there are places where we look up the traceMonitor when we're not recording or executing a trace.

Ideally, we would go through every use of JS_TRACE_MONITOR and fix it to be explicit about which traceMonitor it actually wants. But when I tried to do that, I found that it's pretty hard. I'm pretty sure this patch gives every caller what they need, though.

One tricky part is what happens if a thread starts recording, switches compartments, and then tries to record again (i.e., nested recording on different compartments). In this case, we will abort the outer recording when we re-enter the interpreter through Invoke. The same happens for profiling and trace execution.

I'll be away for the next four days, so please feel free to steal this.
Attachment #504063 - Flags: review?(gal)

Comment 19

8 years ago
This is insufficient. We need a list here. While only one compartment executes, a deep bailed compartment might still have code on the stack to which we will return as we unwind, so we can't flush that compartment's trace cache either.
Whiteboard: [sg:critical?] → [sg:critical?][hardblocker]
(In reply to comment #19)
> This is insufficient. We need a list here. While only one compartment executes,
> a deep bailed compartment might still have code on the stack to which we will
> return as we unwind, so we can't flush that compartment's trace cache either.

The fix for bug 623050 should have eliminated the possibility of flushing while we're on trace, even during a deep bail. It seems pretty air-tight to me.

This bug is about doing a GC while we're on trace (which, admittedly, has similar consequences). I think I understand your concern: if we only keep one tracerCompartment per thread, what happens if a thread tries to run two traces from different compartments, one inside the other? This is the "tricky" situation I described above. Here's what happens in more detail.

- The outer trace starts to run in compartment C1, which sets the thread's tracerCompartment to C1.
- The inner trace starts in compartment C1. It sees the outer trace and calls LeaveTrace, causing the outer trace to deep bail. The thread's tracerCompartment is set to C2.
- The inner trace finishes, and tracerCompartment is set to NULL. This sounds scary, but it's okay.
- When we do the DeepBail of the outer trace, we call JS_TRACE_MONITOR. It checks tracerCompartment, which is NULL, so it falls back on cx->compartment. So in order for the code to be correct, we need to ensure that cx->compartment == C1 during the deep bail. There doesn't seem to be any way to switch compartments during a deep bail, and I don't think we'd ever introduce such behavior in the future.

However, if this argument is wrong (or just too complex), I can add a stack of compartments to each thread. Or maybe we could just do a stack-walk of some sort when popping tracerCompartment. I'm not sure how exactly this would work.

Comment 21

8 years ago
Comment on attachment 504063 [details] [diff] [review]
patch

>+#ifdef JS_TRACER
>+static inline js::TraceMonitor &
>+JS_TRACE_MONITOR(JSContext *cx)
>+{
>+    JSCompartment *c = JS_THREAD_DATA(cx)->tracerCompartment;
>+    if (c == NULL)
>+        c = cx->compartment;

Why the discrepancy? Shouldn't this be an assert instead?

>+static inline bool
>+JS_ON_TRACE(JSContext *cx)
>+{
>+#ifdef JS_TRACER
>+    if (cx->compartment || JS_THREAD_DATA(cx)->tracerCompartment)
>+        return JS_TRACE_MONITOR(cx).ontrace();
>+#endif
>+    return false;
>+}

Again why?
There are a number of places where we call JS_TRACE_MONITOR and the like when we're not doing any sort of tracing activity. For example, when we're in the tracer but we haven't decided to record yet, we call this macro a lot. In that case, we just want to use cx->compartment.

Updated

8 years ago
Attachment #504063 - Flags: review?(gal) → review+
http://hg.mozilla.org/tracemonkey/rev/f3b470fb91a9
Whiteboard: [sg:critical?][hardblocker] → [sg:critical?][hardblocker][fixed-in-tracemonkey]
Backed out. I guess I screwed something up.

http://hg.mozilla.org/tracemonkey/rev/30c2d7eaef7d
Whiteboard: [sg:critical?][hardblocker][fixed-in-tracemonkey] → [sg:critical?][hardblocker]
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/f3b470fb91a9
http://hg.mozilla.org/mozilla-central/rev/30c2d7eaef7d (backout)
Note: not marking as fixed because last changeset is a backout.
Created attachment 508079 [details] [diff] [review]
patch v2, part 1

I decided to redo the previous patch. This patch is much more explicit about which TraceMonitor we want at any given time.

I've broken it into two pieces. This first piece just rewrites the tracer to call the JS_TRACE_MONITOR macro much less. Instead, it passes around the |tm| through arguments, fields, etc. If it is correct, this patch should not change program behavior at all.

Please be careful when reviewing it, though. In a number of places, it relies on the recorder's |traceMonitor| field. Every time it does, we need to make sure that the recorder hasn't already been freed at that point due to an abort. I think that's the only tricky part of the patch.

There's one more piece that I'll post next.
Attachment #508079 - Flags: review?(gal)
Created attachment 508080 [details] [diff] [review]
patch v2, part 2

This patch actually tries to fix the bug. It adds fields to the ThreadData that store the compartment that is currently executing on trace, in recording, or in profiling. The TRACE_RECORDER, TRACE_PROFILER, and JS_ON_TRACE use these fields.

The JS_TRACE_MONITOR macro has been eliminated. In its place are JS_TRACE_MONITOR_ON_TRACE and JS_TRACE_MONITOR_FROM_CONTEXT. The first should only be called while on trace, and it fetches the trace monitor from the thread data. The second should only be called in places where you know that you want the trace monitor corresponding to cx->compartment.

What to watch out for:
1. We should only call JS_TRACE_MONITOR_ON_TRACE when we're actually on trace. Otherwise we'll get an assertion failure. I'm especially worried about functions like SetBuiltinError and WasBuiltinSuccessful, since we may deep bail while we're running the given native. I've tried to fetch the trace monitor as early in the function as possible to avoid this.

2. For js_SetTraceableNativeFailed, I used the trace monitor from cx->compartment, but I have no idea if this is right. I depends on the caller, and this seems to be called from outside JS.

An additional note: If you think the risk that these patches will introduce regressions is too high, I also have a fixed version of the v1 patch we could use. The v2 patches have passed two tryserver runs, though.
Attachment #508080 - Flags: review?(gal)

Updated

8 years ago
Whiteboard: [sg:critical?][hardblocker] → [sg:critical?][hardblocker][has patch]

Comment 29

8 years ago
Comment on attachment 508079 [details] [diff] [review]
patch v2, part 1

> JS_FRIEND_API(void)
> js_SetTraceableNativeFailed(JSContext *cx)
> {
>-    SetBuiltinError(cx);
>+    /*
>+     * We might not be on trace (we might have deep bailed) so we hope
>+     * cx->compartment is correct.

Can we assert instead of hoping here?

>+     */
>+    SetBuiltinError(&JS_TRACE_MONITOR(cx));
> }
> 
>+/*
>+ * N.B. JS_ON_TRACE(cx) is true if JIT code is on the stack in the current
>+ * thread, regardless of whether cx is the context in which that trace is
>+ * executing.  cx must be a context on the current thread.
               ^ two spaces
Attachment #508079 - Flags: review?(gal) → review+
(In reply to comment #29)
> > JS_FRIEND_API(void)
> > js_SetTraceableNativeFailed(JSContext *cx)
> > {
> >-    SetBuiltinError(cx);
> >+    /*
> >+     * We might not be on trace (we might have deep bailed) so we hope
> >+     * cx->compartment is correct.
> 
> Can we assert instead of hoping here?

I'm not sure what should be asserted. The thing we need to worry about is this: some traceable native that switches compartments and then calls js_SetTraceableNativeFailed. This seems unlikely, but it's also hard to protect against.

One idea is to use JS_TRACE_MONITOR_ON_TRACE rather than JS_TRACE_MONITOR_FROM_CONTEXT. That looks up the TraceMonitor in a way that is immune to compartment switches. However, it won't work if we deep bail before we call js_SetTraceableNativeFailed.

Either approach requires knowing more about the callers of js_SetTraceableNativeFailed. Do they ever deep bail? Do they ever switch compartments? I looked up the callers in mxr. There are a bunch of calls from WebGL code as well as a call from the script that generates quick stubs (I think). This latter one is a total black box to me.

I'm ccing Boris in case knows.

Comment 31

8 years ago
risky patches probably need a beta.  should this move to blocking beta?

Updated

8 years ago
Attachment #508080 - Flags: review?(gal) → review+
> Either approach requires knowing more about the callers of
> js_SetTraceableNativeFailed. Do they ever deep bail? 

In general, probably yes.  That is, one could call a traceable native (one of the quickstubs say) that first deep bails and then calls js_SetTraceableNativeFailed.

You can inspect the quickstub code in js/src/xpconnect/src/dom_quickstubs.cpp in your objdir, but I'm not sure how much benefit that would bring.  It basically calls js_SetTraceableNativeFailed if unwrapping |this| fails or if calling the actual C++ DOM method fails.  And in general the C++ DOM method might be reentering script, etc.

I have no idea whether any of that can switch compartments (without unswitching before we get back to the quickstub, which I presume is the relevant thing).

Comment 33

8 years ago
Compartment switching is always balanced. If we come out of xpconnect/quickstub and we have a different compartment set, thats a bug
Based on these comments, I'm pretty sure that this is safe. I can't think of an easy way to assert it though.

http://hg.mozilla.org/tracemonkey/rev/c7e8f00451a5
http://hg.mozilla.org/tracemonkey/rev/ee3a4f429942
Whiteboard: [sg:critical?][hardblocker][has patch] → [sg:critical?][hardblocker][fixed-in-tracemonkey]
Created attachment 508848 [details] [diff] [review]
fix for perf regression

This fixes the perf regression from patch v2, part 1. I added the stopProfiling call in LoopProfile::profileLoopEdge because it seemed like the right thing to do, but I guess it regressed performance. It shouldn't have any correctness implications.
Attachment #508848 - Flags: review?(dvander)
Attachment #508848 - Flags: review?(dvander) → review+

Updated

8 years ago
Whiteboard: [sg:critical?][hardblocker][fixed-in-tracemonkey] → [sg:critical?][hardblocker][fixed-in-tracemonkey][has patch]
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
(Reporter)

Updated

7 years ago
Blocks: 676763
(Reporter)

Comment 38

6 years ago
Old tracer bug, opening and marking verified.
Group: core-security
Status: RESOLVED → VERIFIED
Flags: sec-bounty+
You need to log in before you can comment on or make changes to this bug.