Closed Bug 972725 Opened 10 years ago Closed 10 years ago

Crashes in Fedora's gnome-shell apparently caused by mozjs

Categories

(Core :: JavaScript Engine, defect)

24 Branch
x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: adamw, Unassigned)

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0 (Beta/Release)
Build ID: 20140203105753

Steps to reproduce:

Use GNOME Shell on Fedora 20 or 21 for a while (affects many, but not all, users).


Actual results:

Shell crashes fairly often (every few hours), with no consistent trigger, always in a javascript function of some kind.


Expected results:

Shell shouldn't crash.

The Fedora reports for this are https://bugzilla.redhat.com/show_bug.cgi?id=1034467 and https://bugzilla.redhat.com/show_bug.cgi?id=1028813 (there are a few other dupes). The Fedora GNOME devs say the backtraces indicate some kind of underlying memory corruption issue. GNOME's javascript stuff goes through a library named gjs which in turn sits on top of mozjs, and when I asked #jsapi about this issue, they suggested filing a bug as this may be a mozjs issue.

We have seen these crashes with GNOME 3.10 on mozjs17 (Fedora 20), GNOME 3.11.x on mozjs17 (Fedora Rawhide prior to this week), and GNOME 3.11.x on mozjs24 (Fedora Rawhide as of this week).

I've reproduced the crash with GNOME Shell running through valgrind, with recommended parameters:

env G_SLICE=always-malloc valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=12 --vex-iropt-register-updates=allregs-at-mem-access --log-file=/home/adamw/gnome-shell_valgrind.log --smc-check=all-non-file gnome-shell --replace

and will attach the log. An extract of the most obviously relevant bit reads:

==14803== Process terminating with default action of signal 11 (SIGSEGV)
==14803==  Bad permissions for mapped region at address 0x17D54060
==14803==    at 0x17D54060: ???
==14803==    by 0x3692F67384: EnterBaseline(JSContext*, js::jit::EnterJitData&) [clone .part.191] (BaselineJIT.cpp:105)
==14803==    by 0x3692F676D9: js::jit::EnterBaselineMethod(JSContext*, js::RunState&) (BaselineJIT.cpp:81)
==14803==    by 0x3692CFD182: Interpret(JSContext*, js::RunState&) (Interpreter.cpp:2334)
==14803==    by 0x3692CFDBC7: js::RunScript(JSContext*, js::RunState&) (Interpreter.cpp:438)
==14803==    by 0x3692CEFD2B: js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) (Interpreter.cpp:500)
==14803==    by 0x3692E0EADE: js::CallOrConstructBoundFunction(JSContext*, unsigned int, JS::Value*) (jsfun.cpp:1212)
==14803==    by 0x3692CEFD72: js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:321)
==14803==    by 0x3692E0EEDF: js_fun_apply(JSContext*, unsigned int, JS::Value*) (jsfun.cpp:982)
==14803==    by 0x3692CEFD72: js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:321)
==14803==    by 0x3692CFFBED: js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value*, JS::Value*) (Interpreter.cpp:531)
==14803==    by 0x3692F5F9DF: js::jit::DoCallFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICCall_Fallback*, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) (BaselineIC.cpp:7007)
Oh, for clarity, the log is from GNOME Shell 3.11.5 with gjs 1.39.3 and mozjs24 24.2.0 (current Fedora Rawhide), with the mozjs build done with --enable-valgrind.
==14803== Conditional jump or move depends on uninitialised value(s)
==14803==    at 0x3692821904: ??? (Value.h:1736)

This should not happen, can you try launching the gnome-shell with --vgdb-error=0 and hook a gdb on it to find from where this uninitialized value is coming from?

==14803== Process terminating with default action of signal 11 (SIGSEGV)
==14803==  Bad permissions for mapped region at address 0x17D54060
==14803==    at 0x17D54060: ???
==14803==    by 0x3692F67384: EnterBaseline(JSContext*, js::jit::EnterJitData&) [clone .part.191] 

A part from the above, I wonder it you are adding the segfault handler that we added we introduced with asmjs?
When we use the interruption callback, we protect the access to the memory to avoid checking a boolean at each loop iterations.  If the SEGV handler is not registered all the interruption callbacks are likely to cause this error if they happen during the execution of jitted code.
Flags: needinfo?(luke)
If the crashes happen with mozjs17, then this predates the baseline JIT, asm.js optimizations and even IonMonkey.

Given that the crashes persist on mozjs24 (despite massive changes to all things jit), I'd expect this is not a jit problem.  The most likely offender is if the embedding is using multiple global objects and not entering/leaving compartments correctly.  This would cause random corruption that would lead to crashes in jit code.  We have pretty good assertions for this, but you have to run a debug build.
Flags: needinfo?(luke)
Luke: yup, they definitely happen with mozjs17 (we get a new CC on the F20 bug, https://bugzilla.redhat.com/show_bug.cgi?id=js-shape-finalize-20 , at least once a day most days, and I haven't gone through and cleaned out probable dupes for a while - due to the nature of the bug, the backtraces aren't always identical, so I go through BZ and dupe off anything in JS:: every so often).

So right now in Valgrind Parameter Bingo, you want me to add --vgdb-errror=0 , attach a gdb when the crash happens, and get a backtrace? I can do that, but if just a 'regular' backtrace of one of the crashes would suffice, you can pick any of the dozens attached to the Fedora bugs and their duplicates.

You also want me to rebuild mozjs24 with --enable-debug , yes?
(In reply to Adam Williamson from comment #5)
> You also want me to rebuild mozjs24 with --enable-debug , yes?

Yes, that would add in tons of asserts that would hopefully catch the bug closer to the source (esp. if it is a cross-compartment problem).
(In reply to Adam Williamson from comment #5)
> So right now in Valgrind Parameter Bingo, you want me to add --vgdb-errror=0
> , attach a gdb when the crash happens, and get a backtrace? I can do that,
> but if just a 'regular' backtrace of one of the crashes would suffice, you
> can pick any of the dozens attached to the Fedora bugs and their duplicates.

Not a backtrace, valgrind already give us all the backtraces we want, but this is a way to debug issues within gdb at the point where valgrind see error messages such as uninitialized variables.
I'll sit in IRC, then, as you're gonna have to talk me through whatever you want doing with gdb. I'm just the QA monkey, so far as gdb is concerned I know how to do a 'thread apply all bt full', anything past that you have to send me the play-by-play :)

about to launch shell with the rebuilt mozjs now, the time-to-crash is unpredictable but it shouldn't be more than a few hours.
So I have mozjs24 built with --enable-debug and gjs rebuilt against that mozjs24, but if I add --vgdb-error=0 to my valgrind invocation, attach gdb to the valgrind process as explained in the valgrind log, and run 'continue', shell never manages to fully start up; the old shell process is killed but then the new one never gets to the point of painting any window decorations (I waited for a couple hours). That was what I also saw when trying with --num-callers=20 , so I guess it's just too much for the hardware to handle or something. Should I try reducing --num-callers further, or what?

In the meantime I can try and get a straight valgrind log with the --debug build, I guess. But I'm also knee deep in systemd debugging today. le sigh.
(In reply to Adam Williamson from comment #9)
If you have an --enable-debug build, just running it (no valgrind) should be good enough to hit the assertions.  Good luck with systemd; switching from Upstart?
(In reply to Adam Williamson from comment #9)
> So I have mozjs24 built with --enable-debug and gjs rebuilt against that
> mozjs24, but if I add --vgdb-error=0 to my valgrind invocation, attach gdb
> to the valgrind process as explained in the valgrind log, and run
> 'continue', shell never manages to fully start up; the old shell process is
> killed but then the new one never gets to the point of painting any window
> decorations (I waited for a couple hours). That was what I also saw when
> trying with --num-callers=20 , so I guess it's just too much for the
> hardware to handle or something. Should I try reducing --num-callers
> further, or what?

Every time valgrind reports an error, it will stop in gdb, and this is what is wanted to have more contextual information on the origin of the uninitialized value which is flowing into Value.h (which is the first JS error)

As Valgrind will stop the execution each time it see an error, you will have to type 'continue' in gdb as long as you do not hit this first backtrace which involve the unitialize value in Value.h.  The current stack is this one:

==14803== Conditional jump or move depends on uninitialised value(s)
==14803==    at 0x3692821904: ??? (Value.h:1736)
==14803==    by 0x3692824860: gjs_value_from_g_argument (arg.cpp:2437)
==14803==    by 0x369282B5A1: ??? (function.cpp:1086)
==14803==    by 0x369282C0F0: ??? (function.cpp:1203)
==14803==    by 0x3692CEFEF0: js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) (jscntxtinlines.h:321)

So, this is probably an error which need to be fixed in the way the JS API is used in gjs_value_from_g_argument, the trick with --vgdb-error=0, is something which can be used by developers, to find out more about the context of valgrind error reports. (such as running 'where full' in gdb, instead of 'bt')

Also, have you tried to just run without valgrind, to see if any of the assertions are triggering?
nbp: gdb wasn't showing any errors or returning to its prompt, I checked. After the initial 'continue' it just...sits there.
So I ran the --debug build through valgrind all day yesterday, and it didn't crash once. Either I got very (un)lucky (depending on which way you want to look at it...), or this may be a bit Heisenbug-ish - the crash doesn't happen if you run with the asserts that would help debug the crash...

I'll give it another shot later today.
Tried this again today, still can't crash it when running via valgrind with a --enable-debug build of mozjs24. However, it's noticeable that from time to time the desktop entirely hangs for quite a considerable period - two or three minutes. Then suddenly it pops back to working. The frequency of these hangs is rather similar to the normal frequency of the crashes, so it seems like perhaps when running through valgrind with --enable-debug, these mysterious hangs happen *instead of* the crashes.

I don't see any console output or output in the valgrind log that seems to be contemporaneous with the hangs, though.
This is an AssertHeapIsIdleOrIterating  backtrace with mozjs24 built with --enable-debug and gnome-shell on rawhide.
I am a bit surprised by the stack that you obtained, but it explain what Adam reported, as being slow where it used to crash.  The reason seems to be that the stack allocated for running the compiled JS code is not enough, and when you enter a JIT-ed function, we check that we are not about to overflow the stack.

It appears that during this check we are also interrupted by the interruption callback, which is a mechanism used to execute some code while the JS is not finished. We use the same mechanism in Firefox to trigger the alert box.

Can you profile the JS code, within the Gnome shell?

We have an API [1] with which you can enable the SPS profiler, this would maintain an additional stack which follow every JavaScript function execution.  Also, you can have a look at Gecko's profiler[2] implementation, which is using the SPS Profiler.  It is using a timer, which send a SIGPROF to the main thread such as it collect samples of the stack.

[1] http://dxr.mozilla.org/mozilla-central/source/js/public/ProfilingStack.h
[2] http://dxr.mozilla.org/mozilla-central/source/tools/profiler/GeckoProfiler.h

Also, if you are able to reproduce this error with a debug build, can you run

(gdb) call js_DumpBacktrace(cx)

under gdb, when the program fails.  You can usually find a "cx" (JSContext*) in the C++ stack, such as the new attached one.

Terrence: Any idea what could cause AssertHeapIsIdleOrIterating to fail during an interruption callback.
Flags: needinfo?(terrence)
Setting up SPS profiling in shell is beyond me. 
The bug is not really reproducible here, the trace was from a core file. I'll try running under gdb for a while.
(In reply to Nicolas B. Pierron [:nbp] from comment #16)
> Terrence: Any idea what could cause AssertHeapIsIdleOrIterating to fail
> during an interruption callback.

This asserts that we are not actively in a GC or otherwise tracing the heap, e.g. for JS_IterateCompartmentsCellIters for about:memory, JS_DumpHeap for debugging, etc. So this would imply that script is running from a finalizer, a JSTracer callback, while using CellIter, or perhaps someone is just calling the interrupt hook manually in one of this places. This is not allowed and, as far as I know there is nowhere in SpiderMonkey proper where this can happen. However, I think in the past we've had trouble with finalizers in Gecko accidentally trying to run scripts.
Flags: needinfo?(terrence)
I reproduced the crash with a --debug build, with shell running inside gdb. I get this:

Assertion failure: !rt->isHeapCollecting(), at /builddir/build/BUILD/mozjs-24.2.0/js/src/jsapi.cpp:206

Program received signal SIGSEGV, Segmentation fault.
0x0000003ac9652590 in AssertHeapIsIdleOrIterating (rt=<optimized out>)
    at /usr/src/debug/mozjs-24.2.0/js/src/jsapi.cpp:206
206	    JS_ASSERT(!rt->isHeapCollecting());

I couldn't do the "call js_DumpBacktrace(cx)" bit because of the chicken-and-egg problem...the terminal on which I was running gnome-shell through gdb was inside the gnome-shell session :) I'll try and set it up so I can actually access the running gdb while the crashed shell process is still around next time.
Great! A simple |bt full| from within gdb should point right at the offending code.
Welp, I had to kill the gnome-shell process itself to be able to get at gdb, so I don't know if the backtrace generated after that is useful, but here it is:

(gdb) bt full
#0  0x0000003ac9652590 in AssertHeapIsIdleOrIterating (rt=<optimized out>)
    at /usr/src/debug/mozjs-24.2.0/js/src/jsapi.cpp:206
No locals.
#1  AssertHeapIsIdleOrIterating (
    cx=<error reading variable: Cannot access memory at address 0x7fffffff7bc8>) at /usr/src/debug/mozjs-24.2.0/js/src/jsapi.cpp:212
No locals.
#2  JSAutoCompartment::JSAutoCompartment (this=<optimized out>, 
    cx=<optimized out>, target=<optimized out>)
    at /usr/src/debug/mozjs-24.2.0/js/src/jsapi.cpp:1459
No locals.

I now have a gnome-shell running through gdb from a VT, so next time it crashes I should be able to do the backtrace without killing the shell process first...
Here's a better 'bt full' after catching the crash in gdb.
The problems seems to be related to a known Bug, where the finalizer is asserting that the heap is idle.  Do we have any way to work-around such issues?

#0  0x0000003ac9652590 in AssertHeapIsIdleOrIterating (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsapi.cpp:206
#1  AssertHeapIsIdleOrIterating (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsapi.cpp:212
#2  JSAutoCompartment::JSAutoCompartment (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsapi.cpp:1459


#3  0x0000003f4043a9f8 in closure_marshal (…) at gi/value.cpp:80
[…]
#17 0x0000003f404318a0 in object_instance_finalize (fop=<optimized out>, obj=0x7fffba918ca0) at gi/object.cpp:1378


#18 0x0000003ac96f1e7c in finalize (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsobjinlines.h:213
#19 js::gc::Arena::finalize<JSObject> (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsgc.cpp:331
[…]
#26 IncrementalCollectSlice (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsgc.cpp:4289
#27 0x0000003ac96ff337 in GCCycle (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsgc.cpp:4422
#28 0x0000003ac96ff917 in Collect (…) at /usr/src/debug/mozjs-24.2.0/js/src/jsgc.cpp:4558
Flags: needinfo?(terrence)
Comment on attachment 8394929 [details]
'bt full' after catching a crash in gdb with a --debug build of mozjs

SpiderMonkey embeddings /must not/ call back into the API from a finalizer, full stop. We do allow API usage, including running (almost) arbitrary script code, during GC, but /only/ during the JSGCCallback when the phase is JSGC_END. Gecko has the same need: it implements something called "delayed finalization." The idea is that when finalizers need to interact with SpiderMonkey they push the operation into a list, then run these operations in order when they get the JSGC_END callback. I guess gnome-shell needs something similar.
Flags: needinfo?(terrence)
GNOME think they have a fix for this, now:

https://bugzilla.gnome.org/show_bug.cgi?id=725024
https://git.gnome.org/browse/gjs/commit/?id=dc31b7cfda356b2691dc90440ea54b461c1b90b4

the bug's still affecting Fedora 20 and we need to backport a fix somehow or other, but I think we're getting to the bottom of it...might be OK to close this report now, I guess?
Great to hear it! Feel free to re-open if you need us to do more here.
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: