Open Bug 1337988 Opened 7 years ago Updated 2 years ago

Crash @setObejctNoCheck on ARM while creating startup cache

Categories

(Core :: JavaScript: GC, defect, P3)

defect

Tracking

()

Tracking Status
firefox53 --- affected

People

(Reporter: glandium, Unassigned)

References

Details

(Keywords: triage-deferred)

On non-Android ARM linux (Debian armhf):

Program terminated with signal SIGBUS, Bus error.
#0  JS::Value::setObjectNoCheck (obj=0xf128d050, this=<optimized out>)
    at /home/glandium/mozilla-central/obj-armv8l-unknown-linux-gnueabihf/dist/include/js/Value.h:377
377             data.asBits = bitsFromTagAndPayload(JSVAL_TAG_OBJECT, PayloadType(obj));
(gdb) bt
#0  JS::Value::setObjectNoCheck (obj=0xf128d050, this=<optimized out>)
    at /home/glandium/mozilla-central/obj-armv8l-unknown-linux-gnueabihf/dist/include/js/Value.h:377
#1  JS::Value::setObject (obj=..., this=<optimized out>)
    at /home/glandium/mozilla-central/obj-armv8l-unknown-linux-gnueabihf/dist/include/js/Value.h:372
#2  js::MutableWrappedPtrOperations<JS::Value, JS::MutableHandle<JS::Value> >::setObject (obj=..., this=<synthetic pointer>)
    at /home/glandium/mozilla-central/obj-armv8l-unknown-linux-gnueabihf/dist/include/js/Value.h:1368
#3  mozJSComponentLoader::Import (this=0xf0f60a80, registryLocation=..., targetValArg=..., cx=0xf3910000, optionalArgc=0 '\000', 
    retval=...) at /home/glandium/mozilla-central/js/xpconnect/loader/mozJSComponentLoader.cpp:1084
#4  0xf5a2e6a4 in nsXPCComponents_Utils::Import (this=<optimized out>, registryLocation=..., targetObj=..., cx=0xf3910000, 
    optionalArgc=0 '\000', retval=...) at /home/glandium/mozilla-central/js/xpconnect/src/XPCComponents.cpp:2503
#5  0xf563f1b2 in NS_InvokeByIndex (that=0x1, methodIndex=2, paramCount=4045983824, params=0xfff8ace0)
    at /home/glandium/mozilla-central/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:413
#6  0x00000000 in ?? ()

Because NS_InvokeByIndex contains assembly code and no cfi info, the stack is wrong from there, but by adding a breakpoint, here is the right backtrace above that point:

#0  NS_InvokeByIndex (that=0xf121aab0, methodIndex=9, paramCount=5, params=0xfffecdd0)
    at /home/glandium/mozilla-central/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:359
#1  0xf5bd0624 in CallMethodHelper::Invoke (this=0xfffecda0)
    at /home/glandium/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:2010
#2  CallMethodHelper::Call (this=0xfffecda0) at /home/glandium/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:1329
#3  XPCWrappedNative::CallMethod (ccx=..., mode=mode@entry=XPCWrappedNative::CALL_METHOD)
    at /home/glandium/mozilla-central/js/xpconnect/src/XPCWrappedNative.cpp:1296
#4  0xf5bd3412 in XPC_WN_CallMethod (cx=<optimized out>, argc=1, vp=0xf107e0b8)
    at /home/glandium/mozilla-central/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:983
#5  0xf6c7fc64 in js::CallJSNative (args=..., native=<optimized out>, cx=0xf3a12800)
    at /home/glandium/mozilla-central/js/src/jscntxtinlines.h:281
#6  js::InternalCallOrConstruct (cx=0xf3a12800, args=..., construct=<optimized out>)
    at /home/glandium/mozilla-central/js/src/vm/Interpreter.cpp:460
#7  0xf6c7c828 in js::CallFromStack (args=..., cx=<optimized out>) at /home/glandium/mozilla-central/js/src/vm/Interpreter.cpp:511
#8  Interpret (cx=0xf3a12800, state=...) at /home/glandium/mozilla-central/js/src/vm/Interpreter.cpp:2957
#9  0xf6c7fa2e in js::RunScript (cx=cx@entry=0xf3a12800, state=...) at /home/glandium/mozilla-central/js/src/vm/Interpreter.cpp:406
#10 0xf6c80e8a in js::ExecuteKernel (result=0x0, evalInFrame=..., newTargetValue=..., envChainArg=..., script=..., cx=0xf3a12800)
    at /home/glandium/mozilla-central/js/src/vm/Interpreter.cpp:687
#11 js::Execute (cx=cx@entry=0xf3a12800, script=..., script@entry=..., envChainArg=..., rval=rval@entry=0x0)
    at /home/glandium/mozilla-central/js/src/vm/Interpreter.cpp:720
#12 0xf6e02aba in ExecuteScript (cx=cx@entry=0xf3a12800, scope=scope@entry=..., script=..., rval=rval@entry=0x0)
    at /home/glandium/mozilla-central/js/src/jsapi.cpp:4440
#13 0xf6e0a4f4 in JS_ExecuteScript (cx=cx@entry=0xf3a12800, scriptArg=scriptArg@entry=...)
    at /home/glandium/mozilla-central/js/src/jsapi.cpp:4473
#14 0xf5bac0fc in mozJSComponentLoader::ObjectForLocation (this=this@entry=0xf1ddfa00, aInfo=..., aComponentFile=<optimized out>, 
    aObject=..., aTableScript=..., aLocation=0xf1073c84, aPropagateExceptions=aPropagateExceptions@entry=true, 
    aException=aException@entry=...) at /home/glandium/mozilla-central/js/xpconnect/loader/mozJSComponentLoader.cpp:969
#15 0xf5bac71a in mozJSComponentLoader::ImportInto (this=this@entry=0xf1ddfa00, aLocation=..., targetObj=..., targetObj@entry=..., 
    callercx=<optimized out>, callercx@entry=0xf3a12800, vp=vp@entry=...)
    at /home/glandium/mozilla-central/js/xpconnect/loader/mozJSComponentLoader.cpp:1200
#16 0xf5bacc22 in mozJSComponentLoader::Import (this=0xf1ddfa00, registryLocation=..., targetValArg=..., cx=0xf3a12800, 
    optionalArgc=0 '\000', retval=...) at /home/glandium/mozilla-central/js/xpconnect/loader/mozJSComponentLoader.cpp:1076
#17 0xf5bb16a4 in nsXPCComponents_Utils::Import (this=<optimized out>, registryLocation=..., targetObj=..., cx=0xf3a12800, 
    optionalArgc=0 '\000', retval=...) at /home/glandium/mozilla-central/js/xpconnect/src/XPCComponents.cpp:2503
#18 0xf57c21b2 in NS_InvokeByIndex (that=0xf121aab0, methodIndex=9, paramCount=5, params=0xfffed948)
    at /home/glandium/mozilla-central/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:413
#19 0x00000000 in ?? ()

(That's a different NS_InvokeByIndex, that went normally)

The crash happens because the retval in the nsXPCComponents_Utils::Import stack frame (#4 in comment 1) is:
{<js::MutableHandleBase<JS::Value, JS::MutableHandle<JS::Value> >> = {<js::MutableWrappedPtrOperations<JS::Value, JS::MutableHandle<JS::Value> >> = {<js::WrappedPtrOperations<JS::Value, JS::MutableHandle<JS::Value> >> = {<No data fields>}, <No data fields>}, <No data fields>}, ptr = 0x2}

I bisected this to bug 1308236, and backing that out fixes the crash.

Building with GCC6 causes bug 1337263 before this one, but building with GCC5, this crash still happens (while bug 1337263 doesn't).
I'm not sure why we don't see this on Android, btw. Maybe it's CPU dependent.
I'm running out of time today to take a look at this, so CC'ing jonco for now, but otherwise I can take a look at this tomorrow.
Is that=0x1 and paramCount=4045983824 an artifact of missing CFI info, or is it really getting trashed? I'm guessing bad CFI, because either of those should lead to a much more immediate crash.

(Side question: why *don't* we have CFI info there, at least enough to walk the stack? Could we drop in a couple of .cfi_offset directives and related stuff? I don't know ARM enough to do it.)

Is there a way I can replicate this with a try push? I don't know what non-Android ARM is here.

Is your second stack calling the same method via NS_InvokeByIndex?

Can you do p sizeof(retval) and ptype retval?

I can't think of how you could end up with a 0x2 pointer to a Value in any case. The bad CFI in the caller's frame wouldn't mangle that, would it?

I guess I'd have to trace through the parameter conversion code or something. Unless that operator== patch somehow confuses a Something<Value> with a Something<pointer>? But I read through the patch again and didn't see anything like that.
(In reply to Steve Fink [:sfink] [:s:] from comment #3)
> Is that=0x1 and paramCount=4045983824 an artifact of missing CFI info, or is
> it really getting trashed? I'm guessing bad CFI, because either of those
> should lead to a much more immediate crash.

See the sentence between the two backtraces.
 
> (Side question: why *don't* we have CFI info there, at least enough to walk
> the stack? Could we drop in a couple of .cfi_offset directives and related
> stuff? I don't know ARM enough to do it.)

Yes, we could. We actually should, because it affects crash-reporter stacks on android as well, when NS_InvokeByIndex is involved. That said, I'm not sure why

> Is there a way I can replicate this with a try push?

I don't think so, unfortunately. I'm using a Debian ARM porterbox. A raspberry Pi or something similar running a Linux distro would do the trick too.

I'll try GCC 4.9, which is the version we use on Android, to see if it's a GCC version thing or something else.

> I don't know what non-Android ARM is here.
> 
> Is your second stack calling the same method via NS_InvokeByIndex?

See the sentence between the two backtraces.

> Can you do p sizeof(retval) and ptype retval?

Its type is in the pasted value.

> I can't think of how you could end up with a 0x2 pointer to a Value in any
> case. The bad CFI in the caller's frame wouldn't mangle that, would it?

No.
(In reply to Mike Hommey [:glandium] from comment #1)
> I'm not sure why we don't see this on Android, btw. Maybe it's CPU dependent.

Building with GCC 4.9, like the Android builds do, doesn't crash.
Building with GCC 5 and -O0 still crashes.
(In reply to Mike Hommey [:glandium] from comment #4)
> (In reply to Steve Fink [:sfink] [:s:] from comment #3)
> > Is that=0x1 and paramCount=4045983824 an artifact of missing CFI info, or is
> > it really getting trashed? I'm guessing bad CFI, because either of those
> > should lead to a much more immediate crash.
> 
> See the sentence between the two backtraces.

I read it, I just wasn't clear on whether it would only affect the stack trace or whether it would mangle the reported stack values. It makes sense that it would mangle the on-stack params.

> I'll try GCC 4.9, which is the version we use on Android, to see if it's a
> GCC version thing or something else.
> 
> > I don't know what non-Android ARM is here.
> > 
> > Is your second stack calling the same method via NS_InvokeByIndex?
> 
> See the sentence between the two backtraces.

Just confirming that by "right backtrace" you didn't just mean a different method invocation from the same stack. Sounds like this is the real deal.

> > Can you do p sizeof(retval) and ptype retval?
> 
> Its type is in the pasted value.

I wanted the full descriptions of the types, to spot any added fields or something. Though I guess I'd need it for the various layers, not just the toplevel type. (I'm sure it's a vain hope, but if the sizeof were larger than expected, it would point to something like this.) I guess I really want something like pahole output, to confirm there's no alignment weirdness going on. Though that still feels like a long shot; surely it would mess up way more things.

> > I can't think of how you could end up with a 0x2 pointer to a Value in any
> > case. The bad CFI in the caller's frame wouldn't mangle that, would it?
> 
> No.

I have seen some mangled gdb output for these sorts of types, when in fact the actual value was fine. But given that it crashed on this, that seems unlikely. It didn't happen to give you the crashing address? And would it ordinarily give a SIGBUS for a 0x2 or similar dereference, as opposed to a SIGSEGV? The latter is what I expect on x86 linux.
We have the same problem on Fedora24/25 and armv7hl. From some reason it's working on Fedora 26 (gcc7) but fails on F25 (gcc6). As Fedora 26 has gcc7 we use some tricks to fix builds there so that may be the cause F26/gcc7 works. I'll investigate, build logs are here if anyone want's to take a look:

F26/gcc7/works:
https://kojipkgs.fedoraproject.org//packages/firefox/53.0.2/4.fc26/data/logs/armv7hl/build.log

F25/gcc6/fails:
https://kojipkgs.fedoraproject.org//work/tasks/5002/19505002/build.log
> I have seen some mangled gdb output for these sorts of types, when in fact
> the actual value was fine. But given that it crashed on this, that seems
> unlikely. It didn't happen to give you the crashing address? And would it
> ordinarily give a SIGBUS for a 0x2 or similar dereference, as opposed to a
> SIGSEGV? The latter is what I expect on x86 linux.

I put breakpoint to nsXPCComponents_Utils::Import() and when stepping here the retval is 0x2, 0x4, 0x6 and then I get then SIGSEGV crash setObjectNoCheck(). So looks like nsXPCComponents_Utils::Import() is calling recursively and the parameter is malformed somehow.

I have -g -O1 build (-O0 unfortunately fails with internal compiler error).
Keywords: triage-deferred
Priority: -- → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.