Crash in js::jit::DoTypeMonitorFallback
Categories
(Core :: JavaScript Engine: JIT, defect, P3)
Tracking
()
People
(Reporter: u279076, Unassigned)
References
Details
(Keywords: crash)
Crash Data
Updated•8 years ago
|
Comment 1•8 years ago
|
||
Comment 2•8 years ago
|
||
Comment 3•6 years ago
|
||
Comment 4•6 years ago
|
||
Comment 5•5 years ago
|
||
tl;dr -- Crashes happen because gcc mis-implements part of the AAPCS
ABI (gcc bug 77728). This causes gcc to pass (and then expect)
certain function arguments in the wrong register(s). XPCOM methods
miscompiled in this way can't be correctly invoked by the code in
xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp because the latter
constructs an AAPCS-compliant function call rather than a buggy one.
(I apologize in advance for the length of what follows. I'm not
intimately familiar with the mozilla codebase, and I usually work in C
rather than C++. Those things taken together, I figured it was better
to take a "show your work" approach in this write-up.)
--
I've encountered failures on my hard-float Asus C201P Chromebook
running debian 'stretch'.
Using debian's 60esr package, the program immediately dies at startup:
noel@stretch$ firefox --new-instance
ExceptionHandler::GenerateDump cloned child 554
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
I downloaded the debian-managed sources and rebuilt the code with
debugging support using the distribution-provided gcc-6.3.0 (target
arm-linux-gnueabihf). Running under gdb produces:
Thread 1 "firefox-esr" received signal SIGSEGV, Segmentation fault.
0xb3b7b87a in JS::MutableHandle<JS::Value>::set (v=..., this=<synthetic pointer>)
at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:581
581 *ptr = v;
(gdb) bt
#0 0xb3b7b87a in JS::MutableHandle<JS::Value>::set(JS::Value const&) (v=..., this=<synthetic pointer>) at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:581
#1 0xb3b7b87a in js::jit::DoTypeMonitorFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICTypeMonitor_Fallback*, JS::HandleValue, JS::MutableHandleValue) (cx=0xb0b12800, frame=0xbeffd248, stub=0xb0b90010, value=JSVAL_VOID, res=<error reading variable: Cannot access memory at address 0x0>) at /usr/local/src/firefox-debian-git-fail/js/src/jit/SharedIC.cpp:2388
#2 0x3dd7b46c in ()
(gdb) up
#1 js::jit::DoTypeMonitorFallback (cx=0xb0b12800, frame=0xbeffd248, stub=0xb0b90010,
value=JSVAL_VOID, res=<error reading variable: Cannot access memory at address 0x0>)
at /usr/local/src/firefox-debian-git-fail/js/src/jit/SharedIC.cpp:2388
2388 res.set(value);
(gdb) call js::DumpBacktrace(cx)
#0 [Thread 0xac2ff450 (LWP 20923) exited]
beffd270 b resource://gre/modules/XPCOMUtils.jsm:103 (add66918 @ 0)
#1 adbfa018 i file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/components/nsAsyncShutdown.js:259 (adda8230 @ 466)
(gdb) js
0 XPCU_generateQI(interfaces = nsIAsyncShutdownService) ["resource://gre/modules/XPCOMUtils.jsm":103]
this = [object Object]
1 <TOP LEVEL> ["file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/components/nsAsyncShutdown.js":259]
I then disabled JIT, setting javascript.options.ion and
javascript.options.baselinejit to false, and re-ran:
Thread 1 "firefox-esr" received signal SIGSEGV, Segmentation fault.
JS::Rooted<JS::Value>::registerWithRootLists (roots=..., this=0xbeffc5f8)
at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:947
947 this->prev = *stack;
(gdb) bt
#0 0xb1f7a590 in JS::Rooted<JS::Value>::registerWithRootLists(mozilla::EnumeratedArray<JS::RootKind, (JS::RootKind)14, JS::Rooted<void*>*>&) (roots=..., this=0xbeffc5f8)
at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:947
#1 0xb1f7a590 in JS::Rooted<JS::Value>::Rooted<JSContext*>(JSContext* const&) (cx=<synthetic pointer>: 0x1, this=0xbeffc5f8)
at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:965
#2 0xb1f7a590 in nsJSCID::GetService(JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>) (this=<optimized out>, iidval=..., cx=0x1, optionalArgc=0 '\000', retval=$jsval(-1.6558312099286871e-45))
at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCJSID.cpp:696
#3 0xb1acd254 in NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*) (that=0x0, methodIndex=3064680352, paramCount=0, params=0xbeffc7d0)
at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:413
#4 0xb0b12800 in ()
(gdb) js
0 anonymous() ["resource://gre/modules/Services.jsm":20]
this = [object Object]
1 get() ["resource://gre/modules/XPCOMUtils.jsm":194]
this = [object Object]
2 earlyBlankFirstPaint() ["file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js":12]
3 <TOP LEVEL> ["file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js":11]
frame #1 has cx=0x1. That should instead be a JSContext* pointing to
somewhere in the heap. I re-ran things and fiddled around until I got
the program to halt at the NS_InvokeByIndex just before the crash.
Thread 1 "firefox-esr" hit Breakpoint 1, NS_InvokeByIndex (that=0xac40b0f0, methodIndex=3, paramCount=1, params=0xbeffb2a8) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:359
359 {
(gdb) ig 1 23
Will ignore next 23 crossings of breakpoint 1.
(gdb) c
Continuing.
[Thread 0xac2ff450 (LWP 27022) exited]
Thread 1 "firefox-esr" hit Breakpoint 1, NS_InvokeByIndex (that=0xac40b3b0, methodIndex=11, paramCount=4, params=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:359
359 {
(gdb) bt 10
#0 0xb1acd208 in NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*) (that=0xac40b3b0, methodIndex=11, paramCount=4, params=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:359
#1 0xb1f94a86 in CallMethodHelper::Invoke() (this=0xbeffc7a0) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNative.cpp:1951
#2 0xb1f94a86 in CallMethodHelper::Call() (this=0xbeffc7a0) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNative.cpp:1267
#3 0xb1f94a86 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) (ccx=..., mode=mode@entry=XPCWrappedNative::CALL_METHOD) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNative.cpp:1234
#4 0xb1f97d02 in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) (cx=0xb0b12800, argc=<optimized out>, vp=0xadbfa208) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:913
#5 0xb39fff0e in js::CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), JS::CallArgs const&) (args=..., native=0xb1f97bb1 <XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*)>, cx=0xb0b12800) at /usr/local/src/firefox-debian-git-fail/js/src/vm/JSContext-inl.h:290
#6 0xb39fff0e in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (cx=0xb0b12800, args=..., construct=<optimized out>) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:468
#7 0xb39fcf04 in js::CallFromStack(JSContext*, JS::CallArgs const&) (args=..., cx=<optimized out>) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:523
#8 0xb39fcf04 in Interpret(JSContext*, js::RunState&) (cx=0xb0b12800, state=...) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:3115
#9 0xb39ffbc6 in js::RunScript(JSContext*, js::RunState&) (cx=0xb0b12800, state=...) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:418
#10 0xb39fffcc in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (cx=0xb0b12800, args=..., construct=<optimized out>) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:490
(gdb) call js::DumpBacktrace((JSContext*)0xb0b12800)
#0 adbfa1c8 i resource://gre/modules/Services.jsm:20 (add66438 @ 27)
#1 adbfa150 i resource://gre/modules/XPCOMUtils.jsm:194 (add66bf0 @ 35)
#2 adbfa088 i file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js:12 (add8d0f8 @ 48)
#3 adbfa018 i file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js:11 (add8d090 @ 167)
(gdb) n
361 vtable_func func = vtable[methodIndex];
(gdb) p methodIndex
$42 = 11
(gdb) n
413 );
(gdb) p func
$43 = (vtable_func) 0xb1f7a4ed <nsJSCID::GetService(JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>)>
(gdb) n
395 [param_count_plus_2] "r" (paramCount + 2),
(gdb) n
413 );
(gdb) n
invoke_copy_to_stack (stk=0xbeffc630, end=0xbeffc660, paramCount=4, s=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:316
316 for (uint32_t i = 0; i < paramCount; i++, s++) {
(gdb) p s
$44 = (nsXPTCVariant *) 0xbeffc7d0
(gdb) p s->val.p
$45 = (void *) 0xadd6a5a0
(gdb) n
310 {
(gdb)
316 for (uint32_t i = 0; i < paramCount; i++, s++) {
(gdb)
317 if (s->IsPtrData()) {
(gdb) p s->val.p
$46 = (void *) 0xadd6a5a0
(gdb) n
318 copy_word(ireg_args, stk, end, (uint32_t)s->ptr);
(gdb)
348 copy_word(ireg_args, stk, end, reinterpret_cast<uint32_t>(s->val.p));
(gdb)
316 for (uint32_t i = 0; i < paramCount; i++, s++) {
(gdb)
317 if (s->IsPtrData()) {
(gdb) p s->val.p
$47 = (void *) 0xb0b12800
(gdb) # ^^^ JSContext*
(gdb) n
323 switch (s->type)
(gdb) n
348 copy_word(ireg_args, stk, end, reinterpret_cast<uint32_t>(s->val.p));
(gdb) s
copy_word (data=2964400128, end=0xbeffc660, stack_args=<synthetic pointer>: 0xbeffc630, ireg_args=<synthetic pointer>: 0xbeffc658) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:211
211 if (ireg_args < end) {
(gdb) p ireg_args
$53 = (uint32_t *&) <synthetic pointer>: 0xbeffc658
(gdb) p end
$54 = (uint32_t *) 0xbeffc660
(gdb) n
invoke_copy_to_stack (stk=0xbeffc630, end=<optimized out>, paramCount=4, s=0xbeffc7e0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:348
348 copy_word(ireg_args, stk, end, reinterpret_cast<uint32_t>(s->val.p));
(gdb) n
316 for (uint32_t i = 0; i < paramCount; i++, s++) {
(gdb) p ireg_args
$55 = (uint32_t *) 0xbeffc65c
(gdb) p *(ireg_args - 1)
$58 = 0xb0b12800
(gdb) # ^^^ copy_word decided to put it into a register
(gdb) fin
Run till exit from #0 invoke_copy_to_stack (stk=0xbeffc630, end=<optimized out>, paramCount=4, s=0xbeffc7e0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:316
0xb1acd244 in NS_InvokeByIndex (that=0xbeffc634, methodIndex=3204433504, paramCount=4, params=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:413
413 );
(gdb) disas/m
Dump of assembler code for function NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*):
359 {
0xb1acd208 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, lr}
0xb1acd20c <+4>: mov r4, r0
0xb1acd20e <+6>: mov r7, r3
360 vtable_func *vtable = *reinterpret_cast<vtable_func **>(that);
361 vtable_func func = vtable[methodIndex];
0xb1acd210 <+8>: ldr r3, [r0, #0]
362 // 'register uint32_t result asm("r0")' could be used here, but it does not
363 // seem to be reliable in all cases: http://gcc.gnu.org/PR46164
364 nsresult result;
365 asm (
366 "mov r3, sp\n"
367 "mov %[stack_space_size], %[param_count_plus_2], lsl #3\n"
368 "tst r3, #4\n" /* check stack alignment */
369
370 "add %[stack_space_size], #(4 * 16)\n" /* space for VFP registers */
371 "mov r3, %[params]\n"
372
373 "it ne\n"
374 "addne %[stack_space_size], %[stack_space_size], #4\n"
375 "sub r0, sp, %[stack_space_size]\n" /* allocate space on stack */
376
377 "sub r2, %[param_count_plus_2], #2\n"
378 "mov sp, r0\n"
379
380 "add r1, r0, %[param_count_plus_2], lsl #3\n"
381 "blx %[invoke_copy_to_stack]\n"
382
383 "add ip, sp, %[param_count_plus_2], lsl #3\n"
384 "mov r0, %[that]\n"
385 "ldmdb ip, {r1, r2, r3}\n"
386 "vldm ip, {d0, d1, d2, d3, d4, d5, d6, d7}\n"
387 "blx %[func]\n"
388
389 "add sp, sp, %[stack_space_size]\n" /* cleanup stack */
390 "mov %[stack_space_size], r0\n" /* it's actually 'result' variable */
391 : [stack_space_size] "=&r" (result)
392 : [func] "r" (func),
393 [that] "r" (that),
394 [params] "r" (params),
395 [param_count_plus_2] "r" (paramCount + 2),
0xb1acd216 <+14>: add.w r8, r2, #2
396 [invoke_copy_to_stack] "r" (invoke_copy_to_stack)
397 : "cc", "memory",
398 // Mark all the scratch registers as clobbered because they may be
399 // modified by the functions, called from this inline assembly block
400 "r0", "r1", "r2", "r3", "ip", "lr",
401 "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7",
402 // Also unconditionally mark d16-d31 registers as clobbered even though
403 // they actually don't exist in vfpv2 and vfpv3-d16 variants. There is
404 // no way to identify VFP variant using preprocessor at the momemnt
405 // (see http://gcc.gnu.org/PR46128 for more details), but fortunately
406 // current versions of gcc do not seem to complain about these registers
407 // even when this code is compiled with '-mfpu=vfpv3-d16' option.
408 // If gcc becomes more strict in the future and/or provides a way to
409 // identify VFP variant, the following d16-d31 registers list needs
410 // to be wrapped into some #ifdef
411 "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23",
412 "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31"
413 );
0xb1acd212 <+10>: ldr.w r6, [r3, r1, lsl #2]
0xb1acd21a <+18>: ldr.w r9, [pc, #68] ; 0xb1acd260 <NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*)+88>
0xb1acd21e <+22>: add r9, pc
0xb1acd220 <+24>: mov r3, sp
0xb1acd222 <+26>: mov.w r5, r8, lsl #3
0xb1acd226 <+30>: tst.w r3, #4
0xb1acd22a <+34>: add.w r5, r5, #64 ; 0x40
0xb1acd22e <+38>: mov r3, r7
0xb1acd230 <+40>: it ne
0xb1acd232 <+42>: addne r5, #4
0xb1acd234 <+44>: sub.w r0, sp, r5
0xb1acd238 <+48>: sub.w r2, r8, #2
0xb1acd23c <+52>: mov sp, r0
0xb1acd23e <+54>: add.w r1, r0, r8, lsl #3
0xb1acd242 <+58>: blx r9
=> 0xb1acd244 <+60>: add.w r12, sp, r8, lsl #3
0xb1acd248 <+64>: mov r0, r4
0xb1acd24a <+66>: ldmdb r12, {r1, r2, r3}
0xb1acd24e <+70>: vldmia r12, {d0-d7}
0xb1acd252 <+74>: blx r6
0xb1acd254 <+76>: add sp, r5
0xb1acd256 <+78>: mov r5, r0
414 return result;
415 }
0xb1acd258 <+80>: mov r0, r5
0xb1acd25a <+82>: ldmia.w sp!, {r4, r5, r6, r7, r8, r9, pc}
0xb1acd25e <+86>: nop
0xb1acd260 <+88>: ; <UNDEFINED> instruction: 0xfffffed3
End of assembler dump.
(gdb) stepi
0xb1acd248 413 );
(gdb) stepi
0xb1acd24a 413 );
(gdb) stepi
0xb1acd24e 413 );
(gdb) stepi
0xb1acd252 413 );
(gdb) stepi
Thread 1 "firefox-esr" hit Breakpoint 5, nsJSCID::GetService (this=0xac40b3b0, iidval=$jsval(0), cx=0x1, optionalArgc=0 '\000', retval=<error reading variable: Cannot access memory at address 0x225>) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCJSID.cpp:667
667 {
(gdb) # ^^^ cx is 0x1 again
(gdb) info reg
r0 0xac40b3b0 2889921456
r1 0xbeffc7d0 3204433872
r2 0xb0b12800 2964400128
r3 0x1 1
r4 0xac40b3b0 2889921456
r5 0x74 116
r6 0xb1f7a4ed 2985796845
r7 0xbeffc7d0 3204433872
r8 0x6 6
r9 0xb1acd0f5 2980892917
r10 0xbeffc730 3204433712
r11 0xbeffc70c 3204433676
r12 0xbeffc660 3204433504
sp 0xbeffc630 0xbeffc630
lr 0xb1acd255 -1314074027
pc 0xb1f7a4ec 0xb1f7a4ec <nsJSCID::GetService(JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>)>
cpsr 0x600c0030 1611399216
(gdb) # ^^^ cx got populated from r3
(gdb) # ^^^ but NS_InvokeByIndex stuffed it into r2
(gdb) c
Continuing.
Thread 1 "firefox-esr" received signal SIGSEGV, Segmentation fault.
JS::Rooted<JS::Value>::registerWithRootLists (roots=..., this=0xbeffc5f8) at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:947
947 this->prev = *stack;
nsJSCID::GetService takes the current JSContext* as its second param.
That value can be found in the backtrace leading up to the invocation
of GetService -- 0xb0b12800. NS_InvokeByIndex has the job of manually
setting up for the call to GetService. It uses invoke_copy_to_stack
to put the incoming params in the correct places for the ABI; that
code decides to put 0xb0b12800 into a register. After stepping into
GetService we can confirm this: 0xb0b12800 is in r2. We expected the
'cx' arg to be set to this value, but instead find that it's 0x1,
which could have only come from r3.
I looked at the AAPCS ABI specification to see if I could figure out
how this argument should be passed. With the caveat that I'm
primarily a C developer with very little C++ experience, it seems to
me that NS_InvokeByIndex is doing the right thing by putting the value
into r2, which means GetService is loading it from the wrong place.
If that's true, this must be a compiler bug.
Searching for "gcc AAPCS bug" leads directly to:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728
This appears to match our crash. If GetService wrongly expects
doubleword alignment on one of the passed-via-register arguments, that
could explain the how cx got set from the wrong register. I say
"could" because I'm not an experienced C++ coder. I understand the
definition of a "composite type" from the AAPCS spec, but recognizing
such a thing in the wild is a different matter. My guess is that one
or both of the HandleValue or MutableHandleValue arguments passed to
GetService may be composite types.
Given the above, I decided to take a more pragmatic approach. I found
some code in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80236
(marked as a duplicate of 77728) which tests for the existence of the
problem. Running that on my device:
noel@stretch:/usr/local/src/gcc-bug$ gcc -v 2>&1 | tail -1
gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
noel@stretch:/usr/local/src/gcc-bug$ g++ -g -O0 bar.cpp main.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
Segmentation fault
139
Kaboom.
I downloaded the gcc-6.3.0 debian sources and found they'd been
patched with https://gcc.gnu.org/viewcvs?rev=247639&root=gcc&view=rev
already. However, that's the patch from 77728 which only adds
warnings telling the user that future versions of gcc will change the
ABI. I went ahead and applied the ABI-changing fixes and rebuilt.
Re-running the test code:
noel@stretch:/usr/local/src/gcc-bug$ /usr/local/gcc6patched/bin/g++ -g -O0 bar.cpp main.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
0
Success!
Emboldened by this, I rebuilt firefox using the patched compiler. The
program now starts up normally and appears to be running fine after
several days.
---
While fixing the compiler bug allows firefox to run properly, I'm not
sure it's a tenable solution. As gcc bug 77728 notes, it _is_ an ABI
change. It would be safe to compile firefox with a patched compiler
if it were possible to know that the firefox code would never invoke
any method in a system-provided library that had been miscompiled due
to the bug. This doesn't seem likely.
From the OS packaging point of view, things look even worse. No
distribution is going to be happy recompiling all of the pacakges
affected by this bug. If that happens, then great -- this bug gets
marked WONTFIX.
---
I spent some time trying to figure out if it were possible to patch
the firefox sources to work around this compiler bug. In the end, I
don't believe it would be feasible. The discussion of gcc bug 77728
mentions in passing that the problem might not occur depending on the
order in which the sources are compiled. I can observe that on my
device directly, using that test code from 80236:
noel@stretch:/usr/local/src/gcc-bug$ g++ -g -O0 bar.cpp main.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
Segmentation fault
139
noel@stretch:/usr/local/src/gcc-bug$ g++ -g -O0 main.cpp bar.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
0
crashes happen when caller and callee have different ideas about how
parameters are passed. If both have the same idea, things will still
work, even though that ABI is buggy according to the spec.
There's probably a way to rearrange #include directives and/or amend
the contents of those files in the firefox sources to ensure that gcc
compiles these data types in the same way. Even if we assume this,
the xptcinvoke_arm.cpp code would need to be updated to have awareness
of the compiler bug. invoke_copy_to_stack needs to know whether or
not a param gets the buggy or non-buggy placement, which implies the
vtable needs to be extended to contain such information, which imples
changes to the idl compiler... Yuck. even if it could be made to
work, it the whole thing would be extremely fragile.
---
I believe these issues affect most linux armhf distributions using
gcc-5 or gcc-6, including:
- debian 'stretch'
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=909498
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919769
- ubuntu 16.xx 'xenial', based on debian 'stretch'
https://bugzilla.mozilla.org/show_bug.cgi?id=1391802
- crouton, ubuntu 'xenial' running via chroot
https://github.com/dnschneid/crouton/issues/3347
- ubuntu mate, a raspberry pi distribution based on ubuntu 16.xx
https://bugs.launchpad.net/ubuntu-mate/+bug/1714616
FWIW, in my initial testing I ran the program over and over again to
check that the crash was reliably repeatable. The vast majority of
SIGSEGVs happened within registerWithRootLists, but there were a
handful of other places as well:
JS::MutableHandle<JS::Value>::set
js::GetObjectClass
js::NativeObject::updateShapeAfterMovingGC
mozJSComponentLoader::Import
This was more likely to happen when the machine was loaded. The
simplest explanation is that there's a race between the the multiple
threads of execution, with the registerWithRootLists thread usually
being first to invoke a call triggering the SIGSEGV. If that's
correct, this compiler bug may also be the root cause of these other
mozilla tickets:
https://bugzilla.mozilla.org/show_bug.cgi?id=1391802
https://bugzilla.mozilla.org/show_bug.cgi?id=1452128
---
On a semi-related note, firefox version 52 appears to have no problems
running on my system -- I've used it with months of uptime between
restarts. I'm still trying to figure out exactly which specific
revision caused things to stop working. It's already clear that this
compiler bug is _why_ it stopped working. If I build both versions
from source and grep the build logs for the warning message added in
gcc bug 77728, there's a striking difference:
building 52esr:
noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.01 | wc -l
130
noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.01 | sed -e "s/^.*for argument of type '//g" -e "s/'.*$//g" | sort | uniq -c | sort -nr | less
22 webrtc::VCMSessionInfo::PacketIterator {aka std::_List_iterator<webrtc::VCMPacket>}
16 __gnu_cxx::__normal_iterator<mozilla::layers::CheckerboardEvent::PropertyValue*, std::vector<mozilla::layers::CheckerboardEvent::PropertyValue> >
16 __gnu_cxx::__normal_iterator<mozilla::JsepTrackPair*, std::vector<mozilla::JsepTrackPair> >
15 mozilla::jsipc::ObjectId
14 __gnu_cxx::__normal_iterator<mozilla::NrIceCandidatePair*, std::vector<mozilla::NrIceCandidatePair> >
10 __gnu_cxx::__normal_iterator<google_breakpad::Module::Line*, std::vector<google_breakpad::Module::Line> >
8 __gnu_cxx::__normal_iterator<const mozilla::JsepTrackPair*, std::vector<mozilla::JsepTrackPair> >
4 std::move_iterator<webrtc::EncodedImage*>
4 const mozilla::jsipc::ObjectId
building 60esr:
noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.06 | wc -l
41961
noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.06 | sed -e "s/^.*for argument of type '//g" -e "s/'.*$//g" | sort | uniq -c | sort -nr | less
14900 JS::HandleValue {aka JS::Handle<JS::Value>}
9621 JS::MutableHandle<JS::Value>
8042 JS::MutableHandleValue {aka JS::MutableHandle<JS::Value>}
7146 JS::Handle<JS::Value>
1122 JS::MutableHandle<JS::PropertyDescriptor>
418 const HandleValue {aka const JS::Handle<JS::Value>}
383 JS::Handle<JS::PropertyDescriptor>
64 js::MaybeRooted<JS::Value, (js::AllowGC)1u>::HandleType {aka JS::Handle<JS::Value>}
22 mozilla::jsipc::ObjectId
The v52 code contains very few calls affected by the compiler bug in
comparison to v60.
I'd like to bisect the code to find out where this happened. I've not
done that yet because a full build and testsuite tun takes 14 hours on
this machine. I've reduced that to 4 hours using --disable-tests, but
that's still long. I presume there is a theoretical 'minimum build'
of firefox, but online searches haven't yielded much. So I'm asking
for advice/direction on this while I have people's attention. (Thanks
in advance, and feel free to mail me directly if you'd rather not
clutter the ticket with such things.)
Comment 6•5 years ago
|
||
(In reply to Noel Cragg from comment #5)
That's incredible detective work. Thanks a lot!
glandium, can you look at this? IIUC certain GCC-for-ARM versions are buggy. I know our official builds use Clang now but what's the status upstream?
Comment 7•5 years ago
|
||
I wonder if we can rewrite NS_InvokeByIndex for the __ARM_PCS_VFP similarly to how it's written for the non non-__ARM_PCS_VFP case such that this becomes a non-problem. Nathan, what do you think? ISTR you don't really like branch that doesn't use assembly. The only alternative I can think of is detecting the bug at configure time and having different asms to handle the different ABIs.
Amazing work Noel. Thanks for rootcausing this bug that has been plaguing armhf systems for almost two years. The explanation of this being tied to the AAPCS ABI change wraps up so many loose ends: why this has always affected armhf but not aarch64, what exactly was going wrong in xptcinvoke_arm.cpp, how this relates to VFP registers (particularly in Skia code), why building with clang makes the problem disappear, why moving to gcc 7 or 8 became mandatory.
See this ticket too: https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/1711337/+index?comments=all
[Noel Cragg] It would be safe to compile firefox with a patched compiler
if it were possible to know that the firefox code would never invoke
any method in a system-provided library that had been miscompiled due
to the bug.
Please correct me on this, but as I understand it we only require that Firefox never invoke any system library that is both XPCOM-based and has a C++ interface. There may be very few libraries that fit this category outside of the Mozilla ecosystem.
[glandium] rewrite NS_InvokeByIndex for the __ARM_PCS_VFP similarly to
how it's written for the non non-__ARM_PCS_VFP case such that this
becomes a non-problem
Well, there was not any such ABI change for non-armhf architectures, so I don't think we have an example to pull from.
[Noel Cragg] From the OS packaging point of view, things look even worse. No
distribution is going to be happy recompiling all of the pacakges
affected by this bug.
Agreed. It may viable for Ubuntu 16 and Debian 9 to build Firefox/Thunderbird with a patched gcc 5 or 6 compiler while keeping all existing packages on the old ABI, but even then the distros have to deal with an additional compiler build.
[glandium] The only alternative I can think of is detecting the bug at configure time
and having different asms to handle the different ABIs.
This sounds viable to me.
[Noel Cragg] I'd like to bisect the code to find out where this happened.
We have a lot of info on where the problem started in ticket 1711337 linked above. When Skia is enabled, that regression point is likely Firefox 55. It was possible to disable Skia at runtime and avoid the crash up until Firefox 57. After Firefox 58 the config flag no longer worked and it was only possible to avoid this path by unsetting SK_JUMPER_USE_ASSEMBLY at compile time. With Firefox 59 and beyond, likely several additional such crash points were introduced and it became too hard to track without a complete understanding of the rootcause.
[Noel Cragg] I've not
done that yet because a full build and testsuite tun takes 14 hours on
this machine. I've reduced that to 4 hours using --disable-tests, but
that's still long. I presume there is a theoretical 'minimum build'
of firefox, but online searches haven't yielded much. So I'm asking
for advice/direction on this while I have people's attention.
Sounds like you're compiling natively on the Asus C201P Chromebook? The main way to reduce iteration time for armhf builds is to cross-compile.
Comment 9•5 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7)
I wonder if we can rewrite NS_InvokeByIndex for the __ARM_PCS_VFP similarly to how it's written for the non non-__ARM_PCS_VFP case such that this becomes a non-problem. Nathan, what do you think?
I'm not clear on where __ARM_PCS_VFP
enters the issue; AFAICT, our Linux ARM support just punts on __ARM_PCS_VFP
configurations (hard-float, I assume):
https://searchfox.org/mozilla-central/source/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp#22
Are we winding up with hard-float ABI code being called into by NS_InvokeByIndex
somehow (!?)?
Or we just have a problem with jsval
s being over-aligned by GCC, and that's causing problems with respect to the AAPCS?
ISTR you don't really like branch that doesn't use assembly. The only alternative I can think of is detecting the bug at configure time and having different asms to handle the different ABIs.
I'm not sure what "branch that doesn't use assembly" means in this context.
Does clang have this same bug? Or are the crashes we're seeing caused by something else that we haven't diagnosed yet?
Comment 10•5 years ago
|
||
> [jdonald.x] Please correct me on this, but as I understand it we
> only require that Firefox never invoke any system library that is
> both XPCOM-based and has a C++ interface.
It's not as far-reaching as that. This only affects routines which
pass references to "composite type" data. As far as I understand
this, a composite type is a C++ container where the items inside are
not simple/basic. Said the other way around, this is a collection of
items where it would be impossible to use a standard C "struct".
> [:froydnj] Or we just have a problem with jsvals being over-aligned
> by GCC, and that's causing problems with respect to the AAPCS?
I think the only bug is mis-handling of over-aligned data.
clang doesn't have this bug, as evidenced by jdonald.x's builds. The
author of the gcc bugfix in ticket 77728 also notes that "clang 4.0.0
(trunk 291659) generates the same code as patched gcc."
I don't know if C++ it has reflection available at compile time which
could determine if a specific pointer will be affected by this
alignment bug. If it does, a new nsXPTType could be added (perhaps
T_COMPOSITE or T_AGGREGATE) and invoke_copy_to_stack() could be
modified to invoke copy_dword as a workaround for the bug.
Comment 11•5 years ago
|
||
(In reply to Noel Cragg from comment #10)
[:froydnj] Or we just have a problem with jsvals being over-aligned
by GCC, and that's causing problems with respect to the AAPCS?I think the only bug is mis-handling of over-aligned data.
I don't know if C++ it has reflection available at compile time which
could determine if a specific pointer will be affected by this
alignment bug. If it does, a new nsXPTType could be added (perhaps
T_COMPOSITE or T_AGGREGATE) and invoke_copy_to_stack() could be
modified to invoke copy_dword as a workaround for the bug.
I think we already have (some) support for this, as T_JSVAL
is a separate nsXPTType
field. I believe we just lump it in with a bunch of other stuff here:
https://searchfox.org/mozilla-central/source/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp#77
Perhaps we should be handling it separately, with either a GCC version check or some sort of sizeof/alignof
check on JS types.
(What I'm still unclear on is why hard-float is significant here, because the above code can't actually be compiled with the hard-float ABI...)
Comment 12•5 years ago
|
||
> [:froydnj] What I'm still unclear on is why hard-float is
> significant here, because the above code can't actually be compiled
> with the hard-float ABI...
Line 77 is part of invoke_copy_to_stack() used for the old ABI.
There's another one at line 308 used for the AAPCS ABI.
As far as the logic of the code in xptcinvoke_arm.cpp is concerned, I
don't think hard float is at all significant.
It keeps being mentioned because it was the big, shiny difference
between the systems that worked (arm-linux-gnueabi) and those that
didn't (arm-linux-gnueabihf). It coincidentally happened at the same
time as the real problem, which was the switch from the the old APCS
ABI (gcc working) to the new AAPCS ABI (gcc buggy).
> [:froydnj] I think we already have (some) support for this, as
> T_JSVAL is a separate nsXPTType field.
A new case could be added before the 'default:' on line ~346K:
#if GCC_VERSION >= 50200 && GCC_VERSION < 70000
case nsXPTType::T_JSVAL:
copy_dword(ireg_args, stk, end, reinterpret_cast<uint64_t>(s->val.p));
break;
#endif
That might fix things, but only if jsval is the only argument across
_all_ XPCOM calls that gets wrongly aligned by the gcc bug. I'm not
sure if the latter is true or not, but I'll try applying that patch
and see what happens...
Comment 13•5 years ago
|
||
I tried several variations on the above (T_JSVAL, the "all the others
are plain pointer types" clause, the IsPtrData clause, etc.) without
success. The gcc bug didn't line up with any of those combinations.
I think I need to spend some time doing some C++ programming -- it'd
be useful to be able to recognize the constructs in the source code
which are affected by the bug. That seems like it'd be more
time-effective than trying to repeatedly crawl backwards from a
corrupted stack after a segfault.
I'm also working through the xpidl code to try and understand its
types. Hopefully there's a 1:1 mapping between some set of idl types
and the arguments affected by the gcc bug.
Comment 14•4 years ago
|
||
Code no longer exists.
Updated•3 years ago
|
Description
•