Closed Bug 1309328 Opened 8 years ago Closed 3 years ago

Crash in js::jit::DoTypeMonitorFallback

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

49 Branch
Unspecified
Android
defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox49 --- wontfix
firefox-esr45 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr60 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix

People

(Reporter: u279076, Unassigned)

References

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-da000eee-1c7d-4bf1-9418-7b08a2161009.
=============================================================
 0 		@0xd44ef020 	
1 	libxul.so 	js::jit::DoTypeMonitorFallback 	js/src/jit/SharedIC.cpp:4023
2 		@0xffffff80 	
=============================================================
More reports: https://crash-stats.mozilla.com/signature/?product=FennecAndroid&signature=js%3A%3Ajit%3A%3ADoTypeMonitorFallback

My partner hit this crash on his Samsung Galaxy S6 Edge running Android 6.0.1 with Fennec 49.0. Unfortunately I don't have steps to reproduce this crash - he just unlocked his phone, tapped the icon to switch to an already running Fennec and it crashed. At the time it only had a single tab open to dailyhive.com.

Looking at other reports of this crash it seems to affect all other Fennec versions with 823 reports in Fennec 49 over the last week, ranking at #48.
Priority: -- → P3
Crash volume for signature 'js::jit::DoTypeMonitorFallback':
 - nightly (version 52): 7 crashes from 2016-09-19.
 - aurora  (version 51): 16 crashes from 2016-09-19.
 - beta    (version 50): 216 crashes from 2016-09-20.
 - release (version 49): 1058 crashes from 2016-09-05.
 - esr     (version 45): 15 crashes from 2016-07-25.

Crash volume on the last weeks (Week N is from 10-17 to 10-23):
            W. N-1  W. N-2  W. N-3  W. N-4
 - nightly       0       0       2       1
 - aurora        5       3       2       2
 - beta         72      77      36       8
 - release     346     269     234      81
 - esr           1       2       4       3

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #274      #496
 - aurora  #277      #272
 - beta    #296      #147
 - release #204      #103
 - esr     #2757
Mass wontfix for bugs affecting firefox 52.
Closing because no crashes reported for 12 weeks.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
There are still some crashes so reopen it.
tl;dr -- Crashes happen because gcc mis-implements part of the AAPCS
ABI (gcc bug 77728).  This causes gcc to pass (and then expect)
certain function arguments in the wrong register(s).  XPCOM methods
miscompiled in this way can't be correctly invoked by the code in
xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp because the latter
constructs an AAPCS-compliant function call rather than a buggy one.

(I apologize in advance for the length of what follows.  I'm not
intimately familiar with the mozilla codebase, and I usually work in C
rather than C++.  Those things taken together, I figured it was better
to take a "show your work" approach in this write-up.)

--

I've encountered failures on my hard-float Asus C201P Chromebook
running debian 'stretch'.

Using debian's 60esr package, the program immediately dies at startup:

  noel@stretch$ firefox --new-instance
  ExceptionHandler::GenerateDump cloned child 554
  ExceptionHandler::SendContinueSignalToChild sent continue signal to child
  ExceptionHandler::WaitForContinueSignal waiting for continue signal...

I downloaded the debian-managed sources and rebuilt the code with
debugging support using the distribution-provided gcc-6.3.0 (target
arm-linux-gnueabihf).  Running under gdb produces:

  Thread 1 "firefox-esr" received signal SIGSEGV, Segmentation fault.
  0xb3b7b87a in JS::MutableHandle<JS::Value>::set (v=..., this=<synthetic pointer>)
      at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:581
  581	        *ptr = v;
  (gdb) bt
  #0  0xb3b7b87a in JS::MutableHandle<JS::Value>::set(JS::Value const&) (v=..., this=<synthetic pointer>) at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:581
  #1  0xb3b7b87a in js::jit::DoTypeMonitorFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICTypeMonitor_Fallback*, JS::HandleValue, JS::MutableHandleValue) (cx=0xb0b12800, frame=0xbeffd248, stub=0xb0b90010, value=JSVAL_VOID, res=<error reading variable: Cannot access memory at address 0x0>) at /usr/local/src/firefox-debian-git-fail/js/src/jit/SharedIC.cpp:2388
  #2  0x3dd7b46c in  ()
  (gdb) up
  #1  js::jit::DoTypeMonitorFallback (cx=0xb0b12800, frame=0xbeffd248, stub=0xb0b90010, 
      value=JSVAL_VOID, res=<error reading variable: Cannot access memory at address 0x0>)
      at /usr/local/src/firefox-debian-git-fail/js/src/jit/SharedIC.cpp:2388
  2388	    res.set(value);
  (gdb) call js::DumpBacktrace(cx)
  #0 [Thread 0xac2ff450 (LWP 20923) exited]
        beffd270 b   resource://gre/modules/XPCOMUtils.jsm:103 (add66918 @ 0)
  #1       adbfa018 i   file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/components/nsAsyncShutdown.js:259 (adda8230 @ 466)
  (gdb) js
  0 XPCU_generateQI(interfaces = nsIAsyncShutdownService) ["resource://gre/modules/XPCOMUtils.jsm":103]
      this = [object Object]
  1 <TOP LEVEL> ["file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/components/nsAsyncShutdown.js":259]

I then disabled JIT, setting javascript.options.ion and
javascript.options.baselinejit to false, and re-ran:

  Thread 1 "firefox-esr" received signal SIGSEGV, Segmentation fault.
  JS::Rooted<JS::Value>::registerWithRootLists (roots=..., this=0xbeffc5f8)
      at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:947
  947	        this->prev = *stack;
  (gdb) bt
  #0  0xb1f7a590 in JS::Rooted<JS::Value>::registerWithRootLists(mozilla::EnumeratedArray<JS::RootKind, (JS::RootKind)14, JS::Rooted<void*>*>&) (roots=..., this=0xbeffc5f8)
      at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:947
  #1  0xb1f7a590 in JS::Rooted<JS::Value>::Rooted<JSContext*>(JSContext* const&) (cx=<synthetic pointer>: 0x1, this=0xbeffc5f8)
      at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:965
  #2  0xb1f7a590 in nsJSCID::GetService(JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>) (this=<optimized out>, iidval=..., cx=0x1, optionalArgc=0 '\000', retval=$jsval(-1.6558312099286871e-45))
      at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCJSID.cpp:696
  #3  0xb1acd254 in NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*) (that=0x0, methodIndex=3064680352, paramCount=0, params=0xbeffc7d0)
      at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:413
  #4  0xb0b12800 in  ()
  (gdb) js
  0 anonymous() ["resource://gre/modules/Services.jsm":20]
      this = [object Object]
  1 get() ["resource://gre/modules/XPCOMUtils.jsm":194]
      this = [object Object]
  2 earlyBlankFirstPaint() ["file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js":12]
  3 <TOP LEVEL> ["file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js":11]

frame #1 has cx=0x1.  That should instead be a JSContext* pointing to
somewhere in the heap.  I re-ran things and fiddled around until I got
the program to halt at the NS_InvokeByIndex just before the crash.

  Thread 1 "firefox-esr" hit Breakpoint 1, NS_InvokeByIndex (that=0xac40b0f0, methodIndex=3, paramCount=1, params=0xbeffb2a8) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:359
  359	{
  (gdb) ig 1 23
  Will ignore next 23 crossings of breakpoint 1.
  (gdb) c
  Continuing.
  [Thread 0xac2ff450 (LWP 27022) exited]

  Thread 1 "firefox-esr" hit Breakpoint 1, NS_InvokeByIndex (that=0xac40b3b0, methodIndex=11, paramCount=4, params=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:359
  359	{
  (gdb) bt 10
  #0  0xb1acd208 in NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*) (that=0xac40b3b0, methodIndex=11, paramCount=4, params=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:359
  #1  0xb1f94a86 in CallMethodHelper::Invoke() (this=0xbeffc7a0) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNative.cpp:1951
  #2  0xb1f94a86 in CallMethodHelper::Call() (this=0xbeffc7a0) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNative.cpp:1267
  #3  0xb1f94a86 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) (ccx=..., mode=mode@entry=XPCWrappedNative::CALL_METHOD) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNative.cpp:1234
  #4  0xb1f97d02 in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) (cx=0xb0b12800, argc=<optimized out>, vp=0xadbfa208) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:913
  #5  0xb39fff0e in js::CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), JS::CallArgs const&) (args=..., native=0xb1f97bb1 <XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*)>, cx=0xb0b12800) at /usr/local/src/firefox-debian-git-fail/js/src/vm/JSContext-inl.h:290
  #6  0xb39fff0e in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (cx=0xb0b12800, args=..., construct=<optimized out>) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:468
  #7  0xb39fcf04 in js::CallFromStack(JSContext*, JS::CallArgs const&) (args=..., cx=<optimized out>) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:523
  #8  0xb39fcf04 in Interpret(JSContext*, js::RunState&) (cx=0xb0b12800, state=...) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:3115
  #9  0xb39ffbc6 in js::RunScript(JSContext*, js::RunState&) (cx=0xb0b12800, state=...) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:418
  #10 0xb39fffcc in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) (cx=0xb0b12800, args=..., construct=<optimized out>) at /usr/local/src/firefox-debian-git-fail/js/src/vm/Interpreter.cpp:490
  (gdb) call js::DumpBacktrace((JSContext*)0xb0b12800)
  #0       adbfa1c8 i   resource://gre/modules/Services.jsm:20 (add66438 @ 27)
  #1       adbfa150 i   resource://gre/modules/XPCOMUtils.jsm:194 (add66bf0 @ 35)
  #2       adbfa088 i   file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js:12 (add8d0f8 @ 48)
  #3       adbfa018 i   file:///usr/local/src/firefox-debian-git-fail/build-browser/dist/bin/browser/components/nsBrowserGlue.js:11 (add8d090 @ 167)
  (gdb) n
  361	  vtable_func func = vtable[methodIndex];
  (gdb) p methodIndex
  $42 = 11
  (gdb) n
  413	  );
  (gdb) p func
  $43 = (vtable_func) 0xb1f7a4ed <nsJSCID::GetService(JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>)>
  (gdb) n
  395	      [param_count_plus_2]   "r"   (paramCount + 2),
  (gdb) n
  413	  );
  (gdb) n
  invoke_copy_to_stack (stk=0xbeffc630, end=0xbeffc660, paramCount=4, s=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:316
  316	  for (uint32_t i = 0; i < paramCount; i++, s++) {
  (gdb) p s
  $44 = (nsXPTCVariant *) 0xbeffc7d0
  (gdb) p s->val.p
  $45 = (void *) 0xadd6a5a0
  (gdb) n
  310	{
  (gdb) 
  316	  for (uint32_t i = 0; i < paramCount; i++, s++) {
  (gdb) 
  317	    if (s->IsPtrData()) {
  (gdb) p s->val.p
  $46 = (void *) 0xadd6a5a0
  (gdb) n
  318	      copy_word(ireg_args, stk, end, (uint32_t)s->ptr);
  (gdb) 
  348	        copy_word(ireg_args, stk, end, reinterpret_cast<uint32_t>(s->val.p));
  (gdb) 
  316	  for (uint32_t i = 0; i < paramCount; i++, s++) {
  (gdb) 
  317	    if (s->IsPtrData()) {
  (gdb) p s->val.p
  $47 = (void *) 0xb0b12800
  (gdb) # ^^^ JSContext*
  (gdb) n
  323	    switch (s->type)
  (gdb) n
  348	        copy_word(ireg_args, stk, end, reinterpret_cast<uint32_t>(s->val.p));
  (gdb) s
  copy_word (data=2964400128, end=0xbeffc660, stack_args=<synthetic pointer>: 0xbeffc630, ireg_args=<synthetic pointer>: 0xbeffc658) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:211
  211	  if (ireg_args < end) {
  (gdb) p ireg_args
  $53 = (uint32_t *&) <synthetic pointer>: 0xbeffc658
  (gdb) p end
  $54 = (uint32_t *) 0xbeffc660
  (gdb) n
  invoke_copy_to_stack (stk=0xbeffc630, end=<optimized out>, paramCount=4, s=0xbeffc7e0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:348
  348	        copy_word(ireg_args, stk, end, reinterpret_cast<uint32_t>(s->val.p));
  (gdb) n
  316	  for (uint32_t i = 0; i < paramCount; i++, s++) {
  (gdb) p ireg_args
  $55 = (uint32_t *) 0xbeffc65c
  (gdb) p *(ireg_args - 1)
  $58 = 0xb0b12800
  (gdb) # ^^^ copy_word decided to put it into a register
  (gdb) fin
  Run till exit from #0  invoke_copy_to_stack (stk=0xbeffc630, end=<optimized out>, paramCount=4, s=0xbeffc7e0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:316
  0xb1acd244 in NS_InvokeByIndex (that=0xbeffc634, methodIndex=3204433504, paramCount=4, params=0xbeffc7d0) at /usr/local/src/firefox-debian-git-fail/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp:413
  413	  );
  (gdb) disas/m
  Dump of assembler code for function NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*):
  359	{
     0xb1acd208 <+0>:	stmdb	sp!, {r4, r5, r6, r7, r8, r9, lr}
     0xb1acd20c <+4>:	mov	r4, r0
     0xb1acd20e <+6>:	mov	r7, r3

  360	  vtable_func *vtable = *reinterpret_cast<vtable_func **>(that);
  361	  vtable_func func = vtable[methodIndex];
     0xb1acd210 <+8>:	ldr	r3, [r0, #0]

  362	  // 'register uint32_t result asm("r0")' could be used here, but it does not
  363	  //  seem to be reliable in all cases: http://gcc.gnu.org/PR46164
  364	  nsresult result;
  365	  asm (
  366	    "mov    r3, sp\n"
  367	    "mov    %[stack_space_size], %[param_count_plus_2], lsl #3\n"
  368	    "tst    r3, #4\n" /* check stack alignment */
  369	
  370	    "add    %[stack_space_size], #(4 * 16)\n" /* space for VFP registers */
  371	    "mov    r3, %[params]\n"
  372	
  373	    "it     ne\n"
  374	    "addne  %[stack_space_size], %[stack_space_size], #4\n"
  375	    "sub    r0, sp, %[stack_space_size]\n" /* allocate space on stack */
  376	
  377	    "sub    r2, %[param_count_plus_2], #2\n"
  378	    "mov    sp, r0\n"
  379	
  380	    "add    r1, r0, %[param_count_plus_2], lsl #3\n"
  381	    "blx    %[invoke_copy_to_stack]\n"
  382	
  383	    "add    ip, sp, %[param_count_plus_2], lsl #3\n"
  384	    "mov    r0, %[that]\n"
  385	    "ldmdb  ip, {r1, r2, r3}\n"
  386	    "vldm   ip, {d0, d1, d2, d3, d4, d5, d6, d7}\n"
  387	    "blx    %[func]\n"
  388	
  389	    "add    sp, sp, %[stack_space_size]\n" /* cleanup stack */
  390	    "mov    %[stack_space_size], r0\n" /* it's actually 'result' variable */
  391	    : [stack_space_size]     "=&r" (result)
  392	    : [func]                 "r"   (func),
  393	      [that]                 "r"   (that),
  394	      [params]               "r"   (params),
  395	      [param_count_plus_2]   "r"   (paramCount + 2),
     0xb1acd216 <+14>:	add.w	r8, r2, #2

  396	      [invoke_copy_to_stack] "r"   (invoke_copy_to_stack)
  397	    : "cc", "memory",
  398	      // Mark all the scratch registers as clobbered because they may be
  399	      // modified by the functions, called from this inline assembly block
  400	      "r0", "r1", "r2", "r3", "ip", "lr",
  401	      "d0",  "d1",  "d2",  "d3",  "d4",  "d5",  "d6",  "d7",
  402	      // Also unconditionally mark d16-d31 registers as clobbered even though
  403	      // they actually don't exist in vfpv2 and vfpv3-d16 variants. There is
  404	      // no way to identify VFP variant using preprocessor at the momemnt
  405	      // (see http://gcc.gnu.org/PR46128 for more details), but fortunately
  406	      // current versions of gcc do not seem to complain about these registers
  407	      // even when this code is compiled with '-mfpu=vfpv3-d16' option.
  408	      // If gcc becomes more strict in the future and/or provides a way to
  409	      // identify VFP variant, the following d16-d31 registers list needs
  410	      // to be wrapped into some #ifdef
  411	      "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23",
  412	      "d24", "d25", "d26", "d27", "d28", "d29", "d30", "d31"
  413	  );
     0xb1acd212 <+10>:	ldr.w	r6, [r3, r1, lsl #2]
     0xb1acd21a <+18>:	ldr.w	r9, [pc, #68]	; 0xb1acd260 <NS_InvokeByIndex(nsISupports*, uint32_t, uint32_t, nsXPTCVariant*)+88>
     0xb1acd21e <+22>:	add	r9, pc
     0xb1acd220 <+24>:	mov	r3, sp
     0xb1acd222 <+26>:	mov.w	r5, r8, lsl #3
     0xb1acd226 <+30>:	tst.w	r3, #4
     0xb1acd22a <+34>:	add.w	r5, r5, #64	; 0x40
     0xb1acd22e <+38>:	mov	r3, r7
     0xb1acd230 <+40>:	it	ne
     0xb1acd232 <+42>:	addne	r5, #4
     0xb1acd234 <+44>:	sub.w	r0, sp, r5
     0xb1acd238 <+48>:	sub.w	r2, r8, #2
     0xb1acd23c <+52>:	mov	sp, r0
     0xb1acd23e <+54>:	add.w	r1, r0, r8, lsl #3
     0xb1acd242 <+58>:	blx	r9
  => 0xb1acd244 <+60>:	add.w	r12, sp, r8, lsl #3
     0xb1acd248 <+64>:	mov	r0, r4
     0xb1acd24a <+66>:	ldmdb	r12, {r1, r2, r3}
     0xb1acd24e <+70>:	vldmia	r12, {d0-d7}
     0xb1acd252 <+74>:	blx	r6
     0xb1acd254 <+76>:	add	sp, r5
     0xb1acd256 <+78>:	mov	r5, r0

  414	  return result;
  415	}
     0xb1acd258 <+80>:	mov	r0, r5
     0xb1acd25a <+82>:	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, pc}
     0xb1acd25e <+86>:	nop
     0xb1acd260 <+88>:			; <UNDEFINED> instruction: 0xfffffed3

  End of assembler dump.
  (gdb) stepi
  0xb1acd248	413	  );
  (gdb) stepi
  0xb1acd24a	413	  );
  (gdb) stepi
  0xb1acd24e	413	  );
  (gdb) stepi
  0xb1acd252	413	  );
  (gdb) stepi

  Thread 1 "firefox-esr" hit Breakpoint 5, nsJSCID::GetService (this=0xac40b3b0, iidval=$jsval(0), cx=0x1, optionalArgc=0 '\000', retval=<error reading variable: Cannot access memory at address 0x225>) at /usr/local/src/firefox-debian-git-fail/js/xpconnect/src/XPCJSID.cpp:667
  667	{
  (gdb) # ^^^ cx is 0x1 again
  (gdb) info reg
  r0             0xac40b3b0	2889921456
  r1             0xbeffc7d0	3204433872
  r2             0xb0b12800	2964400128
  r3             0x1	1
  r4             0xac40b3b0	2889921456
  r5             0x74	116
  r6             0xb1f7a4ed	2985796845
  r7             0xbeffc7d0	3204433872
  r8             0x6	6
  r9             0xb1acd0f5	2980892917
  r10            0xbeffc730	3204433712
  r11            0xbeffc70c	3204433676
  r12            0xbeffc660	3204433504
  sp             0xbeffc630	0xbeffc630
  lr             0xb1acd255	-1314074027
  pc             0xb1f7a4ec	0xb1f7a4ec <nsJSCID::GetService(JS::Handle<JS::Value>, JSContext*, unsigned char, JS::MutableHandle<JS::Value>)>
  cpsr           0x600c0030	1611399216
  (gdb) # ^^^ cx got populated from r3
  (gdb) # ^^^ but NS_InvokeByIndex stuffed it into r2
  (gdb) c
  Continuing.

  Thread 1 "firefox-esr" received signal SIGSEGV, Segmentation fault.
  JS::Rooted<JS::Value>::registerWithRootLists (roots=..., this=0xbeffc5f8) at /usr/local/src/firefox-debian-git-fail/build-browser/dist/include/js/RootingAPI.h:947
  947	        this->prev = *stack;

nsJSCID::GetService takes the current JSContext* as its second param.
That value can be found in the backtrace leading up to the invocation
of GetService -- 0xb0b12800.  NS_InvokeByIndex has the job of manually
setting up for the call to GetService.  It uses invoke_copy_to_stack
to put the incoming params in the correct places for the ABI; that
code decides to put 0xb0b12800 into a register.  After stepping into
GetService we can confirm this: 0xb0b12800 is in r2.  We expected the
'cx' arg to be set to this value, but instead find that it's 0x1,
which could have only come from r3.

I looked at the AAPCS ABI specification to see if I could figure out
how this argument should be passed.  With the caveat that I'm
primarily a C developer with very little C++ experience, it seems to
me that NS_InvokeByIndex is doing the right thing by putting the value
into r2, which means GetService is loading it from the wrong place.
If that's true, this must be a compiler bug.

Searching for "gcc AAPCS bug" leads directly to:
  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

This appears to match our crash.  If GetService wrongly expects
doubleword alignment on one of the passed-via-register arguments, that
could explain the how cx got set from the wrong register.  I say
"could" because I'm not an experienced C++ coder.  I understand the
definition of a "composite type" from the AAPCS spec, but recognizing
such a thing in the wild is a different matter.  My guess is that one
or both of the HandleValue or MutableHandleValue arguments passed to
GetService may be composite types.

Given the above, I decided to take a more pragmatic approach.  I found
some code in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80236
(marked as a duplicate of 77728) which tests for the existence of the
problem.  Running that on my device:

  noel@stretch:/usr/local/src/gcc-bug$ gcc -v 2>&1 | tail -1
  gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) 
  noel@stretch:/usr/local/src/gcc-bug$ g++ -g -O0 bar.cpp main.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
  Segmentation fault
  139

Kaboom.

I downloaded the gcc-6.3.0 debian sources and found they'd been
patched with https://gcc.gnu.org/viewcvs?rev=247639&root=gcc&view=rev
already.  However, that's the patch from 77728 which only adds
warnings telling the user that future versions of gcc will change the
ABI.  I went ahead and applied the ABI-changing fixes and rebuilt.
Re-running the test code:

  noel@stretch:/usr/local/src/gcc-bug$ /usr/local/gcc6patched/bin/g++ -g -O0 bar.cpp main.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
  0

Success!

Emboldened by this, I rebuilt firefox using the patched compiler.  The
program now starts up normally and appears to be running fine after
several days.

---

While fixing the compiler bug allows firefox to run properly, I'm not
sure it's a tenable solution.  As gcc bug 77728 notes, it _is_ an ABI
change.  It would be safe to compile firefox with a patched compiler
if it were possible to know that the firefox code would never invoke
any method in a system-provided library that had been miscompiled due
to the bug.  This doesn't seem likely.

From the OS packaging point of view, things look even worse.  No
distribution is going to be happy recompiling all of the pacakges
affected by this bug.  If that happens, then great -- this bug gets
marked WONTFIX.

---

I spent some time trying to figure out if it were possible to patch
the firefox sources to work around this compiler bug.  In the end, I
don't believe it would be feasible.  The discussion of gcc bug 77728
mentions in passing that the problem might not occur depending on the
order in which the sources are compiled.  I can observe that on my
device directly, using that test code from 80236:

  noel@stretch:/usr/local/src/gcc-bug$ g++ -g -O0 bar.cpp main.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
  Segmentation fault
  139
  noel@stretch:/usr/local/src/gcc-bug$ g++ -g -O0 main.cpp bar.cpp -o crashTest 2>/dev/null; ./crashTest; echo $?
  0

crashes happen when caller and callee have different ideas about how
parameters are passed.  If both have the same idea, things will still
work, even though that ABI is buggy according to the spec.

There's probably a way to rearrange #include directives and/or amend
the contents of those files in the firefox sources to ensure that gcc
compiles these data types in the same way.  Even if we assume this,
the xptcinvoke_arm.cpp code would need to be updated to have awareness
of the compiler bug.  invoke_copy_to_stack needs to know whether or
not a param gets the buggy or non-buggy placement, which implies the
vtable needs to be extended to contain such information, which imples
changes to the idl compiler...  Yuck.  even if it could be made to
work, it the whole thing would be extremely fragile.

---

I believe these issues affect most linux armhf distributions using
gcc-5 or gcc-6, including:

 - debian 'stretch'
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=909498
   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919769

 - ubuntu 16.xx 'xenial', based on debian 'stretch'
   https://bugzilla.mozilla.org/show_bug.cgi?id=1391802

 - crouton, ubuntu 'xenial' running via chroot
   https://github.com/dnschneid/crouton/issues/3347

 - ubuntu mate, a raspberry pi distribution based on ubuntu 16.xx
   https://bugs.launchpad.net/ubuntu-mate/+bug/1714616

FWIW, in my initial testing I ran the program over and over again to
check that the crash was reliably repeatable.  The vast majority of
SIGSEGVs happened within registerWithRootLists, but there were a
handful of other places as well:

  JS::MutableHandle<JS::Value>::set
  js::GetObjectClass
  js::NativeObject::updateShapeAfterMovingGC
  mozJSComponentLoader::Import

This was more likely to happen when the machine was loaded.  The
simplest explanation is that there's a race between the the multiple
threads of execution, with the registerWithRootLists thread usually
being first to invoke a call triggering the SIGSEGV.  If that's
correct, this compiler bug may also be the root cause of these other
mozilla tickets:

https://bugzilla.mozilla.org/show_bug.cgi?id=1391802
https://bugzilla.mozilla.org/show_bug.cgi?id=1452128

---

On a semi-related note, firefox version 52 appears to have no problems
running on my system -- I've used it with months of uptime between
restarts.  I'm still trying to figure out exactly which specific
revision caused things to stop working.  It's already clear that this
compiler bug is _why_ it stopped working.  If I build both versions
from source and grep the build logs for the warning message added in
gcc bug 77728, there's a striking difference:

  building 52esr:
  noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.01 | wc -l
  130
  noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.01 | sed -e "s/^.*for argument of type '//g" -e "s/'.*$//g" | sort | uniq -c | sort -nr | less
       22 webrtc::VCMSessionInfo::PacketIterator {aka std::_List_iterator<webrtc::VCMPacket>}
       16 __gnu_cxx::__normal_iterator<mozilla::layers::CheckerboardEvent::PropertyValue*, std::vector<mozilla::layers::CheckerboardEvent::PropertyValue> >
       16 __gnu_cxx::__normal_iterator<mozilla::JsepTrackPair*, std::vector<mozilla::JsepTrackPair> >
       15 mozilla::jsipc::ObjectId
       14 __gnu_cxx::__normal_iterator<mozilla::NrIceCandidatePair*, std::vector<mozilla::NrIceCandidatePair> >
       10 __gnu_cxx::__normal_iterator<google_breakpad::Module::Line*, std::vector<google_breakpad::Module::Line> >
        8 __gnu_cxx::__normal_iterator<const mozilla::JsepTrackPair*, std::vector<mozilla::JsepTrackPair> >
        4 std::move_iterator<webrtc::EncodedImage*>
        4 const mozilla::jsipc::ObjectId

  building 60esr:
  noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.06 | wc -l
  41961
  noel@stretch$ fgrep 'will change in GCC 7.1' ~/log.06 | sed -e "s/^.*for argument of type '//g" -e "s/'.*$//g" | sort | uniq -c | sort -nr | less
    14900 JS::HandleValue {aka JS::Handle<JS::Value>}
     9621 JS::MutableHandle<JS::Value>
     8042 JS::MutableHandleValue {aka JS::MutableHandle<JS::Value>}
     7146 JS::Handle<JS::Value>
     1122 JS::MutableHandle<JS::PropertyDescriptor>
      418 const HandleValue {aka const JS::Handle<JS::Value>}
      383 JS::Handle<JS::PropertyDescriptor>
       64 js::MaybeRooted<JS::Value, (js::AllowGC)1u>::HandleType {aka JS::Handle<JS::Value>}
       22 mozilla::jsipc::ObjectId

The v52 code contains very few calls affected by the compiler bug in
comparison to v60.

I'd like to bisect the code to find out where this happened.  I've not
done that yet because a full build and testsuite tun takes 14 hours on
this machine.  I've reduced that to 4 hours using --disable-tests, but
that's still long.  I presume there is a theoretical 'minimum build'
of firefox, but online searches haven't yielded much.  So I'm asking
for advice/direction on this while I have people's attention.  (Thanks
in advance, and feel free to mail me directly if you'd rather not
clutter the ticket with such things.)

(In reply to Noel Cragg from comment #5)

That's incredible detective work. Thanks a lot!

glandium, can you look at this? IIUC certain GCC-for-ARM versions are buggy. I know our official builds use Clang now but what's the status upstream?

Flags: needinfo?(mh+mozilla)

I wonder if we can rewrite NS_InvokeByIndex for the __ARM_PCS_VFP similarly to how it's written for the non non-__ARM_PCS_VFP case such that this becomes a non-problem. Nathan, what do you think? ISTR you don't really like branch that doesn't use assembly. The only alternative I can think of is detecting the bug at configure time and having different asms to handle the different ABIs.

Flags: needinfo?(mh+mozilla) → needinfo?(nfroyd)
See Also: → 1452128

Amazing work Noel. Thanks for rootcausing this bug that has been plaguing armhf systems for almost two years. The explanation of this being tied to the AAPCS ABI change wraps up so many loose ends: why this has always affected armhf but not aarch64, what exactly was going wrong in xptcinvoke_arm.cpp, how this relates to VFP registers (particularly in Skia code), why building with clang makes the problem disappear, why moving to gcc 7 or 8 became mandatory.

See this ticket too: https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/1711337/+index?comments=all

[Noel Cragg] It would be safe to compile firefox with a patched compiler
if it were possible to know that the firefox code would never invoke
any method in a system-provided library that had been miscompiled due
to the bug.

Please correct me on this, but as I understand it we only require that Firefox never invoke any system library that is both XPCOM-based and has a C++ interface. There may be very few libraries that fit this category outside of the Mozilla ecosystem.

[glandium] rewrite NS_InvokeByIndex for the __ARM_PCS_VFP similarly to
how it's written for the non non-__ARM_PCS_VFP case such that this
becomes a non-problem

Well, there was not any such ABI change for non-armhf architectures, so I don't think we have an example to pull from.

[Noel Cragg] From the OS packaging point of view, things look even worse. No
distribution is going to be happy recompiling all of the pacakges
affected by this bug.

Agreed. It may viable for Ubuntu 16 and Debian 9 to build Firefox/Thunderbird with a patched gcc 5 or 6 compiler while keeping all existing packages on the old ABI, but even then the distros have to deal with an additional compiler build.

[glandium] The only alternative I can think of is detecting the bug at configure time
and having different asms to handle the different ABIs.

This sounds viable to me.

[Noel Cragg] I'd like to bisect the code to find out where this happened.

We have a lot of info on where the problem started in ticket 1711337 linked above. When Skia is enabled, that regression point is likely Firefox 55. It was possible to disable Skia at runtime and avoid the crash up until Firefox 57. After Firefox 58 the config flag no longer worked and it was only possible to avoid this path by unsetting SK_JUMPER_USE_ASSEMBLY at compile time. With Firefox 59 and beyond, likely several additional such crash points were introduced and it became too hard to track without a complete understanding of the rootcause.

[Noel Cragg] I've not
done that yet because a full build and testsuite tun takes 14 hours on
this machine. I've reduced that to 4 hours using --disable-tests, but
that's still long. I presume there is a theoretical 'minimum build'
of firefox, but online searches haven't yielded much. So I'm asking
for advice/direction on this while I have people's attention.

Sounds like you're compiling natively on the Asus C201P Chromebook? The main way to reduce iteration time for armhf builds is to cross-compile.

(In reply to Mike Hommey [:glandium] from comment #7)

I wonder if we can rewrite NS_InvokeByIndex for the __ARM_PCS_VFP similarly to how it's written for the non non-__ARM_PCS_VFP case such that this becomes a non-problem. Nathan, what do you think?

I'm not clear on where __ARM_PCS_VFP enters the issue; AFAICT, our Linux ARM support just punts on __ARM_PCS_VFP configurations (hard-float, I assume):

https://searchfox.org/mozilla-central/source/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp#22

Are we winding up with hard-float ABI code being called into by NS_InvokeByIndex somehow (!?)?

Or we just have a problem with jsvals being over-aligned by GCC, and that's causing problems with respect to the AAPCS?

ISTR you don't really like branch that doesn't use assembly. The only alternative I can think of is detecting the bug at configure time and having different asms to handle the different ABIs.

I'm not sure what "branch that doesn't use assembly" means in this context.

Does clang have this same bug? Or are the crashes we're seeing caused by something else that we haven't diagnosed yet?

Flags: needinfo?(nfroyd)
> [jdonald.x] Please correct me on this, but as I understand it we
> only require that Firefox never invoke any system library that is
> both XPCOM-based and has a C++ interface.

It's not as far-reaching as that.  This only affects routines which
pass references to "composite type" data.  As far as I understand
this, a composite type is a C++ container where the items inside are
not simple/basic.  Said the other way around, this is a collection of
items where it would be impossible to use a standard C "struct".

> [:froydnj] Or we just have a problem with jsvals being over-aligned
> by GCC, and that's causing problems with respect to the AAPCS?

I think the only bug is mis-handling of over-aligned data.

clang doesn't have this bug, as evidenced by jdonald.x's builds.  The
author of the gcc bugfix in ticket 77728 also notes that "clang 4.0.0
(trunk 291659) generates the same code as patched gcc."

I don't know if C++ it has reflection available at compile time which
could determine if a specific pointer will be affected by this
alignment bug.  If it does, a new nsXPTType could be added (perhaps
T_COMPOSITE or T_AGGREGATE) and invoke_copy_to_stack() could be
modified to invoke copy_dword as a workaround for the bug.

(In reply to Noel Cragg from comment #10)

[:froydnj] Or we just have a problem with jsvals being over-aligned
by GCC, and that's causing problems with respect to the AAPCS?

I think the only bug is mis-handling of over-aligned data.

I don't know if C++ it has reflection available at compile time which
could determine if a specific pointer will be affected by this
alignment bug. If it does, a new nsXPTType could be added (perhaps
T_COMPOSITE or T_AGGREGATE) and invoke_copy_to_stack() could be
modified to invoke copy_dword as a workaround for the bug.

I think we already have (some) support for this, as T_JSVAL is a separate nsXPTType field. I believe we just lump it in with a bunch of other stuff here:

https://searchfox.org/mozilla-central/source/xpcom/reflect/xptcall/md/unix/xptcinvoke_arm.cpp#77

Perhaps we should be handling it separately, with either a GCC version check or some sort of sizeof/alignof check on JS types.

(What I'm still unclear on is why hard-float is significant here, because the above code can't actually be compiled with the hard-float ABI...)

> [:froydnj] What I'm still unclear on is why hard-float is
> significant here, because the above code can't actually be compiled
> with the hard-float ABI...

Line 77 is part of invoke_copy_to_stack() used for the old ABI.
There's another one at line 308 used for the AAPCS ABI.

As far as the logic of the code in xptcinvoke_arm.cpp is concerned, I
don't think hard float is at all significant.

It keeps being mentioned because it was the big, shiny difference
between the systems that worked (arm-linux-gnueabi) and those that
didn't (arm-linux-gnueabihf).  It coincidentally happened at the same
time as the real problem, which was the switch from the the old APCS
ABI (gcc working) to the new AAPCS ABI (gcc buggy).

> [:froydnj] I think we already have (some) support for this, as
> T_JSVAL is a separate nsXPTType field.

A new case could be added before the 'default:' on line ~346K:

  #if GCC_VERSION >= 50200 && GCC_VERSION < 70000
    case nsXPTType::T_JSVAL:
      copy_dword(ireg_args, stk, end, reinterpret_cast<uint64_t>(s->val.p));
      break;
  #endif

That might fix things, but only if jsval is the only argument across
_all_ XPCOM calls that gets wrongly aligned by the gcc bug.  I'm not
sure if the latter is true or not, but I'll try applying that patch
and see what happens...
I tried several variations on the above (T_JSVAL, the "all the others
are plain pointer types" clause, the IsPtrData clause, etc.) without
success.  The gcc bug didn't line up with any of those combinations.

I think I need to spend some time doing some C++ programming -- it'd
be useful to be able to recognize the constructs in the source code
which are affected by the bug.  That seems like it'd be more
time-effective than trying to repeatedly crawl backwards from a
corrupted stack after a segfault.

I'm also working through the xpidl code to try and understand its
types.  Hopefully there's a 1:1 mapping between some set of idl types
and the arguments affected by the gcc bug.

Code no longer exists.

Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.