Closed Bug 1247399 Opened 4 years ago Closed 4 years ago

No crash report on Android test crash: ERROR: Minidump ... has no thread list

Categories

(Testing :: General, defect)

ARM
Android
defect
Not set

Tracking

(firefox46 affected, firefox47 fixed)

RESOLVED FIXED
mozilla47
Tracking Status
firefox46 --- affected
firefox47 --- fixed

People

(Reporter: gbrown, Assigned: snorp)

References

Details

Attachments

(5 files)

http://archive.mozilla.org/pub/mobile/tinderbox-builds/mozilla-central-android-api-15/1455116964/mozilla-central_ubuntu64_vm_armv7_mobile_test-robocop-2-bm67-tests1-linux64-build28.txt.gz

09:37:52     INFO -  mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/aYx8UzeMREmv43jdMrZKqw/artifacts/public/build/fennec-47.0a1.en-US.android-arm.crashreporter-symbols.zip
09:37:57     INFO -  mozcrash Copy/paste: /builds/slave/test/build/linux64-minidump_stackwalk /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp /tmp/tmp2jt3Po
09:37:58     INFO -  mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp
09:37:58     INFO -  mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/7da8168e-8ec2-1b82-0270fc93-352c3776.extra
09:37:58  WARNING -  PROCESS-CRASH | testMailToContextMenu | application crashed [None]
09:37:58     INFO -  Crash dump filename: /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp
09:37:58     INFO -  stderr from minidump_stackwalk:
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4197: INFO: Minidump opened minidump /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4314: INFO: Minidump not byte-swapping minidump
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 15 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 7 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 7 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 1197932545 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 6 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 1197932546 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 4 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 5 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 3 not present
09:37:58     INFO -  2016-02-10 09:37:58: minidump_processor.cc:140: ERROR: Minidump /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp has no thread list
09:37:58     INFO -  2016-02-10 09:37:58: stackwalk.cc:131: ERROR: MinidumpProcessor::Process failed
09:37:58     INFO -  2016-02-10 09:37:58: minidump.cc:4169: INFO: Minidump closing minidump
09:37:58     INFO -  0 ERROR runApp() exited with code 1

09:38:00     INFO -  02-10 09:37:32.058 W/google-breakpad( 1648): ExceptionHandler::GenerateDump waitpid failed:
09:38:00     INFO -  02-10 09:37:32.058 W/google-breakpad( 1648): No child processes
09:38:00     INFO -  02-10 09:37:32.068 W/google-breakpad( 1648):
09:38:00     INFO -  02-10 09:37:32.089 W/google-breakpad( 1724): ExceptionHandler::WaitForContinueSignal sys_read failed:
09:38:00     INFO -  02-10 09:37:32.089 W/google-breakpad( 1724): Bad file number
09:38:00     INFO -  02-10 09:37:32.108 W/google-breakpad( 1724):
09:38:00     INFO -  02-10 09:37:32.148 F/libc    ( 1648): Fatal signal 11 (SIGSEGV) at 0xe5e5e5ed (code=1), thread 1648 (.mozilla.fennec)
09:38:00     INFO -  02-10 09:37:32.308 I/DEBUG   (   35): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
09:38:00     INFO -  02-10 09:37:32.308 I/DEBUG   (   35): Build fingerprint: 'generic/sdk/generic:4.3.1/JLS36I/eng.gbrown.20150308.182649:eng/test-keys'
09:38:00     INFO -  02-10 09:37:32.308 I/DEBUG   (   35): Revision: '0'
09:38:00     INFO -  02-10 09:37:32.318 I/DEBUG   (   35): pid: 1648, tid: 1648, name: UNKNOWN  >>> org.mozilla.fennec <<<
09:38:00     INFO -  02-10 09:37:32.318 I/DEBUG   (   35): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr e5e5e5ed
09:38:00     INFO -  02-10 09:37:33.488 I/DEBUG   (   35):     r0 bee4a5b8  r1 0000012d  r2 0000050f  r3 3b9aca00
09:38:00     INFO -  02-10 09:37:33.518 I/DEBUG   (   35):     r4 e5e5e5e5  r5 52b7e140  r6 9e657dc3  r7 0000012d
09:38:00     INFO -  02-10 09:37:33.518 I/DEBUG   (   35):     r8 bee4a5e0  r9 4c466df0  sl 2a00d0a0  fp bee4a5f4
09:38:00     INFO -  02-10 09:37:33.518 I/DEBUG   (   35):     ip 4c466df8  sp bee4a5b0  lr 527957ff  pc 5ab0e626  cpsr 40000030
09:38:00     INFO -  02-10 09:37:33.528 I/DEBUG   (   35):     d0  0000000080000000  d1  0000000042480000
09:38:00     INFO -  02-10 09:37:33.528 I/DEBUG   (   35):     d2  3f80000000000000  d3  000000003f800000
09:38:00     INFO -  02-10 09:37:33.528 I/DEBUG   (   35):     d4  3f80000000000000  d5  43b4800000000000
09:38:00     INFO -  02-10 09:37:33.528 I/DEBUG   (   35):     d6  0000000080000000  d7  0000000080000000
09:38:00     INFO -  02-10 09:37:33.539 I/DEBUG   (   35):     d8  408f400000000000  d9  000000003f000000
09:38:00     INFO -  02-10 09:37:33.539 I/DEBUG   (   35):     d10 0000000000000000  d11 0000000000000000
09:38:00     INFO -  02-10 09:37:33.548 I/DEBUG   (   35):     d12 0000000000000000  d13 0000000000000000
09:38:00     INFO -  02-10 09:37:33.548 I/DEBUG   (   35):     d14 0000000000000000  d15 0000000000000000
09:38:00     INFO -  02-10 09:37:33.548 I/DEBUG   (   35):     scr 60000013
09:38:00     INFO -  02-10 09:37:33.599 I/DEBUG   (   35):
09:38:00     INFO -  02-10 09:37:33.599 I/DEBUG   (   35): backtrace:
09:38:00     INFO -  02-10 09:37:33.619 I/DEBUG   (   35):     #00  pc 00f95626  /dev/ashmem/libxul.so (deleted)
09:38:00     INFO -  02-10 09:37:33.619 I/DEBUG   (   35):     #01  pc 0001dc4c  /system/lib/libdvm.so (dvmPlatformInvoke+112)
09:38:00     INFO -  02-10 09:37:33.619 I/DEBUG   (   35):     #02  pc 0004dcab  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+394)
09:38:00     INFO -  02-10 09:37:33.619 I/DEBUG   (   35):     #03  pc 000385e1  /system/lib/libdvm.so (dvmCheckCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+8)
09:38:00     INFO -  02-10 09:37:33.629 I/DEBUG   (   35):     #04  pc 0004f699  /system/lib/libdvm.so (dvmResolveNativeMethod(unsigned int const*, JValue*, Method const*, Thread*)+184)
09:38:00     INFO -  02-10 09:37:33.629 I/DEBUG   (   35):     #05  pc 00027060  /system/lib/libdvm.so
09:38:00     INFO -  02-10 09:37:33.629 I/DEBUG   (   35):     #06  pc 0002b580  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+184)
09:38:00     INFO -  02-10 09:37:33.639 I/DEBUG   (   35):     #07  pc 0005ff7b  /system/lib/libdvm.so (dvmInvokeMethod(Object*, Method const*, ArrayObject*, ArrayObject*, ClassObject*, bool)+350)
09:38:00     INFO -  02-10 09:37:33.639 I/DEBUG   (   35):     #08  pc 00067a9f  /system/lib/libdvm.so
09:38:00     INFO -  02-10 09:37:33.639 I/DEBUG   (   35):     #09  pc 00027060  /system/lib/libdvm.so
09:38:00     INFO -  02-10 09:37:33.649 I/DEBUG   (   35):     #10  pc 0002b580  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+184)
09:38:00     INFO -  02-10 09:37:33.649 I/DEBUG   (   35):     #11  pc 0005fcbd  /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+292)
09:38:00     INFO -  02-10 09:37:33.649 I/DEBUG   (   35):     #12  pc 000499ab  /system/lib/libdvm.so
09:38:00     INFO -  02-10 09:37:33.658 I/DEBUG   (   35):     #13  pc 0003cb3d  /system/lib/libdvm.so
09:38:00     INFO -  02-10 09:37:33.658 I/DEBUG   (   35):     #14  pc 0004b68f  /system/lib/libandroid_runtime.so
09:38:00     INFO -  02-10 09:37:33.658 I/DEBUG   (   35):     #15  pc 0004c30f  /system/lib/libandroid_runtime.so (android::AndroidRuntime::start(char const*, char const*)+378)
09:38:00     INFO -  02-10 09:37:33.669 I/DEBUG   (   35):     #16  pc 0000105b  /system/bin/app_process
09:38:00     INFO -  02-10 09:37:33.669 I/DEBUG   (   35):     #17  pc 0000db4f  /system/lib/libc.so (__libc_init+50)
09:38:00     INFO -  02-10 09:37:33.669 I/DEBUG   (   35):     #18  pc 00000d7c  /system/bin/app_process
As discussed earlier on irc:

<ted> gbrown: so the way the exception handler works on linux is that it calls sys_clone, which is like halfway between a fork and a thread
<ted> https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#525
<ted> then that new thread does the minidump writing of its parent
<ted> so after sys_clone, the new thread waits for a signal from the parent by reading one end of a pipe:
<ted> https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#572
<ted> that "ExceptionHandler::WaitForContinueSignal sys_read failed:" indicates that that went wrong somehow
<ted> then the parent process, after sending that signal on the pipe, calls waitpid to wait on the child: https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#538
<ted> so it makes sense that that would fail if the child thread failed because it would exit early
<ted> but *why* that's broken i don't know
<ted> gbrown: file a bug to track this?
<gbrown> ted: sure. can do ... after lunch
<ted> gbrown: the ordering of those log messages is...odd
<ted> although maybe that's just buffering or something?
Oh, that's interesting! It wrote a dump file? I wonder if this is a re-entrant crash, Breakpad crashing trying to write a dump?
https://hg.mozilla.org/mozilla-central/rev/99dae519223a
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Status: RESOLVED → REOPENED
Keywords: leave-open
OS: Unspecified → Android
Hardware: Unspecified → ARM
Resolution: FIXED → ---
Target Milestone: mozilla47 → ---
The logging appears in a failed report here[0]

02-21 18:50:16.119  8265  8675 W google-breakpad: ExceptionHandler::GenerateDump cloned child 
02-21 18:50:16.120  8265  8675 W google-breakpad: 18446744073179454268(�L�
02-21 18:50:16.120  8265  8675 W google-breakpad: 
02-21 18:50:16.120  8265  8675 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
02-21 18:50:16.120 10823  8675 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
02-21 18:50:16.120 10823  8675 W google-breakpad: ExceptionHandler::WaitForContinueSignal sys_read failed:
02-21 18:50:16.120  8265  8675 W google-breakpad: ExceptionHandler::GenerateDump waitpid failed:
02-21 18:50:16.120 10823  8675 W google-breakpad: Bad file descriptor
02-21 18:50:16.120  8265  8675 W google-breakpad: No child processes

It looks like we fail to print the child PID here somehow (garbage in the string -- should've cleared to 0 first, whoops). I also don't see the "I'm the child" message anywhere. The child clearly does run, however. I think the only thing that can explain a bad fd in the child is if waitpid() returns immediately (with ECHILD, apparently) and we close the fds. Why would it return ECHILD if it obviously has children? The man page[1] says this can happen if the SIGCHLD is set to SIG_IGN, but even then it should wait until the child process has exited. Kernel bug seems unlikely, but wtf?

Then I was thinking, what does the log look like with a successful dump? Nearly identical, as seen here[2]

02-21 01:32:21.254  7910  8056 W google-breakpad: ExceptionHandler::GenerateDump cloned child 
02-21 01:32:21.254  7910  8056 W google-breakpad: 1983892532�C
02-21 01:32:21.254  7910  8056 W google-breakpad: 
02-21 01:32:21.254  7910  8056 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child
02-21 01:32:21.254  7910  8056 W google-breakpad: ExceptionHandler::GenerateDump waitpid failed:
02-21 01:32:21.254  7910  8056 W google-breakpad: No child processes
02-21 01:32:21.254  7910  8056 W google-breakpad: 
02-21 01:32:21.254 17286 17286 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal...
02-21 01:32:21.254 17286 17286 W google-breakpad: ExceptionHandler::WaitForContinueSignal sys_read failed:
02-21 01:32:21.254 17286 17286 W google-breakpad: Bad file number
02-21 01:32:21.254 17286 17286 W google-breakpad: 

It seems likely that the continuation signaling is just completely busted, and the child is racing to create the dump before the parent has enabled tracing. The child continues with the dump even if WaitForContinueSignal() fails.I still have no idea why we would get ECHILD from waitpid() unless it's the SIG_IGN thing or some kind of kernel bug, but somehow it is working out most of the time.

[0] https://crash-stats.mozilla.com/report/index/210ae0e8-eba6-44d4-aa04-ad3522160221
[1] http://linux.die.net/man/2/waitpid
[2] https://crash-stats.mozilla.com/report/index/b2029df2-2c15-490d-91fe-d5f042160221
Sleeping for 1s before the waitpid() avoids the bad fd issues in the child. So the real problem is definitely waitpid()'s immediate return.
We don't do anything with SIGCHLD or use SA_NOCLDWAIT, AFAICT, so I don't think that's a factor.
I'm not sure what the difference is between the success and failure cases at this point. If the child successfully reads the continuation fd then ptracing has been enabled, but that's also the case if the parent has closed the fd, so it shouldn't make a difference in the result. AFAICT, waitpid() fails with ECHILD every time no matter what. I guess when the parent process exits, it will cause the child to be killed too, but I would think we should lose that race more often than we appear to be.
Ah, I think I just figured it out. The sys_clone() return value is just total garbage, so we pass the wrong value to waitpid().
Using clone() from libc works fine, so the implementation in breakpad is busted.
glandium, do you see anything in the breakpad sys_clone()[0] that would screw up the return val?

[0] https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h#2367
Flags: needinfo?(mh+mozilla)
The disassembled breakpad sys_clone() as it appears in Gecko:

     df0:       b480            push    {r7}
     df2:       f1bc 0f00       cmp.w   ip, #0
     df6:       bf18            it      ne
     df8:       2900            cmpne   r1, #0
     dfa:       bf08            it      eq
     dfc:       f06f 0715       mvneq.w r7, #21
     e00:       d00e            beq.n   e20 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10c>
     e02:       f841 7d04       str.w   r7, [r1, #-4]!
     e06:       f841 cd04       str.w   ip, [r1, #-4]!
     e0a:       f04f 0778       mov.w   r7, #120        ; 0x78
     e0e:       df00            svc     0
     e10:       0007            movs    r7, r0
     e12:       d105            bne.n   e20 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10c>
     e14:       9801            ldr     r0, [sp, #4]
     e16:       9f00            ldr     r7, [sp, #0]
     e18:       47b8            blx     r7
     e1a:       f04f 0701       mov.w   r7, #1
     e1e:       df00            svc     0
     e20:       bc80            pop     {r7}
Attached file clone-test.tar.bz2
This is a small test that uses sys_clone() from linux_syscall_support.h. This seems to work, indicating we have some problem with the Gecko build.
I suppose it could also be something special about calling sys_clone() from within a signal handler....
It seems to work in Gecko outside of the signal handler. I basically pasted the main.c from the test into AndroidBridge and it works fine. Ugh....
What does the disassembly of the clone function from libc look like?
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #19)
> What does the disassembly of the clone function from libc look like?

000174b4 <__bionic_clone>:
   174b4:       e1a0c00d        mov     ip, sp
   174b8:       e92d00f0        push    {r4, r5, r6, r7}
   174bc:       e89c0070        ldm     ip, {r4, r5, r6}
   174c0:       e9210060        stmdb   r1!, {r5, r6}
   174c4:       e3a07078        mov     r7, #120        ; 0x78
   174c8:       ef000000        svc     0x00000000
   174cc:       e1b00000        movs    r0, r0
   174d0:       0a000004        beq     174e8 <__bionic_clone+0x34>
   174d4:       e8bd00f0        pop     {r4, r5, r6, r7}
   174d8:       e3700a01        cmn     r0, #4096       ; 0x1000
   174dc:       912fff1e        bxls    lr
   174e0:       e2600000        rsb     r0, r0, #0
   174e4:       ea012c40        b       625ec <__sync_lock_release_1+0xd0>
   174e8:       e3a0e000        mov     lr, #0
   174ec:       e8bd0003        pop     {r0, r1}
   174f0:       ea012c41        b       625fc <__sync_lock_release_1+0xe0>
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #15)
>      e02:       f841 7d04       str.w   r7, [r1, #-4]!
>      e06:       f841 cd04       str.w   ip, [r1, #-4]!

So, those correspond to:

/* Push "arg" and "fn" onto the stack that will be
 * used by the child.
 */
 "str   %5,[%3,#-4]!\n"
 "str   %2,[%3,#-4]!\n"

from https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h#2390-2394

If the pasted disassembly is really all of sys_clone, I don't see how "arg" and "fn" are supposed to end up in r7 and ip...
This patch fixes it for me. I think the stack arithmetic must be wrong when copying r0 into the return value.

diff --git a/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h b/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h
index 8a42c1c..cb91db5 100644
--- a/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h
+++ b/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h
@@ -2367,7 +2367,7 @@ struct kernel_statfs {
     LSS_INLINE int LSS_NAME(clone)(int (*fn)(void *), void *child_stack,
                                    int flags, void *arg, int *parent_tidptr,
                                    void *newtls, int *child_tidptr) {
-      long __res;
+      register long __res __asm__("r10") = 0;
       {
         register int   __flags __asm__("r0") = flags;
         register void *__stack __asm__("r1") = child_stack;
(In reply to Mike Hommey [:glandium] from comment #21)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #15)
> >      e02:       f841 7d04       str.w   r7, [r1, #-4]!
> >      e06:       f841 cd04       str.w   ip, [r1, #-4]!
> 
> So, those correspond to:
> 
> /* Push "arg" and "fn" onto the stack that will be
>  * used by the child.
>  */
>  "str   %5,[%3,#-4]!\n"
>  "str   %2,[%3,#-4]!\n"
> 
> from
> https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-
> breakpad/src/third_party/lss/linux_syscall_support.h#2390-2394
> 
> If the pasted disassembly is really all of sys_clone, I don't see how "arg"
> and "fn" are supposed to end up in r7 and ip...

Hmm, yeah. Maybe I missed some of the setup code? I'll look again.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #22)
> This patch fixes it for me. I think the stack arithmetic must be wrong when
> copying r0 into the return value.

By that I mean __res.
It looks like the bionic one just leaves the return value in r0, which I guess is fine because it's a function call and not inlined like the lss one. I would guess r0 is the standard location for function return values?

The part that looks fishy to me in the lss version is we copy r0 to sp + 4, but earlier we 'push {r7}' which I assume increases the sp. As a result, sp+4 can't possibly be right? Putting __res in r10 then works around that issue. I'm just flailing around and generally have no idea what I'm doing, though, so this is probably wrong.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #25)
> The part that looks fishy to me in the lss version is we copy r0 to sp + 4,
> but earlier we 'push {r7}' which I assume increases the sp. As a result,
> sp+4 can't possibly be right?

I guess since stacks grow down, it would *decrease* sp. At any rate, the location of __res is somehow wrong.
You mean the code at e14? that's preparing another function call in the child process, and that's a different stack.
Ah, yeah, that's what I was looking at. Where does it copy into __res? I have some more disassembly from the area below:

de4:       4641            mov     r1, r8
     de6:       4613            mov     r3, r2
     de8:       4614            mov     r4, r2
     dea:       44fa            add     sl, pc
     dec:       f10d 0c24       add.w   ip, sp, #36     ; 0x24
     df0:       b480            push    {r7}
     df2:       f1ba 0f00       cmp.w   sl, #0
     df6:       bf18            it      ne
     df8:       2900            cmpne   r1, #0
     dfa:       bf08            it      eq
     dfc:       f06f 0a15       mvneq.w sl, #21
     e00:       d00f            beq.n   e22 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10e>
     e02:       f841 cd04       str.w   ip, [r1, #-4]!
     e06:       f841 ad04       str.w   sl, [r1, #-4]!
     e0a:       f04f 0778       mov.w   r7, #120        ; 0x78
     e0e:       df00            svc     0
     e10:       ea5f 0a00       movs.w  sl, r0
     e14:       d105            bne.n   e22 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10e>
     e16:       9801            ldr     r0, [sp, #4]
     e18:       9f00            ldr     r7, [sp, #0]
     e1a:       47b8            blx     r7
     e1c:       f04f 0701       mov.w   r7, #1
     e20:       df00            svc     0
     e22:       bc80            pop     {r7}
     e24:       f51a 5f80       cmn.w   sl, #4096       ; 0x1000
     e28:       4654            mov     r4, sl
     e2a:       d905            bls.n   e38 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x124>
     e2c:       f7ff fffe       bl      0 <__errno>
     e30:       f04f 3aff       mov.w   sl, #4294967295 ; 0xffffffff
     e34:       4264            negs    r4, r4
     e36:       6004            str     r4, [r0, #0]
     e38:       f1ba 3fff       cmp.w   sl, #4294967295 ; 0xffffffff
     e3c:       d107            bne.n   e4e <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x13a>
     e3e:       6e30            ldr     r0, [r6, #96]   ; 0x60
     e40:       2400            movs    r4, #0
Whoops that's from my modified version. Just a sec.
I guess r10 is an alias for sl so my change seems pretty bad....
So here is the actual problem: gcc assigned r7 to _res. So, the movs r7, r0 that follows the svc is the assignment to _res. And when r0 != 0, we jump to the pop {r7}, which restores the value of r7 that was pushed at the beginning of the function, and after that, we're likely re-copying r7 to r0 to make that the result (that doesn't appear in your disassembly, btw, but has to be there somewhere).

So yes, it makes sense that assigning a register for _res would fix it... OTOH, the whole thing with push/pop r7 is fishy. It would be interesting to see why they added that upstream (as in, what they were trying to fix), because it looks like their fix is fishy. It seems to me that removing the push/pop r7 and readding r7 to clobbers would fix this issue too, but without knowing what those were supposed to fix, it's hard to tell if it would do more good than harm.
Note that I wrote comment 32 before comment 28 and following.
(In reply to Mike Hommey [:glandium] from comment #32)
I guess maybe they wanted to preserve the frame pointer? It does seem like the easiest fix is to use r8 or something for __res. Or force it onto the stack somehow.
The change to push/pop r7 is here: https://bugs.chromium.org/p/linux-syscall-support/issues/detail?id=2

It seems they are in fact doing that to preserve the fp. Mike what do you think is the best fix here? Assign __res to r8?
So, they are, in fact, trying to fix the same thing that we had worked around by building with -fomit-frame-pointer per bug 633436. And I removed that thing in bug 1245055 because it looks like the compiler bug that made it fail to build is gone... So my take on this is that we could go with removing their fix for that now non-existing bug.
Yup, this works fine now. We should try to upstream this too.
Attachment #8723127 - Flags: review?(mh+mozilla)
Comment on attachment 8723127 [details] [diff] [review]
Put r7 in clobber registers for breakpad's sys_clone()

Review of attachment 8723127 [details] [diff] [review]:
-----------------------------------------------------------------

You could title this "Revert upstream fix for their equivalent of bug 633436 because it does more harm than good" :)
Attachment #8723127 - Flags: review?(mh+mozilla) → review+
Assignee: nobody → snorp
Status: REOPENED → ASSIGNED
See Also: → 1245570
Comment on attachment 8723127 [details] [diff] [review]
Put r7 in clobber registers for breakpad's sys_clone()

Approval Request Comment
[Feature/regressing bug #]: N/A
[User impact if declined]: Possibly broken crash reports
[Describe test coverage new/current, TreeHerder]: nightly
[Risks and why]: Low
[String/UUID change made/needed]: None
Attachment #8723127 - Flags: approval-mozilla-aurora?
Marking 46 as affected
Comment on attachment 8723127 [details] [diff] [review]
Put r7 in clobber registers for breakpad's sys_clone()

May improve android crash reporting. 
Please uplift to aurora.
Attachment #8723127 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I'm hitting conflicts applying this to aurora. Can we get a rebased patch?
Flags: needinfo?(snorp)
We actually don't have this problem on 46, because the version of breakpad we use there didn't break it yet.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Flags: needinfo?(snorp)
Resolution: --- → FIXED
We were having other busted minidumps before the Breakpad update though, right? bug 1164052 was filed 10 months ago. Did we just swap one set of bustage for another, and then your couple of fixes cleaned that up?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #46)
> We were having other busted minidumps before the Breakpad update though,
> right? bug 1164052 was filed 10 months ago. Did we just swap one set of
> bustage for another, and then your couple of fixes cleaned that up?

It sounds like it, yeah. We still have busted dumps in 46+, but hopefully fewer?
Depends on: 1258269
See Also: → 1360392
See Also: 1360392
You need to log in before you can comment on or make changes to this bug.