Closed Bug 1247399 Opened 9 years ago Closed 9 years ago

No crash report on Android test crash: ERROR: Minidump ... has no thread list

Categories

(Testing :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(firefox46 affected, firefox47 fixed)

RESOLVED FIXED
mozilla47
Tracking Status
firefox46 --- affected
firefox47 --- fixed

People

(Reporter: gbrown, Assigned: snorp)

References

Details

Attachments

(5 files)

http://archive.mozilla.org/pub/mobile/tinderbox-builds/mozilla-central-android-api-15/1455116964/mozilla-central_ubuntu64_vm_armv7_mobile_test-robocop-2-bm67-tests1-linux64-build28.txt.gz 09:37:52 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/aYx8UzeMREmv43jdMrZKqw/artifacts/public/build/fennec-47.0a1.en-US.android-arm.crashreporter-symbols.zip 09:37:57 INFO - mozcrash Copy/paste: /builds/slave/test/build/linux64-minidump_stackwalk /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp /tmp/tmp2jt3Po 09:37:58 INFO - mozcrash Saved minidump as /builds/slave/test/build/blobber_upload_dir/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp 09:37:58 INFO - mozcrash Saved app info as /builds/slave/test/build/blobber_upload_dir/7da8168e-8ec2-1b82-0270fc93-352c3776.extra 09:37:58 WARNING - PROCESS-CRASH | testMailToContextMenu | application crashed [None] 09:37:58 INFO - Crash dump filename: /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp 09:37:58 INFO - stderr from minidump_stackwalk: 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4197: INFO: Minidump opened minidump /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4314: INFO: Minidump not byte-swapping minidump 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 15 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 7 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 7 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 1197932545 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 6 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 1197932546 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 4 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 5 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4765: INFO: GetStream: type 3 not present 09:37:58 INFO - 2016-02-10 09:37:58: minidump_processor.cc:140: ERROR: Minidump /tmp/tmpZHq991/7da8168e-8ec2-1b82-0270fc93-352c3776.dmp has no thread list 09:37:58 INFO - 2016-02-10 09:37:58: stackwalk.cc:131: ERROR: MinidumpProcessor::Process failed 09:37:58 INFO - 2016-02-10 09:37:58: minidump.cc:4169: INFO: Minidump closing minidump 09:37:58 INFO - 0 ERROR runApp() exited with code 1 09:38:00 INFO - 02-10 09:37:32.058 W/google-breakpad( 1648): ExceptionHandler::GenerateDump waitpid failed: 09:38:00 INFO - 02-10 09:37:32.058 W/google-breakpad( 1648): No child processes 09:38:00 INFO - 02-10 09:37:32.068 W/google-breakpad( 1648): 09:38:00 INFO - 02-10 09:37:32.089 W/google-breakpad( 1724): ExceptionHandler::WaitForContinueSignal sys_read failed: 09:38:00 INFO - 02-10 09:37:32.089 W/google-breakpad( 1724): Bad file number 09:38:00 INFO - 02-10 09:37:32.108 W/google-breakpad( 1724): 09:38:00 INFO - 02-10 09:37:32.148 F/libc ( 1648): Fatal signal 11 (SIGSEGV) at 0xe5e5e5ed (code=1), thread 1648 (.mozilla.fennec) 09:38:00 INFO - 02-10 09:37:32.308 I/DEBUG ( 35): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** 09:38:00 INFO - 02-10 09:37:32.308 I/DEBUG ( 35): Build fingerprint: 'generic/sdk/generic:4.3.1/JLS36I/eng.gbrown.20150308.182649:eng/test-keys' 09:38:00 INFO - 02-10 09:37:32.308 I/DEBUG ( 35): Revision: '0' 09:38:00 INFO - 02-10 09:37:32.318 I/DEBUG ( 35): pid: 1648, tid: 1648, name: UNKNOWN >>> org.mozilla.fennec <<< 09:38:00 INFO - 02-10 09:37:32.318 I/DEBUG ( 35): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr e5e5e5ed 09:38:00 INFO - 02-10 09:37:33.488 I/DEBUG ( 35): r0 bee4a5b8 r1 0000012d r2 0000050f r3 3b9aca00 09:38:00 INFO - 02-10 09:37:33.518 I/DEBUG ( 35): r4 e5e5e5e5 r5 52b7e140 r6 9e657dc3 r7 0000012d 09:38:00 INFO - 02-10 09:37:33.518 I/DEBUG ( 35): r8 bee4a5e0 r9 4c466df0 sl 2a00d0a0 fp bee4a5f4 09:38:00 INFO - 02-10 09:37:33.518 I/DEBUG ( 35): ip 4c466df8 sp bee4a5b0 lr 527957ff pc 5ab0e626 cpsr 40000030 09:38:00 INFO - 02-10 09:37:33.528 I/DEBUG ( 35): d0 0000000080000000 d1 0000000042480000 09:38:00 INFO - 02-10 09:37:33.528 I/DEBUG ( 35): d2 3f80000000000000 d3 000000003f800000 09:38:00 INFO - 02-10 09:37:33.528 I/DEBUG ( 35): d4 3f80000000000000 d5 43b4800000000000 09:38:00 INFO - 02-10 09:37:33.528 I/DEBUG ( 35): d6 0000000080000000 d7 0000000080000000 09:38:00 INFO - 02-10 09:37:33.539 I/DEBUG ( 35): d8 408f400000000000 d9 000000003f000000 09:38:00 INFO - 02-10 09:37:33.539 I/DEBUG ( 35): d10 0000000000000000 d11 0000000000000000 09:38:00 INFO - 02-10 09:37:33.548 I/DEBUG ( 35): d12 0000000000000000 d13 0000000000000000 09:38:00 INFO - 02-10 09:37:33.548 I/DEBUG ( 35): d14 0000000000000000 d15 0000000000000000 09:38:00 INFO - 02-10 09:37:33.548 I/DEBUG ( 35): scr 60000013 09:38:00 INFO - 02-10 09:37:33.599 I/DEBUG ( 35): 09:38:00 INFO - 02-10 09:37:33.599 I/DEBUG ( 35): backtrace: 09:38:00 INFO - 02-10 09:37:33.619 I/DEBUG ( 35): #00 pc 00f95626 /dev/ashmem/libxul.so (deleted) 09:38:00 INFO - 02-10 09:37:33.619 I/DEBUG ( 35): #01 pc 0001dc4c /system/lib/libdvm.so (dvmPlatformInvoke+112) 09:38:00 INFO - 02-10 09:37:33.619 I/DEBUG ( 35): #02 pc 0004dcab /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+394) 09:38:00 INFO - 02-10 09:37:33.619 I/DEBUG ( 35): #03 pc 000385e1 /system/lib/libdvm.so (dvmCheckCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+8) 09:38:00 INFO - 02-10 09:37:33.629 I/DEBUG ( 35): #04 pc 0004f699 /system/lib/libdvm.so (dvmResolveNativeMethod(unsigned int const*, JValue*, Method const*, Thread*)+184) 09:38:00 INFO - 02-10 09:37:33.629 I/DEBUG ( 35): #05 pc 00027060 /system/lib/libdvm.so 09:38:00 INFO - 02-10 09:37:33.629 I/DEBUG ( 35): #06 pc 0002b580 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+184) 09:38:00 INFO - 02-10 09:37:33.639 I/DEBUG ( 35): #07 pc 0005ff7b /system/lib/libdvm.so (dvmInvokeMethod(Object*, Method const*, ArrayObject*, ArrayObject*, ClassObject*, bool)+350) 09:38:00 INFO - 02-10 09:37:33.639 I/DEBUG ( 35): #08 pc 00067a9f /system/lib/libdvm.so 09:38:00 INFO - 02-10 09:37:33.639 I/DEBUG ( 35): #09 pc 00027060 /system/lib/libdvm.so 09:38:00 INFO - 02-10 09:37:33.649 I/DEBUG ( 35): #10 pc 0002b580 /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+184) 09:38:00 INFO - 02-10 09:37:33.649 I/DEBUG ( 35): #11 pc 0005fcbd /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+292) 09:38:00 INFO - 02-10 09:37:33.649 I/DEBUG ( 35): #12 pc 000499ab /system/lib/libdvm.so 09:38:00 INFO - 02-10 09:37:33.658 I/DEBUG ( 35): #13 pc 0003cb3d /system/lib/libdvm.so 09:38:00 INFO - 02-10 09:37:33.658 I/DEBUG ( 35): #14 pc 0004b68f /system/lib/libandroid_runtime.so 09:38:00 INFO - 02-10 09:37:33.658 I/DEBUG ( 35): #15 pc 0004c30f /system/lib/libandroid_runtime.so (android::AndroidRuntime::start(char const*, char const*)+378) 09:38:00 INFO - 02-10 09:37:33.669 I/DEBUG ( 35): #16 pc 0000105b /system/bin/app_process 09:38:00 INFO - 02-10 09:37:33.669 I/DEBUG ( 35): #17 pc 0000db4f /system/lib/libc.so (__libc_init+50) 09:38:00 INFO - 02-10 09:37:33.669 I/DEBUG ( 35): #18 pc 00000d7c /system/bin/app_process
As discussed earlier on irc: <ted> gbrown: so the way the exception handler works on linux is that it calls sys_clone, which is like halfway between a fork and a thread <ted> https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#525 <ted> then that new thread does the minidump writing of its parent <ted> so after sys_clone, the new thread waits for a signal from the parent by reading one end of a pipe: <ted> https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#572 <ted> that "ExceptionHandler::WaitForContinueSignal sys_read failed:" indicates that that went wrong somehow <ted> then the parent process, after sending that signal on the pipe, calls waitpid to wait on the child: https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#538 <ted> so it makes sense that that would fail if the child thread failed because it would exit early <ted> but *why* that's broken i don't know <ted> gbrown: file a bug to track this? <gbrown> ted: sure. can do ... after lunch <ted> gbrown: the ordering of those log messages is...odd <ted> although maybe that's just buffering or something?
Oh, that's interesting! It wrote a dump file? I wonder if this is a re-entrant crash, Breakpad crashing trying to write a dump?
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Status: RESOLVED → REOPENED
Keywords: leave-open
OS: Unspecified → Android
Hardware: Unspecified → ARM
Resolution: FIXED → ---
Target Milestone: mozilla47 → ---
The logging appears in a failed report here[0] 02-21 18:50:16.119 8265 8675 W google-breakpad: ExceptionHandler::GenerateDump cloned child 02-21 18:50:16.120 8265 8675 W google-breakpad: 18446744073179454268(�L� 02-21 18:50:16.120 8265 8675 W google-breakpad: 02-21 18:50:16.120 8265 8675 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child 02-21 18:50:16.120 10823 8675 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal... 02-21 18:50:16.120 10823 8675 W google-breakpad: ExceptionHandler::WaitForContinueSignal sys_read failed: 02-21 18:50:16.120 8265 8675 W google-breakpad: ExceptionHandler::GenerateDump waitpid failed: 02-21 18:50:16.120 10823 8675 W google-breakpad: Bad file descriptor 02-21 18:50:16.120 8265 8675 W google-breakpad: No child processes It looks like we fail to print the child PID here somehow (garbage in the string -- should've cleared to 0 first, whoops). I also don't see the "I'm the child" message anywhere. The child clearly does run, however. I think the only thing that can explain a bad fd in the child is if waitpid() returns immediately (with ECHILD, apparently) and we close the fds. Why would it return ECHILD if it obviously has children? The man page[1] says this can happen if the SIGCHLD is set to SIG_IGN, but even then it should wait until the child process has exited. Kernel bug seems unlikely, but wtf? Then I was thinking, what does the log look like with a successful dump? Nearly identical, as seen here[2] 02-21 01:32:21.254 7910 8056 W google-breakpad: ExceptionHandler::GenerateDump cloned child 02-21 01:32:21.254 7910 8056 W google-breakpad: 1983892532�C 02-21 01:32:21.254 7910 8056 W google-breakpad: 02-21 01:32:21.254 7910 8056 W google-breakpad: ExceptionHandler::SendContinueSignalToChild sent continue signal to child 02-21 01:32:21.254 7910 8056 W google-breakpad: ExceptionHandler::GenerateDump waitpid failed: 02-21 01:32:21.254 7910 8056 W google-breakpad: No child processes 02-21 01:32:21.254 7910 8056 W google-breakpad: 02-21 01:32:21.254 17286 17286 W google-breakpad: ExceptionHandler::WaitForContinueSignal waiting for continue signal... 02-21 01:32:21.254 17286 17286 W google-breakpad: ExceptionHandler::WaitForContinueSignal sys_read failed: 02-21 01:32:21.254 17286 17286 W google-breakpad: Bad file number 02-21 01:32:21.254 17286 17286 W google-breakpad: It seems likely that the continuation signaling is just completely busted, and the child is racing to create the dump before the parent has enabled tracing. The child continues with the dump even if WaitForContinueSignal() fails.I still have no idea why we would get ECHILD from waitpid() unless it's the SIG_IGN thing or some kind of kernel bug, but somehow it is working out most of the time. [0] https://crash-stats.mozilla.com/report/index/210ae0e8-eba6-44d4-aa04-ad3522160221 [1] http://linux.die.net/man/2/waitpid [2] https://crash-stats.mozilla.com/report/index/b2029df2-2c15-490d-91fe-d5f042160221
Sleeping for 1s before the waitpid() avoids the bad fd issues in the child. So the real problem is definitely waitpid()'s immediate return.
We don't do anything with SIGCHLD or use SA_NOCLDWAIT, AFAICT, so I don't think that's a factor.
I'm not sure what the difference is between the success and failure cases at this point. If the child successfully reads the continuation fd then ptracing has been enabled, but that's also the case if the parent has closed the fd, so it shouldn't make a difference in the result. AFAICT, waitpid() fails with ECHILD every time no matter what. I guess when the parent process exits, it will cause the child to be killed too, but I would think we should lose that race more often than we appear to be.
Ah, I think I just figured it out. The sys_clone() return value is just total garbage, so we pass the wrong value to waitpid().
Using clone() from libc works fine, so the implementation in breakpad is busted.
glandium, do you see anything in the breakpad sys_clone()[0] that would screw up the return val? [0] https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h#2367
Flags: needinfo?(mh+mozilla)
The disassembled breakpad sys_clone() as it appears in Gecko: df0: b480 push {r7} df2: f1bc 0f00 cmp.w ip, #0 df6: bf18 it ne df8: 2900 cmpne r1, #0 dfa: bf08 it eq dfc: f06f 0715 mvneq.w r7, #21 e00: d00e beq.n e20 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10c> e02: f841 7d04 str.w r7, [r1, #-4]! e06: f841 cd04 str.w ip, [r1, #-4]! e0a: f04f 0778 mov.w r7, #120 ; 0x78 e0e: df00 svc 0 e10: 0007 movs r7, r0 e12: d105 bne.n e20 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10c> e14: 9801 ldr r0, [sp, #4] e16: 9f00 ldr r7, [sp, #0] e18: 47b8 blx r7 e1a: f04f 0701 mov.w r7, #1 e1e: df00 svc 0 e20: bc80 pop {r7}
Attached file clone-test.tar.bz2
This is a small test that uses sys_clone() from linux_syscall_support.h. This seems to work, indicating we have some problem with the Gecko build.
I suppose it could also be something special about calling sys_clone() from within a signal handler....
It seems to work in Gecko outside of the signal handler. I basically pasted the main.c from the test into AndroidBridge and it works fine. Ugh....
What does the disassembly of the clone function from libc look like?
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #19) > What does the disassembly of the clone function from libc look like? 000174b4 <__bionic_clone>: 174b4: e1a0c00d mov ip, sp 174b8: e92d00f0 push {r4, r5, r6, r7} 174bc: e89c0070 ldm ip, {r4, r5, r6} 174c0: e9210060 stmdb r1!, {r5, r6} 174c4: e3a07078 mov r7, #120 ; 0x78 174c8: ef000000 svc 0x00000000 174cc: e1b00000 movs r0, r0 174d0: 0a000004 beq 174e8 <__bionic_clone+0x34> 174d4: e8bd00f0 pop {r4, r5, r6, r7} 174d8: e3700a01 cmn r0, #4096 ; 0x1000 174dc: 912fff1e bxls lr 174e0: e2600000 rsb r0, r0, #0 174e4: ea012c40 b 625ec <__sync_lock_release_1+0xd0> 174e8: e3a0e000 mov lr, #0 174ec: e8bd0003 pop {r0, r1} 174f0: ea012c41 b 625fc <__sync_lock_release_1+0xe0>
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #15) > e02: f841 7d04 str.w r7, [r1, #-4]! > e06: f841 cd04 str.w ip, [r1, #-4]! So, those correspond to: /* Push "arg" and "fn" onto the stack that will be * used by the child. */ "str %5,[%3,#-4]!\n" "str %2,[%3,#-4]!\n" from https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h#2390-2394 If the pasted disassembly is really all of sys_clone, I don't see how "arg" and "fn" are supposed to end up in r7 and ip...
This patch fixes it for me. I think the stack arithmetic must be wrong when copying r0 into the return value. diff --git a/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h b/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h index 8a42c1c..cb91db5 100644 --- a/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h +++ b/toolkit/crashreporter/google-breakpad/src/third_party/lss/linux_syscall_support.h @@ -2367,7 +2367,7 @@ struct kernel_statfs { LSS_INLINE int LSS_NAME(clone)(int (*fn)(void *), void *child_stack, int flags, void *arg, int *parent_tidptr, void *newtls, int *child_tidptr) { - long __res; + register long __res __asm__("r10") = 0; { register int __flags __asm__("r0") = flags; register void *__stack __asm__("r1") = child_stack;
(In reply to Mike Hommey [:glandium] from comment #21) > (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #15) > > e02: f841 7d04 str.w r7, [r1, #-4]! > > e06: f841 cd04 str.w ip, [r1, #-4]! > > So, those correspond to: > > /* Push "arg" and "fn" onto the stack that will be > * used by the child. > */ > "str %5,[%3,#-4]!\n" > "str %2,[%3,#-4]!\n" > > from > https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google- > breakpad/src/third_party/lss/linux_syscall_support.h#2390-2394 > > If the pasted disassembly is really all of sys_clone, I don't see how "arg" > and "fn" are supposed to end up in r7 and ip... Hmm, yeah. Maybe I missed some of the setup code? I'll look again.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #22) > This patch fixes it for me. I think the stack arithmetic must be wrong when > copying r0 into the return value. By that I mean __res.
It looks like the bionic one just leaves the return value in r0, which I guess is fine because it's a function call and not inlined like the lss one. I would guess r0 is the standard location for function return values? The part that looks fishy to me in the lss version is we copy r0 to sp + 4, but earlier we 'push {r7}' which I assume increases the sp. As a result, sp+4 can't possibly be right? Putting __res in r10 then works around that issue. I'm just flailing around and generally have no idea what I'm doing, though, so this is probably wrong.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #25) > The part that looks fishy to me in the lss version is we copy r0 to sp + 4, > but earlier we 'push {r7}' which I assume increases the sp. As a result, > sp+4 can't possibly be right? I guess since stacks grow down, it would *decrease* sp. At any rate, the location of __res is somehow wrong.
You mean the code at e14? that's preparing another function call in the child process, and that's a different stack.
Ah, yeah, that's what I was looking at. Where does it copy into __res? I have some more disassembly from the area below: de4: 4641 mov r1, r8 de6: 4613 mov r3, r2 de8: 4614 mov r4, r2 dea: 44fa add sl, pc dec: f10d 0c24 add.w ip, sp, #36 ; 0x24 df0: b480 push {r7} df2: f1ba 0f00 cmp.w sl, #0 df6: bf18 it ne df8: 2900 cmpne r1, #0 dfa: bf08 it eq dfc: f06f 0a15 mvneq.w sl, #21 e00: d00f beq.n e22 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10e> e02: f841 cd04 str.w ip, [r1, #-4]! e06: f841 ad04 str.w sl, [r1, #-4]! e0a: f04f 0778 mov.w r7, #120 ; 0x78 e0e: df00 svc 0 e10: ea5f 0a00 movs.w sl, r0 e14: d105 bne.n e22 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x10e> e16: 9801 ldr r0, [sp, #4] e18: 9f00 ldr r7, [sp, #0] e1a: 47b8 blx r7 e1c: f04f 0701 mov.w r7, #1 e20: df00 svc 0 e22: bc80 pop {r7} e24: f51a 5f80 cmn.w sl, #4096 ; 0x1000 e28: 4654 mov r4, sl e2a: d905 bls.n e38 <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x124> e2c: f7ff fffe bl 0 <__errno> e30: f04f 3aff mov.w sl, #4294967295 ; 0xffffffff e34: 4264 negs r4, r4 e36: 6004 str r4, [r0, #0] e38: f1ba 3fff cmp.w sl, #4294967295 ; 0xffffffff e3c: d107 bne.n e4e <_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x13a> e3e: 6e30 ldr r0, [r6, #96] ; 0x60 e40: 2400 movs r4, #0
Whoops that's from my modified version. Just a sec.
I guess r10 is an alias for sl so my change seems pretty bad....
So here is the actual problem: gcc assigned r7 to _res. So, the movs r7, r0 that follows the svc is the assignment to _res. And when r0 != 0, we jump to the pop {r7}, which restores the value of r7 that was pushed at the beginning of the function, and after that, we're likely re-copying r7 to r0 to make that the result (that doesn't appear in your disassembly, btw, but has to be there somewhere). So yes, it makes sense that assigning a register for _res would fix it... OTOH, the whole thing with push/pop r7 is fishy. It would be interesting to see why they added that upstream (as in, what they were trying to fix), because it looks like their fix is fishy. It seems to me that removing the push/pop r7 and readding r7 to clobbers would fix this issue too, but without knowing what those were supposed to fix, it's hard to tell if it would do more good than harm.
Note that I wrote comment 32 before comment 28 and following.
(In reply to Mike Hommey [:glandium] from comment #32) I guess maybe they wanted to preserve the frame pointer? It does seem like the easiest fix is to use r8 or something for __res. Or force it onto the stack somehow.
The change to push/pop r7 is here: https://bugs.chromium.org/p/linux-syscall-support/issues/detail?id=2 It seems they are in fact doing that to preserve the fp. Mike what do you think is the best fix here? Assign __res to r8?
So, they are, in fact, trying to fix the same thing that we had worked around by building with -fomit-frame-pointer per bug 633436. And I removed that thing in bug 1245055 because it looks like the compiler bug that made it fail to build is gone... So my take on this is that we could go with removing their fix for that now non-existing bug.
Yup, this works fine now. We should try to upstream this too.
Attachment #8723127 - Flags: review?(mh+mozilla)
Comment on attachment 8723127 [details] [diff] [review] Put r7 in clobber registers for breakpad's sys_clone() Review of attachment 8723127 [details] [diff] [review]: ----------------------------------------------------------------- You could title this "Revert upstream fix for their equivalent of bug 633436 because it does more harm than good" :)
Attachment #8723127 - Flags: review?(mh+mozilla) → review+
Assignee: nobody → snorp
Status: REOPENED → ASSIGNED
See Also: → 1245570
Comment on attachment 8723127 [details] [diff] [review] Put r7 in clobber registers for breakpad's sys_clone() Approval Request Comment [Feature/regressing bug #]: N/A [User impact if declined]: Possibly broken crash reports [Describe test coverage new/current, TreeHerder]: nightly [Risks and why]: Low [String/UUID change made/needed]: None
Attachment #8723127 - Flags: approval-mozilla-aurora?
Comment on attachment 8723127 [details] [diff] [review] Put r7 in clobber registers for breakpad's sys_clone() May improve android crash reporting. Please uplift to aurora.
Attachment #8723127 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I'm hitting conflicts applying this to aurora. Can we get a rebased patch?
Flags: needinfo?(snorp)
We actually don't have this problem on 46, because the version of breakpad we use there didn't break it yet.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago9 years ago
Flags: needinfo?(snorp)
Resolution: --- → FIXED
We were having other busted minidumps before the Breakpad update though, right? bug 1164052 was filed 10 months ago. Did we just swap one set of bustage for another, and then your couple of fixes cleaned that up?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #46) > We were having other busted minidumps before the Breakpad update though, > right? bug 1164052 was filed 10 months ago. Did we just swap one set of > bustage for another, and then your couple of fixes cleaned that up? It sounds like it, yeah. We still have busted dumps in 46+, but hopefully fewer?
Depends on: 1258269
See Also: → 1360392
See Also: 1360392
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: