Spun off bug 817946 comment 70.
5 years ago
5 years ago
I think I have not seen a complete and non-corrupt stack in B2G crash reports in the last weeks. Here's a few examples for how they look right now: The most common kind of stack is what we are seeing in bp-ba8cd560-c234-4b5b-9cc6-44ede2121212: 0 @0x40788ffe 1 libmozglue.so malloc_mutex_unlock jemalloc.c:1649 2 libmozglue.so arena_malloc jemalloc.c:4151 3 libmozglue.so imalloc jemalloc.c:4231 4 libmozglue.so realloc jemalloc.c:6551 5 @0x13adbfa 6 libmozglue.so malloc_mutex_unlock jemalloc.c:1649 7 libmozglue.so arena_dalloc jemalloc.c:4634 8 @0x2 9 libmozglue.so malloc_mutex_unlock jemalloc.c:1649 10 libmozglue.so arena_dalloc jemalloc.c:4634 11 libmozglue.so arena_malloc jemalloc.c:4151 12 @0xffffffff Having a lot of jemalloc in there as the only things we can find by scanning is a pretty common thing there. Note that when you look at the other threads, some at least go back to pthread_create but some end in garbage or have garbage in them. For bug 819823, I did get bp-17322233-a8cb-4b77-acdc-f52502121210 and bp-555cf5d3-372f-4ce3-83ff-2f0ae2121209 with stacks like this: @0x40d4ed40 1 libnspr4.so PR_AtomicDecrement pratom.c:280 2 libnspr4.so pt_PostNotifies ptsynch.c:125 3 libnspr4.so PR_Unlock ptsynch.c:205 4 libnspr4.so PR_ExitMonitor ptsynch.c:557 5 @0x40e1d665 Cervantes You posted in comment #1 of that bug that with gdb, he did find out that this was actually crashing in mozilla::dom::AudioChannelAgent::StopPlaying and didn't have NSPR in the stack of the crashing thread at all. This corruption is affecting all crash reports from B2G from what I can tell, the most useful stacks I've seen are something like in bp-3f72dc51-6d41-46b4-8d9b-06ac22121212 and even there the stack ends in garbage after 4 frames.
Chris, can you take this? Maybe Gabriele can?
Since I'm also working on generating stacks for the profiler I can take a look at this one too.
Can we first verify that this works in the "normal" cases where there is not stack corruption? e.g. by intentionally crashing using MOZ_CRASH ? If *that* case doesn't work, then we need to look at a particular minidump and figure out where the processing is failing (symbols or memory, or the stackwalk program itself).
I'm fairly certain that when I first tested this code I tested by using kill -ABRT <pid of b2g> and getting stacks out of the resulting dumps. I'll do a fresh build for my panda and sanity check this locally.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5) > I'm fairly certain that when I first tested this code I tested by using kill > -ABRT <pid of b2g> and getting stacks out of the resulting dumps. I'll do a > fresh build for my panda and sanity check this locally. It would be a hilarious/cruel joke if stacks work on panda and don't on the actual phones...
I did my initial testing on an SGS2, FWIW. I don't have an otoro or unagi, but if one of you do (and have a local build), I can give you some real simple steps to test.
I do have an Unagi. That's what I have been using since November. Let me know how we can verify that.
Okay: 1) Install a locally-built build 2) Find the pid of the b2g process and kill -ABRT it on the phone 3) adb pull the minidump file from /data/b2g/whatever/Crash Reports/pending/ before you click anything on the crash UI which might delete it. 4) Run ./build.sh buildsymbols in your B2G dir 5) Download and build the Breakpad source to get minidump_stackwalk: http://code.google.com/p/google-breakpad/source/checkout 6) Run google-breakpad/src/processor/minidump_stackwalk /path/to/minidump /path/to/B2G/objdir-gecko/dist/crashreporter-symbols And see if you get usable stack traces out.
Perhaps more interesting is to install a nightly that's supposed to have symbols and run the same steps and then attach/forward the minidump so that we can run it on the server.
ok, interesting data from https://crash-stats.mozilla.com/report/index/17322233-a8cb-4b77-acdc-f52502121210 (mentioned in comment #1 as diagnosed): The crash address is @0x40d4ed40 In the raw dump: Module|libxul.so||libxul.so|2220D954C3995C395D6E424665870D0A0|0x40715000|0x40779fff|0 whic would put the size of libxul.so at 0x64FFF but according to the matching .sym file, the actual size of libxul.so should be closer to 0xeedd66 (that's the last line record). In which case this crash is actually libxul + 0x639d40 which is (probably correctly!) FUNC 639d28 30 0 mozilla::dom::AudioChannelAgent::StopPlaying Bazinga! Although I don't know why the mapping table of the minidump is wrong. Maybe something about how we do preloading?
Are we prelinking anything yet?
We get the module size info from /proc/self/maps. Conveniently, Breakpad also jams a copy of maps wholesale into the dump file. If you run minidump_dump on that dump you can see what it looks like. That definitely sounds like it could be our smoking gun!
module MDRawModule base_of_image = 0x40715000 size_of_image = 0x65000 checksum = 0x0 time_date_stamp = 0x0 module_name_rva = 0x459f8 version_info.signature = 0x0 version_info.struct_version = 0x0 version_info.file_version = 0x0:0x0 version_info.product_version = 0x0:0x0 version_info.file_flags_mask = 0x0 version_info.file_flags = 0x0 version_info.file_os = 0x0 version_info.file_type = 0x0 version_info.file_subtype = 0x0 version_info.file_date = 0x0:0x0 cv_record.data_size = 34 cv_record.rva = 0x459d0 misc_record.data_size = 0 misc_record.rva = 0x0 (code_file) = "/system/b2g/libxul.so" (code_identifier) = "id" (cv_record).cv_signature = 0x53445352 (cv_record).signature = 2220d954-c399-5c39-5d6e-424665870d0a (cv_record).age = 0 (cv_record).pdb_file_name = "libxul.so" (misc_record) = (null) (debug_file) = "libxul.so" (debug_identifier) = "2220D954C3995C395D6E424665870D0A0" (version) = "" And from MD_LINUX_MAPS output: 40715000-4077a000 r-xp 00000000 1f:05 1193 /system/b2g/libxul.so 4077a000-409a3000 r-xp 00000000 00:00 0 409a3000-41841000 r-xp 00064000 1f:05 1193 /system/b2g/libxul.so 41841000-419e4000 rw-p 00f02000 1f:05 1193 /system/b2g/libxul.so 419e4000-41a4d000 rw-p 00000000 00:00 0 so the libxul.so mapping is discontinuous and the "module" mapping is only picking up the first of the 3 mappings.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9) > > And see if you get usable stack traces out. From what I can see, I get usable stack trace Like this: Thread 1 0 libc.so + 0xd6e8 r4 = 0x403fc458 r5 = 0x403fc454 r6 = 0xfffffd3e r7 = 0x000000f0 r8 = 0x403fb084 r9 = 0x403fb23b r10 = 0x403fb084 fp = 0x403fc228 sp = 0x100ffe70 lr = 0x400db55c pc = 0x400d66e8 Found by: given as instruction pointer in context 1 libc.so!__pthread_cond_timedwait [pthread.c : 1500 + 0xa] sp = 0x100ffe90 pc = 0x400db610 Found by: stack scanning 2 gralloc.msm7627a.so!disp_loop [framebuffer.cpp : 143 + 0x7] r4 = 0x403fc250 r5 = 0x00000001 r6 = 0x403fc228 sp = 0x100ffea8 pc = 0x403fa271 Found by: call frame info 3 libc.so!__thread_entry [pthread.c : 217 + 0x6] r4 = 0x100fff00 r5 = 0x403fa229 r6 = 0x403fc250 r7 = 0x00000078 r8 = 0x403fa229 r9 = 0x403fc250 r10 = 0x00100000 fp = 0x00000001 sp = 0x100ffef0 pc = 0x400dbe18 Found by: call frame info 4 libc.so!pthread_create [pthread.c : 357 + 0xe] r4 = 0x100fff00 r5 = 0x00913ef8 r6 = 0x40102e8c r7 = 0x00000078 r8 = 0x403fa229 r9 = 0x403fc250 r10 = 0x00100000 fp = 0x00000001 sp = 0x100fff00 pc = 0x400db96c Found by: call frame info
->me for now
Now looking at https://crash-stats.mozilla.com/report/index/eae8602a-5346-4edd-a614-b49cb2121213 because I found matching binary bits. 40715000-4077a000 r-xp 00000000 b3:13 8561 /system/b2g/libxul.so 4077a000-409a3000 r-xp 00000000 00:00 0 409a3000-41843000 r-xp 00064000 b3:13 8561 /system/b2g/libxul.so 41843000-419e6000 rw-p 00f03000 b3:13 8561 /system/b2g/libxul.so in terms of offsets part1, 0-0x65000 empty mapping up to 0x28e000 r-x up to 0x41843000 rw- up to 0x12d1000 Crash is at offset 0x45168c nsStyleTransformMatrix::TransformFunctionOf Section headers according to readelf: Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .dynsym DYNSYM 00000134 000134 00d880 10 A 2 1 4 [ 2] .dynstr STRTAB 0000d9b4 00d9b4 01791d 00 A 0 0 1 [ 3] .hash HASH 000252d4 0252d4 00563c 04 A 1 0 4 [ 4] .gnu.version VERSYM 0002a910 02a910 001b10 02 A 1 0 2 [ 5] .gnu.version_d VERDEF 0002c420 02c420 00001c 00 A 2 1 4 [ 6] .gnu.version_r VERNEED 0002c43c 02c43c 000250 00 A 2 4 4 [ 7] .elfhack.text.v0 PROGBITS 0002c68c 02c68c 000048 00 AX 0 0 4 [ 8] .elfhack.data.v0 PROGBITS 0002c6d8 02c6d8 033fa8 08 A 0 0 8 [ 9] .rel.dyn REL 00060680 060680 001620 08 A 1 0 4  .rel.plt REL 00061ca0 061ca0 002910 08 A 1 11 4  .plt PROGBITS 0028eaec 064aec 003dac 00 AX 0 0 4  .text PROGBITS 002928c0 0688c0 c5a958 00 AX 0 0 64  .rodata PROGBITS 00eed220 cc3220 23d4ac 00 A 0 0 16  .ARM.extab PROGBITS 0112a6cc f006cc 00003c 00 A 0 0 4  .ARM.exidx ARM_EXIDX 0112a708 f00708 000058 08 AL 12 0 4  .eh_frame PROGBITS 0112a760 f00760 000034 00 A 0 0 4  .eh_frame_hdr PROGBITS 0112a794 f00794 000014 00 A 0 0 4  .dynamic DYNAMIC 0112b7b0 f007b0 0001a0 08 WA 2 0 4  .data PROGBITS 0112b950 f00950 04b4f0 00 WA 0 0 16  .data.rel.ro PROGBITS 01176e40 f4be40 13318c 00 WA 0 0 8  .init_array INIT_ARRAY 012a9fcc 107efcc 000294 00 WA 0 0 4  .data.rel.ro.loca PROGBITS 012aa260 107f260 01d2b0 00 WA 0 0 8  .got PROGBITS 012c7510 109c510 00640c 00 WA 0 0 4  .bss NOBITS 012cd920 10a291c 069308 00 WA 0 0 16  .comment PROGBITS 00000000 10a291c 000012 01 MS 0 0 1  .note.gnu.gold-ve NOTE 00000000 10a2930 000018 00 0 0 4  .ARM.attributes ARM_ATTRIBUTES 00000000 10a2948 000032 00 0 0 1  .shstrtab STRTAB 00000000 10a297a 000131 00 0 0 1 So the first mapping is sections 1-10, which are mapped at the same offset in the file and in memory. Starting with section 11 (.plt), the image and the in-memory location don't match: the image is basically continuous at 0x64aec while the mapping is at 0x28eaec. Rounding down to the nearest page would give us offset 0x28e000. In my x86-64 libxul.so, .plt is mapped at its direct location and the discontinuous mappings don't start until you get to unusual sections (.tbss and .ctors). glandium, do you know whether the unusual mapping is expected (in which case we should fix this in breakpad) or whether this should be fixed to produce continuous sections in the linker?
Or I could have just looked at the "Section to Segment mapping" output from readelf: B2G-ARM: Section to Segment mapping: Segment Sections... 00 01 .dynsym .dynstr .hash .gnu.version .gnu.version_d .gnu.version_r .elfhack.text.v0 .elfhack.data.v0 .rel.dyn .rel.plt 02 .plt .text .rodata .ARM.extab .ARM.exidx .eh_frame .eh_frame_hdr 03 .dynamic .data .data.rel.ro .init_array .data.rel.ro.local .got .bss 04 .dynamic 05 .eh_frame_hdr 06 07 .ARM.exidx x86-64: Section to Segment mapping: Segment Sections... 00 .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 01 .ctors .dtors .jcr .data.rel.ro .dynamic .got .got.plt .data .bss 02 .dynamic 03 .tbss 04 .eh_frame_hdr 05
(In reply to Benjamin Smedberg [:bsmedberg] from comment #17) > glandium, do you know whether the unusual mapping is expected (in which case > we should fix this in breakpad) or whether this should be fixed to produce > continuous sections in the linker? This is elfhack at work. Breakpad works fine on linux because it hole is still mapped from the file, so breakpad is happy. It works on android because we have something to tell breakpad what the whole range is for libraries. I've been meaning to fix breakpad do handle that itself for a while... That being said, if we're going to prelink on b2g, elfhack will be disabled so the problem may go away, although i'm not sure what android prelinking does to the mappings, since (aiui) it removes relocations. As for address calculation, it's pretty straightforward. Get the address of your crash, remove the base address of the library (start address of the first mapping), and you get the virtual address of where you are, so you just have to match that result with addr2line.
From our discussion Friday, I believe that the immediate solution here is that we are going to disable elfhack. Glandium was going to check with cjones to confirm that. ->glandium
That's fine as a temporary workaround, but elfhack saves us ~3MB of disk space which is very hard to leave on the table. glandium mentioned on IRC that he had a hack in mind.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #21) > glandium mentioned on IRC that he had a hack in mind. a hack on elfhack side. But I won't know if that actually works until at least tomorrow. Until then, we can probably disable elfhack, but sadly, since we're using gonk-misc/default-gecko-config instead of in-tree mozconfigs, that's unconvenient to do. If you want it disabled until tomorrow, you'll have to find someone to figure how best to do it on the b2g18 branch, because I'm going to be offline very soon now. That being said, we have had similar issues with elfhack in the past on linux and android (surprise), and we were able to reprocess broken crash reports in bug 637680. The Android minidump fixer program from there should work for b2g, provided the breakpad APIs it uses haven't changed in the meanwhile.
Due to bug 822432 this is no longer a blocker, we have a server side work around
(In reply to Robert Kaiser (:firstname.lastname@example.org) from comment #24) > After bug 822584 landed, is bug 822432 still needed? Depends if there are still incoming crashes from builds prior to the landing (but maybe we don't care about those)
If that got landed on the b2g18 branch then it probably covers most of the things we care about. Not sure how hard it'd be to figure out if we're still getting broken crashes.
I'd think we don't care about B2G crashes from builds before the original 1/15 "code freeze" and bug 822584 landed on 1/14 on b2g18, from what I see, so if that means we don't need the reprocessing any more, we can tell the Socorro folks that it can be shut off for now. BTW, what is this bug still around for?
(In reply to Robert Kaiser (:email@example.com) from comment #27) > BTW, what is this bug still around for? The actual breakpad issue.