Closed Bug 821353 Opened 12 years ago Closed 11 years ago

Breakpad can't deal with elfhacked binaries on ARM (B2G crash stacks are broken/corrupted)

Categories

(Firefox OS Graveyard :: General, defect, P1)

ARM
Gonk (Firefox OS)

Tracking

(blocking-basecamp:-)

RESOLVED DUPLICATE of bug 689178
B2G C3 (12dec-1jan)
blocking-basecamp -

People

(Reporter: davidb, Assigned: glandium)

References

Details

Severity: normal → major
Priority: -- → P1
blocking-basecamp: --- → ?
Summary: B2G crash stacks are borken/corrupted → B2G crash stacks are broken/corrupted
I think I have not seen a complete and non-corrupt stack in B2G crash reports in the last weeks.

Here's a few examples for how they look right now:

The most common kind of stack is what we are seeing in bp-ba8cd560-c234-4b5b-9cc6-44ede2121212:

0 		@0x40788ffe 	
1 	libmozglue.so 	malloc_mutex_unlock 	jemalloc.c:1649
2 	libmozglue.so 	arena_malloc 	jemalloc.c:4151
3 	libmozglue.so 	imalloc 	jemalloc.c:4231
4 	libmozglue.so 	realloc 	jemalloc.c:6551
5 		@0x13adbfa 	
6 	libmozglue.so 	malloc_mutex_unlock 	jemalloc.c:1649
7 	libmozglue.so 	arena_dalloc 	jemalloc.c:4634
8 		@0x2 	
9 	libmozglue.so 	malloc_mutex_unlock 	jemalloc.c:1649
10 	libmozglue.so 	arena_dalloc 	jemalloc.c:4634
11 	libmozglue.so 	arena_malloc 	jemalloc.c:4151
12 		@0xffffffff 	


Having a lot of jemalloc in there as the only things we can find by scanning is a pretty common thing there. Note that when you look at the other threads, some at least go back to pthread_create but some end in garbage or have garbage in them.


For bug 819823, I did get bp-17322233-a8cb-4b77-acdc-f52502121210 and bp-555cf5d3-372f-4ce3-83ff-2f0ae2121209 with stacks like this:

 		@0x40d4ed40 	
1 	libnspr4.so 	PR_AtomicDecrement 	pratom.c:280
2 	libnspr4.so 	pt_PostNotifies 	ptsynch.c:125
3 	libnspr4.so 	PR_Unlock 	ptsynch.c:205
4 	libnspr4.so 	PR_ExitMonitor 	ptsynch.c:557
5 		@0x40e1d665

Cervantes You posted in comment #1 of that bug that with gdb, he did find out that this was actually crashing in mozilla::dom::AudioChannelAgent::StopPlaying and didn't have NSPR in the stack of the crashing thread at all.


This corruption is affecting all crash reports from B2G from what I can tell, the most useful stacks I've seen are something like in bp-3f72dc51-6d41-46b4-8d9b-06ac22121212 and even there the stack ends in garbage after 4 frames.
Summary: B2G crash stacks are broken/corrupted → B2G crash stacks are borken/corrupted
Chris, can you take this?  Maybe Gabriele can?
Assignee: nobody → jones.chris.g
blocking-basecamp: ? → +
Target Milestone: --- → B2G C3 (12dec-1jan)
Since I'm also working on generating stacks for the profiler I can take a look at this one too.
Can we first verify that this works in the "normal" cases where there is not stack corruption? e.g. by intentionally crashing using MOZ_CRASH ? If *that* case doesn't work, then we need to look at a particular minidump and figure out where the processing is failing (symbols or memory, or the stackwalk program itself).
Assignee: jones.chris.g → gsvelto
I'm fairly certain that when I first tested this code I tested by using kill -ABRT <pid of b2g> and getting stacks out of the resulting dumps. I'll do a fresh build for my panda and sanity check this locally.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5)
> I'm fairly certain that when I first tested this code I tested by using kill
> -ABRT <pid of b2g> and getting stacks out of the resulting dumps. I'll do a
> fresh build for my panda and sanity check this locally.

It would be a hilarious/cruel joke if stacks work on panda and don't on the actual phones...
Summary: B2G crash stacks are borken/corrupted → B2G crash stacks are broken/corrupted
I did my initial testing on an SGS2, FWIW. I don't have an otoro or unagi, but if one of you do (and have a local build), I can give you some real simple steps to test.
I do have an Unagi. That's what I have been using since November. Let me know how we can verify that.
Okay:
1) Install a locally-built build
2) Find the pid of the b2g process and kill -ABRT it on the phone
3) adb pull the minidump file from /data/b2g/whatever/Crash Reports/pending/ before you click anything on the crash UI which might delete it.
4) Run ./build.sh buildsymbols in your B2G dir
5) Download and build the Breakpad source to get minidump_stackwalk: http://code.google.com/p/google-breakpad/source/checkout
6) Run google-breakpad/src/processor/minidump_stackwalk /path/to/minidump /path/to/B2G/objdir-gecko/dist/crashreporter-symbols

And see if you get usable stack traces out.
Perhaps more interesting is to install a nightly that's supposed to have symbols and run the same steps and then attach/forward the minidump so that we can run it on the server.
ok, interesting data from https://crash-stats.mozilla.com/report/index/17322233-a8cb-4b77-acdc-f52502121210 (mentioned in comment #1 as diagnosed):

The crash address is @0x40d4ed40
In the raw dump:
Module|libxul.so||libxul.so|2220D954C3995C395D6E424665870D0A0|0x40715000|0x40779fff|0

whic would put the size of libxul.so at 0x64FFF

but according to the matching .sym file, the actual size of libxul.so should be closer to 0xeedd66 (that's the last line record). In which case this crash is actually libxul + 0x639d40 which is (probably correctly!) FUNC 639d28 30 0 mozilla::dom::AudioChannelAgent::StopPlaying

Bazinga! Although I don't know why the mapping table of the minidump is wrong. Maybe something about how we do preloading?
Are we prelinking anything yet?
We get the module size info from /proc/self/maps. Conveniently, Breakpad also jams a copy of maps wholesale into the dump file. If you run minidump_dump on that dump you can see what it looks like. That definitely sounds like it could be our smoking gun!
module[44]
MDRawModule
  base_of_image                   = 0x40715000
  size_of_image                   = 0x65000
  checksum                        = 0x0
  time_date_stamp                 = 0x0
  module_name_rva                 = 0x459f8
  version_info.signature          = 0x0
  version_info.struct_version     = 0x0
  version_info.file_version       = 0x0:0x0
  version_info.product_version    = 0x0:0x0
  version_info.file_flags_mask    = 0x0
  version_info.file_flags         = 0x0
  version_info.file_os            = 0x0
  version_info.file_type          = 0x0
  version_info.file_subtype       = 0x0
  version_info.file_date          = 0x0:0x0
  cv_record.data_size             = 34
  cv_record.rva                   = 0x459d0
  misc_record.data_size           = 0
  misc_record.rva                 = 0x0
  (code_file)                     = "/system/b2g/libxul.so"
  (code_identifier)               = "id"
  (cv_record).cv_signature        = 0x53445352
  (cv_record).signature           = 2220d954-c399-5c39-5d6e-424665870d0a
  (cv_record).age                 = 0
  (cv_record).pdb_file_name       = "libxul.so"
  (misc_record)                   = (null)
  (debug_file)                    = "libxul.so"
  (debug_identifier)              = "2220D954C3995C395D6E424665870D0A0"
  (version)                       = ""

And from MD_LINUX_MAPS output:

40715000-4077a000 r-xp 00000000 1f:05 1193       /system/b2g/libxul.so
4077a000-409a3000 r-xp 00000000 00:00 0 
409a3000-41841000 r-xp 00064000 1f:05 1193       /system/b2g/libxul.so
41841000-419e4000 rw-p 00f02000 1f:05 1193       /system/b2g/libxul.so
419e4000-41a4d000 rw-p 00000000 00:00 0 

so the libxul.so mapping is discontinuous and the "module" mapping is only picking up the first of the 3 mappings.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)

> 
> And see if you get usable stack traces out.


From what I can see, I get usable stack trace 

Like this:
Thread 1
 0  libc.so + 0xd6e8
     r4 = 0x403fc458    r5 = 0x403fc454    r6 = 0xfffffd3e    r7 = 0x000000f0
     r8 = 0x403fb084    r9 = 0x403fb23b   r10 = 0x403fb084    fp = 0x403fc228
     sp = 0x100ffe70    lr = 0x400db55c    pc = 0x400d66e8
    Found by: given as instruction pointer in context
 1  libc.so!__pthread_cond_timedwait [pthread.c : 1500 + 0xa]
     sp = 0x100ffe90    pc = 0x400db610
    Found by: stack scanning
 2  gralloc.msm7627a.so!disp_loop [framebuffer.cpp : 143 + 0x7]
     r4 = 0x403fc250    r5 = 0x00000001    r6 = 0x403fc228    sp = 0x100ffea8
     pc = 0x403fa271
    Found by: call frame info
 3  libc.so!__thread_entry [pthread.c : 217 + 0x6]
     r4 = 0x100fff00    r5 = 0x403fa229    r6 = 0x403fc250    r7 = 0x00000078
     r8 = 0x403fa229    r9 = 0x403fc250   r10 = 0x00100000    fp = 0x00000001
     sp = 0x100ffef0    pc = 0x400dbe18
    Found by: call frame info
 4  libc.so!pthread_create [pthread.c : 357 + 0xe]
     r4 = 0x100fff00    r5 = 0x00913ef8    r6 = 0x40102e8c    r7 = 0x00000078
     r8 = 0x403fa229    r9 = 0x403fc250   r10 = 0x00100000    fp = 0x00000001
     sp = 0x100fff00    pc = 0x400db96c
    Found by: call frame info
->me for now
Assignee: gsvelto → benjamin
Now looking at https://crash-stats.mozilla.com/report/index/eae8602a-5346-4edd-a614-b49cb2121213 because I found matching binary bits.

40715000-4077a000 r-xp 00000000 b3:13 8561       /system/b2g/libxul.so
4077a000-409a3000 r-xp 00000000 00:00 0 
409a3000-41843000 r-xp 00064000 b3:13 8561       /system/b2g/libxul.so
41843000-419e6000 rw-p 00f03000 b3:13 8561       /system/b2g/libxul.so

in terms of offsets
part1, 0-0x65000
empty mapping up to 0x28e000
r-x up to 0x41843000
rw- up to 0x12d1000

Crash is at offset 0x45168c nsStyleTransformMatrix::TransformFunctionOf

Section headers according to readelf:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .dynsym           DYNSYM          00000134 000134 00d880 10   A  2   1  4
  [ 2] .dynstr           STRTAB          0000d9b4 00d9b4 01791d 00   A  0   0  1
  [ 3] .hash             HASH            000252d4 0252d4 00563c 04   A  1   0  4
  [ 4] .gnu.version      VERSYM          0002a910 02a910 001b10 02   A  1   0  2
  [ 5] .gnu.version_d    VERDEF          0002c420 02c420 00001c 00   A  2   1  4
  [ 6] .gnu.version_r    VERNEED         0002c43c 02c43c 000250 00   A  2   4  4
  [ 7] .elfhack.text.v0  PROGBITS        0002c68c 02c68c 000048 00  AX  0   0  4
  [ 8] .elfhack.data.v0  PROGBITS        0002c6d8 02c6d8 033fa8 08   A  0   0  8
  [ 9] .rel.dyn          REL             00060680 060680 001620 08   A  1   0  4
  [10] .rel.plt          REL             00061ca0 061ca0 002910 08   A  1  11  4
  [11] .plt              PROGBITS        0028eaec 064aec 003dac 00  AX  0   0  4
  [12] .text             PROGBITS        002928c0 0688c0 c5a958 00  AX  0   0 64
  [13] .rodata           PROGBITS        00eed220 cc3220 23d4ac 00   A  0   0 16
  [14] .ARM.extab        PROGBITS        0112a6cc f006cc 00003c 00   A  0   0  4
  [15] .ARM.exidx        ARM_EXIDX       0112a708 f00708 000058 08  AL 12   0  4
  [16] .eh_frame         PROGBITS        0112a760 f00760 000034 00   A  0   0  4
  [17] .eh_frame_hdr     PROGBITS        0112a794 f00794 000014 00   A  0   0  4
  [18] .dynamic          DYNAMIC         0112b7b0 f007b0 0001a0 08  WA  2   0  4
  [19] .data             PROGBITS        0112b950 f00950 04b4f0 00  WA  0   0 16
  [20] .data.rel.ro      PROGBITS        01176e40 f4be40 13318c 00  WA  0   0  8
  [21] .init_array       INIT_ARRAY      012a9fcc 107efcc 000294 00  WA  0   0  4
  [22] .data.rel.ro.loca PROGBITS        012aa260 107f260 01d2b0 00  WA  0   0  8
  [23] .got              PROGBITS        012c7510 109c510 00640c 00  WA  0   0  4
  [24] .bss              NOBITS          012cd920 10a291c 069308 00  WA  0   0 16
  [25] .comment          PROGBITS        00000000 10a291c 000012 01  MS  0   0  1
  [26] .note.gnu.gold-ve NOTE            00000000 10a2930 000018 00      0   0  4
  [27] .ARM.attributes   ARM_ATTRIBUTES  00000000 10a2948 000032 00      0   0  1
  [28] .shstrtab         STRTAB          00000000 10a297a 000131 00      0   0  1

So the first mapping is sections 1-10, which are mapped at the same offset in the file and in memory.

Starting with section 11 (.plt), the image and the in-memory location don't match: the image is basically continuous at 0x64aec while the mapping is at 0x28eaec. Rounding down to the nearest page would give us offset 0x28e000.

In my x86-64 libxul.so, .plt is mapped at its direct location and the discontinuous mappings don't start until you get to unusual sections (.tbss and .ctors).

glandium, do you know whether the unusual mapping is expected (in which case we should fix this in breakpad) or whether this should be fixed to produce continuous sections in the linker?
Flags: needinfo?(mh+mozilla)
Or I could have just looked at the "Section to Segment mapping" output from readelf:

B2G-ARM:
Section to Segment mapping:
  Segment Sections...
   00     
   01     .dynsym .dynstr .hash .gnu.version .gnu.version_d .gnu.version_r .elfhack.text.v0 .elfhack.data.v0 .rel.dyn .rel.plt 
   02     .plt .text .rodata .ARM.extab .ARM.exidx .eh_frame .eh_frame_hdr 
   03     .dynamic .data .data.rel.ro .init_array .data.rel.ro.local .got .bss 
   04     .dynamic 
   05     .eh_frame_hdr 
   06     
   07     .ARM.exidx

x86-64:
 Section to Segment mapping:
  Segment Sections...
   00     .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   01     .ctors .dtors .jcr .data.rel.ro .dynamic .got .got.plt .data .bss 
   02     .dynamic 
   03     .tbss 
   04     .eh_frame_hdr 
   05
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #17)
> glandium, do you know whether the unusual mapping is expected (in which case
> we should fix this in breakpad) or whether this should be fixed to produce
> continuous sections in the linker?

This is elfhack at work. Breakpad works fine on linux because it hole is still mapped from the file, so breakpad is happy. It works on android because we have something to tell breakpad what the whole range is for libraries. I've been meaning to fix breakpad do handle that itself for a while...

That being said, if we're going to prelink on b2g, elfhack will be disabled so the problem may go away, although i'm not sure what android prelinking does to the mappings, since (aiui) it removes relocations.

As for address calculation, it's pretty straightforward. Get the address of your crash, remove the base address of the library (start address of the first mapping), and you get the virtual address of where you are, so you just have to match that result with addr2line.
Flags: needinfo?(mh+mozilla)
Summary: B2G crash stacks are broken/corrupted → Breakpad can't deal with elfhacked binaries on ARM (B2G crash stacks are broken/corrupted)
From our discussion Friday, I believe that the immediate solution here is that we are going to disable elfhack. Glandium was going to check with cjones to confirm that. ->glandium
Assignee: benjamin → mh+mozilla
Flags: needinfo?(jones.chris.g)
That's fine as a temporary workaround, but elfhack saves us ~3MB of disk space which is very hard to leave on the table.

glandium mentioned on IRC that he had a hack in mind.
Flags: needinfo?(jones.chris.g) → needinfo?(mh+mozilla)
(In reply to Chris Jones [:cjones] [:warhammer] from comment #21)
> glandium mentioned on IRC that he had a hack in mind.

a hack on elfhack side. But I won't know if that actually works until at least tomorrow. Until then, we can probably disable elfhack, but sadly, since we're using gonk-misc/default-gecko-config instead of in-tree mozconfigs, that's unconvenient to do. If you want it disabled until tomorrow, you'll have to find someone to figure how best to do it on the b2g18 branch, because I'm going to be offline very soon now.

That being said, we have had similar issues with elfhack in the past on linux and android (surprise), and we were able to reprocess broken crash reports in bug 637680. The Android minidump fixer program from there should work for b2g, provided the breakpad APIs it uses haven't changed in the meanwhile.
Flags: needinfo?(mh+mozilla)
Depends on: 822432
Blocks: 822584
Due to bug 822432 this is no longer a blocker, we have a server side work around
blocking-basecamp: + → -
After bug 822584 landed, is bug 822432 still needed?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #24)
> After bug 822584 landed, is bug 822432 still needed?

Depends if there are still incoming crashes from builds prior to the landing (but maybe we don't care about those)
If that got landed on the b2g18 branch then it probably covers most of the things we care about. Not sure how hard it'd be to figure out if we're still getting broken crashes.
I'd think we don't care about B2G crashes from builds before the original 1/15 "code freeze" and bug 822584 landed on 1/14 on b2g18, from what I see, so if that means we don't need the reprocessing any more, we can tell the Socorro folks that it can be shut off for now.

BTW, what is this bug still around for?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #27)
> BTW, what is this bug still around for?

The actual breakpad issue.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.