821353 - Breakpad can't deal with elfhacked binaries on ARM (B2G crash stacks are broken/corrupted)

I think I have not seen a complete and non-corrupt stack in B2G crash reports in the last weeks.

Here's a few examples for how they look right now:

The most common kind of stack is what we are seeing in bp-ba8cd560-c234-4b5b-9cc6-44ede2121212:

0 		@0x40788ffe 	
1 	libmozglue.so 	malloc_mutex_unlock 	jemalloc.c:1649
2 	libmozglue.so 	arena_malloc 	jemalloc.c:4151
3 	libmozglue.so 	imalloc 	jemalloc.c:4231
4 	libmozglue.so 	realloc 	jemalloc.c:6551
5 		@0x13adbfa 	
6 	libmozglue.so 	malloc_mutex_unlock 	jemalloc.c:1649
7 	libmozglue.so 	arena_dalloc 	jemalloc.c:4634
8 		@0x2 	
9 	libmozglue.so 	malloc_mutex_unlock 	jemalloc.c:1649
10 	libmozglue.so 	arena_dalloc 	jemalloc.c:4634
11 	libmozglue.so 	arena_malloc 	jemalloc.c:4151
12 		@0xffffffff 	


Having a lot of jemalloc in there as the only things we can find by scanning is a pretty common thing there. Note that when you look at the other threads, some at least go back to pthread_create but some end in garbage or have garbage in them.


For bug 819823, I did get bp-17322233-a8cb-4b77-acdc-f52502121210 and bp-555cf5d3-372f-4ce3-83ff-2f0ae2121209 with stacks like this:

 		@0x40d4ed40 	
1 	libnspr4.so 	PR_AtomicDecrement 	pratom.c:280
2 	libnspr4.so 	pt_PostNotifies 	ptsynch.c:125
3 	libnspr4.so 	PR_Unlock 	ptsynch.c:205
4 	libnspr4.so 	PR_ExitMonitor 	ptsynch.c:557
5 		@0x40e1d665

Cervantes You posted in comment #1 of that bug that with gdb, he did find out that this was actually crashing in mozilla::dom::AudioChannelAgent::StopPlaying and didn't have NSPR in the stack of the crashing thread at all.


This corruption is affecting all crash reports from B2G from what I can tell, the most useful stacks I've seen are something like in bp-3f72dc51-6d41-46b4-8d9b-06ac22121212 and even there the stack ends in garbage after 4 frames.

Summary: B2G crash stacks are broken/corrupted → B2G crash stacks are borken/corrupted

Andrew Overholt [:overholt]

Comment 2

•

12 years ago

Chris, can you take this?  Maybe Gabriele can?

Assignee: nobody → jones.chris.g

blocking-basecamp: ? → +

Target Milestone: --- → B2G C3 (12dec-1jan)

Gabriele Svelto [:gsvelto] (PTO)

Comment 3

•

12 years ago

Since I'm also working on generating stacks for the profiler I can take a look at this one too.

Benjamin Smedberg

Comment 4

•

12 years ago

Can we first verify that this works in the "normal" cases where there is not stack corruption? e.g. by intentionally crashing using MOZ_CRASH ? If *that* case doesn't work, then we need to look at a particular minidump and figure out where the processing is failing (symbols or memory, or the stackwalk program itself).

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Updated

•

12 years ago

Assignee: jones.chris.g → gsvelto

(not currently active) Ted Mielczarek

Comment 5

•

12 years ago

I'm fairly certain that when I first tested this code I tested by using kill -ABRT <pid of b2g> and getting stacks out of the resulting dumps. I'll do a fresh build for my panda and sanity check this locally.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 6

•

12 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #5)
> I'm fairly certain that when I first tested this code I tested by using kill
> -ABRT <pid of b2g> and getting stacks out of the resulting dumps. I'll do a
> fresh build for my panda and sanity check this locally.

It would be a hilarious/cruel joke if stacks work on panda and don't on the actual phones...

Hubert Figuiere [:hub]

Updated

•

12 years ago

Summary: B2G crash stacks are borken/corrupted → B2G crash stacks are broken/corrupted

(not currently active) Ted Mielczarek

Comment 7

•

12 years ago

I did my initial testing on an SGS2, FWIW. I don't have an otoro or unagi, but if one of you do (and have a local build), I can give you some real simple steps to test.

Hubert Figuiere [:hub]

Comment 8

•

12 years ago

I do have an Unagi. That's what I have been using since November. Let me know how we can verify that.

(not currently active) Ted Mielczarek

Comment 9

•

12 years ago

Okay:
1) Install a locally-built build
2) Find the pid of the b2g process and kill -ABRT it on the phone
3) adb pull the minidump file from /data/b2g/whatever/Crash Reports/pending/ before you click anything on the crash UI which might delete it.
4) Run ./build.sh buildsymbols in your B2G dir
5) Download and build the Breakpad source to get minidump_stackwalk: http://code.google.com/p/google-breakpad/source/checkout
6) Run google-breakpad/src/processor/minidump_stackwalk /path/to/minidump /path/to/B2G/objdir-gecko/dist/crashreporter-symbols

And see if you get usable stack traces out.

Benjamin Smedberg

Comment 10

•

12 years ago

Perhaps more interesting is to install a nightly that's supposed to have symbols and run the same steps and then attach/forward the minidump so that we can run it on the server.

Benjamin Smedberg

Comment 11

•

12 years ago

ok, interesting data from https://crash-stats.mozilla.com/report/index/17322233-a8cb-4b77-acdc-f52502121210 (mentioned in comment #1 as diagnosed):

The crash address is @0x40d4ed40
In the raw dump:
Module|libxul.so||libxul.so|2220D954C3995C395D6E424665870D0A0|0x40715000|0x40779fff|0

whic would put the size of libxul.so at 0x64FFF

but according to the matching .sym file, the actual size of libxul.so should be closer to 0xeedd66 (that's the last line record). In which case this crash is actually libxul + 0x639d40 which is (probably correctly!) FUNC 639d28 30 0 mozilla::dom::AudioChannelAgent::StopPlaying

Bazinga! Although I don't know why the mapping table of the minidump is wrong. Maybe something about how we do preloading?

Mike Hommey [:glandium]

Assignee

Comment 12

•

12 years ago

Are we prelinking anything yet?

(not currently active) Ted Mielczarek

Comment 13

•

12 years ago

We get the module size info from /proc/self/maps. Conveniently, Breakpad also jams a copy of maps wholesale into the dump file. If you run minidump_dump on that dump you can see what it looks like. That definitely sounds like it could be our smoking gun!

Benjamin Smedberg

Comment 14

•

12 years ago

module[44]
MDRawModule
  base_of_image                   = 0x40715000
  size_of_image                   = 0x65000
  checksum                        = 0x0
  time_date_stamp                 = 0x0
  module_name_rva                 = 0x459f8
  version_info.signature          = 0x0
  version_info.struct_version     = 0x0
  version_info.file_version       = 0x0:0x0
  version_info.product_version    = 0x0:0x0
  version_info.file_flags_mask    = 0x0
  version_info.file_flags         = 0x0
  version_info.file_os            = 0x0
  version_info.file_type          = 0x0
  version_info.file_subtype       = 0x0
  version_info.file_date          = 0x0:0x0
  cv_record.data_size             = 34
  cv_record.rva                   = 0x459d0
  misc_record.data_size           = 0
  misc_record.rva                 = 0x0
  (code_file)                     = "/system/b2g/libxul.so"
  (code_identifier)               = "id"
  (cv_record).cv_signature        = 0x53445352
  (cv_record).signature           = 2220d954-c399-5c39-5d6e-424665870d0a
  (cv_record).age                 = 0
  (cv_record).pdb_file_name       = "libxul.so"
  (misc_record)                   = (null)
  (debug_file)                    = "libxul.so"
  (debug_identifier)              = "2220D954C3995C395D6E424665870D0A0"
  (version)                       = ""

And from MD_LINUX_MAPS output:

40715000-4077a000 r-xp 00000000 1f:05 1193       /system/b2g/libxul.so
4077a000-409a3000 r-xp 00000000 00:00 0 
409a3000-41841000 r-xp 00064000 1f:05 1193       /system/b2g/libxul.so
41841000-419e4000 rw-p 00f02000 1f:05 1193       /system/b2g/libxul.so
419e4000-41a4d000 rw-p 00000000 00:00 0 

so the libxul.so mapping is discontinuous and the "module" mapping is only picking up the first of the 3 mappings.

Hubert Figuiere [:hub]

Comment 15

•

12 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)

> 
> And see if you get usable stack traces out.


From what I can see, I get usable stack trace 

Like this:
Thread 1
 0  libc.so + 0xd6e8
     r4 = 0x403fc458    r5 = 0x403fc454    r6 = 0xfffffd3e    r7 = 0x000000f0
     r8 = 0x403fb084    r9 = 0x403fb23b   r10 = 0x403fb084    fp = 0x403fc228
     sp = 0x100ffe70    lr = 0x400db55c    pc = 0x400d66e8
    Found by: given as instruction pointer in context
 1  libc.so!__pthread_cond_timedwait [pthread.c : 1500 + 0xa]
     sp = 0x100ffe90    pc = 0x400db610
    Found by: stack scanning
 2  gralloc.msm7627a.so!disp_loop [framebuffer.cpp : 143 + 0x7]
     r4 = 0x403fc250    r5 = 0x00000001    r6 = 0x403fc228    sp = 0x100ffea8
     pc = 0x403fa271
    Found by: call frame info
 3  libc.so!__thread_entry [pthread.c : 217 + 0x6]
     r4 = 0x100fff00    r5 = 0x403fa229    r6 = 0x403fc250    r7 = 0x00000078
     r8 = 0x403fa229    r9 = 0x403fc250   r10 = 0x00100000    fp = 0x00000001
     sp = 0x100ffef0    pc = 0x400dbe18
    Found by: call frame info
 4  libc.so!pthread_create [pthread.c : 357 + 0xe]
     r4 = 0x100fff00    r5 = 0x00913ef8    r6 = 0x40102e8c    r7 = 0x00000078
     r8 = 0x403fa229    r9 = 0x403fc250   r10 = 0x00100000    fp = 0x00000001
     sp = 0x100fff00    pc = 0x400db96c
    Found by: call frame info

Benjamin Smedberg

Comment 16

•

12 years ago

->me for now

Assignee: gsvelto → benjamin

Benjamin Smedberg

Comment 17

•

12 years ago

Now looking at https://crash-stats.mozilla.com/report/index/eae8602a-5346-4edd-a614-b49cb2121213 because I found matching binary bits.

40715000-4077a000 r-xp 00000000 b3:13 8561       /system/b2g/libxul.so
4077a000-409a3000 r-xp 00000000 00:00 0 
409a3000-41843000 r-xp 00064000 b3:13 8561       /system/b2g/libxul.so
41843000-419e6000 rw-p 00f03000 b3:13 8561       /system/b2g/libxul.so

in terms of offsets
part1, 0-0x65000
empty mapping up to 0x28e000
r-x up to 0x41843000
rw- up to 0x12d1000

Crash is at offset 0x45168c nsStyleTransformMatrix::TransformFunctionOf

Section headers according to readelf:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .dynsym           DYNSYM          00000134 000134 00d880 10   A  2   1  4
  [ 2] .dynstr           STRTAB          0000d9b4 00d9b4 01791d 00   A  0   0  1
  [ 3] .hash             HASH            000252d4 0252d4 00563c 04   A  1   0  4
  [ 4] .gnu.version      VERSYM          0002a910 02a910 001b10 02   A  1   0  2
  [ 5] .gnu.version_d    VERDEF          0002c420 02c420 00001c 00   A  2   1  4
  [ 6] .gnu.version_r    VERNEED         0002c43c 02c43c 000250 00   A  2   4  4
  [ 7] .elfhack.text.v0  PROGBITS        0002c68c 02c68c 000048 00  AX  0   0  4
  [ 8] .elfhack.data.v0  PROGBITS        0002c6d8 02c6d8 033fa8 08   A  0   0  8
  [ 9] .rel.dyn          REL             00060680 060680 001620 08   A  1   0  4
  [10] .rel.plt          REL             00061ca0 061ca0 002910 08   A  1  11  4
  [11] .plt              PROGBITS        0028eaec 064aec 003dac 00  AX  0   0  4
  [12] .text             PROGBITS        002928c0 0688c0 c5a958 00  AX  0   0 64
  [13] .rodata           PROGBITS        00eed220 cc3220 23d4ac 00   A  0   0 16
  [14] .ARM.extab        PROGBITS        0112a6cc f006cc 00003c 00   A  0   0  4
  [15] .ARM.exidx        ARM_EXIDX       0112a708 f00708 000058 08  AL 12   0  4
  [16] .eh_frame         PROGBITS        0112a760 f00760 000034 00   A  0   0  4
  [17] .eh_frame_hdr     PROGBITS        0112a794 f00794 000014 00   A  0   0  4
  [18] .dynamic          DYNAMIC         0112b7b0 f007b0 0001a0 08  WA  2   0  4
  [19] .data             PROGBITS        0112b950 f00950 04b4f0 00  WA  0   0 16
  [20] .data.rel.ro      PROGBITS        01176e40 f4be40 13318c 00  WA  0   0  8
  [21] .init_array       INIT_ARRAY      012a9fcc 107efcc 000294 00  WA  0   0  4
  [22] .data.rel.ro.loca PROGBITS        012aa260 107f260 01d2b0 00  WA  0   0  8
  [23] .got              PROGBITS        012c7510 109c510 00640c 00  WA  0   0  4
  [24] .bss              NOBITS          012cd920 10a291c 069308 00  WA  0   0 16
  [25] .comment          PROGBITS        00000000 10a291c 000012 01  MS  0   0  1
  [26] .note.gnu.gold-ve NOTE            00000000 10a2930 000018 00      0   0  4
  [27] .ARM.attributes   ARM_ATTRIBUTES  00000000 10a2948 000032 00      0   0  1
  [28] .shstrtab         STRTAB          00000000 10a297a 000131 00      0   0  1

So the first mapping is sections 1-10, which are mapped at the same offset in the file and in memory.

Starting with section 11 (.plt), the image and the in-memory location don't match: the image is basically continuous at 0x64aec while the mapping is at 0x28eaec. Rounding down to the nearest page would give us offset 0x28e000.

In my x86-64 libxul.so, .plt is mapped at its direct location and the discontinuous mappings don't start until you get to unusual sections (.tbss and .ctors).

glandium, do you know whether the unusual mapping is expected (in which case we should fix this in breakpad) or whether this should be fixed to produce continuous sections in the linker?

Flags: needinfo?(mh+mozilla)

Benjamin Smedberg

Comment 18

•

12 years ago

Or I could have just looked at the "Section to Segment mapping" output from readelf:

B2G-ARM:
Section to Segment mapping:
  Segment Sections...
   00     
   01     .dynsym .dynstr .hash .gnu.version .gnu.version_d .gnu.version_r .elfhack.text.v0 .elfhack.data.v0 .rel.dyn .rel.plt 
   02     .plt .text .rodata .ARM.extab .ARM.exidx .eh_frame .eh_frame_hdr 
   03     .dynamic .data .data.rel.ro .init_array .data.rel.ro.local .got .bss 
   04     .dynamic 
   05     .eh_frame_hdr 
   06     
   07     .ARM.exidx

x86-64:
 Section to Segment mapping:
  Segment Sections...
   00     .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   01     .ctors .dtors .jcr .data.rel.ro .dynamic .got .got.plt .data .bss 
   02     .dynamic 
   03     .tbss 
   04     .eh_frame_hdr 
   05

Mike Hommey [:glandium]

Assignee

Comment 19

•

12 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #17)
> glandium, do you know whether the unusual mapping is expected (in which case
> we should fix this in breakpad) or whether this should be fixed to produce
> continuous sections in the linker?

This is elfhack at work. Breakpad works fine on linux because it hole is still mapped from the file, so breakpad is happy. It works on android because we have something to tell breakpad what the whole range is for libraries. I've been meaning to fix breakpad do handle that itself for a while...

That being said, if we're going to prelink on b2g, elfhack will be disabled so the problem may go away, although i'm not sure what android prelinking does to the mappings, since (aiui) it removes relocations.

As for address calculation, it's pretty straightforward. Get the address of your crash, remove the base address of the library (start address of the first mapping), and you get the virtual address of where you are, so you just have to match that result with addr2line.

Flags: needinfo?(mh+mozilla)

(not currently active) Ted Mielczarek

Updated

•

12 years ago

Summary: B2G crash stacks are broken/corrupted → Breakpad can't deal with elfhacked binaries on ARM (B2G crash stacks are broken/corrupted)

Benjamin Smedberg

Comment 20

•

12 years ago

From our discussion Friday, I believe that the immediate solution here is that we are going to disable elfhack. Glandium was going to check with cjones to confirm that. ->glandium

Assignee: benjamin → mh+mozilla

Flags: needinfo?(jones.chris.g)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Comment 21

•

12 years ago

That's fine as a temporary workaround, but elfhack saves us ~3MB of disk space which is very hard to leave on the table.

glandium mentioned on IRC that he had a hack in mind.

Flags: needinfo?(jones.chris.g) → needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Assignee

Comment 22

•

12 years ago

(In reply to Chris Jones [:cjones] [:warhammer] from comment #21)
> glandium mentioned on IRC that he had a hack in mind.

a hack on elfhack side. But I won't know if that actually works until at least tomorrow. Until then, we can probably disable elfhack, but sadly, since we're using gonk-misc/default-gecko-config instead of in-tree mozconfigs, that's unconvenient to do. If you want it disabled until tomorrow, you'll have to find someone to figure how best to do it on the b2g18 branch, because I'm going to be offline very soon now.

That being said, we have had similar issues with elfhack in the past on linux and android (surprise), and we were able to reprocess broken crash reports in bug 637680. The Android minidump fixer program from there should work for b2g, provided the breakpad APIs it uses haven't changed in the meanwhile.

Flags: needinfo?(mh+mozilla)

Benjamin Smedberg

Updated

•

12 years ago

Depends on: 822432

Mike Hommey [:glandium]

Assignee

Updated

•

12 years ago

Blocks: 822584

JP Rosevear [:jpr]

Comment 23

•

12 years ago

Due to bug 822432 this is no longer a blocker, we have a server side work around

blocking-basecamp: + → -

Robert Kaiser

Comment 24

•

11 years ago

After bug 822584 landed, is bug 822432 still needed?

Mike Hommey [:glandium]

Assignee

Comment 25

•

11 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #24)
> After bug 822584 landed, is bug 822432 still needed?

Depends if there are still incoming crashes from builds prior to the landing (but maybe we don't care about those)

(not currently active) Ted Mielczarek

Comment 26

•

11 years ago

If that got landed on the b2g18 branch then it probably covers most of the things we care about. Not sure how hard it'd be to figure out if we're still getting broken crashes.

Robert Kaiser

Comment 27

•

11 years ago

I'd think we don't care about B2G crashes from builds before the original 1/15 "code freeze" and bug 822584 landed on 1/14 on b2g18, from what I see, so if that means we don't need the reprocessing any more, we can tell the Socorro folks that it can be shut off for now.

BTW, what is this bug still around for?

Mike Hommey [:glandium]

Assignee

Comment 28

•

11 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #27)
> BTW, what is this bug still around for?

The actual breakpad issue.

Mike Hommey [:glandium]

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → DUPLICATE