Closed Bug 1309708 Opened 8 years ago Closed 8 years ago

Crash report dialog on first startup after APK update

Categories

(Firefox for Android Graveyard :: General, defect)

51 Branch
defect
Not set
normal

Tracking

(firefox51 fixed, firefox52 fixed)

RESOLVED FIXED
Firefox 52
Tracking Status
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: esawin, Assigned: esawin)

References

Details

Attachments

(4 files, 3 obsolete files)

With bug 1298090 landed, we show the crash report dialog alongside Fennec on the first startup after an APK update.
A workaround is to force-stop Fennec via Android App info before startup after the APK update.

It looks like the PackageReplacedReceiver is being kept alive by Android during Fennec startup and we run into some conflict when trying to open the libraries closed by the receiver.

Keeping the handles open fixes the issue and gives us a little performance boost on first startup, too.

I can investigate more, unless you have an idea already what's happening exactly, glandium?
Attachment #8800425 - Flags: review?(mh+mozilla)
Android can cache processes even if there is no activity/service/whatever currently running in it. That's probably what's happening here too. I think it makes sense to leave the process in a usable state when we are unpacking.
Looks like we're getting a segfault when reading libnss.so.
Here is a clean install and start without __wrap_dlclose for comparison.
(In reply to Eugen Sawin [:esawin] from comment #3)
> Created attachment 8801824 [details]
> Linker logs (with __wrap_dlclose)
> 
> Looks like we're getting a segfault when reading libnss.so.

Where do you see that in the attached log?
(In reply to Mike Hommey [:glandium] from comment #5)
> (In reply to Eugen Sawin [:esawin] from comment #3)
> > Created attachment 8801824 [details]
> > Linker logs (with __wrap_dlclose)
> > 
> > Looks like we're getting a segfault when reading libnss.so.
> 
> Where do you see that in the attached log?

10-17 20:16:28.777 17019 17019 I GeckoLinker: /data/app/org.mozilla.fennec_esawin-2/base.apk!/assets/armeabi-v7a/libnss3.so: Loading segment @0xe1de9000 r-x
...
10-17 20:16:28.779 17019 17019 I GeckoLinker: Relocate /data/app/org.mozilla.fennec_esawin-2/base.apk!/assets/armeabi-v7a/libnss3.so @0xe1de9000
...
10-17 20:16:35.375 17019 17089 I GeckoLinker: Caught segmentation fault @0xe1de9000
Why is the process still doing stuff after that segfault is caught?

10-17 20:16:35.375 17019 17089 I GeckoLinker: Redispatching to registered handler @0xeb75ffed

Obviously, that handler is somehow recovering from the crash, because the same thread does other stuff later:

10-17 20:16:36.582 17019 17089 E GeckoLinker: rabbit GetHandleByPtr: 0xd1c20f71
10-17 20:16:36.582 17019 17089 E GeckoLinker: rabbit GetHandleByPtr: /data/app/org.mozilla.fennec_esawin-2/base.apk!/assets/armeabi-v7a/libnss3.so
10-17 20:16:36.582 17019 17089 E GeckoLinker: rabbit Load: libsoftokn3.so
10-17 20:16:36.582 17019 17089 E GeckoLinker: rabbit Load [100]: libsoftokn3.so
10-17 20:16:36.582 17019 17089 I GeckoLinker: Mapped /data/app/org.mozilla.fennec_esawin-2/base.apk @0xc5dca000
10-17 20:16:36.582 17019 17089 I GeckoLinker: ZipCollection::Register("/data/app/org.mozilla.fennec_esawin-2/base.apk")
10-17 20:16:36.582 17019 17089 I GeckoLinker: /data/app/org.mozilla.fennec_esawin-2/base.apk - GetFile assets/armeabi-v7a/libsoftokn3.so

So it would be good to know what's at 0xeb75ffed, and I think that'd be dalvik. IMHO, this particular segfault message is a red herring.

A backtrace of the crash for which you get the crash report dialog would be a better thing to have.
Flags: needinfo?(esawin)
I should have been more clear, there is no actual crash, Fennec starts up alongside an (empty) crash report dialog. I suspect there is something unclean about closing the lib handle and re-opening it, but I've only got the logs as a clue.
Flags: needinfo?(esawin)
Can you run this under gdb with a breakpoint in SEGVHandler::handler (from mozglue/linker/ElfLoader.cpp), and find out what's trying to read at the original address of libnss3?
(Also, the content of /proc/self/maps when that happens would be good to have.)
Flags: needinfo?(esawin)
I've forced a crash in SEGVHandler::handler to get a stack (easier than attaching during startup), however this issue is not reproducing on debug builds.

On an optimized build, this is what I found:

4001d000-40060000 r-xp 00000000 b3:16 1228       /system/lib/libc.so
40060000-40063000 rw-p 00043000 b3:16 1228       /system/lib/libc.so
40063000-4006e000 rw-p 00000000 00:00 0 
4006e000-4007f000 r--s 00000000 00:0b 3080       /dev/__properties__ (deleted)
400cb000-400f0000 r-xp 00000000 b3:16 422        /system/bin/mksh
400f0000-400f1000 r--p 00024000 b3:16 422        /system/bin/mksh
400f1000-400f2000 rw-p 00025000 b3:16 422        /system/bin/mksh
400f2000-400f6000 rw-p 00000000 00:00 0 
4016e000-4016f000 r--p 00000000 00:00 0 
4017e000-40191000 r-xp 00000000 b3:16 406        /system/bin/linker
40191000-40192000 r--p 00012000 b3:16 406        /system/bin/linker
40192000-40193000 rw-p 00013000 b3:16 406        /system/bin/linker
40193000-401a3000 rw-p 00000000 00:00 0 
412c6000-412ce000 rw-p 00000000 00:00 0          [heap]
bed03000-bed24000 rw-p 00000000 00:00 0          [stack]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

#0  0x5d8a2a98 in SEGVHandler::handler (signum=<optimized out>, info=<optimized out>, context=<optimized out>)
    at /home/esawin/dev/g/mozglue/linker/ElfLoader.cpp:1259
#1  0xffff0514 in ?? ()
#2  0xffff0514 in ?? ()

That doesn't seem very helpful. I'm not sure why debug builds don't hit this issue, could be timing related?
Flags: needinfo?(esawin)
I don't expect a crash from a SEGFAULT signal handler to yield something useful, and the stack trace you got is proof of that. Can you try again with gdb/jimdb?
Flags: needinfo?(esawin)
I get the same stack when backtracing from SEGVHandler::handler without MOZ_CRASH:

40011000-40012000 r--p 00000000 00:00 0 
4002f000-40054000 r-xp 00000000 b3:16 422        /system/bin/mksh
40054000-40055000 r--p 00024000 b3:16 422        /system/bin/mksh
40055000-40056000 rw-p 00025000 b3:16 422        /system/bin/mksh
40056000-4005a000 rw-p 00000000 00:00 0 
4005a000-4006b000 r--s 00000000 00:0b 3398       /dev/__properties__ (deleted)
4010b000-4011e000 r-xp 00000000 b3:16 406        /system/bin/linker
4011e000-4011f000 r--p 00012000 b3:16 406        /system/bin/linker
4011f000-40120000 rw-p 00013000 b3:16 406        /system/bin/linker
40120000-40130000 rw-p 00000000 00:00 0 
40130000-40173000 r-xp 00000000 b3:16 1228       /system/lib/libc.so
40173000-40176000 rw-p 00043000 b3:16 1228       /system/lib/libc.so
40176000-40181000 rw-p 00000000 00:00 0 
41e9e000-41ea6000 rw-p 00000000 00:00 0          [heap]
beed4000-beef5000 rw-p 00000000 00:00 0          [stack]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

SEGVHandler::handler (signum=11, info=0x5db1cc90, context=0x5db1cd10)
    at /home/esawin/dev/g/mozglue/linker/ElfLoader.cpp:1258
1258      DEBUG_LOG("Caught segmentation fault @%p", info->si_addr);
(gdb) bt
#0  SEGVHandler::handler (signum=11, info=0x5db1cc90, context=0x5db1cd10) at /home/esawin/dev/g/mozglue/linker/ElfLoader.cpp:1258
#1  0xffff0514 in ?? ()
#2  0xffff0514 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Flags: needinfo?(esawin)
It's weird that the addresses for info and context are not mapped according to your /proc/$pid/maps dump. Are you sure it's a dump for the right process?

Can you print *info and print *(ucontext_t*)context?
Flags: needinfo?(esawin)
Attached file maps dump
You're right, it was the wrong process. I've attached the full maps dump.

SEGVHandler::handler (signum=11, info=0x5da1cc90, context=0x5da1cd10)
    at /home/esawin/dev/g/mozglue/linker/ElfLoader.cpp:1259
1259      DEBUG_LOG("Caught segmentation fault @%p", info->si_addr);
(gdb) bt
#0  SEGVHandler::handler (signum=11, info=0x5da1cc90, context=0x5da1cd10) at /home/esawin/dev/g/mozglue/linker/ElfLoader.cpp:1259
#1  0xffff0514 in ?? ()
#2  0xffff0514 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) p *info
$1 = {si_signo = 11, si_errno = 0, si_code = 2, _sifields = {_pad = {1571815424, 0 <repeats 28 times>}, _kill = {_pid = 1571815424, 
      _uid = 0}, _timer = {_tid = 1571815424, _overrun = 0, _pad = 0x5da1cca4 "", _sigval = {sival_int = 0, sival_ptr = 0x0}, 
      _sys_private = 0}, _rt = {_pid = 1571815424, _uid = 0, _sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {_pid = 1571815424, 
      _uid = 0, _status = 0, _utime = 0, _stime = 0}, _sigfault = {_addr = 0x5db00000}, _sigpoll = {_band = 1571815424, _fd = 0}}}

(gdb) p *(ucontext_t*)context
No symbol "ucontext_t" in current context.

(gdb) x/40xb context                                                                                                                      
0x5da1cd10:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cd18:     0x00    0x90    0xa1    0x5d    0x00    0x00    0x00    0x00
0x5da1cd20:     0x00    0x40    0x00    0x00    0x0e    0x00    0x00    0x00
0x5da1cd28:     0x17    0x00    0x00    0x00    0x04    0x12    0x00    0x00
0x5da1cd30:     0x00    0x00    0xb0    0x5d    0x21    0x95    0xc4    0x61


libxul.so had been mapped to 0x5db00000 before the dlclose afaics.
Flags: needinfo?(esawin)
By libxul.so I mean /data/app/org.mozilla.fennec_esawin-2.apk!/assets/armeabi-v7a/libxul.so.
(In reply to Eugen Sawin [:esawin] from comment #15)
> (gdb) x/40xb context                                                        
> 
> 0x5da1cd10:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
> 0x5da1cd18:     0x00    0x90    0xa1    0x5d    0x00    0x00    0x00    0x00
> 0x5da1cd20:     0x00    0x40    0x00    0x00    0x0e    0x00    0x00    0x00
> 0x5da1cd28:     0x17    0x00    0x00    0x00    0x04    0x12    0x00    0x00
> 0x5da1cd30:     0x00    0x00    0xb0    0x5d    0x21    0x95    0xc4    0x61

Sadly, this is cut right when it begins to be interesting...  need at least 64 more bytes. Make that 200 for good measure.
Flags: needinfo?(esawin)
#0  SEGVHandler::handler (signum=11, info=0x5da1cc90, context=0x5da1cd10) at /home/esawin/dev/g/mozglue/linker/ElfLoader.cpp:1259
#1  0xffff0514 in ?? ()
#2  0xffff0514 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p *info
$1 = {si_signo = 11, si_errno = 0, si_code = 2, _sifields = {_pad = {1571815424, 0 <repeats 28 times>}, _kill = {_pid = 1571815424, 
      _uid = 0}, _timer = {_tid = 1571815424, _overrun = 0, _pad = 0x5da1cca4 "", _sigval = {sival_int = 0, sival_ptr = 0x0}, 
      _sys_private = 0}, _rt = {_pid = 1571815424, _uid = 0, _sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {_pid = 1571815424, 
      _uid = 0, _status = 0, _utime = 0, _stime = 0}, _sigfault = {_addr = 0x5db00000}, _sigpoll = {_band = 1571815424, _fd = 0}}}
(gdb) x/200xb context
0x5da1cd10:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cd18:     0x00    0x90    0xa1    0x5d    0x00    0x00    0x00    0x00
0x5da1cd20:     0x00    0x40    0x00    0x00    0x0e    0x00    0x00    0x00
0x5da1cd28:     0x17    0x00    0x00    0x00    0x04    0x12    0x00    0x00
0x5da1cd30:     0x00    0x00    0xb0    0x5d    0x21    0x35    0x07    0x61
0x5da1cd38:     0x04    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cd40:     0x00    0x00    0x00    0x00    0xec    0xf8    0xdf    0x5d
0x5da1cd48:     0xf0    0xf8    0xdf    0x5d    0x04    0x00    0x00    0x00
0x5da1cd50:     0xf4    0xf8    0xdf    0x5d    0xfc    0x60    0xa0    0x5d
0x5da1cd58:     0x08    0x28    0x62    0x61    0x30    0xd3    0x61    0x61
0x5da1cd60:     0xfd    0xb8    0x2e    0x5f    0xb4    0xf8    0xdf    0x5d
0x5da1cd68:     0xa9    0xf8    0x5c    0x60    0x7e    0xf8    0x5c    0x60
0x5da1cd70:     0x30    0x00    0x00    0x80    0x00    0x00    0xb0    0x5d
0x5da1cd78:     0x04    0x12    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cd80:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cd88:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cd90:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cd98:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cda0:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cda8:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cdb0:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cdb8:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cdc0:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cdc8:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x5da1cdd0:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
Flags: needinfo?(esawin)
So the interesting data in there is:
lr: 0x605cf8a9
pc: 0x605cf87e

Assuming /proc/pid/maps isn't changed, that means the crash is happening from somewhere within /data/data/org.mozilla.fennec_esawin/cache/libxul.so, which starts at 0x5fb0e000.
That makes pc at 0xac187e and lr at 0xac18a9 within it. You'd need to run addr2line to know what those are.

But that's assuming /proc/pid/maps is unchanged.
Flags: needinfo?(esawin)
Flags: needinfo?(esawin)
Since troubleshooting on IRC didn't get us closer to a proper solution of this issue, I would like to isolate the broadcast receiver into its own process for now to prevent any collisions with Fennec's linker.

With this patch we run the receiver in a separate process (":extract") private to Fennec which avoids any linker issues.
Attachment #8808808 - Flags: review?(snorp)
Comment on attachment 8808808 [details] [diff] [review]
0001-Bug-1309708-2.0-Run-the-library-extraction-in-a-dedi.patch

Review of attachment 8808808 [details] [diff] [review]:
-----------------------------------------------------------------

I was initially ok with this as a band-aid, but it introduces a race if we are starting fennec and unpacking after an update at the same time. I think I prefer the original patch on this bug, but we should still try to figure out the real problem.
Attachment #8808808 - Flags: review?(snorp) → review-
When opening a library via CustomElf::Load, we report the mapping for the crash reporter.
We don't remove the mapping when closing the library handle, which results in stale (and duplicate) reported mappings (when re-opening).
Attachment #8800425 - Attachment is obsolete: true
Attachment #8808808 - Attachment is obsolete: true
Attachment #8800425 - Flags: review?(mh+mozilla)
Attachment #8809229 - Flags: review?(mh+mozilla)
Comment on attachment 8809229 [details] [diff] [review]
0001-Bug-1309708-1.1-Delete-reported-library-mapping-on-c.patch

Review of attachment 8809229 [details] [diff] [review]:
-----------------------------------------------------------------

This makes so much more sense now. Thanks for persevering and finding the root cause. Yet again, bug 689178 is greatly needed.

::: mozglue/android/APKOpen.cpp
@@ +206,5 @@
> +    struct mapping_info *info = &lib_mapping[pos];
> +    if (!strcmp(info->name, name)) {
> +      struct mapping_info *last = &lib_mapping[--mapping_count];
> +      free(info->name);
> +      *info = *last;

You shouldn't decrease mapping_count before this.
Attachment #8809229 - Flags: review?(mh+mozilla) → review+
Addressed comment.
Attachment #8809229 - Attachment is obsolete: true
Attachment #8809390 - Flags: review+
Pushed by esawin@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/459230d2ab75
[1.2] Delete reported library mapping on close. r=glandium
https://hg.mozilla.org/mozilla-central/rev/459230d2ab75
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
Comment on attachment 8809390 [details] [diff] [review]
0001-Bug-1309708-1.2-Delete-reported-library-mapping-on-c.patch

Approval Request Comment
[Feature/regressing bug #]: Old issue, exposed by bug 1298090.
[User impact if declined]: Potential crash on startup with broken/empty crash report.
[Describe test coverage new/current, TreeHerder]: Nightly, Aurora.
[Risks and why]: Low, affects only crash reporter integration.
[String/UUID change made/needed]: None.
Attachment #8809390 - Flags: approval-mozilla-beta?
Comment on attachment 8809390 [details] [diff] [review]
0001-Bug-1309708-1.2-Delete-reported-library-mapping-on-c.patch

Fix an issue related to crash report handling. Beta51+. Should be in 51 beta 4.
Attachment #8809390 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: