Closed Bug 1036286 Opened 5 years ago Closed 5 years ago

Firefox on Android L unexpectedly exit by sighandler

Categories

(Firefox for Android :: General, defect)

ARM
Android
defect
Not set

Tracking

()

VERIFIED FIXED
Firefox 34
Tracking Status
firefox31 - wontfix
firefox32 + fixed
firefox33 + fixed
firefox34 + verified
fennec 32+ ---

People

(Reporter: m_kato, Assigned: glandium)

References

(Blocks 1 open bug)

Details

Attachments

(11 files, 3 obsolete files)

651.75 KB, text/plain
Details
601.53 KB, text/plain
Details
537.26 KB, text/plain
Details
58.58 KB, text/plain
Details
477.31 KB, application/x-sharedlib
Details
704.99 KB, text/plain
Details
1.12 MB, text/plain
Details
104.12 KB, text/plain
Details
2.35 MB, text/plain
Details
7.20 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
4.59 KB, patch
froydnj
: review+
Details | Diff | Splinter Review
Even if landing bug 1030899, Firefox on Android L unexpectedly exits.  If disabling signal handler (sDisableSignalHandling in profile/dirserviceprovider/src/nsProfileLock.cpp = true), it isn't exited.
Summary: Firefox on Android L unexpectedly exit when → Firefox on Android L unexpectedly exit
Summary: Firefox on Android L unexpectedly exit → Firefox on Android L unexpectedly exit by sighandler
this signal handler is for bug 76431, but Android version cannot share profile...
What signal is being handled here?

The signal handler here is just supposed to be a last-ditch correction before we crash to unlock the profile. In general if we've hit the signal handler, the fatal error is already present.
Component: Profile: BackEnd → General
Product: Core → Firefox for Android
tracking-fennec: --- → ?
Not tracking for 31. It is too late for the Android L now. However, we could take the fixes in 32.
Assignee: nobody → snorp
tracking-fennec: ? → 32+
This issue depends on custom linker.  At first, SEGVHandler::handler catches SEGSEGV, then it load code without error.  But, after that another sighandler of profile catches it with same si_addr.  So process is exited by exit().
Glandium, see comment 4
Flags: needinfo?(mh+mozilla)
Please attach logs with MOZ_DEBUG_LINKER set to 1.
Flags: needinfo?(mh+mozilla)
Makoto, can you provide the logs glandium asked for?
Flags: needinfo?(m_kato)
Attached file MOZ_DEBUG_LINKER.txt (obsolete) —
I have been observing the progress of bugs relating to the Android L Preview and wwould like to try to help accelerate the process. Hopefully this log is what you guys needed.
Attachment #8457133 - Attachment description: Full log: Nexus 5, Android L Preview, Fennec 7/16/2014 → MOZ_DEBUG_LINKER.txt
Glandium, see the attached logs
Flags: needinfo?(m_kato) → needinfo?(mh+mozilla)
This log doesn't show anything problematic at the linker level, and doesn't match what Makoto described.
Flags: needinfo?(mh+mozilla) → needinfo?(m_kato)
Some additional information. The device I am testing from is a Nexus 5. The build I gathered the log from was nightly http://ftp.mozilla.org/pub/mozilla.org/mobile/nightly/latest-mozilla-central-android/fennec-33.0a1.multi.android-arm.apk on 7/16/14 (today) since as I understand bug 1030899 is already patched in all builds 32+. 

I gathered the log by starting firefox with "adb shell am start -a android.activity.MAIN -n org.mozilla.fennec/.App --es env0 MOZ_DEBUG_LINKER=1" and simply having adb logcat write to file in the background. 

If I have done something incorrectly or if there is anything else that I could be instructed to try I would be glad to continue helping.
(In reply to Mike Hommey [:glandium] from comment #10)
> This log doesn't show anything problematic at the linker level, and doesn't
> match what Makoto described.

This problem occurs when installing multiple signal handlers. If nsProfileLock::FatalSignalHandler() catches SIGSEGV for custom linker, this signal handler function will call exit() even if that SIGSEGV should handle custom linker only.

So adb log show the following.

I/        ( 1477): Process 32017 exited due to signal (11)


On my debug build, when it accesses PumpMessage::PumpMessage() by new MessageLoopForUI into NS_InitXPCOM2 (the page around this function isn't loaded yet), nsProfileLock::FatalSignalHandler will call exit() after old signal handler on custom linker.

Of course, when I turn off dynamic load of customer linker (modify signalHandlingSlow=true), no SIGSEGV on NS_InitXPCOM2.

But since libart installs another signal handler, ART JIT code causes signal, then nsProfileLock::FatalSignalHandler will call exit()...

I think that Android cannot share user profile, so is it unnecessary to lock profile?
Flags: needinfo?(m_kato)
There are reasons why I'm asking for a log with MOZ_DEBUG_LINKER=1, so please do so.
Flags: needinfo?(m_kato)
Attached file MOZ_DEBUG_LINKER=1
Attachment #8457133 - Attachment is obsolete: true
Flags: needinfo?(m_kato)
Add trace version.

diff --git a/profile/dirserviceprovider/src/nsProfileLock.cpp b/profile/dirserviceprovider/src/nsProfileLock.cpp
--- a/profile/dirserviceprovider/src/nsProfileLock.cpp
+++ b/profile/dirserviceprovider/src/nsProfileLock.cpp
@@ -119,16 +119,18 @@ static struct sigaction SIGSEGV_oldact;
 static struct sigaction SIGTERM_oldact;
 
 void nsProfileLock::FatalSignalHandler(int signo
 #ifdef SA_SIGINFO
                                        , siginfo_t *info, void *context
 #endif
                                        )
 {
+    __android_log_print(ANDROID_LOG_INFO, "Gecko", "Entering nsProfileLock::FatalSignalHandler");
+
     // Remove any locks still held.
     RemovePidLockFiles(true);
 
     // Chain to the old handler, which may exit.
     struct sigaction *oldact = nullptr;
 
     switch (signo) {
       case SIGHUP:
@@ -172,26 +174,28 @@ void nsProfileLock::FatalSignalHandler(i
 
             sigprocmask(SIG_UNBLOCK, &unblock_sigs, nullptr);
 
             raise(signo);
         }
 #ifdef SA_SIGINFO
         else if (oldact->sa_sigaction &&
                  (oldact->sa_flags & SA_SIGINFO) == SA_SIGINFO) {
+            __android_log_print(ANDROID_LOG_INFO, "Gecko", "nsProfileLock::FatalSignalHandler is calling sa_sigaction=%p", oldact->sa_sigaction);
             oldact->sa_sigaction(signo, info, context);
         }
 #endif
         else if (oldact->sa_handler && oldact->sa_handler != SIG_IGN)
         {
             oldact->sa_handler(signo);
         }
     }
 
     // Backstop exit call, just in case.
+    __android_log_print(ANDROID_LOG_INFO, "Gecko", "nsProfileLock::FatalSignalHandler is calling _exit(%d)", signo);
Can you attach a new debug log with the following patch applied?

diff --git a/mozglue/linker/CustomElf.cpp b/mozglue/linker/CustomElf.cpp
--- a/mozglue/linker/CustomElf.cpp
+++ b/mozglue/linker/CustomElf.cpp
@@ -737,16 +737,17 @@ CustomElf::Relocate()
     if (symtab_index != ELF_R_SYM(rel->r_info)) {
       symtab_index = ELF_R_SYM(rel->r_info);
       const Sym sym = symtab[symtab_index];
       if (sym.st_shndx != SHN_UNDEF) {
         symptr = GetPtr(sym.st_value);
       } else {
         /* TODO: handle symbol resolving to nullptr vs. being undefined. */
         symptr = GetSymbolPtrInDeps(strtab.GetStringAt(sym.st_name));
+       DEBUG_LOG("Resolved %s to %p", strtab.GetStringAt(sym.st_name), symptr);
       }
     }
 
     if (symptr == nullptr)
       WARN("%s: Relocation to NULL @0x%08" PRIxAddr,
           GetPath(), rel->r_offset);
 
     /* Apply relocation */
diff --git a/mozglue/linker/ElfLoader.cpp b/mozglue/linker/ElfLoader.cpp
--- a/mozglue/linker/ElfLoader.cpp
+++ b/mozglue/linker/ElfLoader.cpp
@@ -919,16 +919,17 @@ Divert(T func, T new_func)
     start = trampoline + 6;
   }
 
   size_t len = sizeof(trampoline) - (start - trampoline);
   EnsureWritable w(reinterpret_cast<void *>(addr), len + sizeof(void *));
   memcpy(reinterpret_cast<void *>(addr), start, len);
   *reinterpret_cast<void **>(addr + len) = FunctionPtr(new_func);
   cacheflush(addr, addr + len + sizeof(void *), 0);
+  DEBUG_LOG("Diverting function @%p", reinterpret_cast<void *>(addr));
   return true;
 #else
   return false;
 #endif
 }
 #else
 #define sys_sigaction sigaction
 template <typename T>
Flags: needinfo?(m_kato)
Attached file MOZ_DEBUG_LINKER=1 with commnet #16 (obsolete) —
Flags: needinfo?(m_kato)
(In reply to Makoto Kato (:m_kato) from comment #17)
> Created attachment 8457825 [details]
> MOZ_DEBUG_LINKER=1 with commnet #16

This message:
+  DEBUG_LOG("Diverting function @%p", reinterpret_cast<void *>(addr));

doesn't show up in your log. Can you figure why?
humm, rebuild with clobber.
Flags: needinfo?(m_kato)
It seems to be too early point to output log.  Because it cannot output "SEGVHandler latency" too and SEGVHandler instance is global value.

glandium, is "SEGVHandler latency" outputed on your device?
Flags: needinfo?(m_kato)
(In reply to Makoto Kato (:m_kato) from comment #20)
> It seems to be too early point to output log.  Because it cannot output
> "SEGVHandler latency" too and SEGVHandler instance is global value.
> 
> glandium, is "SEGVHandler latency" outputed on your device?

Interestingly, no.
always true on isVerbose() at force.

I confirmed the following is logged.

I/GeckoLinker(27972): Diverting function @0xb6f60770
Attachment #8457825 - Attachment is obsolete: true
Could you attach a new log for the same build, including the Diverting function message, as well as a copy of /system/lib/libc.so and a dump of /proc/$pid/maps for the fennec process (which, obviously, you'd need to get before the crash ; you may want to open(/proc/self/maps) and dump its content from fennec itself)
Flags: needinfo?(m_kato)
Attached file /proc/*/maps
Flags: needinfo?(m_kato)
Attached file libc.so on Android L
Attached file logcat
Assignee: snorp → mh+mozilla
Android L added a libsigchain library it LD_PRELOADs. That library exposes
a different sigaction than libc's. It's used for ART.

faulty.lib gets its sigaction from libsigchain, but after bug 874708, the
libraries it loads simply use libc's sigaction, assuming it would be the
hooked one. In turn, this means libraries loaded by faulty.lib may
override faulty.lib's handler, which is definitely not the intent.

This essentially restores some of the code that bug 874708 removed.

An alternative fix would be to add support for LD_PRELOAD, but that has more
implications and feels more risky. This could be done, if necessary, as a
followup.
Attachment #8461188 - Flags: review?(nfroyd)
Comment on attachment 8461188 [details] [diff] [review]
Make sure libraries loaded by faulty.lib use its own sigaction

Makoto-san, can you test this patch?
Attachment #8461188 - Flags: feedback?(m_kato)
Comment on attachment 8461188 [details] [diff] [review]
Make sure libraries loaded by faulty.lib use its own sigaction

Review of attachment 8461188 [details] [diff] [review]:
-----------------------------------------------------------------

Is this going to break the Samsung GS4 case that bug 874708 was partly trying to address?
Attachment #8461188 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd (:froydnj) from comment #29)
> Is this going to break the Samsung GS4 case that bug 874708 was partly
> trying to address?

No, because it doesn't remove the hook into the system sigaction.
(the problem in that bug was that a system-linker-loaded library was calling the system sysaction ; the patch here only affects faulty.lib-loaded libraries)
Initializing XPCOM works fine. But then, Fennec still exits by exit().

Before calling ART's signal handler, signal seems to be dispatched to nsProfileLock's signal handler.  so it calls exit()....
Comment on attachment 8461188 [details] [diff] [review]
Make sure libraries loaded by faulty.lib use its own sigaction

initialization of XPCOM works fine, but nsProfileLock's signal handler still catches SIGSEGV for ART?. so Fennec exits unexpectedly.  I attach logcat file.
Attachment #8461188 - Flags: feedback?(m_kato)
(In reply to Makoto Kato (:m_kato) from comment #32)
> Created attachment 8461276 [details]
> logcat w/ glandium's fix
> 
> Initializing XPCOM works fine. But then, Fennec still exits by exit().
> 
> Before calling ART's signal handler, signal seems to be dispatched to
> nsProfileLock's signal handler.  so it calls exit()....

Actually, there's progress, because now the linker *is* catching a segfault and dispatching it to the nsProfileLock's signal handler, as it's supposed to do.

I/GeckoLinker( 8775): Caught segmentation fault @0x0
I/GeckoLinker( 8775): Redispatching to registered handler @0x5a4288fd

So, now, you'd need to figure out what /that/ segfault actually is.

The good thing is that if I land this, the next nightly that has the patch in it should trigger the crash reporter and give a proper crash report. So let's land this.
(In reply to Mike Hommey [:glandium] from comment #34)

> The good thing is that if I land this, the next nightly that has the patch
> in it should trigger the crash reporter and give a proper crash report. So
> let's land this.

Since nsProfileLock's signal handler calls exit(), we may not be able to get crash by crash reporter.
(In reply to Makoto Kato (:m_kato) from comment #36)
> Since nsProfileLock's signal handler calls exit(), we may not be able to get
> crash by crash reporter.

Local builds don't have the crash reporter enabled by default.
(In reply to Mike Hommey [:glandium] from comment #37)
> Local builds don't have the crash reporter enabled by default.

Although I test the latest Nightly, it doesn't launch crash reporter due to comment #36.  When nsProfileLock's signal handler is called, oldact->sa_handler isn't SIG_DFL.
(In reply to Makoto Kato (:m_kato) from comment #39)
> (In reply to Mike Hommey [:glandium] from comment #37)
> > Local builds don't have the crash reporter enabled by default.
> 
> Although I test the latest Nightly, it doesn't launch crash reporter due to
> comment #36.  When nsProfileLock's signal handler is called,
> oldact->sa_handler isn't SIG_DFL.

And when oldact->sa_handler isn't SIG_DFL, oldact->sa_sigaction or oldact->sa_handler are called before exit, and one of those should be the crash reporter's. Note something fishy is likely to be happening with the ART segfault handler being registered before any of all that. Anyways, there's still a null deref crash that needs to be identified. The fact that segfault handling is kind of screwed up doesn't change much about that... except if the null deref comes from ART...
(In reply to Mike Hommey [:glandium] from comment #41)
> Makoto-san, can you test this build?
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-
> ce8aa1d9a6a4/try-android/fennec-34.0a1.en-US.android-arm.apk

Installed last night. Finally have working Firefox on L! Posting this using that apk :)

Thanks.
(In reply to Mike Hommey [:glandium] from comment #41)
> Makoto-san, can you test this build?
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-
> ce8aa1d9a6a4/try-android/fennec-34.0a1.en-US.android-arm.apk

It works on Android L.

(In reply to epg from comment #42)
> (In reply to Mike Hommey [:glandium] from comment #41)
> Installed last night. Finally have working Firefox on L! Posting this using
> that apk :)
> 
> Thanks.

thank you for verified.
Flags: needinfo?(m_kato)
Can you test this build? (this one at least doesn't break on older android versions)
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-444c8ae1110e/try-android/fennec-34.0a1.en-US.android-arm.apk
Flags: needinfo?(m_kato)
Attached file logcat
Fennec is launched, but I cannot navigate to any site. Also, I click menu in Fennec, then process is crashed.
Flags: needinfo?(m_kato)
(In reply to Mike Hommey [:glandium] from comment #47)
> Let's try this one?
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-
> 89a9d566cdb1/try-android/fennec-34.0a1.en-US.android-arm.apk

It works.
It's necessary to delay it because for the second part, we need to call dlopen,
and until recently bionic's linker dead-locked when using dlopen from a static
initializer.
Attachment #8467590 - Flags: review?(nfroyd)
Attachment #8461188 - Attachment is obsolete: true
and force-send calls to sigaction from faulty.lib-loaded libraries to the libc.
Attachment #8467604 - Flags: review?(nfroyd)
(In reply to Mike Hommey [:glandium] from comment #49)
> Backout of the original fix.
> https://hg.mozilla.org/integration/mozilla-inbound/rev/350d48e11f31

Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/350d48e11f31
During the channel meeting we discussed that we'd like to get this on 32.0b6 if at all possible by next week.
Comment on attachment 8467590 [details] [diff] [review]
Delay registration of the faulty.lib signal handler until when it's necessary

Review of attachment 8467590 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/linker/ElfLoader.cpp
@@ +970,5 @@
>  {
>    /* Initialize oldStack.ss_flags to an invalid value when used to set
>     * an alternative stack, meaning we haven't got information about the
> +   * original alternative stack and thus don't mean to restore it in
> +   * the desctructor. */

Nit: "destructor"

@@ +1072,5 @@
>    TmpData *data = reinterpret_cast<TmpData*>(that.stackPtr.get());
>    uint64_t latency = ProcessTimeStamp_Now() - data->crash_timestamp;
>    DEBUG_LOG("SEGVHandler latency: %" PRIu64, latency);
>    /* See bug 886736 for timings on different devices, 150 µs is reasonably above
> +   * the latency on "working" devices and seems to be reasonably fast not to

If you're going to re-word this, maybe "seems to be short enough to not incur..."
Attachment #8467590 - Flags: review?(nfroyd) → review+
Comment on attachment 8467604 [details] [diff] [review]
Ensure faulty.lib diverts libc's sigaction instead of a LD_PRELOADed one

Review of attachment 8467604 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozglue/linker/ElfLoader.cpp
@@ +1026,5 @@
> +
> +#if defined(ANDROID)
> +  /* Android > 4.4 comes with a sigaction wrapper in a LD_PRELOADed library
> +   * (libsigchain) for ART. That wrapper kind of does the same trick as we
> +   * do, so we need extra care in handling it.

Thank you for this comment!
Attachment #8467604 - Flags: review?(nfroyd) → review+
https://hg.mozilla.org/mozilla-central/rev/6bdc1e09802a
https://hg.mozilla.org/mozilla-central/rev/5b871cc94344
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 34
For what it's worth Google just published LPV81C, new L Preview build
This should be in last nightly, can you test it?
Flags: needinfo?(m_kato)
(In reply to Mike Hommey [:glandium] from comment #59)
> This should be in last nightly, can you test it?

The latest Nightly works on Nexus 5/L Preview
Flags: needinfo?(m_kato)
Comment on attachment 8467590 [details] [diff] [review]
Delay registration of the faulty.lib signal handler until when it's necessary

Approval Request Comment
[User impact if declined]: Required for the second patch.
[Describe test coverage new/current, TBPL]: It's been tested before landing, and is now in nightlies.
[Risks and why]: This patch delays the initialization of the signal handler used for on-demand decompression of libraries in Firefox for Android. It doesn't affect neither Firefox desktop nor B2G. This code is exercised before loading any part of Gecko and simply delays that initialization from the moment the java code loads libmozglue.so to the moment the java code loads libxul.so. That initialization is not necessary until libxul.so gets loaded. IOW, risk is very low.
[String/UUID change made/needed]: none
Attachment #8467590 - Flags: approval-mozilla-beta?
Attachment #8467590 - Flags: approval-mozilla-aurora?
Comment on attachment 8467604 [details] [diff] [review]
Ensure faulty.lib diverts libc's sigaction instead of a LD_PRELOADed one

Approval Request Comment
[User impact if declined]: Firefox for Android doesn't start at all on Android L (next version of Android)
[Describe test coverage new/current, TBPL]: This was tested before landing and is not on nightlies.
[Risks and why]: This patch changes how sigaction and signal are resolved for libraries we load with our own linker, and how the system sigaction function is hooked. However, in practice, for systems earlier than Android L, the behavior before and after the patch is exactly the same because in all cases, sigaction and signal come from libc.so. IOW, this is almost a no-op for non-Android L, and a fix for Android L. Risk is low.
[String/UUID change made/needed]: none
Attachment #8467604 - Flags: approval-mozilla-beta?
Attachment #8467604 - Flags: approval-mozilla-aurora?
Attachment #8467590 - Flags: approval-mozilla-beta?
Attachment #8467590 - Flags: approval-mozilla-beta+
Attachment #8467590 - Flags: approval-mozilla-aurora?
Attachment #8467590 - Flags: approval-mozilla-aurora+
Attachment #8467604 - Flags: approval-mozilla-beta?
Attachment #8467604 - Flags: approval-mozilla-beta+
Attachment #8467604 - Flags: approval-mozilla-aurora?
Attachment #8467604 - Flags: approval-mozilla-aurora+
Verified against L Preview LPV81C on my Nexus 7 (2013)
Status: RESOLVED → VERIFIED
Keywords: qawanted, verifyme
You need to log in before you can comment on or make changes to this bug.