Closed Bug 1043112 Opened 6 years ago Closed 5 years ago

[B2G] Add support for core dumps in custom builds

Categories

(Firefox OS Graveyard :: GonkIntegration, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
FxOS-S3 (24Jul)

People

(Reporter: aosmond, Assigned: aosmond)

References

Details

Attachments

(3 files, 12 obsolete files)

44 bytes, text/x-github-pull-request
dhylands
: review+
Details | Review
50 bytes, text/x-github-pull-request
aosmond
: review+
Details | Review
1.84 KB, patch
aosmond
: review+
Details | Diff | Splinter Review
Presently when a process crashes on B2G devices, the only trace generated is in the logcat. The build process and run-gdb scripts should be updated to permit enabling core dumping of processes for development purposes. This should not be enabled by default on production builds.
Status: NEW → ASSIGNED
QA Contact: aosmond
Attached file Pull request on B2G (obsolete) —
Attached file Pull request on platform_system_core (obsolete) —
Attached file Pull request on gonk-misc (obsolete) —
Assignee: nobody → aosmond
QA Contact: aosmond
Comment on attachment 8461289 [details] [review]
Pull request on gonk-misc

mwu: I have been able to generate coredumps by manually changing system/core/rootfiles/init.rc to import init.b2g.debug.rc, however I am having trouble figuring out how to automate the patching of said file.

There is an existing mechanism to patch init.rc via the TARGET_PROVIDES_B2G_INIT_RC flag. However if I try to follow the same method and simply always patch init.rc, I see that it does do it, but the file is then overwritten by the original, presumably because there is no way for make to know there is a dependency between gonk-misc/init.b2g.rc and system/core/rootfiles/init.rc. Ideally I would toggle coredumps based on some developer controlled flag (i.e. say B2G_DEBUG=1 or B2G_DEBUG_COREDUMP=1 in .userconfig) but even doing it always is challenging enough :).

Any suggestions would be appreciated! I'm starting to think the best way to handle this is to just put in a .patch file, and say please manually patch system/core/rootfiles/init.rc using this file and build again...
Attachment #8461289 - Flags: feedback?(mwu)
Would it work to have b2g.sh enable core dumps, instead of init.rc?
(In reply to Jed Davis [:jld] from comment #5)
> Would it work to have b2g.sh enable core dumps, instead of init.rc?

We might be able to move some of it but the following bit is the crucial part:

on post-fs-data
    # Make processes dumpable
    setrlimit 4 -1 -1

The coredump file size limit is inherited by child processes, so if we moved the equivalent of that into the b2g.sh (ulimit -c unlimited), then only processes spawned by that shell script will get cores. That gets some of the value, but not all the way for services (at least) I care about, such as mediaserver.
We don't need to have this in a separate init.rc file. It should be part of init.b2g.rc and trigger on some property.
(In reply to Michael Wu [:mwu] from comment #7)
> We don't need to have this in a separate init.rc file. It should be part of
> init.b2g.rc and trigger on some property.

I agree that would be ideal, but:

1) It looks like property events are triggered after the services are started, and so I don't think they won't inherit the correct rlimit 4 setting.
2) init.b2g.rc is not always used. On flame for example, it doesn't import it and so it isn't executed as far as I can tell.
> I agree that would be ideal, but:
> 
> 1) It looks like property events are triggered after the services are
> started, and so I don't think they won't inherit the correct rlimit 4
> setting.
> 2) init.b2g.rc is not always used. On flame for example, it doesn't import
> it and so it isn't executed as far as I can tell.

Err 1), that is to say I don't think they *will* inherit the correct rlimit 4 setting. Processes started after should however.
(In reply to Michael Wu [:mwu] from comment #7)
> We don't need to have this in a separate init.rc file. It should be part of
> init.b2g.rc and trigger on some property.

Hm, there is one possibility that came to mind although testing today has revealed there are some significant roadblocks. In theory if we trigger on a property, we can update the rlimit so that future processes are set, and *then* execute a script which goes through each existing process and sets the max core size using the prlimit API (supported by our kernel). (The only alternative to prlimit is to write to /proc/<pid>/limits but support for this is not in the mainline or Android kernels.)

However our bionic libc version does not support prlimit. It was added early this year although it does seem to have some limitations which may or may not be relevant:

https://github.com/android/platform_bionic/commit/0f461e35f63200641fc53bba222845a84589c024

I'm not sure how easy it would be to get our partners to push this change into their versions of libc since we pull that from their repositories.

And with all that said, we still need a way to shove some changes into the init.rc/init.b2g.rc which isn't clear to me as I pointed out in comment 8. I would be happy if we had a less dynamic solution requiring a rebuild today since we have no simple means to get core files, and could look at improving it in the future (maybe the move to KK will bring some of the necessary changes??).
(In reply to Andrew Osmond [:aosmond] from comment #8)
> 2) init.b2g.rc is not always used. On flame for example, it doesn't import
> it and so it isn't executed as far as I can tell.

That's not a big deal. init.b2g.rc is the thing we support. We can port the change to devices which don't use it.
Attached file Pull request on device-flame (obsolete) —
Turn on using the modified gonk-misc version of init.rc for flame.
(In reply to Michael Wu [:mwu] from comment #11)
> (In reply to Andrew Osmond [:aosmond] from comment #8)
> > 2) init.b2g.rc is not always used. On flame for example, it doesn't import
> > it and so it isn't executed as far as I can tell.
> 
> That's not a big deal. init.b2g.rc is the thing we support. We can port the
> change to devices which don't use it.

I see what you mean now. There are a number of places where TARGET_PROVIDES_B2G_INIT_RC and TARGET_PROVIDES_INIT_RC are set, and I didn't realize *we* controlled the makefile which had the final setting. I've added a new pull request to turn change TARGET_PROVIDES_INIT_RC to true on flame to avoid trampling our updated version :).

However since the vendor init.rc under system/core/rootdir itself already includes the bits we need from init.b2g.rc (specifically the "service b2g" definition), TARGET_PROVIDES_B2G_INIT_RC should remain false.

As such, I modified gonk-misc/Android.mk to generate init.rc when TARGET_PROVIDES_INIT_RC=true. It will import init.b2g.rc if TARGET_PROVIDES_B2G_INIT_RC=false. It will import init.b2g.debug.rc if B2G_DEBUG_CORE is set; perhaps this should be B2G_DEBUG? The core files can be *quite* large and contain personal information depending on the process dumped, so I'm not sure what you think about turning it on in debug builds (based on B2G_DEBUG) by default.

I'm willing to look at this again once we get the libc support to make a runtime switch based on a property possible. Another thing to consider is that if debug symbols are not created (optimized/release build?), the cores are probably useless so we probably only want to support it when B2G_DEBUG is set anyways.
Comment on attachment 8461286 [details] [review]
Pull request on platform_system_core

Obsoleted by gonk-misc/device-flame pull requests
Attachment #8461286 - Attachment is obsolete: true
This won't core dump assertion failures by default because crash report catches SIGABRT.
(In reply to Wander Lairson Costa :wcosta from comment #15)
> This won't core dump assertion failures by default because crash report
> catches SIGABRT.

If I run "kill -s SIGABRT <process>", I see the crash reporter pops up an indicator on screen, but the core dump is still generated. It also captures SIGSEGV with the same effect. Are there other corner cases I'm missing?

I was lucky (?) that I hit an abort error when playing with the camera app, although oddly it claims it is SIGSEGV which brought it down, despite the trace being clear it is an abort:

Program terminated with signal 11, Segmentation fault.
#0  0xb603db36 in ?? ()
warning: Breakpoint address adjusted from 0xb6f3db89 to 0xb6f3db88.
warning: Dynamic linker breakpoint function now found. Pending
breakpoints in dynamically-loaded libraries will now work.
(gdb) bt
#0  0xb603db36 in mozalloc_abort (msg=<optimized out>) at ../../../../B2G_HG/b2g-inbound/memory/mozalloc/mozalloc_abort.cpp:30
#1  0xb489d270 in Abort (
    aMsg=0xbed66434 "[Child 1107] ###!!! ABORT: Element::UnbindFromTree should have destroyed the element transitions object: 'collection->mElement->GetCrossShadowCurrentDoc() == mPresContext->Document()', file ../../../."...) at ../../../../B2G_HG/b2g-inbound/xpcom/base/nsDebugImpl.cpp:482
Attachment #8461285 - Flags: review?(mwu)
Comment on attachment 8461289 [details] [review]
Pull request on gonk-misc

Updated the pull request to trigger turning on core dumping based on a property update, which is not set by default (so no cores). Added utility to expose the kernel system call that is missing from libc.
Attachment #8461289 - Flags: review?(mwu)
Attachment #8461289 - Flags: feedback?(mwu)
Attachment #8461289 - Flags: feedback+
Attachment #8464745 - Flags: review?(mwu)
(In reply to Andrew Osmond [:aosmond] from comment #16)
> (In reply to Wander Lairson Costa :wcosta from comment #15)
> > This won't core dump assertion failures by default because crash report
> > catches SIGABRT.
> 
> If I run "kill -s SIGABRT <process>", I see the crash reporter pops up an
> indicator on screen, but the core dump is still generated. It also captures
> SIGSEGV with the same effect. Are there other corner cases I'm missing?
> 
> I was lucky (?) that I hit an abort error when playing with the camera app,
> although oddly it claims it is SIGSEGV which brought it down, despite the
> trace being clear it is an abort:
> 
> Program terminated with signal 11, Segmentation fault.
> #0  0xb603db36 in ?? ()
> warning: Breakpoint address adjusted from 0xb6f3db89 to 0xb6f3db88.
> warning: Dynamic linker breakpoint function now found. Pending
> breakpoints in dynamically-loaded libraries will now work.
> (gdb) bt
> #0  0xb603db36 in mozalloc_abort (msg=<optimized out>) at
> ../../../../B2G_HG/b2g-inbound/memory/mozalloc/mozalloc_abort.cpp:30
> #1  0xb489d270 in Abort (
>     aMsg=0xbed66434 "[Child 1107] ###!!! ABORT: Element::UnbindFromTree
> should have destroyed the element transitions object:
> 'collection->mElement->GetCrossShadowCurrentDoc() ==
> mPresContext->Document()', file ../../../."...) at
> ../../../../B2G_HG/b2g-inbound/xpcom/base/nsDebugImpl.cpp:482

This is really weird, I tried sending SIGABRT to the b2g process and I got no dumps. When I send SIGQUIT, it core dumps.
(In reply to Wander Lairson Costa :wcosta from comment #18)
> This is really weird, I tried sending SIGABRT to the b2g process and I got
> no dumps. When I send SIGQUIT, it core dumps.

Ah, this seems to be the difference between b2g and its child processes plugin-container. I can reproduce what you are referring to now.

It looks like that signal (along with SIGSEGV and SIGILL) is captured in toolkit/xre/nsSigHandlers.cpp and it doesn't pass the signal along to the kernel when it shuts down. It should not a problem to work around however; I'll see if I can whip together a patch today.
(In reply to Andrew Osmond [:aosmond] from comment #19)
> (In reply to Wander Lairson Costa :wcosta from comment #18)
> > This is really weird, I tried sending SIGABRT to the b2g process and I got
> > no dumps. When I send SIGQUIT, it core dumps.
> 
> Ah, this seems to be the difference between b2g and its child processes
> plugin-container. I can reproduce what you are referring to now.
> 
> It looks like that signal (along with SIGSEGV and SIGILL) is captured in
> toolkit/xre/nsSigHandlers.cpp and it doesn't pass the signal along to the
> kernel when it shuts down.

Yes, this is exactly what I observed.

> It should not a problem to work around however;
> I'll see if I can whip together a patch today.

This would be great, because currently we have no core dumps for assertion failures on debug builds.
(In reply to Wander Lairson Costa :wcosta from comment #20)
> (In reply to Andrew Osmond [:aosmond] from comment #19)
> > (In reply to Wander Lairson Costa :wcosta from comment #18)
> > > This is really weird, I tried sending SIGABRT to the b2g process and I got
> > > no dumps. When I send SIGQUIT, it core dumps.
> > 
> > Ah, this seems to be the difference between b2g and its child processes
> > plugin-container. I can reproduce what you are referring to now.
> > 
> > It looks like that signal (along with SIGSEGV and SIGILL) is captured in
> > toolkit/xre/nsSigHandlers.cpp and it doesn't pass the signal along to the
> > kernel when it shuts down.
> 
> Yes, this is exactly what I observed.
> 
> > It should not a problem to work around however;
> > I'll see if I can whip together a patch today.
> 
> This would be great, because currently we have no core dumps for assertion
> failures on debug builds.

This patch should cause it to generate core dumps as expected now. Reading the comments, it seems like it *should* already be, but if we call exit, it seems to swallow any pending signals. I'm not sure if it behaves differently in other circumstances, so I just changed the behaviour on MOZ_B2G builds.
Attachment #8471845 - Flags: review?(mwu)
Attachment #8471845 - Flags: review?(mwu) → review?(benjamin)
Comment on attachment 8471845 [details] [diff] [review]
Force b2g process to generate core dump

I am about 99% certain that this will break release crashreporting, which also uses a signal handler. If crash reporting is enabled, we must use that instead of saving a core.
Attachment #8471845 - Flags: review?(benjamin) → review-
(In reply to Andrew Osmond [:aosmond] from comment #21)
> Created attachment 8471845 [details] [diff] [review]
> Force b2g process to generate core dump
> 
> (In reply to Wander Lairson Costa :wcosta from comment #20)
> > (In reply to Andrew Osmond [:aosmond] from comment #19)
> > > (In reply to Wander Lairson Costa :wcosta from comment #18)
> > > > This is really weird, I tried sending SIGABRT to the b2g process and I got
> > > > no dumps. When I send SIGQUIT, it core dumps.
> > > 
> > > Ah, this seems to be the difference between b2g and its child processes
> > > plugin-container. I can reproduce what you are referring to now.
> > > 
> > > It looks like that signal (along with SIGSEGV and SIGILL) is captured in
> > > toolkit/xre/nsSigHandlers.cpp and it doesn't pass the signal along to the
> > > kernel when it shuts down.
> > 
> > Yes, this is exactly what I observed.
> > 
> > > It should not a problem to work around however;
> > > I'll see if I can whip together a patch today.
> > 
> > This would be great, because currently we have no core dumps for assertion
> > failures on debug builds.
> 
> This patch should cause it to generate core dumps as expected now. Reading
> the comments, it seems like it *should* already be, but if we call exit, it
> seems to swallow any pending signals. I'm not sure if it behaves differently
> in other circumstances, so I just changed the behaviour on MOZ_B2G builds.

hmm, not exactly as expected. It indeed generates a core dump, but the stack trace recurs from the
kill(getpid(), signo) call:

#0  kill () at bionic/libc/arch-arm/bionic/kill.S:45
#1  0xb55738e4 in nsProfileLock::FatalSignalHandler (signo=11, info=0xbea2e008, context=0xbea2e088) at /home/wander/work/gecko-dev/profile/dirserviceprovider/nsProfileLock.cpp:192
#2  0xb58785a8 in AsmJSFaultHandler (context=0xbea2e088, info=0xbea2e008, signum=11) at ../../../../gecko-dev/js/src/asmjs/AsmJSSignalHandlers.cpp:975
#3  AsmJSFaultHandler (signum=11, info=0xbea2e008, context=0xbea2e088) at ../../../../gecko-dev/js/src/asmjs/AsmJSSignalHandlers.cpp:957
#4  0xb6f0965c in ?? ()
#5  0xb6f0965c in ?? ()
(In reply to Wander Lairson Costa [:wcosta] from comment #23)
> hmm, not exactly as expected. It indeed generates a core dump, but the stack
> trace recurs from the
> kill(getpid(), signo) call:

That's probably a SIGSYS from the sandbox.  You should call raise(3), which uses tgkill(2) to post the signal to the calling thread, rather than an unspecified thread in the same process as kill(2) does.
(In reply to Wander Lairson Costa [:wcosta] from comment #23)
> #3  AsmJSFaultHandler (signum=11, info=0xbea2e008, context=0xbea2e088) at
> ../../../../gecko-dev/js/src/asmjs/AsmJSSignalHandlers.cpp:957
> #4  0xb6f0965c in ?? ()
> #5  0xb6f0965c in ?? ()

I commented a bit too quickly earlier.  There are several things going on here:

1. As seen in the quoted text above, gdb doesn't know how to unwind out of the signal handler on this platform.  So if you want to re-raise a SIGSEGV/SIGBUS that was caused by an actual fault (rather than by kill/tgkill; this is available in the si_code field of the siginfo), you have to return from the signal handler after resetting the signal handling.

2. kill(getpid(), ...) shouldn't be used in a multithreaded process, as it can signal a thread other than the caller; see above about raise().

3. kill(...) is forbidden in sandboxed content processes and will raise SIGSYS; tgkill is allowed only if the tgid is the calling process.
(In reply to Jed Davis [:jld] from comment #25)
> (In reply to Wander Lairson Costa [:wcosta] from comment #23)
> > #3  AsmJSFaultHandler (signum=11, info=0xbea2e008, context=0xbea2e088) at
> > ../../../../gecko-dev/js/src/asmjs/AsmJSSignalHandlers.cpp:957
> > #4  0xb6f0965c in ?? ()
> > #5  0xb6f0965c in ?? ()
> 
> I commented a bit too quickly earlier.  There are several things going on
> here:
> 
> 1. As seen in the quoted text above, gdb doesn't know how to unwind out of
> the signal handler on this platform.  So if you want to re-raise a
> SIGSEGV/SIGBUS that was caused by an actual fault (rather than by
> kill/tgkill; this is available in the si_code field of the siginfo), you
> have to return from the signal handler after resetting the signal handling.
> 
> 2. kill(getpid(), ...) shouldn't be used in a multithreaded process, as it
> can signal a thread other than the caller; see above about raise().
> 
> 3. kill(...) is forbidden in sandboxed content processes and will raise
> SIGSYS; tgkill is allowed only if the tgid is the calling process.

Returning from the handler to re-raise the signal sounds really complex in some common scenarios. How about adding a build switch to not handle faulty signals at all? I guess would be more work, but less tricky.
(In reply to Wander Lairson Costa [:wcosta] from comment #26)
> Returning from the handler to re-raise the signal sounds really complex in
> some common scenarios. How about adding a build switch to not handle faulty
> signals at all? I guess would be more work, but less tricky.

It's not that complicated; Breakpad already has code for it:

https://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc#324
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #22)
> Comment on attachment 8471845 [details] [diff] [review]
> Force b2g process to generate core dump
> 
> I am about 99% certain that this will break release crashreporting, which
> also uses a signal handler. If crash reporting is enabled, we must use that
> instead of saving a core.

I'm not quite sure how this is possible. The signal handler originally calls _exit at the end of the function, which will bring the process down without triggering further handlers. If the crash reporter runs, it must have already done so.

But in any event, at least with his new patch (with or without the MOZ_CRASHREPORTER define), if you look at the backtrace produced from "kill -s SIGSEGV <pid>":
(gdb) bt
#0  syscall () at bionic/libc/arch-arm/bionic/syscall.S:41
#1  0xb5564bfe in tgkill (sig=11, tid=<optimized out>, tgid=1909)
    at ../../../../../../../../../B2G_HG/b2g-inbound/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc:105
#2  google_breakpad::ExceptionHandler::SignalHandler (sig=11, info=0xbe907210, uc=0xbe907290)
    at ../../../../../../../../../B2G_HG/b2g-inbound/toolkit/crashreporter/google-breakpad/src/client/linux/handler/exception_handler.cc:346
#3  0xb555245c in nsProfileLock::FatalSignalHandler (signo=11, info=0xbe907210, context=0xbe907290)
    at /home/aosmond/dev/mozilla/B2G_HG/b2g-inbound/profile/dirserviceprovider/nsProfileLock.cpp:189
#4  0xb58594d4 in AsmJSFaultHandler (context=0xbe907290, info=0xbe907210, signum=11) at ../../../../B2G_HG/b2g-inbound/js/src/asmjs/AsmJSSignalHandlers.cpp:978
#5  AsmJSFaultHandler (signum=11, info=0xbe907210, context=0xbe907290) at ../../../../B2G_HG/b2g-inbound/js/src/asmjs/AsmJSSignalHandlers.cpp:960
#6  0xb6f48388 in ?? ()
#7  0xb6f48388 in ?? ()

It seems to me crash reporter ran just fine :).

(In reply to Jed Davis [:jld] from comment #27)
> It's not that complicated; Breakpad already has code for it:
> 
> https://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-
> breakpad/src/client/linux/handler/exception_handler.cc#324

I've been looking at this closely, reading up on _exit, details on signals and I think I may have over complicated the solution :). It *would* be more or less a copy-paste situation except for the fact that if crash reporter is present, it is doing what we need anyways! We just need to stop calling _exit which prevents the core.

Now there is something I should note in case somebody has a solution (or if this is a known limitation?). If I crash b2g with kill -s SIGSEGV, it will attempt to respawn. But until the device is rebooted, all of these respawn attempts generate a core. If this happens in normal crashes, we will quickly create a lot of core files. I could change the core_pattern definition so that it overwrites the same file each time, but that wouldn't stop the cycle, just the wasted disk space / write cycles. I will attach a stack trace in case anybody wishes to look at it.

(Note the continuous respawning happens even without any changes, so it has nothing to do with the signal handler.)
Attachment #8480274 - Flags: review?(benjamin)
Attachment #8471845 - Attachment is obsolete: true
If you have a crash loop you want to investigate, start b2g manually:

adb shell stop b2g
adb shell b2g.sh
(In reply to Michael Wu [:mwu] from comment #30)
> If you have a crash loop you want to investigate, start b2g manually:
> 
> adb shell stop b2g
> adb shell b2g.sh

Thanks Michael, that was useful :). I have resolved the crash loop and b2g is able to restart itself once more without issue. If there is a crash reporter waiting to be processed when it starts up, it would crash because we hadn't finished initializing everything yet; it wouldn't normally initialize until the first gecko child process was spawned which would be too late. I am very unfamiliar with the startup code for gecko so if there is a better place to put this, by all means make a suggestion (I was considering XRE_main in toolkit/xre/nsAppRunner.cpp).
Attachment #8480274 - Attachment is obsolete: true
Attachment #8480275 - Attachment is obsolete: true
Attachment #8480274 - Flags: review?(benjamin)
Attachment #8480854 - Flags: review?(benjamin)
Comment on attachment 8480854 [details] [diff] [review]
Force b2g process to generate core dump, v3

I don't understand the OOPInit addition. It seems like you're initializing OOP crash reporting unconditionally on startup, and that's not what we want in most cases (at least on desktop builds). Why was that added/why is it necessary?

The other bits here need to be reveiwed by somebody who understands linux syscalls, which is not really me. I'm going to pick jld out of a hat!
Attachment #8480854 - Flags: review?(jld)
Flags: needinfo?(aosmond)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #32)
> Comment on attachment 8480854 [details] [diff] [review]
> Force b2g process to generate core dump, v3
> 
> I don't understand the OOPInit addition. It seems like you're initializing
> OOP crash reporting unconditionally on startup, and that's not what we want
> in most cases (at least on desktop builds). Why was that added/why is it
> necessary?
> 
> The other bits here need to be reveiwed by somebody who understands linux
> syscalls, which is not really me. I'm going to pick jld out of a hat!

It shouldn't be unconditionally. CheckForLastRunCrash is always executed at startup. If there is an unprocessed crash report (e.g. in the scenario where I crash the main b2g process with kill -s SIGSEGV), it wants to move it to the pending directory so that the crash report can be eventually submitted (at least as far as I understand the process). It will *not* have reached the init code I added unless it finds that minidump file and successfully opens/extracts info from it.

The critical call path follows immediately after where I added the init check: CheckForLastRunCrash calls MoveToPending which calls GetPendingDir.

GetPendingDir requires that OOP should be initialized in order to continue and checks this with a MOZ_ASSERT; this check fails without the above patch, and that causes the process to go down in my debug build. b2g gets restarted by the init process (?) and the cycle starts again because the crash report has yet to be moved to the pending folder.

Note the phone won't recover without deleting the crash report files or a reflash (reboots do not help). This can get very frustrating.

Either the MOZ_ASSERT is superfluous in which case we can remove it, or we need OOPInit to be called (or redesign!).

Maybe the test case I am using is somehow flawed, using an external signal to bring it down? I do most of my work in the plugin-container context and those crash reports / core dumps are generated just fine, so I'm not too sure where the best place to try this out would be :).

wcosta: Did you experience a process crash loop due to an assert in GetPending after crashing the process using the aborts mentioned earlier in the thread?
Flags: needinfo?(aosmond) → needinfo?(wcosta)
(In reply to Andrew Osmond [:aosmond] from comment #33)
[snip]
> wcosta: Did you experience a process crash loop due to an assert in
> GetPending after crashing the process using the aborts mentioned earlier in
> the thread?

Yes I did, details on how to reproduce it at Bug 1048402 (it was closed as "won't fix").
Flags: needinfo?(wcosta)
The GetPendingDir change should be moved to a separate bug to avoid confusing this one. The basic assumption here is that GetPendingDir is only called by OOP crash reporting code. It relies on OOPInit in order to initialize `pendingDir` and so that's why we're asserting the relationship.

It seems that CheckForLastRunCrash violates that assumption because it uses GetPendingDir in an entirely separate context. I don't think that initializing all of OOP crash reporting to solve this initialization issue is the correct solution, but we should continue in the new bug so Ted can help sort this through.
Comment on attachment 8480854 [details] [diff] [review]
Force b2g process to generate core dump, v3

Review of attachment 8480854 [details] [diff] [review]:
-----------------------------------------------------------------

::: profile/dirserviceprovider/nsProfileLock.cpp
@@ +33,5 @@
>  #include <rmsdef.h>
>  #endif
>  
> +#if defined(MOZ_B2G) && !defined(MOZ_CRASHREPORTER)
> +#include <sys/syscall.h>

See below about raise().

@@ +200,5 @@
> +        case SIGQUIT:
> +        case SIGILL:
> +        case SIGABRT:
> +        case SIGSEGV:
> +#ifndef MOZ_CRASHREPORTER

This could use a comment to the effect that, if MOZ_CRASHREPORTER *is* defined, then its handler already did this (when called as `oldact->sa_sigaction`, above), so if that handler returned instead of terminating the process, then this handler should also return.

@@ +204,5 @@
> +#ifndef MOZ_CRASHREPORTER
> +            // Retrigger the signal for those that can generate a core dump
> +            signal(signo, SIG_DFL);
> +            if (info->si_code <= 0) {
> +                if (syscall(__NR_tgkill, getpid(), syscall(__NR_gettid), signo) < 0) {

This can just be `raise(signo)`, I think.

In particular, on B2G/Android we use the definition of `raise` in mozglue/build/BionicGlue.cpp to work around Android bugs, and it's exactly the above collection of syscalls.
Attachment #8480854 - Flags: review?(jld) → review+
Comment on attachment 8480854 [details] [diff] [review]
Force b2g process to generate core dump, v3

Review of attachment 8480854 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +2778,5 @@
>  
> +  if (!OOPInitialized()) {
> +    MOZ_ASSERT(NS_IsMainThread());
> +    OOPInit();
> +  }

Is this a fix for bug 820716?
Comment on attachment 8480854 [details] [diff] [review]
Force b2g process to generate core dump, v3

Please move the nsExceptionHandler.cpp changes to a different bug. This one can land with jld's review.
Attachment #8480854 - Flags: review?(benjamin)
(In reply to Jed Davis [:jld] from comment #37)
> Comment on attachment 8480854 [details] [diff] [review]
> Force b2g process to generate core dump, v3
> 
> Review of attachment 8480854 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/crashreporter/nsExceptionHandler.cpp
> @@ +2778,5 @@
> >  
> > +  if (!OOPInitialized()) {
> > +    MOZ_ASSERT(NS_IsMainThread());
> > +    OOPInit();
> > +  }
> 
> Is this a fix for bug 820716?

Yes. Well, an attempt ;).

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #38)
> Comment on attachment 8480854 [details] [diff] [review]
> Force b2g process to generate core dump, v3
> 
> Please move the nsExceptionHandler.cpp changes to a different bug. This one
> can land with jld's review.

Removed; just has the signal handler changes now.
Attachment #8480854 - Attachment is obsolete: true
Attachment #8491972 - Flags: review+
Attached file Pull request on device-flame, v1.1 (obsolete) —
Collapse changes into single commit.
Attachment #8464745 - Attachment is obsolete: true
Attachment #8464745 - Flags: review?(mwu)
Attachment #8491995 - Flags: review?(mwu)
Collapse changes into single commit.
Attachment #8461285 - Attachment is obsolete: true
Attachment #8461285 - Flags: review?(mwu)
Attachment #8491996 - Flags: review?(mwu)
Attached file Pull request on gonk-misc, v1.1 (obsolete) —
Collapse changes into single commit. Note that system/core/init/builtins.c stubbed out the exec keyword for the init process on flame-kk, so you need to run b2g-prlimit manually at present if testing (but it otherwise functions as expected).
Attachment #8461289 - Attachment is obsolete: true
Attachment #8461289 - Flags: review?(mwu)
Attachment #8491998 - Flags: review?(mwu)
Comment on attachment 8491996 [details] [review]
Pull request on B2G, v2

I think Dave might know this script better than I do at this point. Feel free to pass it back to me if it doesn't make sense.
Attachment #8491996 - Flags: review?(mwu) → review?(dhylands)
Comment on attachment 8491996 [details] [review]
Pull request on B2G, v2

I'd like to see error handling for required arguments to core. What if the extra arguments aren't provided? If they point to a file, they should error if the file doesn't exist.

Also, I should be able to use ./run-ddd.sh or ./run-gdb.sh with a core file (ddd is a graphical front-end for gdb).
Attachment #8491996 - Flags: review?(dhylands)
Comment on attachment 8491996 [details] [review]
Pull request on B2G, v2

(In reply to Dave Hylands [:dhylands] from comment #44)
> Comment on attachment 8491996 [details] [review]
> Pull request on B2G, v1.1
> 
> I'd like to see error handling for required arguments to core. What if the
> extra arguments aren't provided? If they point to a file, they should error
> if the file doesn't exist.
> 
> Also, I should be able to use ./run-ddd.sh or ./run-gdb.sh with a core file
> (ddd is a graphical front-end for gdb).

Updated.
Attachment #8491996 - Attachment description: Pull request on B2G, v1.1 → Pull request on B2G, v2
Attachment #8491996 - Flags: review?(dhylands)
Attached file Pull request on gonk-misc, v3 (obsolete) —
Update patch to avoid dependencies on init.rc.

- This makes it easier to work with all platforms (not just flame) at the cost of losing dynamic property updates.
- Set persist.debug.coredump to either "all" (same as before, all processes now will dump cores) or "b2g" (only b2g processes will dump cores), followed by a device reboot, after which it will remain persistent and automatically turn on. To disable coredumps simply clear the persist.debug.coredump property and reboot.
- Can only enable core dumps on userdebug or eng builds.
Attachment #8491995 - Attachment is obsolete: true
Attachment #8491998 - Attachment is obsolete: true
Attachment #8491995 - Flags: review?(mwu)
Attachment #8491998 - Flags: review?(mwu)
Attachment #8499950 - Flags: review?(mwu)
Comment on attachment 8499950 [details] [review]
Pull request on gonk-misc, v3

Review on PR. Looks good overall - just a few things need adjusting.
Attachment #8499950 - Flags: review?(mwu)
Comment on attachment 8499950 [details] [review]
Pull request on gonk-misc, v3

Updated based on comments.
Attachment #8499950 - Attachment description: Pull request on gonk-misc, v2 → Pull request on gonk-misc, v3
Attachment #8499950 - Flags: review?(mwu)
Comment on attachment 8499950 [details] [review]
Pull request on gonk-misc, v3

Looks good. Might want to fold the commits together before merging.
Attachment #8499950 - Flags: review?(mwu) → review+
Comment on attachment 8491996 [details] [review]
Pull request on B2G, v2

Looks reasonable - just a couple minor nits.
Attachment #8491996 - Flags: review?(dhylands) → review+
Fixed nits and squashed pull requests.
Keywords: checkin-needed
(In reply to Carsten Book [:Tomcat] from comment #52)

> https://github.com/mozilla-b2g/gonk-misc/commit/
> 8ae27b558791b8b47559a5336e797c6ca8d0f017

had to revert this change for bustage like https://treeherder.mozilla.org/ui/logviewer.html#?job_id=605782&repo=b2g-inbound
New pull request which should fix the issue on hamachi (and other devices with older kernels).
Attachment #8499950 - Attachment is obsolete: true
Attachment #8501683 - Flags: review+
Keywords: checkin-needed
(In reply to Andrew Osmond [:aosmond] from comment #54)
> Created attachment 8501683 [details] [review]
> Pull request on gonk-misc, v4
> 
> New pull request which should fix the issue on hamachi (and other devices
> with older kernels).

Nit: The actual kernel probably does support it, but the header files in that version of the Android SDK don't.  The prlimit64 syscall (at least on ARM) landed in Linux 2.6.36; every physical B2G device is at least 3.0.x, but the ICS B2G emulator is 2.6.29.

So, for future reference if we ever want this feature on older devices, something like this should work (or else fail cleanly at runtime with ENOSYS if the kernel is actually too old):

#ifndef __NR_prlimit64
#ifdef __ARM_EABI__
#define __NR_prlimit64 369
#endif
#endif
(In reply to Jed Davis [:jld] from comment #56)
> (In reply to Andrew Osmond [:aosmond] from comment #54)
> > Created attachment 8501683 [details] [review]
> > Pull request on gonk-misc, v4
> > 
> > New pull request which should fix the issue on hamachi (and other devices
> > with older kernels).
> 
> Nit: The actual kernel probably does support it, but the header files in
> that version of the Android SDK don't.  The prlimit64 syscall (at least on
> ARM) landed in Linux 2.6.36; every physical B2G device is at least 3.0.x,
> but the ICS B2G emulator is 2.6.29.
> 
> So, for future reference if we ever want this feature on older devices,
> something like this should work (or else fail cleanly at runtime with ENOSYS
> if the kernel is actually too old):
> 
> #ifndef __NR_prlimit64
> #ifdef __ARM_EABI__
> #define __NR_prlimit64 369
> #endif
> #endif

Huh, obviously I didn't look at it closely enough :). I vaguely recalled looking at the versions that introduced the API ages ago, and just assumed that's what the symbol missing was referring to. Especially without a hamachi device at hand to do testing with. Thanks for the tips!

If b2g-prlimit doesn't work, you can still set persist.debug.coredump=b2g and get core dumps for b2g/content processes only, even on the ICS B2G emulator. That depends on setting ulimit for the shell process before it spawns b2g rather than prlimit for arbitrary ones.
Blocks: 1181715
Somehow this was missed when the rest of the patches were landed. Until recently I was able to get core dumps from b2g anyways (but probably not the parent process). This should fix that in addition to letting the debugger capture signals properly.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4372d927f282
Attachment #8491972 - Attachment is obsolete: true
Attachment #8633699 - Flags: review+
Target Milestone: 2.1 S6 (10oct) → FxOS-S3 (24Jul)
You need to log in before you can comment on or make changes to this bug.