Last Comment Bug 644707 - Avoid calling system abort(), generate useful stack traces through a manual segfault instead
: Avoid calling system abort(), generate useful stack traces through a manual s...
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: Trunk
: x86 Linux
-- normal (vote)
: mozilla8
Assigned To: Alon Zakai (:azakai)
: Nathan Froyd [:froydnj]
Depends on:
Blocks: 664510
  Show dependency treegraph
Reported: 2011-03-24 11:47 PDT by Josh Matthews [:jdm]
Modified: 2011-08-01 05:36 PDT (History)
6 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

Raise an abort signal instead of calling abort. (1.31 KB, patch)
2011-03-24 11:48 PDT, Josh Matthews [:jdm]
no flags Details | Diff | Splinter Review
patch v2 (1.89 KB, patch)
2011-07-06 14:09 PDT, Alon Zakai (:azakai)
ted: review+
Details | Diff | Splinter Review

Description User image Josh Matthews [:jdm] 2011-03-24 11:47:50 PDT
abort can drag us into libc and screw up backtraces on tinderbox.  I've never seen JS_ASSERT fail, so let's use their method for mozalloc_abort.
Comment 1 User image Josh Matthews [:jdm] 2011-03-24 11:48:41 PDT
Created attachment 521576 [details] [diff] [review]
Raise an abort signal instead of calling abort.
Comment 2 User image Benjamin Smedberg [:bsmedberg] 2011-03-24 11:49:12 PDT
raise(SIGABRT) is not caught by the crash reporter, which is why mozalloc_abort uses it only after actually trying to crash first.
Comment 3 User image Benjamin Smedberg [:bsmedberg] 2011-03-24 11:49:46 PDT
Or maybe I'm wrong, ted knows I think.
Comment 4 User image Josh Matthews [:jdm] 2011-03-24 11:50:18 PDT
Comment on attachment 521576 [details] [diff] [review]
Raise an abort signal instead of calling abort.

Oh, boo.
Comment 5 User image Ted Mielczarek [:ted.mielczarek] 2011-03-24 12:08:35 PDT
I really don't remember, honestly. I'm pretty sure whatever we're doing for JS_ASSERT works on tinderbox. It's not hard to test, though, just stick one in somewhere and run with MOZ_CRASHREPORTER=1 in your environment, and see if you get the crash reporter.
Comment 6 User image Alon Zakai (:azakai) 2011-07-06 12:25:21 PDT
I tested on Linux, and both |abort()| and |raise(SIGABRT)| trigger the crash reporter. So that part seems fine. The issue with SIGABRT not caught is apparently limited to OS X, as the source comment says in that file. The patch doesn't change anything on OS X, so should be fine, I am in favor of moving forward with it.

But I do have some other concerns:

1. There are other places that call abort(), not just mozalloc_abort. Should we change all of them? Also, some are in things like jemalloc, chromium ipc code, etc. - not our core codebase. What do we want to do about those?

2. I didn't actually check if crash report stack traces look better with this. One thing that concerns me is that we will still have a stack trace inside libc, won't we? Just it will be inside raise() instead of abort(). That is presumably less deep (if abort calls raise), but it doesn't seem guaranteed to work, unless I am missing something?
Comment 7 User image Ted Mielczarek [:ted.mielczarek] 2011-07-06 13:14:38 PDT
The other possibility here is just to use TouchBadMemory() like we do on OS X. That will give us a reliable crash with a stack entirely in libxul.
Comment 8 User image Alon Zakai (:azakai) 2011-07-06 14:09:12 PDT
Created attachment 544339 [details] [diff] [review]
patch v2

This patch overrides the system abort() with a redirect to mozalloc_abort, in order to catch all calls to abort() in the browser. It also tries to touch bad memory first, then raise(SIGABORT) (except on OS X where this is known not to work), then _exit, in that order.

This seems like the safest thing to do, or can we do better somehow?
Comment 9 User image Josh Matthews [:jdm] 2011-07-06 18:00:47 PDT
I went with raise(SIGABORT) because every crash stack I've seen using JS_ASSERT has worked just fine.
Comment 10 User image Alon Zakai (:azakai) 2011-07-07 10:06:54 PDT
The patch does try raise(SIGABORT), after trying to manually segfault. Manual segfaults seem most likely to work since the stack trace will be entirely in our code.

But in any case, I am fine with either way, both seem likely to be improvements over the current situation. ted, which do you prefer?
Comment 11 User image Ted Mielczarek [:ted.mielczarek] 2011-07-07 10:15:35 PDT
I'm in favor of whatever gets us the best stack traces. Have you tested this on Android to see what the stacks look like?
Comment 12 User image Alon Zakai (:azakai) 2011-07-07 10:37:51 PDT
I don't have access to an Android device at the moment, I should be able to test this in a few days.
Comment 13 User image Ted Mielczarek [:ted.mielczarek] 2011-07-07 13:45:10 PDT
Comment on attachment 544339 [details] [diff] [review]
patch v2

Review of attachment 544339 [details] [diff] [review]:

Seems reasonable.

::: memory/mozalloc/mozalloc_abort.cpp
@@ +81,2 @@
> +    // On *NIX platforms the preffered way to abort is by touching bad memory,

Comment 14 User image Alon Zakai (:azakai) 2011-07-07 14:14:40 PDT
Pushed to m-i

For posterity's sake, we talked on irc and ted ran some tests which showed that we get useful stack data using manual segfaults, but not with raise(SIGABORT) or abort(), confirming the reasoning before.
Comment 15 User image Marco Bonardo [::mak] 2011-07-08 05:55:19 PDT
Comment 16 User image Karl Tomlinson (:karlt) 2011-07-31 19:26:42 PDT
I wondered what the reason was for interposing the global abort symbol even for aborts from within system libraries, because we end up depending on stack scanning through those system libraries anyway, but I guess it may be useful as it means that all aborts will all have the same signature instead of different operating systems having slightly different offsets within
Comment 17 User image Ted Mielczarek [:ted.mielczarek] 2011-08-01 05:36:46 PDT
Note that we could also add symbols to our symbol store for linux-gate. Breakpad has some outdated ones in the source tree:

I landed a patch recently that made linux-gate get a reasonable debug ID again, but I didn't take the time to generate symbol files for it.

Note You need to log in before you can comment on or make changes to this bug.