Closed Bug 148453 Opened 22 years ago Closed 21 years ago

Mozilla no longer dumps core on SEGV signal!

Categories

(Core Graveyard :: Profile: BackEnd, defect)

x86
Linux
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla-bugs, Assigned: ccarlen)

References

Details

(Keywords: dataloss, regression)

Attachments

(2 files, 3 obsolete files)

I've noticed that Mozilla have recently stopped dumping the core when it
crashes; instead it just exits with exit status 11. I believe that this was
caused by the bug 76431 checkin - the SEGV signal gets intercepted by the
profile manager which then deletes the lock file and exits and core is never dumped!

This is pretty bad. It makes it significantly harder to investigate unexpected
(e.g. those that I didn't cause intentially when investigating a previous crash)
crashes.
This is a recent regression. It's also a dataloss - important stack trace
information gets lost instead of being saved in a core file.
Keywords: dataloss, regression
Keywords: mozilla1.0, nsbeta1
This will also presumably affect the stack-trace-on-crash hack which dbaron will
check in one of these days. The code chains signal handlers, so as long as we
run after talkback is initialied, talkback should still work. If the profile
code runs before, then the lock files probably aren't cleaned up - anyone with a
talkback build (or the comm source) want to test that?

Can we reset the sigaction to SIG_DFL, and then raise() from a signal handler?
Or should the final call be to abort(), not _exit ?
Keywords: mozilla1.0.1
re-assign and cc some people from bug 76431
Assignee: Matti → ccarlen
Component: Browser-General → Profile Manager BackEnd
QA Contact: imajes-qa → ktrina
I wonder whether this affects talkback as well?
this doesn't seem to affect talkback, but it prevents platforms without talkback
from getting stacktraces on unreproducible crashes.
Can bug 14989 check-ins help here?
Any progress on this? A recent Mozilla build (2002122113) crashed twice on me
over the last two days without me doing anything distinctive, and I have no idea
what's wrong. :-( It would be *really* nice to get the core dumps back!
Depends on: 76431
Keywords: mozilla1.3
This patch ammends the current nsProfileAccess signal handler to check whether
the old handler was SIG_DFL and if so, then restore that SIG_DFL sigaction and
re-raise the signal to make sure the default action is executed.

This ensures that the core is actually dumped on SIGSEGV/SIGABRT/etc.
Comment on attachment 110048 [details] [diff] [review]
When old action was SIG_DFL, make sure that default handler is actually executed.

Please review. Thanks!
Attachment #110048 - Flags: superreview?
Attachment #110048 - Flags: review?
Comment on attachment 110048 [details] [diff] [review]
When old action was SIG_DFL, make sure that default handler is actually executed.

Oops, made a typo, fixing.
Attachment #110048 - Attachment is obsolete: true
Attachment #110048 - Flags: superreview?
Attachment #110048 - Flags: review?
Fixed a small typo in the patch.
Attachment #110116 - Flags: superreview?
Attachment #110116 - Flags: review?
Attachment #110116 - Flags: superreview?(dveditz)
Attachment #110116 - Flags: superreview?
Attachment #110116 - Flags: review?(blizzard)
Attachment #110116 - Flags: review?
I doubt it (although I do not really know how the talkback works). My code only
makes sure that the default handler is executed properly when there was no
non-default handler specified before the profile was locked. If talkback is
implemented as a signal handler, then profile locking should start with
non-default handler and my code shouldn't change anything...

But if something really strange is going with signal handling, it might help, I
do not know.
Flags: blocking1.3b?
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: major → critical
Flags: blocking1.3b? → blocking1.3b-
QA Contact: ktrina → gbush
Updating for bitrot.
Attachment #110116 - Attachment is obsolete: true
Comment on attachment 113656 [details] [diff] [review]
When original action was SIG_DFL, make sure the default handler is still executed before we exit.

It's a pretty simple fix, please review.
Attachment #113656 - Flags: superreview?(dveditz)
Attachment #113656 - Flags: review?(blizzard)
Attachment #110116 - Flags: superreview?(dveditz)
Attachment #110116 - Flags: review?(blizzard)
Comment on attachment 113656 [details] [diff] [review]
When original action was SIG_DFL, make sure the default handler is still executed before we exit.

Blizzard thinks dbaron should look at it.
Attachment #113656 - Flags: review?(blizzard) → review?(dbaron)
removing obsolete mozilla1.0 keywords and the misleading "dataloss" (which
applies to losing user data not an inability to get debugging core dumps).
Accordingly bumping severity back down to normal, but it would be nice to get fixed.
Severity: critical → normal
adt: nsbeta1-
Keywords: nsbeta1nsbeta1-
Comment on attachment 113656 [details] [diff] [review]
When original action was SIG_DFL, make sure the default handler is still executed before we exit.

>                 struct sigaction act, oldact;
>                 act.sa_handler = FatalSignalHandler;
>-                act.sa_flags = 0;
>+                act.sa_flags = SA_NOMASK;
>                 sigfillset(&act.sa_mask);

I'm surprised this does anything useful, given that |man sigaction| seems to
suggest that, with the full |sa_mask|, it won't, since "In addition [to
sa_mask], the signal which triggered the handler will be blocked, unless the
SA_NODEFER or SA_NOMASK flags are used."  But I'm guessing that one of them is
undone by the new |sigaction| call and the other isn't, or something like that?

But it looks like this code was moved in revision 1.80.  Could you attach a new
patch?	Sorry to take so long to get to this...
Attachment #113656 - Flags: superreview?(dveditz)
Attachment #113656 - Flags: superreview+
Attachment #113656 - Flags: review?(dbaron)
Attachment #113656 - Flags: review?(ccarlen)
Comment on attachment 113656 [details] [diff] [review]
When original action was SIG_DFL, make sure the default handler is still executed before we exit.

>@@ -2131,7 +2135,7 @@
>                 // because mozilla is run via nohup.
>                 struct sigaction act, oldact;
>                 act.sa_handler = FatalSignalHandler;
>-                act.sa_flags = 0;
>+                act.sa_flags = SA_NOMASK;

I moved the patch to where this code is now. Problem is, SA_NOMASK doesn't
exist on OS X. Then probably not BSD either? If this is really needed, you need
to figure out the #ifdefs.
Attachment #113656 - Flags: review?(ccarlen) → review-
I disagree with comment #18---"dataloss" is not necessarily
misleading, it can be right on the point because there 
is _no_ autosave (see bug 16359 and bug 16360). 

My wife lost several hours of work yesterday and I failed 
to find any of it in kernel memory (not sure why, probably 
destroyed it while making a copy of it; I've been successful 
before with other programs crashing). A coredump, I believe,
would have allowed me to recover her text fairly easily.

My Mozilla 1.4 for GNU/Linux does not dump core regardless of 
whether the talkback agent is or is not in use. If people want
coredumps (which they show by raising the ulimit for it from
the normally default 0), then they should get coredumps. Please
fix this. It may be a weird way to recover data but it will be
something. Autosave hasn't happened for 3+ years already. This
should be easier to fix, I think.
Severity: normal → critical
Flags: blocking1.6a?
Flags: blocking1.5?
Flags: blocking1.4.2?
Flags: blocking1.4.1?
Keywords: mozilla1.3dataloss
There's no need to set blocking1.6a yet, the trunk is still open for checkins.

/be
Status: NEW → ASSIGNED
Flags: blocking1.6a?
Flags: blocking1.5?
Flags: blocking1.4.1?
Attachment #113656 - Attachment is obsolete: true
Comment on attachment 133812 [details] [diff] [review]
compiles on solaris and linux

With the patch:
$ ./run-mozilla.sh ./xpcshell
js> Components.classes['@mozilla.org/generic-factory;1'].createInstance()
Segmentation Fault - core dumped
$ uname -a
SunOS DEATHSTARII 5.9 Generic_112233-08 sun4u sparc SUNW,Ultra-80

Yes I have a patch for generic factory too, but it's known broken today, so it
was convenient :)
Attachment #133812 - Flags: superreview?(dbaron)
Attachment #133812 - Flags: review?(bryner)
Comment on attachment 133812 [details] [diff] [review]
compiles on solaris and linux

Sure, sr=dbaron, but don't invent new bracing styles.

    ...
} else if (...)
{
    ...

isn't normal.  Prevailing style here suggests adding a break before the else
and a break before the { following the if.
Attachment #133812 - Flags: superreview?(dbaron) → superreview+
Comment on attachment 133812 [details] [diff] [review]
compiles on solaris and linux

And how about space after , in argument lists (sigaction)?

/be
Comment on attachment 133812 [details] [diff] [review]
compiles on solaris and linux

I'm a little concerned about signal handling recursion should we ever crash in
the signal handler, due to NODEFER being used.	Could you instead just unblock
the signal before calling raise() by calling sigprocmask() ?
unblock the signal in question just before invoking the default handler
Comment on attachment 134344 [details] [diff] [review]
this is what i had in mind

>+        if (oldact->sa_handler == SIG_DFL) {
>+            // Make sure the default sig handler is executed

period at end of comment sentence.

>+            // We need it to get Mozilla to dump core.
>+            sigaction(signo,oldact,NULL);
>+
>+            // Now that we've restored the default handler, unmask the
>+            // signal and invoke it.
>+

Why the extra blank line here, but not previously.  I'd say axe this one, but
in any event be consistent.

>+            sigset_t unblock_sigs;
>+            sigemptyset(&unblock_sigs);
>+            sigaddset(&unblock_sigs, signo);
>+
>+            sigprocmask(SIG_UNBLOCK, &unblock_sigs, NULL);
>+
>+            raise(signo);
>+        } else if (oldact->sa_handler && oldact->sa_handler != SIG_IGN)
>+        {
>+            oldact->sa_handler(signo);
>+        }
>     }
> 
>     // Backstop exit call, just in case.

Sorry for the drive-by nit-picking, it looks good to me otherwise.  Timeless,
darin?

/be
Comment on attachment 134344 [details] [diff] [review]
this is what i had in mind

this patch makes good sense to me.

to be nit picky, the brace style is not consistent near the "else" and there
should be spaces in the parameter list of sigaction ;-)

r/sr=darin fwiw
Comment on attachment 133812 [details] [diff] [review]
compiles on solaris and linux

Minusing, I prefer the patch I attached that doesn't use SA_NODEFER...
Attachment #133812 - Flags: review?(bryner) → review-
Comment on attachment 134344 [details] [diff] [review]
this is what i had in mind

Brendan, what do you think?
Attachment #134344 - Flags: review?(brendan)
Comment on attachment 134344 [details] [diff] [review]
this is what i had in mind

Fix the nit mentioned in comment 27 and r/sr=me.

/be
Attachment #134344 - Flags: review?(brendan) → review+
Attachment #134344 - Flags: superreview?(dbaron)
Comment on attachment 134344 [details] [diff] [review]
this is what i had in mind

>+        } else if (oldact->sa_handler && oldact->sa_handler != SIG_IGN)
>+        {

Put the |else| on a new line after the }, make sure someone's tested this, and
sr=dbaron.
Attachment #134344 - Flags: superreview?(dbaron) → superreview+
checked in.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Brian,

Can you check the correct version of this patch into 1.4.2 and mark fixed1.4.2
please?
Flags: blocking1.4.2? → blocking1.4.2+
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: