Closed Bug 398499 Opened 17 years ago Closed 17 years ago

Crash in DispatchEventToHandlers(EventTargetRec*, OpaqueEventRef*, HandlerCallRec*

Categories

(Core Graveyard :: Widget: Mac, defect)

1.8 Branch
x86
macOS
defect
Not set
critical

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: marcia, Assigned: smichaud)

References

()

Details

(Keywords: crash, regression, verified1.8.1.12, Whiteboard: [radar:5466547] 1.8.1.x-only crash)

Attachments

(5 files, 1 obsolete file)

Attached file Apple crash report (obsolete) —
Seen while testing Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.8pre) Gecko/2007100307 BonEcho/2.0.0.8pre.

STR:
1. Load the URL. Crash.

Apple Crash report attached.
Additional STR:

1. Change focus to another application, such as Thunderbird and put that window on top. Then switch back to FF and that should generate the crash.
I checked the trunk and I do not get this crash when following the same steps. So far now this is 1.8 branch only.
nominating, not sure how critical this is.
Severity: normal → critical
Flags: blocking1.8.1.8?
Not DOM Events issue. Maybe Widget: Mac?
Assignee: nobody → joshmoz
Component: DOM: Events → Widget: Mac
QA Contact: events → mac
Attached file Text version of crash
Attachment #283490 - Attachment is obsolete: true
BTW thanks to Jeremy for pointing me to this bug, he should get the credit for finding it.
Flags: blocking1.8.1.8? → blocking1.8.1.9?
1.8 is widget:mac (Carbon) and trunk is Cocoa.  This crashes in a Carbon system library (which I'm guessing is not used from Cocoa apps) and requires interaction with another app so it makes sense that it crashes only 1.8 and makes me think it's triggered in or related to widget:mac. [mid-air with smaug]

(In reply to comment #1)
> Additional STR:
> 1. Change focus to another application, such as Thunderbird and put that window
> on top. Then switch back to FF and that should generate the crash.
No need to switch back.  AFAICT this also requires mouse activity over any part of the moz app which is covered by another app.

(In reply to comment #3)
> nominating, not sure how critical this is.
fwiw, I'm 98% sure that a stack in my local CrashReporter (apple) log from march was *not* on a MediaWiki site.  And that happens to be the first Firefox crash in the log with HIToolbox 1.4.9 (the previous crash was 1.4.8 and a different stack)  So this is probably not a MediaWiki specific issue.

The CrashReporter log from marcia has a truncated stack (as does my own on OS X 10.4.10).  I'll attach a my own gdb trace for 10.4.10.  I'll walk Marcia through getting a gdb trace if anyone wants it for Leopard.

(In reply to comment #6)
> BTW thanks to Jeremy for pointing me to this bug, he should get the credit for
> finding it.
btw, thanks to Marcia to for reproducing on Leopard! :)  It was hard to miss on branch when browsing MediaWiki diffs :-{

<rdar://problem/5466547>
Whiteboard: [radar:5466547]
Can somebody confirm that this still happens reliably on a fully software-updated Mac OS X 10.5 build 9a559? Does it happen reliably on this build? How likely are users to hit this bug?
(In reply to comment #8)
> Can somebody confirm that this still happens reliably on a fully
> software-updated Mac OS X 10.5 build 9a559? Does it happen reliably on this
> build? How likely are users to hit this bug?

Are you having trouble reproducing?

Marcia repro'd on that build and it was prolly up to date but I'll have to double check with her. (there's prolly enough info in the report to determine the answer if you knew enough about their build changes but I don't :P) See attachment 283577 [details] for a CrashReporter log on Leopard.  I see I forgot to attach my gdb trace (10.4.10) after my last comment; coming right up.

This reliably happens on 10.4.10 on many tens of Mediawiki generated pages (mostly but maybe not exclusively diff pages, I forget) and therefore will likely effect the majority of OS X users that are more than just casual editors on any Mediawiki wiki.  (Haven't tried to reduce to MW specific versions yet)

FWIW, I'm 95% certain a matching stack in my log from several months back is *not* from Mediawiki.

from a message I just sent to an apple guy that responded to the radar:
> btw, i'd love to get some numbers on how many people are actually crashing
> from this using statistics from talkback but I can't get it to run in rosetta
> (I even tried invoking oah/traslate directly and it segfaulted)  ISTR firefox
> once worked in rosetta on this machine and I hear it still works for other
> people so probably a local issue.  Later this week or next I'll try again to
> get ahold of a PPC tester / find a public PPC machine / try with rosetta on
> another box.  Also, I know CrashReporter generates lots of reports and it
> would be nice to get stats from those too.
Attached file gdb stack on 10.4.10
CrashReporter log of crashing on the same page under Rosetta.  Note HIToolbox doesn't even show up in the stack.
I just tested this again on the latest Leopard seed 9a559 on a PPC mac and I crash hitting that URL (it is a bit of a delayed crash). Talkback fires, but I am having trouble submitting the report due to a network error (perhaps a separate bug). I have an Apple report from this PPC crash which I can attach.
Assignee: joshmoz → smichaud
I've reproduced this crash on OS X 10.4.10 (on a PPC Mac).  I'll test
later on Leopard, but it's already pretty clear that this is a bad
bug, and not Leopard-specific.

By the way, Jeremy, what your gdb stack (attachment 285198 [details], comment
#10) has at it's bottom is truly bizarre -- it looks like you've quit
Firefox!  Something similar (and equally bizarre) shows up in a stack
trace posted for a seemingly related bug (in 1.8-branch Firefox,
attachment 285380 [details], bug 400291).  I now suspect that the bottom parts
of both stacks are corrupt.

If possible, Jeremy, please do a debug build of 1.8-branch Firefox and
test with that -- because it will have debug symbols, gdb and
crashreporterd will be less likely to generate corrupt stacks from it.

(I'll do the same, of course ... but I probably won't have time until
tomorrow.)
> Something similar (and equally bizarre) shows up in a stack trace
> posted for a seemingly related bug ...

I should have said "seemingly unrelated".  Aside from the (apparent)
stack-trace corruption, this bug doesn't seem related to bug 400291.
I see this bug (on OS X 10.4.10) on all the Firefox 2.0.0.X versions
that I tested, but not on any of the Firefox 1.5.0.X versions.  So it
seems to have been introduced (a long time ago) on the 1.8 branch (as
opposed to the 1.8.0 branch).
This bug is a regression that took place on the 1.8 branch between the
2006-05-30-07-mozilla1.8 nightly (which doesn't have the bug) and the
2006-05-31-08-mozilla1.8 nightly (which does).

This regression window almost certainly implicates the single patch
that was landed as a fix for bug 106695 and bug 339640:

http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=MOZILLA_1_8_BRANCH&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2006-05-30+12%3A34&maxdate=2006-05-30+12%3A34&cvsroot=%2Fcvsroot
Keywords: regression
This bug isn't Leopard-specific.
Summary: [10.5] Crash in DispatchEventToHandlers(EventTargetRec*, OpaqueEventRef*, HandlerCallRec* → Crash in DispatchEventToHandlers(EventTargetRec*, OpaqueEventRef*, HandlerCallRec*
Here's a stack trace made using a build with debug symbols (built from
current 1.8-branch source code).

It includes both a gdb trace of the top and bottom of the main thread
stack, and a crashreporterd trace of all the threads' stacks.
Attached patch FixSplinter Review
Here's a patch that fixes this bug.

Contrary to appearances, the bug isn't caused by an infinite recursion
... just by a very large non-infinite one :-)

This bug's URL has a very large number of nsMacControl objects on its
page.  For each of these, a handler for kEventWindowActivated and
kEventWindowDeactivated Carbon events is installed on the window to
which the control belongs.  Currently (before my patch), this handler
calls CallNextEventHandler() to invoke all previously installed
handlers.  But if there are n nsMacControl objects in a page/window,
using CallNextEventHandler() makes the handler recurse n-1 times.

This bug's URL has so many controls that this recursion (apparently)
causes a stack overflow.

Calling CallNextEventHandler() from this handler servers no purpose,
so I've eliminated the call.

Even with my patch, this bug's URL still redraws incredibly slowly,
and sometimes fails to redraw at all.  But that's another problem :-(
Attachment #286306 - Flags: review?(joshmoz)
Needless to say, this isn't an Apple bug.

Jeremy, please let them know that.
Attachment #286306 - Flags: review?(joshmoz) → review?(mark)
Comment on attachment 286306 [details] [diff] [review]
Fix

This is OK with me.  Make sure that you test that the sheet show/hide stuff in nsMacWindow still works in conjunction with controls, since I know that it uses {kEventClassWindow, kEventWindowActivated} too.
Attachment #286306 - Flags: review?(mark) → review+
> Make sure that you test that the sheet show/hide stuff in
> nsMacWindow still works in conjunction with controls, since I know
> that it uses {kEventClassWindow, kEventWindowActivated} too.

Please give me some pointers about how to do this -- examples of URLs
(or browser menus/functions) that use this stuff, and so forth.
It's probably enough if you bring up a Bugzilla page and then try to navigate through all of the prefs sheets in Google Toolbar.
(In reply to comment #21)
> Jeremy, please let them know that.
Done.
> It's probably enough if you bring up a Bugzilla page and then try to
> navigate through all of the prefs sheets in Google Toolbar.

Thanks, Mark.

I did this (while occasionally switching to another browser window and
back, or to another app and back), and saw no problems.
Comment on attachment 286306 [details] [diff] [review]
Fix

Roc, Jeremy has already pointed out the following error in my comment.
I'll fix it when I land the patch:

+  // actually "consume" an activate or deactive event here, there should be

should be

+  // actually "consume" an activate or deactivate event here, there should be
Attachment #286306 - Flags: superreview?(roc)
Comment on attachment 286306 [details] [diff] [review]
Fix

(In reply to comment #27)
As long as we're enumerating errata here, I'll add to the pile:
>+  // for each of them.  So CallNextEventHandler() recurses through this
>+  // handler once for each "additional" control, and if there are too many
>+  // controls a stack overflow can result (bmo bug 398499).  Since we never
This sentence could probably use an extra comma or to be split into 2 sentences.

>+  // no problem alwaye returning eventNotHandledErr.
s/alwaye/always/
Roc? :-)
Comment on attachment 286306 [details] [diff] [review]
Fix

sorry, I lost track of this one somehow
Attachment #286306 - Flags: superreview?(roc) → superreview+
Comment on attachment 286306 [details] [diff] [review]
Fix

This patch's risk is minimal (the call to CallNextEventHandler()
didn't serve any useful purpose).

It's benefits are perhaps moderate (very few web pages will have
enough controls to trigger the crash my patch fixes).  Or you could
argue that, since Wikipedia is an important website, the patch's
benefits are substantial.
Attachment #286306 - Flags: approval1.8.1.10?
Personally I don't think we should take this on branch.
Oh?  Why not?

(The problem is branch-only, by the way.)
Because we typically only take crash fixes on branch if they have security implications, and AFAIK this doesn't. We might make exceptions for crashes that are regressions from other branch changes, or for crashes that are very frequent, but this doesn't seem to meet those criteria either.
> We might make exceptions for crashes that are regressions from other branch
> changes

What I mean specifically is that we might take a fix for a crash that was introduced since Firefox 2.0.0.0 shipped. But it seems this bug was in Firefox 2.0.0.0.
> Because we typically only take crash fixes on branch if they have
> security implications, and AFAIK this doesn't.

No, I agree this crash doesn't have security implications.

But this policy seems _way_ too restrictive ... at least in my
opinion.  Even if you sometimes make exceptions to it.

Though it'd be another matter if this bug didn't have a patch.

Given that a patch exists, I'm not sure it makes sense to
distinguish between "primordial" crash bugs and regressions.

(By the way, one reason this bug got a lot of attention is that at
first it seemed to be Leopard-only.)
We've had some pretty bad regressions in the 2.0.0.x series. You could argue that we're not restrictive *enough*.

> Given that a patch exists, I'm not sure it makes sense to
> distinguish between "primordial" crash bugs and regressions.

It does make sense. Crashes that 2.0.0.x has always incurred get worked around by Web developers and users. New crashes are a lot more painful.
You also need to remember that this patch is close to zero risk --
basically a no-brainer.

But it's out of my hands ... do with it what you will :-)
Now that we have a lot of test automation on trunk, the situation for the Firefox 3 branch should be a lot better, so we may want to relax our policy there. Or not. But anyway, this is not the right place for that discussion.
> We've had some pretty bad regressions in the 2.0.0.x series. You
> could argue that we're not restrictive *enough*.

Regressions caused by narrowly-scoped fixes for crash bugs?

I'd be surprised to hear there were more than one or two ... or even
that many.

I _do_ think your approach is correct for new features ... but I'd bet
that almost all of these regressions you talk about came from adding
new features.
> Regressions caused by narrowly-scoped fixes for crash bugs?

Yes.

> I'd be surprised to hear there were more than one or two ...

Be surprised.

> I'd bet that almost all of these regressions you talk about came from adding
> new features.

None of the ones I'm aware of came from adding new features (not surprising, since we haven't added any new features in the 2.0.0.x series except, arguably, some Vista compatibility stuff).
Note that in many of the 2.0.0.x releases we fixed > 50 security bugs each. It's not surprising that some of these fixes caused regressions, even though we each one as safe as we could.
> Note that in many of the 2.0.0.x releases we fixed > 50 security
> bugs each.

Yes, I can see that security bugs deserve priority.  And it's hard to
keep track of the implications of that many different changes.  And
they inevitably (and legitimately) draw attention away from other
bugs (even crash bugs).

And I grant that, even without the crash, 1.8-branch Firefox behaves
very badly on this bug's URL.
Comment on attachment 286306 [details] [diff] [review]
Fix

1.8.1.10 isn't the release to take this in (short, focused). We'll look later.
Attachment #286306 - Flags: approval1.8.1.10? → approval1.8.1.11?
Flags: wanted1.8.1.x+
Flags: wanted1.8.0.x-
Whiteboard: [radar:5466547] → [radar:5466547] 1.8.1.x-only crash
Comment on attachment 286306 [details] [diff] [review]
Fix

approved for 1.8.1.10, a=dveditz for release-drivers
Attachment #286306 - Flags: approval1.8.1.12? → approval1.8.1.12+
I meant 1.8.1.12
Landed on 1.8 branch.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Steven, please use the fixed1.8.1.* keyword (12, in this case) after checking in on the branch.
Keywords: fixed1.8.1.12
I'll try to remember next time :-)
Verified in 1.8.1.12 with Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.12pre) Gecko/2008012208 BonEcho/2.0.0.12pre.
Status: RESOLVED → VERIFIED
Flags: blocking1.8.1.12?
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: