Closed Bug 402505 Opened 17 years ago Closed 16 years ago

Crash [@ libobjc.A.dylib@objc_msgSend AppKit@[NSTSMInputContext interpretKeyEvents:] AppKit@[NSView interpretKeyEvents:] [ChildView keyDown:]]

Categories

(Core :: Widget: Cocoa, defect, P1)

x86
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla1.9beta3

People

(Reporter: djst, Assigned: MatsPalmgren_bugz)

References

Details

(Keywords: crash, testcase, Whiteboard: [sg:critical] using freed memory)

Crash Data

Attachments

(10 files, 7 obsolete files)

45.42 KB, text/html
Details
40.52 KB, text/plain
Details
14.13 KB, text/html
Details
13.18 KB, text/html
Details
1.94 KB, text/plain
Details
34.45 KB, patch
Details | Diff | Splinter Review
189 bytes, text/plain
Details
397 bytes, text/html
Details
35.25 KB, text/html
Details
30.77 KB, patch
jaas
: review+
Details | Diff | Splinter Review
Steps to reproduce

1. Go to http://fsb.se/
2. Click on "Logga in" at the top right corner
3. Click on "Fortsätt" (Continue)
4. In the text box, enter 000000000000
5. Hit Enter ONCE
6. Observe the JavaScript error dialog that pops up, and wait for the next page to load
7. Hit Enter again

Crashes here every time.
Summary: Crash when logging in to fsb.se with keyboard → Crash [@ ChildView keyDown AppKit@0xf516f AppKit@0xf5dc8] when logging in to fsb.se with keyboard
Summary: Crash [@ ChildView keyDown AppKit@0xf516f AppKit@0xf5dc8] when logging in to fsb.se with keyboard → Crash [@ libobjc.A.dylib@0x24d8 AppKit@0xf516f AppKit@0xf5dc8 ChildView keyDown] when logging in to fsb.se with keyboard
Attached file debug crash log
We get more symbols here (as an aside, is there a bug about mozilla's socorro instance not having OS symbols?).

We're in some TSM stuff; I wonder if this might be related to the other enter/return bugs that have popped up recently?
Summary: Crash [@ libobjc.A.dylib@0x24d8 AppKit@0xf516f AppKit@0xf5dc8 ChildView keyDown] when logging in to fsb.se with keyboard → Crash [@ libobjc.A.dylib@objc_msgSend AppKit@[NSTSMInputContext interpretKeyEvents:] AppKit@[NSView interpretKeyEvents:] [ChildView keyDown:]] when logging in to fsb.se with keyboard
I'm not sure the cause. The crashed line is not mine, I have only added the if statement before the line.
I know what the problem is... more in a bit...
Assignee: joshmoz → mats.palmgren
Flags: blocking1.9? → blocking1.9+
Status: NEW → ASSIGNED
Whiteboard: [sg:critical] using freed memory
Target Milestone: --- → mozilla1.9 M10
nsChildView::Destroy is called while handling a keyDown event for it.

The problem is that nsChildView::DispatchEvent() can lead to really
destructive things, like deleting frames,views,widgets,shells etc,
and the caller needs to be prepared for that.
[ChildView dealloc] while handling an event.
Attached file Dangerous call paths... (obsolete) —
Call paths to nsChildView::DispatchEvent() and Rollup() in nsChildView.mm
(Rollup() is also potentially destructive)
Attached patch Patch rev. 1 (obsolete) — Splinter Review
Fixes this crash and other potential crashes.
Note that holding a strong ref on the nsIWidget does NOT protect its
|mView| from being destroyed, we need to hold a separate strong ref
on that where needed (this is what the nsAutoRetainView helper class does).
Attachment #287626 - Flags: review?(joshmoz)
Priority: -- → P2
Mats - can you post a patch that does the minimum required to fix this bug? I want to understand what you're doing better, that would help.
Unfortunately the site has changed so I can't reproduce it anymore -
they seem to have removed the dialog window in step 6.
I tried to make a testcase but failed, so if someone could help with
that it would be appreciated.
Flags: in-testsuite?
Keywords: testcase
Whiteboard: [sg:critical] using freed memory → [sg:critical] using freed memory [needs-testcase]
(In reply to comment #9)
Looking at the "nsChildView::Destroy stack" I would guess that the
minimal patch to fix this particular crash would be to retain 'self'
in the -[ChildView keyDown:] method.

The rest of the patch adds protection to other methods that makes
direct or indirect synchronous calls into Gecko (through
nsChildView::DispatchEvent() or Rollup()).
Comment on attachment 287626 [details] [diff] [review]
Patch rev. 1

The approach here doesn't make much sense to me.

From comment #8:
"Note that holding a strong ref on the nsIWidget does NOT protect its |mView| from being destroyed, we need to hold a separate strong ref on that where needed (this is what the nsAutoRetainView helper class does)."

I don't see how mView could be getting destroyed from under an nsIWidget because nsIWidget retains the view. It may be that there is an incorrect double release happening somehow, but in general we should not have to protect ourselves in this way.

Also, we shouldn't need all of those kungFuDeathGrips because callers to those methods (like "::Resize") should be retaining.

Also, this patch doesn't apply any more, it'll need to be regenerated.
Attachment #287626 - Flags: review?(joshmoz) → review-
Priority: P2 → P4
Severity: critical → major
Summary: Crash [@ libobjc.A.dylib@objc_msgSend AppKit@[NSTSMInputContext interpretKeyEvents:] AppKit@[NSView interpretKeyEvents:] [ChildView keyDown:]] when logging in to fsb.se with keyboard → Crash [@ libobjc.A.dylib@objc_msgSend AppKit@[NSTSMInputContext interpretKeyEvents:] AppKit@[NSView interpretKeyEvents:] [ChildView keyDown:]]
Severity: major → critical
Attached patch fix v1.0 (obsolete) — Splinter Review
From looking at the information posted as attachments on this bug, I suspect this'll do it.
Assignee: mats.palmgren → joshmoz
Attachment #287626 - Attachment is obsolete: true
Attachment #291581 - Flags: review?(smichaud)
Attachment #291581 - Attachment is obsolete: true
Attachment #291581 - Flags: review?(smichaud)
Attached patch fix v1.1 (obsolete) — Splinter Review
After some further thinking and discussion with Steven, this is better.
Attachment #291588 - Flags: review?(smichaud)
Everything's fine with this patch but one thing -- I think you should
restore conversationIdentifier to the way it was before (returning a
long).

Apple's new docs on the NSTextInput protocol have
conversationIdentifier returning an NSInteger (which is an int on
32-bit systems and a long on 64-bit systems).  But keeping its return
value a long works the same way -- a long is 32 bits on a 32-bit
system and (I think) 64 bits on a 64-bit system (an Apple one, at
least).

(Older Apple docs have conversationIdentifier returning a long.)
Comment on attachment 291588 [details] [diff] [review]
fix v1.1

Josh says he'll restore conversationIdentifier to the way it was
(returning a long) on checkin.
Attachment #291588 - Flags: review?(smichaud) → review+
Attached patch fix v1.2 (obsolete) — Splinter Review
Steven talked me out of the long->int change.
Attachment #291588 - Attachment is obsolete: true
Attachment #291590 - Flags: review?(smichaud)
Attachment #291590 - Flags: review?(smichaud) → superreview?(roc)
(In reply to comment #12)
> I don't see how mView could be getting destroyed from under an nsIWidget
> because nsIWidget retains the view.

Please look at attachment 287622 [details] and note that the nsChildView
marked red is Destroy'ed while events on it are handled.
nsChildView::Destroy() releases the native view which is now dangling: 
http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/widget/src/cocoa/nsChildView.mm&root=/cvsroot&mark=499,552#473
and through the magic of nested event loops (nsXULWindow::ShowModal)
there is a real risk that Cocoa will deallocate it - and this is
indeed what caused the crash in this bug.
See attachment 287623 [details]

One could argue that fault really lays with nsView::~nsView that
calls nsChildView::Destroy() prematurely.
I think Olli have fixed that problem in bug 373344 now though,
at least for the view the event was originally dispatched for.
If the nsWeakView in the pres shell has the effect of protecting
the event view from being deleted then it should also protect
against nsChildView::Destroy() being called.

If so, that would fix most cases, leaving perhaps the code that
calls Rollup() to need to be reviewed again under this new assumption.
Attachment #291590 - Flags: superreview?(roc)
Comment on attachment 291590 [details] [diff] [review]
fix v1.2

That explanation makes sense to me. I moved my here patch out to bug 406909.
Attachment #291590 - Attachment is obsolete: true
Mats, as best I can tell, what happens in your attachment 287623 [details] is
impossible with proper checking (in ChildView methods) for mGeckoChild
== NULL.

nsChildView::Destroyed() calls [mView widgetDestroyed], which sets
mView's mGeckoChild variable to NULL.
The problem is the active call frames for the native view that is
being released -- when unwinding we need 'self' to be alive so we can
safely check its 'mGeckoChild' field... this is why it needs to be
retained during event handling.

If you look at attachment 287624 [details], every nsChildView/ChildView (including
'this'/'self') that is used after every call up to DispatchEvent needs
to have a strong ref to make sure the null-checks are on objects that
are still alive.

Hmm, now that I look more closely at Olli's fix for bug 373344 again,
I don't think it really helps this bug.  AFAICT it doesn't delay view
destruction.  It adds a safe way to check if the view is alive and
then passes a null view pointer to PostHandleEvent() in that case.
(sorry, I should have understood that by the nsWeakView name of course)

Which means we're back to square one here.
I'll try to do a new call analysis of the current code and post an
updated patch tomorrow.
Mats, I think I now see how a ChildView object could get dealloced
while calls are still being made on it (from the OS).  But it's not
the way you describe it in your first paragraph in comment #18.

The comment above this line shows why it can't be the cause:

http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/widget/src/cocoa/nsChildView.mm&root=/cvsroot&mark=499#473

But, through various twists and turns, it's possible that this line
might be:

http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/widget/src/cocoa/nsChildView.mm&root=/cvsroot&mark=502#473

The call to delayedTearDown (where mView is released and dealloced) is
postponed until the next time through the main thread's run loop.  So
the call to delayedTearDown can never take place in the same stack as
the call to TearDownView().

But if the OS keeps sending events to the ChildView object's "event
handlers" (e.g. keyDown: or mouseDown:) on subsequent invocations of
the main thread's event loop, they might (eventually) hit a ChildView
object that's already been dealloced (which would make tests of
mGeckoChild == NULL (probably) return the wrong result, even if they
didn't cause crashes themselves).

So the best fix would (as you said in comment #18) be to stop
nsChildView::Destroy() from being called "prematurely".  One way to do
this is your kungFuDeathGrip patch.
Here's something that might make it easier to test for prematurely
dealloced ChildView objects:  OS X supports "scribbling" an arbitrary
value into every dealloced object (so that any attempt to dereference
pointers in it will cause a crash) --
http://developer.apple.com/technotes/tn2004/tn2124.html (which also
has good info on many other things).

At a Terminal prompt:

$ export MallocScribble=1
$ [...]/Minefield.app/Contents/MacOS/firefox-bin
(In reply to comment #22)
> The call to delayedTearDown (where mView is released and dealloced) is
> postponed until the next time through the main thread's run loop.  So
> the call to delayedTearDown can never take place in the same stack as
> the call to TearDownView().

No, I believe you're mistaken.  While a modal dialog is posted we call
NS_ProcessNextEvent() which causes native events to be processed:
http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/xpfe/appshell/src/nsXULWindow.cpp&rev=1.177&root=/cvsroot&mark=397-400#379
Compare the top and bottom of the stack in attachment 287623 [details],
there's native event handling in both places.

Yes, 'delayedTearDown' will in most cases be sufficient, but if you
call into Gecko event handling there's a risk that it will force
native events to be processed (including calling 'delayedTearDown').
IMHO, we should remove 'delayedTearDown' since it wallpapers over real
bugs making them occur less often, but it does not fix them completely.

FWIW, I tested removing it and that seems to work fine, neither of the
testcases in bug 373122 or bug 374260 crashes, and all Mochitests runs
without crashing too (with my patch for this bug).
It's a little late to do that change now though.


> So the best fix would (as you said in comment #18) be to stop
> nsChildView::Destroy() from being called "prematurely".  One way to do
> this is your kungFuDeathGrip patch.

Well, I never claimed that my patch would stop nsChildView::Destroy()
from being called prematurely, and it doesn't.  What it does is simply
to retain the objects we depend on, so we can safely null-check mView
or mGeckoChild for example.
Mostly the same as before...
Attachment #287624 - Attachment is obsolete: true
Attached patch Patch rev. 2 (obsolete) — Splinter Review
Assignee: joshmoz → mats.palmgren
Attachment #291859 - Flags: review?(joshmoz)
(In reply to comment #24)

> No, I believe you're mistaken.

No, I'm not, and your attachment 287623 [details] shows it.

But you misunderstood what I said, and perhaps I wasn't clear enough.

Yes, native events can be processed at just about any point in the
browser -- that's by design (to make sure native events are processed
smoothly).

But no, you'll never see calls to TearDownView and delayedTearDown on
the same object in the same stack.

> Well, I never claimed that my patch would stop
> nsChildView::Destroy() from being called prematurely, and it
> doesn't.  What it does is simply to retain the objects we depend on,
> so we can safely null-check mView or mGeckoChild for example.

Yes, you're right.  Sorry for the confusion.
(In reply to comment #24)

> IMHO, we should remove 'delayedTearDown' since it wallpapers over
> real bugs making them occur less often, but it does not fix them
> completely.

Once again I disagree.  As the comments above delayedTearDown and the
call to delayedTearDown say, it's needed to prevent changes to the
NSView hierarchy during calls to [ChildView drawRect:].

The problems that it fixes don't happen very often, so I'm not
surprised you didn't see any after removing this code.  But see bug
373122, particularly bug 373122 comment #9.
(Following up comment #28)

And delayedTearDown would (theoretically) still be needed even if you
used your kungFuDeathGrip in [ChildView drawRect:] -- only
delayedTearDown guarantees that the ChildView object won't be
dealloced until the next time through the main thread's event loop.

By the way, what guarantees this behavior is calling [NSObject
performSelectorOnMainThread:withObject:waitUntilDone] with
waitUntilDone set to FALSE.
Attached patch logging.diffSplinter Review
Attached file GDB commands
Comment on attachment 292521 [details] [diff] [review]
logging.diff

BTW, this does not change the code in any significant way, it just
adds a 'printf' on entry/exit of the event handling methods and
tracks if we're in an event handler for each view.
Attached file Testcase #1
STEPS TO REPRODUCE
1. click on the link
2. in the alert that pops up: click ok
Steven, allow me demonstrate what is really happening.
If you load the testcase (from local disk) with the attached logging
patch applied and run it in GDB with the attached GDB commands this is
the result.  I have highlighted the interesting stuff and added comments
describing what's going on.

Can you reproduce the results?
Mats, you still haven't shown a case of nsChildView::TearDownView()
and [ChildView delayedTearDown] on the same object in the same stack.
And I still think delayedTearDown is needed for [ChildView drawRect:]
(I'll go into my reasons below).

But you have shown a case of a single method ([ChildView mouseDown:])
remaining on the stack while the main thread run loop is spun multiple
times.

So you do have a point, which seems to be this:

If you call code that spins the current thread's run loop, this will
cause the run loop to be re-entered.  This in turn will cause code to
run that the OS has been told to postpone until the "next time"
through the current thread's run loop.

Above I said that native events can be processed at just about any
point in the browser.  This was an oversimplification.

The OS (of course) spins its own run loop(s).  But the browser also
spins the main thread event loop as it processes Gecko events (the
relevant code is in nsBaseAppShell::OnProcessNextEvent() (in
widget/src/xpwidgets/nsBaseAppShell.cpp) and
nsAppShell::ProcessNextNativeEvent() (in
widget/src/cocoa/nsAppShell.mm)).

So if your code calls code that spins the Gecko event loop (which in
turn spins the main thread's run loop), the "wait until next time"
rule is likely to be broken, and it's unwise to depend on it.

We still need to stop the view hierarchy being changed during calls to
[ChildView drawRect:], and I still think delayedTearDown is the best
way (perhaps the only way) to accomplish this.  As best I can tell,
the Gecko event loop doesn't get spun multiple times as a result of
the call (in [ChildView drawRect:]) to DispatchWindowEvent().  If this
is true, delayedTearDown should continue to work properly there.

But I grant that it's probably best not to use delayedTearDown in
other cases -- as you said, it just muddies the waters.  And I saw
some wierdness while running my own tests that I need to look into (it
looked like delayedTearDown was being called multiple times on the
same object).

So at some point I need to revisit the whole delayedTearDown business
... though I don't think this is urgent.

By the way, in my own tests I just ran a recent trunk build (with
debugging symbols) in gdb, and didn't use your patch.  Like you, I did
see the same call to [ChildView mouseDown:] on the stack through
multiple calls to nsChildView::TearDownView() and [ChildView
delayedTearDown].  But I didn't see any errors when I set
NSZombieEnabled or MallocScribble.
I just wanted to mentioned that I crashed in this stack twice testing the Beta 2 candidate during my smoketest run using Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b2) Gecko/2007121014 Firefox/3.0b2.
Priority: P4 → P2
Target Milestone: mozilla1.9 M10 → mozilla1.9 M11
Attached patch Patch rev. 3 (obsolete) — Splinter Review
Updated to tip; no significant changes from last version.  Please review.
Attachment #291859 - Attachment is obsolete: true
Attachment #294600 - Flags: review?(joshmoz)
Attachment #291859 - Flags: review?(joshmoz)
Priority: P2 → P1
Whiteboard: [sg:critical] using freed memory [needs-testcase] → [sg:critical][need review] using freed memory [needs-testcase]
Comment on attachment 294600 [details] [diff] [review]
Patch rev. 3

I'd like Steven to review this patch so you can both agree before I look at it. Please request review from me when he has given r+.

Sorry this is taking so long but we're both in the middle of other bugs and this requires a lot of time to review.
Attachment #294600 - Flags: review?(joshmoz) → review?(smichaud)
Comment on attachment 294600 [details] [diff] [review]
Patch rev. 3

I'm going to be doing this review now, sorry for the switches.

So far my impression is that this is not an acceptable solution. We need a more systematic approach to protecting ourselves, I think that is possible. More on that soon.
Attachment #294600 - Flags: review?(smichaud) → review?(joshmoz)
Attached patch Patch rev. 4Splinter Review
I spent most of today going over this and I understand much more about what is going on now. I think Mats has the right idea, nice work Mats!

As I was reviewing this I ended up moving Mats' patch piece by piece into my tree, and because I was doing that anyway I just produced a patch with my comments addressed. I did not change very much.

Mats - if you review my edits to your patch then we can consider this r+ from me. Then we can get sr from roc and get this in.

List of changes in my patch:

- modified some comments
- fixed compiler warning unrelated to this bug
- moved location of nsAutoRetainView class definition
- in nsChildView::Resize and nsChildView::Scroll, check mOnDestroyCalled to see if widget was destroyed instead of mView since the former is more logically direct
Attachment #294600 - Attachment is obsolete: true
Attachment #296081 - Flags: review?(mats.palmgren)
Attachment #294600 - Flags: review?(joshmoz)
"more logically direct" -> "logically more direct"

You probably knew what I meant but my typo was killing the part of me that majored in English. I had to say something.
Attachment #296081 - Flags: review?(mats.palmgren) → superreview?(roc)
Comment on attachment 296081 [details] [diff] [review]
Patch rev. 4

Your edits looks fine. Please put an r+ on the patch
so it looks formally reviewed. Thanks.
Attachment #296081 - Flags: review?(joshmoz)
Attachment #296081 - Flags: review?(joshmoz) → review+
Whiteboard: [sg:critical][need review] using freed memory [needs-testcase] → [sg:critical] using freed memory [needs-testcase]
Attachment #296081 - Flags: superreview?(roc) → superreview+
Landed on trunk.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
verified fixed using Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9pre) Gecko/2008050621 Firefox/3.0pre and the steps to reproduce from tenser and the testcase from mats ...no crash -> Verified fixed
Status: RESOLVED → VERIFIED
Keywords: testcase
Whiteboard: [sg:critical] using freed memory [needs-testcase] → [sg:critical] using freed memory
Crash Signature: [@ libobjc.A.dylib@objc_msgSend AppKit@[NSTSMInputContext interpretKeyEvents:] AppKit@[NSView interpretKeyEvents:] [ChildView keyDown:]]
You need to log in before you can comment on or make changes to this bug.