Closed Bug 373122 Opened 13 years ago Closed 13 years ago

Crash and/or Cocoa exception with Flash, overflow: scroll, and display: -moz-box

Categories

(Core :: Widget: Cocoa, defect, critical)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla1.9alpha8

People

(Reporter: jruderman, Assigned: smichaud)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(4 keywords, Whiteboard: [sg:critical?] post 1.8-branch)

Attachments

(3 files, 2 obsolete files)

Steps to reproduce:
1. Load the testcase and let it reload a few times.  (It will reload faster if it's local.)

Result:

firefox-bin[8985] *** NSRunLoop ignoring exception '*** -[NSCFArray objectAtIndex:]: index (6) beyond bounds (6)' that raised during posting of delayed perform with target 34b4a950 and selector 'runAppShell'
(this should be considered a crash, see bug 340453)
and Firefox stops painting.

OR

Firefox crashes with a sg:critical-looking stack.
Flags: blocking1.9?
Whiteboard: [sg:critical?]
Attached file testcase
Triggers the bug both opt and debug (Mac trunk).
Regressed between 2007-03-04_1651 and 2007-03-04_1750 (http://hourly-archive.localgho.st/ ftw).

Regression from bug 363253?  The patch for that bug had the checkin comment "Move scrollframe attribute-setting out of reflow to a post-reflow callback".
Blocks: 363253
Keywords: regression
Possibly. It looks to me like the problem is we get into a particular Mac widget setup and then, during Mac painting, a call to WillPaint flushes notifications which does "display:none" on the plugin, destroying its widget, and the Mac widget code gets all confused. I'm not sure what this has to do with bug 363253, but I can see how it could be a problem.
This whole WillPaint and notification-flushing mess is one of the things I really want to fix with the compositor...
Bug 373586 has the same regression range.
The fix for bug 373586 did not take care of this bug.
josh/colin, any thoughts on a fix?  

roc, are there bugs filed for the compositor work you mentioned in comment 4?
Keywords: arch
Breaking on -[NSException raise] should allow you to get a trace from where the exception is getting raised.

Jesse said it best in the original report though: fixing bug 340453 should turn this into just a regular crash, with a nice stack trace.
Flags: blocking1.9? → blocking1.9+
Depends on: 163260
Depends on: 340453
Target Milestone: --- → mozilla1.9beta1
Attached file What causes this bug
Here's a trace of what seems to cause this bug:  The browser modifies the
window's view hierarchy during a call (to [NSView drawRect:], from the OS) to
draw one of the views in that hierarchy.  Specifically it removes a view at
the same level in the hierarchy (one that "belongs" to the same superview,
though not (in this case) the view currently being drawn).  This changes the
size of the relevant "subviews" array.  But the method from which [NSView
drawRect:] is being called ([NSView
_recursiveDisplayRectIfNeededIgnoringOpacity]) isn't aware of this.  So when
it tries to fetch the next subview from the array (the subview after the one
currently being drawn), this causes a "beyond bounds" error.

(The "Posed ..." entries in this trace are artifacts of how I got this
information -- I used poseAsClass to "hook" calls to [NSCFArray
objectAtIndex:], [NSCFArray removeObjectAtIndex:] and [NSView
_recursiveDisplayRectIfNeededIgnoringOpacity].  It was _not_ easy to figure
out how to hook methods of the NSCFArray class.  If anyone's interested in how
I did this, I can attach a short howto.)

In my next message I'll post a preliminary patch that seems to fix this
problem.  With luck it'll also fix bug 374260.
Er... if any ProcessReflowCommands() can trigger ProcessPendingRestyles() (as this stack shows), I think we have a problem.  At the very least, all callers that expect to flush out already-ending layout updates but not style reresolves (PresShell::WillPaint is one of them) will be likely to crash...

See also bug 375436.

I guess one way to go is to say that any layout update might wipe out the frame tree (and run arbitrary script?) and make all consumers deal...
Attached patch Preliminary patch (obsolete) — Splinter Review
Here's a patch that (in my tests) fixes this bug.  It stops the current view
from being removed from its superview during nsChildView::TearDownView() --
instead it postpones this until the next time the OS runs the main run loop
(on the main thread).

I don't know that this is the only way to fix the problem, or even the best
way.  But it works, and I can see nothing wrong with it in principle.
> With luck it'll also fix bug 374260.

It seems to (in a "plain" copy of Minefield, patched only with my
"preliminary patch" (attachment 264783 [details] [diff] [review])).

But my version of Minefield with all my debugging information still
crashes (though in a different location), so I'll need to keep
looking.

I'll post further comments at bug 374260 (unless what I have to say
will also be relevant here).
(In reply to comment #10)

> I guess one way to go is to say that any layout update might wipe
> out the frame tree (and run arbitrary script?) and make all
> consumers deal...

Are you saying that (as things now stand) a call to
PresShell::WillPaint() might destroy the current view and the current
window? :-)
I don't know what the ownership model for those is, assuming you mean NSViews and so forth.

It can absolutely destroy nsIViews.  And possibly nsGlobalWindows.
I meant both Gecko objects and their OS-specific counterparts (in this
case things like NSViews and NSWindows).

I assume that Gecko objects wrap their OS-specific counterparts, so
that the OS-specific objects would have the same ownership model
... but I don't really know.

> It can absolutely destroy nsIViews.

Can it destroy the "current" nsIView (the one the browser and/or the
OS is currently "working on")?  (In this case, "working on" would be
"drawing" (via [NSView drawRect:]).)  I suppose the answer is "yes".
There is no "current" nsIView at that point in the code...
> There is no "current" nsIView at that point in the code...

OK, I misunderstood.  I'm just talking about OS-specific objects (like
NSViews and NSWindows).

(By the way, if there's no "current" nsIView at that point in the code
(of which my trace is a snapshot), I'd guess that my patch can deal
with a call to nsChildView::TearDownView() on the "current" NSView
(when an nsPaint event is dispatched as that NSView is being "drawn"
in [NSView drawRect:]).  Bug 374260 may be a case of this.

I don't know what would happen if the "current" NSWindow got
destroyed, though (the one to which the "current" NSView belongs).)
Here's a revised patch, hopefully final.  I've changed it to
accomodate bug 381087 (which I just spun off from from bug 374260).
I've also added comments.

My previous patch "worked" for bug 381087 (the crash-bug part of bug
374260) -- it stopped the crashes.  But I discovered that it
introduces problems of its own.  Bug 381087 is caused by an NSWindow's
contentView being destroyed before the NSWindow is.  My first patch
caused the contentView to be destroyed after the NSWindow ... but this
could cause a double-free (since the call that destroys an NSWindow
also destroys its contentView).

Since an NSWindow "owns" its contentView (and destroys it when it gets
destroyed), I no either release a ChildView that's a contentView or
remove it from its superview (in nsChildView::TearDownView()).
Attachment #264783 - Attachment is obsolete: true
Attachment #265268 - Flags: review?(joshmoz)
Comment on attachment 265268 [details] [diff] [review]
Revised patch (accomodates bug 381087)

In my tests I'm now leaking ChildView objects that are contentViews.
This wasn't happening before.  I need to find out why.
Attachment #265268 - Flags: review?(joshmoz)
Alright, _this_ patch should (I hope) be the final version.

I wasn't counting retains and releases correctly :-(
Attachment #265268 - Attachment is obsolete: true
Attachment #265296 - Flags: review?(joshmoz)
Attachment #265296 - Flags: review?(joshmoz) → review+
Attachment #265296 - Flags: superreview?(mikepinkerton)
pink, can you sr?
Flags: wanted1.8.1.x-
Flags: wanted1.8.0.x-
Whiteboard: [sg:critical?] → [sg:critical?] post 1.8-branch
+  // Also calling [mView removeFromSuperviewWithoutNeedingDisplay] causes
+  // mView to be released again and dealloced, while remaining win's
+  // contentView. 

If that's the case, then we were always doing multiple releases, since this call was followed directly by a release. How did this ever work?
If you call [mView removeFromSuperviewWithoutNeedingDisplay] in
addition to [mView release], mView will be dealloced immediately
(though since the call to [mView release] was second, that's what
triggered the dealloc).

As far as I know, this didn't cause any problems unless mView was also
its NSWindow's contentView -- since mView was dealloced after it had
been removed from the NSView hierarchy.

(Though, of course, removing an NSView from the NSView hierarchy
during a call to [NSView drawRect:] caused other problems.)
Or maybe you're asking why [NSWindow dealloc] didn't get into trouble
trying to release an invalid contentView ... which in any case is a
valid question.

I'll try to find out.
> I'll try to find out.

The answer seems to be that [NSWindow dealloc] never touches the
window's contentView directly.

All NSWindow objects appear to have an NSNextStepFrame object (a
subclass of NSView) as their top-level NSView -- above the
contentView.  I tested in Cocoa Firefox and Safari, and also found the
following doc:

http://www.cocoadev.com/index.pl?NSNextStepFrame

The only NSView object touched directly by [NSWindow dealloc] is the
NSWindow's NSNextStepFrame object -- [NSWindow dealloc] calls [NSView
release] on it, from which (normally) [NSNextStepFrame dealloc] is
called in turn.  [NSNextStepFrame dealloc] releases all its subviews.
If the NSWindow object's contentView is still part of its NSView
hierarchy, it will get released (and normally also dealloced).  If the
NSWindow's contentView is no longer part of its NSView hierarchy, it
will never get touched.

But if the NSWindow's contentView was previously dealloced (without
calling [NSWindow setContentView:nil]), it still stores a pointer to
the (now deleted) contentView (in its _contentView variable).  And
anything that calls [NSWindow contentView] will get back the invalid
pointer.
Comment on attachment 265296 [details] [diff] [review]
Revised patch (count retains and releases correctly)

sr=pink
Attachment #265296 - Flags: superreview?(mikepinkerton) → superreview+
landed on trunk
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee: joshmoz → smichaud
Group: security
Flags: in-testsuite?
Of note, this seems to have started bug 385408 on OS X (Windows version of the same bug started sometime earlier). See bug 385408 comment 5.
crash test landed
http://hg.mozilla.org/mozilla-central/rev/015bcccac065
Flags: in-testsuite? → in-testsuite+
You need to log in before you can comment on or make changes to this bug.