Closed Bug 311399 Opened 20 years ago Closed 20 years ago

Crash [@ SetOrigin] during rendering context destruction [Mac]

Categories

(Core Graveyard :: GFX: Mac, defect)

PowerPC
macOS
defect
Not set
critical

Tracking

(Not tracked)

VERIFIED FIXED
mozilla1.8.1

People

(Reporter: mark, Assigned: mark)

References

()

Details

(4 keywords)

Crash Data

Attachments

(1 file, 3 obsolete files)

The current Mac topcrasher on the branch is in SetOrigin: http://talkback-public.mozilla.org/reports/firefox/FF15x/smart-analysis.mac I haven't been able to reproduce, but based on the Talkback stack traces, the crash is occurring at rendering context destruction time. The rendering context destructor restores a previous port and origin. I ressurected this code in bug 298677, bug 298677 comment 42 has some of the gory details of the history and why it was removed in the first place. It appears that when this crashes, it's because a bad port is being restored. My proposed solution is to wrap the restoration in ::IsValidPort.
Talkback 10259022 10246827 10243232 10185177 1018022 &c.
Severity: normal → critical
Flags: blocking1.8rc1?
will need something soon for this bug. Sounds like mark is on it.
Flags: blocking1.8rc1? → blocking1.8rc1+
Attached patch Potential fix (obsolete) — Splinter Review
There may be a slight pageload regression with this patch, but it's not nearly as bad as I was anticipating on Tiger. Unpatched pageload numbers were 292, 351, 369, 359, 356. Patched, 367, 354, 375, 353, 354. Filtering out the lows and the highs, there's an average slowdown of 4ms. I'm still looking to see if there's a safe way to solve this problem by using a different save and restore strategy.
I'd really rather avoid the call to IsValidPort(), and figure out why we end up with a bad port. I think IsValidPort() walks the heap to see if the pointer is good, so how fast it is might depend on the size of the malloc heap.
Attached patch Better potential fix (obsolete) — Splinter Review
I looked at this some more and found that whenever mPort is bad, the crash occurs directly in ::SetGWorld. The crash that's occurring here is in the next call, ::SetOrigin, which says to me that, in SAT notation, (looking : the port :: barking : the wrong tree). If the port's not bad, then this might have something to do with the GDevice? This patch turns the direct ::SetGWorld call into a SafeSetPort. Since I can't reproduce the crash, it might be a good idea to take this and let it bake to get some new Talkback data.
Attachment #199307 - Flags: review?
Attachment #199307 - Attachment is patch: true
Attachment #199307 - Attachment mime type: application/octet-stream → text/plain
Attachment #199307 - Flags: review? → review?(sfraser_bugs)
Comment on attachment 199307 [details] [diff] [review] Better potential fix I'm not sure that we have strong evidence that the crash is caused by the graphics device going bad. Since we use GWorlds, which can have their own GDevices, and I'm not really clear on when one needs to reset the GDevice, I'm leary of messing much with this code. Without being able to reproduce the crash, we don't have much to go on.
Attached patch A different strategy (obsolete) — Splinter Review
Here's another way to go. The save/restore stuff in the rendering context was reintroduced in bug 298677 to solve bad origins getting set when the caret was going. This takes a more direct approach. Instead of saving and restoring in the RC, this does it all directly in nsCaret::GetCaretRectAndInvert. We didn't need it in the RC to solve any other case I'm aware of. Hackish? Sure. It's also probably the safest way of solving this problem, it's got a high probability of success, and since none of us are able to reproduce it, I'm not sure how much better we'll be able to do. One other advantage here is that it will work when DONT_REUSE_RENDERING_CONTEXT isn't defined in nsCaret.cpp, although that's not likely to ever happen given the projected lifetime of this caret and QD graphics. While I had the hood up, I noticed that the RC wasn't freed during an early return in GetCaretRectAndInvert's bidi switching branch and fixed it. It's actually possible that the SetOrigin crashes are related to carets in bidi fields, and not freeing the RC before returning could have been responsible for too-late port restoration.
Attachment #199532 - Flags: review?(sfraser_bugs)
Attachment #199307 - Flags: review?(sfraser_bugs)
Whiteboard: [needs review sfraser]
Whiteboard: [needs review sfraser] → [needs review sfraser][needs reproducible testcase]
Simon and Mark both sent me some very helpful e-mails about this bug. After much thought, we're going to minus this bug for now even though it's a top mac crasher. Here's why: We aren't sure that either of these patches will actually fix the crash. We'd be guessing. Furthermore, our experts (mark/simon nor QA) have not been able to trigger this top crash internally making it even harder to verify a potential fix with just a few days to go until the tree closure. What we'd like to see happen is to decide on one of these two patches. Go through the review process to get the patch on the trunk and start monitoring talkback over the next few weeks (or more) to see how successful the patch is. We could then consider taking it on the 1.8 branch in a follow up point release (neither patch involves an API change).
Flags: blocking1.8rc1+ → blocking1.8rc1-
Today's talkback comb revealed a very useful testcase for both this bug and bug 312119, which now appear to be the same thing. TB1084955X (I made up the X to try to coax Bugzilla into making a link.) Visit http://musicstore-koeln.de/en/global/3_42_GEGIT_32_GIT0003878-002/0/0/0/detail/musicstore.html?artikel_cid=GIT0003878-003 . It helps to have an empty cache. While the page is loading, but after the flash movie in the top right begins playing, switch the Ausfuhrungen popup from FI to WR. (Other items in the popup will cause the crash too, the talkback report suggests WR.) Roughly a third of the time, you'll crash in SetOrigin or QDPlatformGlobalToLocal. If you don't crash within the first few tries, it seems easier to dump the cache and try again if you're looking to reproduce. This bug is timing-sensitive and seems more difficult to reproduce in debug builds. I'm testing primarily with Flash 8 now, but was also able to crash this with Flash 7. Patches 1 and 3 alleviate both crashes. Patch 2 does nothing. Setting IsValidPort assertions indicates that the saved port is in fact invalid. There is a saved GDevice in these cases. The crash doesn't occur every time there's an invalid port. The data structures haven't had a chance to be trampled by the time SetOrigin is called. QDPlatformGlobalToLocal is called (by the Flash plugin) a little bit later, giving more of an opportunity to trash memory we have no business pointing to. Talkback data show that we're crashing in QDPlatformGlobalToLocal twice as frequently as in SetOrigin. They're the Mac topcrashers by far, with over 300 hits combined in the past 10 days. When this crashes in SetOrigin, the caller is PresShell::ProcessReflowCommands, freeing a rendering context whose lifetime is that method, but it's a longish-lived method. It's called by ReflowEvent::HandleEvent. I suspect that the invalid port belongs to the old plugin. I think we should go with the third patch, which does solve the crash and gets the save/restore out of the way of other RC users.
Whiteboard: [needs review sfraser][needs reproducible testcase] → [needs review sfraser]
TBID is TB10849555X. Resetting blocking? now that we know more about this bug.
Flags: blocking1.8rc1- → blocking1.8rc1?
Comment on attachment 199532 [details] [diff] [review] A different strategy This is an ugly patch to put into XP code, but it will have to do on the branch. For the trunk, I think we need something more elegant.
Attachment #199532 - Flags: review?(sfraser_bugs) → review+
It's not the plugin's port, it's the nsMacWindow's old and now-gone port. The nsMacWindow belonged to the FI/WR popup in the testcase. We take care to set the port to something safe when necessary in the nsMacWindow destructor, but that's called while ProcessReflow is still way up on the stack, after it created its RC which saved the potentially evil port state. Reflow's RC alone is the problem here, and this probably even crashes with no Flash involvement. This isn't a case of save and restore model not matching stack nesting as we expected, it's just a case of an RC that lives too long and does too much.
Whiteboard: [needs review sfraser]
Attachment #199532 - Flags: superreview?(bzbarsky)
Attachment #199307 - Attachment is obsolete: true
Mark, can you tell us what this patch does, what's the risk associated with taking this patch, and what kind of testing has been done or would need to be done to feel good about taking this on the branch.
Comment on attachment 199532 [details] [diff] [review] A different strategy So I assume doing this in PushState/PopState doesn't work for some reason? That would seem to be the obvious place to handle this sort of state to me.... Is the problem there that it actually doesn't work, or that it's just really slow?
Asa- This patch backs out some save/restore code that was added to the rendering context to resolve some caret-only drawing regressions in bug 298677, and replaces it with save/restore code in the caret itself. That's where the ugliness from a code perspective comes in: the caret is cross-platform, and not really the best place for platform-specific tweaks like this, although it's the most practical and safe spot to do it on the branch right now. I discussed this in comment 7 and Simon brought it up again in comment 11, and that's where bz is heading in comment 14. The reason that this is safe is that it moves save/restore into the caret so that it occurs when it would have in the caret's rendering context, leaving bug 298677 fixed. Since save/restore is no longer in the RC, it can't cause trouble for other users of RCs, which is where this crash and the one in bug 312119 are running into trouble. I can't guarantee it's risk-free because it's possible that other users of the rendering context were relying on the save/restore, although I can say that it's highly unlikely for any of our code to fall into this trap. We set up the gfx state for ourselves as needed, and only need to do save/restore to make sure the system sees the right state. I'll note that from 2002 to July of this year, there was no save/restore here at all, and everything worked just fine, save/restore was only reintroduced to resolve a regression from bug 282940. It's possible, but unlikely, that 282940 caused other regressions that were resolved by leaving the save/restore in the RC (from where it now must be removed, since we can see it's causing crashes). Really, the best way to do this is to not do any save/restore and to set the state up on demand when the system needs it, but that comes with its own problems, and we haven't managed to get that working perfectly yet. Admittedly, even though that would be the right thing to do, it's probably too risky for the branch at this point even if we did get it working. I can say that I'm working with this in my own branch build right now and haven't noticed any problems or crashes. I wouldn't be advocating the patch at this stage of the game if I didn't think it was a good idea.
(In reply to comment #14) > So I assume doing this in PushState/PopState doesn't work for some reason? > That would seem to be the obvious place to handle this sort of state to me.... > Is the problem there that it actually doesn't work, or that it's just really > slow? The caret already does PushState/PopState, and the port origin is already reset by PopState, but that doesn't work here. PushState pushes the state of an already-initialized rendering context. In the case of the caret, the RC is initialized to a window widget that contains the caret. The port, and more importantly, port origin, are set according to that widget. This, in turn, messes with scrollbars if they're manipulated when the caret draws. It's bug 298677. PopState only brings QuickDraw back to the window widget's base state from initialization time, which is not the same as the state in effect before the rendering context was initialized. Per discussion above, it seems that we only need to restore the QD state in this manner for the caret. Best of all would be to rip this save/restore back out entirely and handle it on a system-level event that comes in before a scrollbar is drawn or its position read, but we can't do that without garbage appearing in the scrollbars. That was bug 300058.
Attachment #199532 - Flags: superreview?(bzbarsky) → superreview+
Flags: blocking1.8rc1? → blocking1.8rc1+
Comment on attachment 199532 [details] [diff] [review] A different strategy Asa, did you intend to mark sr+?
Attachment #199532 - Flags: superreview+
Comment on attachment 199532 [details] [diff] [review] A different strategy sr=bzbarsky. Are we landing this on trunk, then? Or just on branch and working on a different fix for trunk?
Attachment #199532 - Flags: superreview+
Attachment #199532 - Flags: approval1.8rc1?
Unless it's an absolute driver precondition, I'd rather not take this on the trunk. I'd like to get rid of save/restore and do some OS event handlers there if we can make them work without glitches. At that point, it'd be worth considering a backport to the 1.8 branch if it's live.
Comment on attachment 199532 [details] [diff] [review] A different strategy a=asa for branch landing. the sooner the better.
Attachment #199532 - Flags: approval1.8rc1? → approval1.8rc1+
Fixed on branch.
Keywords: fixed1.8
This seems to have regressed bug 298677 in cases that don't involve carets, during page loading. Bug 313639.
we backed this out, clearing out the fixed1.8 flag.
Flags: blocking1.8rc1+ → blocking1.8rc1?
Keywords: fixed1.8
Attachment #199532 - Flags: approval1.8rc1+
Depends on: 313639
Let's fix this with Carbon event handlers. Apple says that the drawing artifacts will be fixed.
Flags: blocking1.8rc1? → blocking1.8rc1-
Attached patch Proper fixSplinter Review
I'm putting the patch here and not on one of the other five bugs about this because the primary motivator is to fix the crash. The strategy here is to remove the port save/restore stuff, wherever it may be found. A new kEventControlDraw handler is installed; it and the TrackControl action proc look for a bogus state and work around it when necessary. The rationale is explained in the comments. The most dangerous part about this patch is that it posts a fake mouse-down event while the mouse button is already down in tracking in order to force the Control Manager to reread the scrollbar position. I tried several other approaches, but nothing else short of bypassing the Control Manager and providing our own home-grown tracking function looks like it'll fly. In testing, this fixes the crash, eliminates all drawing artifacts, and is free of aesthetic and functional regressions I was able to cook up. It also takes care of other weird and inappropriate caret-scrollbar and loading-scrollbar interactions. Bug 300047, harmless visual artifacts, will appear in scrollbars on Mac OS X 10.4.0 - 10.4.2. They don't appear in 10.3.9 or 10.4.3.
Attachment #199233 - Attachment is obsolete: true
Attachment #199532 - Attachment is obsolete: true
Attachment #202167 - Flags: superreview?(sfraser_bugs)
Attachment #202167 - Flags: review?(joshmoz)
Comment on attachment 202167 [details] [diff] [review] Proper fix This patch compiles and works fine under my testing. I looked over the code, looks good. It'll be good to have this cleared up.
Attachment #202167 - Flags: review?(joshmoz) → review+
Comment on attachment 202167 [details] [diff] [review] Proper fix hack-o-rama
Attachment #202167 - Flags: superreview?(sfraser_bugs) → superreview+
Yeah, but only the PostEvent part. Live with this on the trunk now.
Blocks: 301478, 304607
Flags: blocking1.8.1?
Moving my stability bugs to blocking1.8.0.1?
Flags: blocking1.8.1? → blocking1.8.0.1?
Since the patch for this bug was checked into the trunk at 2005-11-09 19:46, shouldn't this bug be marked as fixed?
yes
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
*** Bug 312119 has been marked as a duplicate of this bug. ***
*** Bug 319609 has been marked as a duplicate of this bug. ***
Comment on attachment 202167 [details] [diff] [review] Proper fix We really really really want this for 1.5.0.next. It fixes the Mac topcrashers, @ QDPlatformGlobalToLocal() and @ SetOrigin().
Attachment #202167 - Flags: approval1.8.0.1?
Attachment #202167 - Flags: approval1.8.1+
Attachment #202167 - Flags: approval1.8.0.1?
Attachment #202167 - Flags: approval1.8.0.1+
a=dveditz for drivers
Flags: blocking1.8.1+
Flags: blocking1.8.0.1?
Flags: blocking1.8.0.1+
Checked in to 1_8 and 1_8_0.
Target Milestone: --- → mozilla1.8.1
I just crashed on the test page after context clicking the flash ad, then slecting about Flash. Talkback ID: TB13799831W, I won't know until the reports queue gets to my crash whether this is the same bug as this or not.
Tracy, that's a different stack. Looks similar to bug 320998, although I don't recall having seen one of those with Flash on the stack.
Tracy: I just followed your same steps but did not get a crash, tested using today's branch build, Mac OS X 10.4.3 (In reply to comment #37) > I just crashed on the test page after context clicking the flash ad, then > slecting about Flash. > > Talkback ID: TB13799831W, I won't know until the reports queue gets to my crash > whether this is the same bug as this or not. >
okay, since I was straying from test steps and the stack from my crash is different, marking this verified
Status: RESOLVED → VERIFIED
Looks good using Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1b1) Gecko/20060810 BonEcho/2.0b1. I do not crash on the site, and I tried using Mark's steps in Comment 9. adding verified keyword.
Product: Core → Core Graveyard
Crash Signature: [@ SetOrigin]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: