Closed Bug 153815 (streetmap.co.uk) Opened 22 years ago Closed 22 years ago

crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] [@ @0x00000000][@ nsImageWin::DrawToImage] (remaining issues from bug 144315)

Categories

(Core :: Layout, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla1.2alpha

People

(Reporter: alexsavulov, Assigned: kinmoz)

References

()

Details

(4 keywords, Whiteboard: [adt2] custrtm- [ETA 09/09])

Crash Data

Attachments

(7 files, 5 obsolete files)

526 bytes, patch
Details | Diff | Splinter Review
10.06 KB, text/plain
Details
1.44 KB, application/zip
Details
8.92 KB, text/plain
Details
1.79 KB, patch
alexsavulov
: review+
Details | Diff | Splinter Review
3.81 KB, patch
john
: review+
Details | Diff | Splinter Review
6.01 KB, patch
john
: review+
Details | Diff | Splinter Review
crash in nsImageListener::FrameChanged (remaining issues from bug 144315 - see that bug for details) search for a place in GB using http://streetmap.co.uk when yuo get the map, click on the arrows to navigate. i managed to crash pretty fast when i open the link in a new tab.
copying the bug info from bug 144315
Alexandru, were you able to reproduce the crash with http://www.animecity.nu/mozilla/lynxos/index.html ? (Comment 76 of Bug 144315)
Summary: crash in nsImageListener::FrameChanged (remaining issues from bug 144315) → crash in [@ nsImageListener::FrameChanged] (remaining issues from bug 144315)
*** Bug 145542 has been marked as a duplicate of this bug. ***
QA Contact: petersen → moied
arron: i was able to crash at anemicity and _not_ at the MIB site with today's trunk build which i pulled at 3a.m this morning.
Also please do not forget the crash I garnered at http://jslab.org/mozilla/xsldom.xml -- can someone reproduce that crash in a build after the fix from bug 144315? This link involves strictly XML and XSLT.
*** Bug 154126 has been marked as a duplicate of this bug. ***
it would be nice to get a fix for this before 1.0.1 is out the door. pls update the ETA in the Status Whiteboard.
Blocks: 143047
Keywords: mozilla1.0.1
Whiteboard: [adt2 rtm] custrtm- → [adt2 rtm] custrtm- [ETA Needed]
*** Bug 146027 has been marked as a duplicate of this bug. ***
I was able to crash at http://www.animecity.nu/mozilla/lynxos/index.html following the instructions for that testcase. My incident is below: Incident ID 7706258 Stack Signature nsImageBoxListener::OnStopDecode 1a429002 Email Address jpatel@netscape.com Product ID MozillaTrunk Build ID 2002062508 Trigger Time 2002-06-25 14:19:11 Platform Win32 Operating System Windows NT 4.0 build 1381 Module gklayout.dll URL visited http://www.animecity.nu/mozilla/lynxos/index.html User Comments reproducing bug 144315 or 153815. just loaded url and pressed F5 once to clear the table content...then clicked link to go to google.com and pressed back and refresh and back again and boom! Trigger Reason Access violation Source File Name c:/builds/seamonkey/mozilla/layout/xul/base/src/nsImageBoxFrame.cpp Trigger Line No. 876 Stack Trace nsImageBoxListener::OnStopDecode [c:/builds/seamonkey/mozilla/layout/xul/base/src/nsImageBoxFrame.cpp, line 876] imgRequestProxy::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequestProxy.cpp, line 295] imgRequest::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequest.cpp, line 339] imgContainer::Notify [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgContainer.cpp, line 460] nsTimerImpl::Fire [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp, line 352] nsTimerManager::FireNextIdleTimer [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp, line 588] nsAppShell::Run [c:/builds/seamonkey/mozilla/widget/src/windows/nsAppShell.cpp, line 134] nsAppShellService::Run [c:/builds/seamonkey/mozilla/xpfe/appshell/src/nsAppShellService.cpp, line 458] main1 [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp, line 1472] main [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp, line 1808] WinMain [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp, line 1826] WinMainCRTStartup() KERNEL32.dll + 0x1ba06 (0x77f1ba06) Adding Trunk, M1BR and nsImageBoxListener::OnStopDecode to summary for tracking. This is still a topcrash for the MozillaTrunk and Gecko1.0 Branch. BTW: We need a priority and target milestone for this bug.
Summary: crash in [@ nsImageListener::FrameChanged] (remaining issues from bug 144315) → crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageBoxListener::OnStopDecode] (remaining issues from bug 144315)
Alex Vincent: Your test URL (http://jslab.org/mozilla/xsldom.xml) doesn't crash for me with MozillaTrunk build 2002061508 (with fix from bug 144315). I tried loading it and refreshing it numerous times...is there anything else I should be doing to try to crash?
milestone and priority. basicly this bug should have the same values like bug 144315, since is a clone of that bug for the still open issues - we had multiple crash causes producing the same stack (or very simmilar).
Priority: -- → P1
Target Milestone: --- → mozilla1.0.1
Jay Patel -- see bug 153799. I did have several tabs open, plus mail/news.
http://www.sonypictures.com/movies/meninblack/, http://jslab.org/mozilla/xsldom.xml doesn't crash on trunk build 20020625 but, can but these URL crashes on branch 20020625 branch build.
I am still crashing using 2002062508 1.1a+ trunk build for Mac OS9.x. TB7722354K TB7722218E The step is the same as I described in bug 144315.
Incident ID 7722354 Stack Signature .__ptr_glue 71f549a2 Email Address Product ID MozillaTrunk Build ID 2002062508 Trigger Time 2002-06-26 01:58:29 Platform MacOS Operating System MacOS version 9.2.2 Module LAYOUT_DLL URL visited http://book.asahi.com/review/ User Comments http://book.asahi.com/review/?info=b&no=1http://book.asahi.com/review/index.html?info=s&no=4http://book.asahi.com/review/index.html?info=s&no=1http://book.asahi.com/review/index.html?info=s&no=2I kept clicking these links. Trigger Reason PowerPC access violation Source File Name Trigger Line No. Stack Trace .__ptr_glue imgRequestProxy::FrameChanged() [imgRequestProxy.cpp, line 294] imgRequest::FrameChanged() [imgRequest.cpp, line 331] imgContainer::Notify() [imgContainer.cpp, line 459] nsTimerImpl::Fire() [nsTimerImpl.cpp, line 342] handleTimerEvent() [nsTimerImpl.cpp, line 403] PL_HandleEvent() [plevent.c, line 596] PL_ProcessPendingEvents() [plevent.c, line 526] nsEventQueueImpl::ProcessPendingEvents() [nsEventQueue.cpp, line 388] nsMacNSPREventQueueHandler::ProcessPLEventQueue() [nsToolkit.cpp, line 177] nsMacNSPREventQueueHandler::RepeatAction() [nsToolkit.cpp, line 142] Repeater::DoRepeaters() [nsRepeater.cpp, line 136] nsMacMessagePump::DispatchEvent() [nsMacMessagePump.cpp, line 481] nsMacMessagePump::DoMessagePump() [nsMacMessagePump.cpp, line 289] nsAppShell::Run() [nsAppShell.cpp, line 120] nsAppShellService::Run() [nsAppShellService.cpp, line 457] main1() [nsAppRunner.cpp, line 1456] main() [nsAppRunner.cpp, line 1805] .__start
can't repro going to jslab.org, but when i crash going to anemicity.nu && streetmap i get this call stack: nsImageListener::FrameChanged(nsImageListener * const 0x04d13b48, imgIContainer * 0x04cb7aa0, nsISupports * 0x04b0b170, gfxIImageFrame * 0x04d183e8, nsRect * 0x0012fd28) line 2383 + 36 bytes imgRequestProxy::FrameChanged(imgIContainer * 0x04cb7aa0, gfxIImageFrame * 0x04d183e8, nsRect * 0x0012fd28) line 295 imgRequest::FrameChanged(imgRequest * const 0x04cb9fd4, imgIContainer * 0x04cb7aa0, nsISupports * 0x00000000, gfxIImageFrame * 0x04d183e8, nsRect * 0x0012fd28) line 337 imgContainer::Notify(imgContainer * const 0x04cb7aa4, nsITimer * 0x04cf6f60) line 460 nsTimerImpl::Fire() line 343 nsTimerManager::FireNextIdleTimer(nsTimerManager * const 0x01231a88) line 588 nsAppShell::Run(nsAppShell * const 0x010f7100) line 134 nsAppShellService::Run(nsAppShellService * const 0x010f55e0) line 458 main1(int 0x00000001, char * * 0x00306e90, nsISupports * 0x00000000) line 1456 + 32 bytes main(int 0x00000001, char * * 0x00306e90) line 1805 + 37 bytes mainCRTStartup() line 338 + 17 bytes KERNEL32! 77e8d326() which is a little different than jay's. also when going to streetmap i get the following assertions: ###!!! ASSERTION: OnStopDecode called multiple times.: '!(mState & onStopDecode) ', file c:\builds\trunk\mozilla\modules\libpr0n\src\imgRequest.cpp, line 507 ###!!! ASSERTION: initial containing block already created: 'nsnull == mInitialC ontainingBlock', file c:\builds\trunk\mozilla\layout\html\style\src\nsCSSFrameCo nstructor.cpp, line 8918 ###!!! ASSERTION: invlaid divisor: 'PR_FALSE', file c:\builds\trunk\mozilla\layo ut\html\table\src\nsTableFrame.cpp, line 3682 ###!!! ASSERTION: unexpected next reflow command frame: '*iter == mFrames.FirstC hild()', file c:\builds\trunk\mozilla\layout\html\base\src\nsHTMLFrame.cpp, line 522
easily to reproduce at http://debatt.aftenposten.no/Group.asp Click one of the debate links (like "Norsk politikk" or "Norge og EU") When is has loaded: hit navigation buttons for back / forward in succession - untill you crash. I crash fairly quick each time, usually on the first "back".
from my comment #16 wrt assertions, i thought we fixed the problem where InitalReflow was being called twice. using today's trunk build going back to the MIB site, i don't get any assertions (especially the "initial containing block already created"). did we _only_ just fix for the MIB site? :/
*** Bug 154351 has been marked as a duplicate of this bug. ***
*** Bug 154365 has been marked as a duplicate of this bug. ***
Attached patch wallpaper patch — — Splinter Review
talked to waterson yesterday and perhaps a wallpaper patch is the most we can do right now. pics may not render correctly but at least there's no crash. *shrug*
Rendering problem only happens to animated gif not for other areas AFAIK. Next click fixes the rendering problem.
*** Bug 154703 has been marked as a duplicate of this bug. ***
adding johnny since he was involved in making a second InitialReflow() call, hopefully more insight can be gained.
from comment #21: "...most we can do right now..." hmmm, i see 46 crashes (OnStopDecode) and 11 crashes (FrameChanged) in the last 3 days. (over 4200 OnStopDecode crashes since 05-31) We still don't have a reliable testcase so we cannot test this. Before bug 144315 was fixed we thought there is only one crash cause, it turned out that there are more. If someone wants to put that patch in, please explain on the bug why would this solve the crash. Please think of possible regressions too and. Another thing: this bug was the sequel bug for bug 144315. Although I agree that bug 146027 is closely related still there are slightly different issues. Don't forget any of them. (bug 146027 is now a dup of this one) I'm working on it too, still i'd like to find out the real problem since we already have a temporary patch. Thanks.
sorry for the crippled english :-)
Here is the latest Talkback date for the nsImageBoxListener::OnStopDecode crashes (originally bug 146027) on the MozillaTrunk. There are plenty of user comments and urls to help us reproduce this.
Yes, the patch for Bug 144315 only fixed MIB. (Well, not really, but MIB was the only major site causing that particular bug.. I'm sure there were lesser known sites that crashed too) As for Patch Attachment 89482 [details], it's almost identical to the patch I created about a week ago (Patch Attachment 88427 [details] [diff] which I posted to Bug 144315), so I'll copy the comments I made when I attached the patch: I very much doubt the patch is fixing the root of the problem, but rather patching one of the effects of it (perhaps the only effect?). The patch exits the function if mInitialContainingBlock is null. This gets rid of the second assertion entirely, and the page doesn't appear to be affected adversly. I still have yet to be able to produce nsImageBoxListener::OnStopDecode. IMO, I think the crash reports are bogus -- talkback or whatever interprets talkback data is misrepresenting the function names. OnStopDecode never gets called from imgRequestProxy::FrameChanged as the stack trace suggests. In fact, Line 295 - 1 calls mListener->FrameChanged. Perhaps I'm missing something here. http://lxr.mozilla.org/seamonkey/source/modules/libpr0n/src/imgRequestProxy.cpp#294
Arron: no, i think you're right and that we have to focus on the leaked frame issue whatever the talkback reports are showing. i think the most important goal is not to try to find a fix but to have a reliable testcase. QA will need strong evidence when they will verify if the bug was fixed.
Maybe some help for Arron M in Comment #28: "IMO, I think the crash reports are bogus" I downloaded and installed Gecko/20020628, started with a fresh profile, went to the PCF forums (http://forum.pcformat.co.uk/) and crashed in a weird form. I strongly believe it is the advertisements there that make my Mozilla crash, as when they load, I see them as 'noise' (horizontal coloured stripes). This time it made my whole Mozilla crash (plus Windows) into a full screen 'noise' over which the refresh scrolled in visible lines and blocks. Just as if the PC was too hot or the videocard lost its cooling (neither of which is true) Because everything had crashed, I could only reboot. So I don't have a Talkback ID.
"I strongly believe it is the advertisements there that make my Mozilla crash, as when they load, I see them as 'noise' (horizontal coloured stripes)." Bingo. I see the same noise in place of the Flash ads on streetmap.co.uk. This one needs at least a workaround for 1.0.1 - streetmap.co.uk is sorta popular in the UK.
i copied the animecity html files locally and my trunk (today's) still crashed. it definitely requires the second javascript src in the dir index_files to crash. i think the frame tree is being built wrong thus making a second initalReflow call somehow. when i remove the second javascript scr, i don't get the assertion. can someone cc some construct_frame_tree developers b/c i don't know of any? shiva and i believe this may be related to a timer issue. my branch build still doesn't crash :( talkback data shows that branch crashes occurs mostly on the linux platform. i run win2000. a couple of weeks ago, shiva spoke to pavlov re: bug 118004, pavlov was stating something about a leaky frame and so the timer can't clean up correctly...sounds very similar to this, a little too similar. notice in the stack trace that the timer is firing off on a frame that's invalid. thx for your time (it was kind of lengthy) :)
*** Bug 155346 has been marked as a duplicate of this bug. ***
i crash also everytime with the steps in 155346 (build 20020630)
Another user Jan Derk is crashing using a different set of steps than any noted in this bug so far...but she is getting the exact same stack as the other "OnStopDecode" crashes. Here are her words from bug 151163: "I have just replicated the bug on W2k build 2002062308 with "Load links in background on". Talkback information has been sent with my email addres janderk AT digitaldutch DOT com. Yes I run a talkback enabled version and have send in 3 of these crashes. I can replicate this behavior any time by following these steps: Load links in background on" must be enabled in the Tabbed Browsing Options. 1) Go to http://forum.rackshack.net/ 2) Middle mouse click the top left Rackshack logo to load the main page in another tab 3) Click the new tab, before the new page has finished loading. The page now shows only the background for me. It just stops. 4) Hitting the Refresh button raises the exception causing Mozilla to die." And her most recent crash: Incident ID 7925528 Stack Signature nsImageBoxListener::OnStopDecode 56f9cda2 Email Address Product ID MozillaTrunk Build ID 2002062308 Trigger Time 2002-07-02 07:39:11 Platform Win32 Operating System Windows NT 5.0 build 2195 Module gklayout.dll URL visited forum.rackshack.net User Comments I can replicate this behavior any time by following these steps: "Load links in background on" must be enabled in the Tabbed Browsing Options. 1) Go to http://forum.rackshack.net/ 2) Middle mouse click the top left Rackshack logo to load the main page in another tab 3) Click the new tab, before the new page has finished loading. The page now shows only the background for me. It just stops. 4) Hitting the Refresh button raises the exception causing Mozilla to die. Trigger Reason Access violation Source File Name c:/builds/seamonkey/mozilla/layout/xul/base/src/nsImageBoxFrame.cpp Trigger Line No. 876 Stack Trace nsImageBoxListener::OnStopDecode [c:/builds/seamonkey/mozilla/layout/xul/base/src/nsImageBoxFrame.cpp, line 876] imgRequestProxy::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequestProxy.cpp, line 295] imgRequest::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequest.cpp, line 339] imgContainer::Notify [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgContainer.cpp, line 460] nsTimerImpl::Fire [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp, line 352] nsTimerManager::FireNextIdleTimer [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp, line 588] nsAppShell::Run [c:/builds/seamonkey/mozilla/widget/src/windows/nsAppShell.cpp, line 134] nsAppShellService::Run [c:/builds/seamonkey/mozilla/xpfe/appshell/src/nsAppShellService.cpp, line 458] main1 [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp, line 1472] main [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp, line 1808] WinMain [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp, line 1826] WinMainCRTStartup() KERNEL32.DLL + 0xd326 (0x77e8d326) Cc'ing Jan. And removing http://www.sonypictures.com/movies/meninblack/ from url bar, since we are no longer crashing there.
yes, I know, i see the crash too. it would be nice if people that see the crash would work on a minimized testcase that reproduces the bug on the localhost (html, JS, etc) and so make us the debugging work easier. thx!
Attachment #89977 - Attachment mime type: application/octet-stream → application/zip
Comment on attachment 89977 [details] This crashes in my Trunk build (with or without) debugger. Loading from local directory where are jsOuter.js jsInner.js or are they unnecessary?
what took you so long guys? :-) thx for the testcase. yes, is crashing on my machine with a recent build. however, i cannot tell which stack we get yet be cause i have to pull the entire tree again godddamit (the &*#$ing thing needs to be pulled fresh all 2 days when you use unix make on windows).
We had this testcase since Jun 23.. just not zipped. the two js files are probably not needed so long as the <script src=> tags cause the parser to block. I have a better patch than Attachment 89417 [details] [diff], but I'm holding out while I dig deeper.
Arron, well, maybe you could tell us what you think at least if you think you have a better patch. maybe some poeple can help with. this is an important issue, we want to get rid of this crash asap. we're under time pressure. thx, alex
Sure, but you may think I've gone off the deep end. :P Let's see if I can summarize what I've learned: -My Theory: 1) script in index.html stops the parser for it 2) iframe.html starts parsing, but stops at it's script tag. Iframe is now partially loaded 3) index.html continues it's parsing. It hits </script> and does a flush. 4) flush somehow causes both pages to be created 5) due to iframe.html being 1/2 loaded, some stuff gets called twice for it. 6) the animated gif get's created twice 7) User leaves the page. Clean-up occurs (only 1 animated gif gets destroyed?) 8) timer fires on extra animated gif. 9) Crashy, because the stuff that the animated gif is attached to is gone. - It's sort of a timing issue. While index.html is paused/stopped, it has smarts to do a flush. This flush is based on a timer. If this timer fires before the iframe starts to load, everything will be okay (no crash). - Based on the above, I created a patch to force a flush upon every close of <iframe>. That way, the frame is created immediately, and not during the flakey steps during my theory steps. The patch is Attachment 89614 [details] [diff], found in Bug 154018. Now, what I am trying to hunt down is how to stop the stuff in iframe from being created twice. They both get created in nsCSSFrameConstructor::ContentInserted, the first one during the ConstructDocElementFrame call, and the second one during the bm->ProcessAttachedQueue() call. Personally, I don't think the ConstructDocElementFrame should be creating stuff when bm already has it in it's queue to be created (or does it at that point? I don't know yet) Figuring out why these elements are being created twice will determine whether Bug 154018 is the same bug as this one. Right now, Attachment 89614 [details] [diff] fixes this bug and Bug 148827
from what i see, we fail to add the image frame to the frames tree of the <iframe> (or we fail to create the image frame at all what i doubt)
i can't crash following comment# 35 using yesterday's trunk build. :( i applied arron's patch on my trunk and it fixes the crash. YAY! something weird is happening though. an ample amount of the following assertions are popping up now: ###!!! ASSERTION: creating nsImageListener: 'PR_FALSE', file c:\builds\trunk\moz illa\layout\html\base\src\nsImageFrame.cpp, line 2289 ###!!! ASSERTION: deleting nsImageListener: 'PR_FALSE', file c:\builds\trunk\moz illa\layout\html\base\src\nsImageFrame.cpp, line 2295 pages and pages of it, is there something _really_ wrong with that? occasionally these assertion still pops up: ###!!! ASSERTION: OnStopDecode called multiple times.: '!(mState & onStopDecode) ', file c:\builds\trunk\mozilla\modules\libpr0n\src\imgRequest.cpp, line 507 ###!!! ASSERTION: invlaid divisor: 'PR_FALSE', file c:\builds\trunk\mozilla\layo ut\html\table\src\nsTableFrame.cpp, line 3682
ly, did you put those first 2 assertions in there yourself? ;) I can't find them at all in LXR. The 3rd one is a seperate bug. I'm unsure whether a bug report for it has been filed yet. I plan on looking up/creating the bug sometime after this bug is fixed. I'm not sure about the 4th assertion.
duuuuude! i just discovered that for aminated gifs it takes forever to cancel a timer from the moment a page without images is loaded. to verify this: put a printf's in: imgContainer::Notify(nsITimer *timer) @ // do notification to FE to draw this frame, ..... printf("*#*# FRAME CHANGED\n"); observer->FrameChanged(this, nsnull, mCompositingFrame, &dirtyRect); and in: imgContainer::StopAnimation() @ mTimer->Cancel(); printf("*#*# TIMER STOPPED\n"); then load an animated gif after that load about:blank you'll see the FRAME CHANGED string 5 times until the timer gets canceled (TIMER STOPPED) if you load the animated gif named enthralled.gif that is available in the packed test case attached to this bug. that ain't good at all and needs to be cahnged. will try to see if we can get the webshell to do that or so. daaaamn!
arron: i'm such a dork! yeah, they were my own. :( alexandru: shiva and i kind of suspected something like that when we tried to reduce arron's testcase by linking about:blank to the initial page that gets loaded.
hmmm, since we leak frames (nsIFrame) in so many ways (due to bad html or due to some complicated constructs that involve so many different components like in the testcase above - and i'm sure the number of differet cases are infinite), i think that canceling the timers as soon as they're not needed anymore is the right way to go. Not only that we will get rid of this kind of behavior (timer causing a FrameChanged on a defunct frame), but we also get rid of unecessary execution of code. I haven't even dreamed about that such things are happening. Working on this.
indeed, objects do not get destroyed until after the next page has loaded. So during those 50ns that it took for about:blank to load, the gif got in 5 more frames of animation. There was an attempt once to stop animation timers upon stopping the document (Attachment 75533 [details] [diff] of Bug 70030). Stopping the document occurs just before loading a new one. However, the patch caused too big of a pageload regression and was taken out. So, it's back to animation timers being stopped when their imgContainer object is destroyed, which is after the new page loads.
thanks for digging that out Arron! is very important to know. i think to cancel the timers before the imgContainer gets destroyed, as soon as the webshell that contained the document with the animated gif(s) is gone (i'm not 100% sure but, i think for now, this would be the right approach). need to dig more into it.
ah, BTW: 50ns? with my debug build and using viewer.exe is like 500ms until all 6 events (5 framechanged and timerstopped) show up in the console and i don't think is the console's event queue only that delays things like that. hmmm, well this is an obscure part anyway :-)
there is another annoying thing: if the page after the one with the animated gif contains a non-animated image, the freaking timer continues to fire. that is definitively an overkill.
Sorry, I had my nanoseconds mixed up. Although it doesn't invalidate the point that animations still animate until they are destroyed, which is after the next page is loaded. And yes, having animation timers still firing even when the page is no longer available is silly and really shouldn't happen.. but that's a seperate bug, IMO.
Alexandru, In comment #48 you say: > hmmm, since we leak frames (nsIFrame) in so many ways (due to bad html or due > to some complicated constructs that involve so many different components like > in the testcase above - and i'm sure the number of differet cases are > infinite), ... Why do you say that, do you have numbers that show us often leaking lots of frames? I haven't heard of any frame leaks in a *long* time, except in some edge cases that normally result in crashes (like this one). In this case we do "leak" frames it seems, and it does result in a crash, at least some times. Seems to me that someone should concentrate on *why* we leak those (image) frames here and stop worrying about what happens when we do leak image frames.
I seem to recall Ly showing me some stacks that showed why we leaked the frames in this case -- it was a messy issue related to jst's frames in the content model landing. In particular, I think it had something to do with the extra (first?) InitialReflow call happening in DocumentViewerImpl::InitPresentationStuff as called from DocumentViewerImpl::Show, although my memory could be wrong. Posting those stacks to the bug could be useful.
I did a stack traces of the two InitialReflows in streetmap.co.uk and they are in Attachment 88785 [details]. It's a bit outdated though. I'll try to make a similar stack traces for the testcase. I should get that done today.
JST: hmmm, i don't have concrete numbers but i'm pretty confident that this testcase here is not the only example of how we leak frames. i saw bugs before where we had similar problems due to bad html (sometimes blocks in inline elements, sometimes some weird combinations of unclosed comments, scripts, and so on - i would have to dig to find those bugs). yes, i agree with you that trying to go after the frame leakage would be what we have to do. however, that would require more work if we want to cleanup the whole thing. so, going one testcase after the other will be pretty much what we already do since a while. unfortunately for this particular case here we have more than one URL that crashes with pretty much the same stack as the previous bug to this one (there is also another bug that crashes in the nsImageBoxFrame). Now since there is no reduced testcase for every of those pages (there are 4200+ crash reports since May, ok many are dups too) i cannot say if we hit the same kind of frame leakage or not. this is just an estimate though.
My Theory (extended): 1) index.html loads up to <script>. Everything in parser, but not flushed 2) script in index.html stops the parser (for index.html) 3) iframe.html starts parsing, but stops at it's script tag. Iframe is now partially loaded (and probably not flushed). Actually, only doctype, <html>, <head>, and <script> are in the parser for it. 4) index.html continues its parsing. It hits </script> and does a flush. 5) flush somehow causes both pages to be created 6) iframe.html gets flushed twice. Since the script is in the html/head block, the html/head gets created twice and two initialframes result 7) User leaves the page. other page finishes loading and clean-up of old page occurs (animated gif doesn't get destroyed?) 8) timer fires on animated gif. 9) Crashy, because the stuff that the animated gif is attached to is gone. If you are following my Bug 154018, you'll note the theory is very similar, except in that bug's case, the iframe.html's script is inside the body (not inside the head), so the body gets created twice.
oh, BTW with the testcase attached here that image frame actually exists as long as you don't click on the link and it receives the FrameChanged notification and processes them. The thing is, the image frame does not appear in the frame dump of the <iframe>. Only when we go to another URL the frame gets destroyed, but because is not in the frame tree of the <iframe> there is no Destroy performed on the frame looks like and so the timer gets canceled later on when the image container is destroyed.
Please note that this crashes a top partner site - see http://bugscape.mcom.com/show_bug.cgi?id=17165 for more.
For the record (to dbaron and others), this has nothing to do with my iframe loading changes, this happens on both the trunk and on the branch, my iframe loading changes only landed on the trunk.
Attached patch fix for the crashes (obsolete) — — Splinter Review
i observed that be cause of this double InitialReflow we create 2 area frames for the same document (attached to the same canvas frame). while the canvas frame has one of the area frames as child, the other one will receive the new body frame. well is pretty difficult to explain. now i'm tired and want to stop. test the patch, i will add a better explanation soon. have a happy 4th of July celebration.
whoo, more information! Thanks Alexandru! Based on alexsavulov's patch, I've come up with my 3rd revision of what I think is happening. I put stars around what's new. 1) index.html loads up to <script>. Everything in parser, but not flushed 2) script in index.html stops the parser (for index.html) 3) iframe.html starts parsing, but stops at it's script tag. Iframe is now partially loaded (and probably not flushed). Actually, only doctype, <html>, <head>, and <script> are in the parser for it. 4) index.html continues its parsing. It hits </script> and does a flush. 5) flush somehow causes both pages to be created 6) iframe.html gets flushed twice. Since the script is in the html/head block, the html/head gets created twice and two initialframes result. *Thus, 2 frames are added* 7) *index.html finishes.* 8) iframe.html finishes. *image gets added to one of those two frames (probably the wrong one)* 9) User leaves the page. new page finishes loading and clean-up of old page occurs. *Image does not get destroyed. Perhaps the second frame doesn't get destroyed either.* 10) timer fires on animated gif and crashy, because the stuff that the animated gif is attached to is gone. --- So what alexsavulov's patch does is destroys the first frame (step 6). The second frame becomes the first frame and presumably links up properly. Upon destruction, the frame is destroyed, along with the animated gif. IMO, I believe Attachment 89614 [details] [diff] better patches this bug. That patch fixes the problem by forcing a flush at step 1 (right after the </iframe>). The result is the frame being created immediately, and initialreflow never being called twice (thus no need to delete the first 'bad' frame). Plus it gets rid of the content duplication which is what Bug 154018 is about. (the patch in this bug does not get rid of content duplication) Either way, the crashes stop. If it feels like I'm stepping on people's toes or that I'm trying to start a patch war, I'm not, and I apologize in advance. I'm just looking for the best solution.
Arron: no man, you're not starting a patch war. the problem here, is that that freaking initialreflow causes the construction of a frame for the root content a second time no matter what. see: mStyleSet->ContentInserted(mPresContext, nsnull, root, 0); now, since i cannot be sure, that i can prevent all the occurences of a second initialreflow, i thought i prevent the existence of two [area] frames that cause the leaking of frames (having the empty frame assigned as a child to the parent and the frame that receives the rest of the tree assigned to britney that does not know what 2+2 is, not to mention reflow :-). now, while i think that my patch just prevents leaking and is not a jewel, i think that we have to deal with preventing that frame leaking. this is a deep architectural problem that won't be solved just by preventing a flush or another, unfortunately. there will be always another stupid combination of whatever scripts, images, iframes, bad html and so on that will again hit the same freking problem, i think. if you prevent flushing after scripts are loaded and executed, you may regress cases whre java script appends/changes content/styles and so on and this tragedy will never end. i think, that is the reason why javascripts cause this kind of flushing. play with moving the <script>s around and you'll see that you won't crash in all of cases. so better than breaking some dynamic things, let's just prevent frame leakage. that is my oppinion.
Alexandru: Ah, I see where you're coming from now. Yes, in that case, I agree it is good to have your patch in there to verify if there's already an docElementFrame (and destroy it so life can more of less go on) As for my patch, it doesn't prevent flushes, but rather forces a flush after </iframe>. Forcing flushes, while probably adding clock cycles to the page load, do not cause any grief or add any crashes :) But as you suggest, there might be more nasties out there causing this crash, which your patch will catch and deal with until they are fixed properly. Here's hoping the only nasty is iframes (under conditions talked about in these bugs) which will be fixed by Bug 154018. But enough about that, I'll leave the Bug 154018 talk to Bug 154018 from now on. :) Feel free to comment (or make a better patch ;) ) in that bug if you have some ideas.
i agree. bug 154018 is also an issue and although related to this one here be cause of the involvment of the iframe, i think that no matter how we solve this bug, the other one (bug 154018) needs to be repaired too. i will only try to see if i can prevent the destruction of the first area frame and pass everything to it, and not construct the second area frame at all. we'll see how that goes. btw: sorry for interpreting your intention wrong: force flush is right (i had the wrong impression that you did prevent flush)
arron, alex: Is it possible to have multiple frame leaked before the actual(or right) frame is created ?. In that case(not necessarily current scenario) do we need to delete/purge all the pre-existing leaked frames ? Will patch 90178 take care of that too ?
about patch 90178: actually i'm still looking for a way to use the existing frame and not creating the new frame at all if possible. that would be much more efficient.
Whiteboard: [adt2 rtm] custrtm- [ETA Needed] → [adt2 rtm] custrtm- [ETA 07/09]
Looking at recent MozillaTrunk topcrash data, I noticed yet another stack signature (nsImageFrame::FrameChanged) being reported with similar stacks: Here is once stack with nsImageBoxListener::OnStopDecode in it: nsImageFrame::FrameChanged [c:/builds/seamonkey/mozilla/layout/html/base/src/nsImageFrame.cpp line 736] nsImageBoxListener::OnStopDecode [c:/builds/seamonkey/mozilla/layout/xul/base/src/nsImageBoxFrame.cpp line 876] imgRequestProxy::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequestProxy.cpp line 295] imgRequest::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequest.cpp line 339] imgContainer::Notify [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgContainer.cpp line 460] nsTimerImpl::Fire [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp line 352] nsTimerManager::FireNextIdleTimer [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp line 588] nsAppShell::Run [c:/builds/seamonkey/mozilla/widget/src/windows/nsAppShell.cpp line 134] nsAppShellService::Run [c:/builds/seamonkey/mozilla/xpfe/appshell/src/nsAppShellService.cpp line 458] main1 [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp line 1472] main [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp line 1808] WinMain [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp line 1826] WinMainCRTStartup() kernel32.dll + 0x1eb69 (0x77e7eb69) And another stack with nsImageListener::FrameChanged in it: nsImageFrame::FrameChanged [c:/builds/seamonkey/mozilla/layout/html/base/src/nsImageFrame.cpp line 736] nsImageListener::FrameChanged [c:/builds/seamonkey/mozilla/layout/html/base/src/nsImageFrame.cpp line 2390] imgRequestProxy::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequestProxy.cpp line 295] imgRequest::FrameChanged [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgRequest.cpp line 339] imgContainer::Notify [c:/builds/seamonkey/mozilla/modules/libpr0n/src/imgContainer.cpp line 460] nsTimerImpl::Fire [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp line 352] nsTimerManager::FireNextIdleTimer [c:/builds/seamonkey/mozilla/xpcom/threads/nsTimerImpl.cpp line 588] nsAppShell::Run [c:/builds/seamonkey/mozilla/widget/src/windows/nsAppShell.cpp line 134] nsAppShellService::Run [c:/builds/seamonkey/mozilla/xpfe/appshell/src/nsAppShellService.cpp line 458] main1 [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp line 1472] main [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp line 1808] WinMain [c:/builds/seamonkey/mozilla/xpfe/bootstrap/nsAppRunner.cpp line 1826] Are these related to this bug? I'm guessing yes. Also, I noticed that the nsImageBoxListener::OnStopDecode crashes stopped showing up after 7/3. Do we know of any changes on the Trunk that would have caused those crashes to go away (or just move to another part of the code)?
well, this kind of crashes are very related. we can only say, based on the testcases (or URLs) that we have what is going to be fixed by the patch 90178 for example. try to find an URL's/testcase that cause a stack with nsImageBoxListener in it so we can test it. then we can say more about that. still I strongly suspect the same issue.
*** Bug 156555 has been marked as a duplicate of this bug. ***
i cannot land this patch (attachment 90178 [details] [diff] [review]) today yet. still looking for a better alternative if possible.
Whiteboard: [adt2 rtm] custrtm- [ETA 07/09] → [adt2 rtm] custrtm- [ETA 07/11]
Attachment #90178 - Attachment is obsolete: true
after long debugging sessions (until 3am in the morning) I came to the conclusion that what Arron suggested in his patch (attachment 89614 [details] [diff] [review] / bug 154081) is the only thing we can do for the moment. the time presses us and, although this may be not the best thing we can do, we cannot afford to live with this crasher in the next release. I refined Arron's patch a little bit (however it is basicly his proposal), and i also think that we should put it in for the next release until we can find a better patch and actually what caused the regression (if it was not there before). I will open a bug about this if we land the patch, to back it out if we find the cause for the regression (if any) or a proper patch. this is a multi threading issue from what i can tell and the script loading thread is competing against the main thread so we must flush the tags before the script tag is processed. (coarse guess though) need r=/sr=
*** Bug 156897 has been marked as a duplicate of this bug. ***
We're seeing this same stack going one frame deeper, but looks like the same issue. Updating the summary with [@ nsImageFrame::FrameChanged] for tracking.
Summary: crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageBoxListener::OnStopDecode] (remaining issues from bug 144315) → crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] (remaining issues from bug 144315)
Comment on attachment 90930 [details] [diff] [review] a patch similar to the one in bug 154018 (attachment 89614 [details] [diff] [review]) Initial testing of the patch reveals that it appears to patch up this bug, Bug 154018 and Bug 148827. :) I don't quite understand how it works just yet, but that's only due to my little knowledge of content/layout. I'll try to make sense of it later on today. In the meantime, I agree this patch should the next release, along with being in 1.1b if there's time.
Attachment #90930 - Attachment description: a patch similar to the one in bug 154081 (attachment 89614) → a patch similar to the one in bug 154018 (attachment 89614)
Comment on attachment 90930 [details] [diff] [review] a patch similar to the one in bug 154018 (attachment 89614 [details] [diff] [review]) Sorry I can only pick nits, but one of the few coding standards in Mozilla is "When in Rome", and Rome here means 'if (', not 'if('. Please put a space after such C++ keywords. /be
*** Bug 156837 has been marked as a duplicate of this bug. ***
Attached patch improved patch (obsolete) — — Splinter Review
changed teh patch to meet requrements and used caps for IFRAME to avoid confusion with nsIFrame.
Attachment #90930 - Attachment is obsolete: true
Whiteboard: [adt2 rtm] custrtm- [ETA 07/11] → [adt2 rtm] custrtm- [ETA 07/14]
Comment on attachment 91019 [details] [diff] [review] improved patch r=jkeiser, with some comments explaining that this is more than likely a hack and referencing this bug. I have done some investigating and can't figure out what's up yet.
Attachment #91019 - Flags: review+
Attached patch same patch with comments (obsolete) — — Splinter Review
added the comments
Attachment #91019 - Attachment is obsolete: true
Comment on attachment 91162 [details] [diff] [review] same patch with comments transfered review flag
Attachment #91162 - Flags: review+
Comment on attachment 91162 [details] [diff] [review] same patch with comments >@@ -2106,6 +2113,7 @@ > mDeflectedCount = 0; > mLastSampledUserEventTime = 0; > mScrolledToRefAlready = PR_FALSE; >+ mHasIFRAME = PR_FALSE; > } > > HTMLContentSink::~HTMLContentSink() This is unnecessary (along with the 11 lines of code before it) since the class uses NS_DECL_AND_IMPL_ZEROING_OPERATOR_NEW (note the comment at the beginning of the constructor that you're modifying). Other than that, this seems like a reasonable workaround.
Comment on attachment 91162 [details] [diff] [review] same patch with comments One of my concerns with the patch is that we are forcing a flush for every script tag we encounter after an iframe right? That is I don't see any place where mHasIFRAME is ever set to false after an iframe has been seen. I was talking to kmcclusk earlier today about this, and he suggested that alex run jrgm's page load tests to see if there are any noticeable perf differences with the patch. I know that several sites use iframes for ad banners and toolbars. Also, I think harishd owns the content sink, so I think he should take a look at this change. I was under the impression we already forced flushes when we got script tags? Maybe I was misinformed. In any case I've seen warnings in comments in that file regarding flushing and avoiding re-entrancy, so I'd like for him to verify that is a safe place to do it. Since this problem seems timing related I'm not totally convinced that flushing when processing script tags will fix the problem completely?
Oops, I missed that. It does seem like you should set mHasIFRAME to false when you do the flush.
hmm, i thought to let the flag set, but now it seems reasonable to reset it. another <iframe> would set it anyway. good catch. will mail harish.
How about just passing mHasIFRAME to the FlushTags() in HTMLContentSink::PreEvaluateScript() in stead of doing a double flush when loading scripts and iframes? That way this wouldn't add any flush calls, it would just make the flush we already do before executing a script flush reflows in stead of just flusing the sink. And yeah, we should reset the flag...
I tried the current patch, attachment 91162 [details] [diff] [review], on win2k with the page load test. There wasn't a significant difference in results. But a printf says that we only hit that flush on two pages: 8 times on the msn.com copy, and once on the spinner.com copy. There may have been a slight (3%) loss on msn.com, although it's difficult to separate that from noise given 5 loads of that page (i.e., other pages can move that much between two tests with the same binary just due to chance events). But don't let that discount the comments above, since while that test is a reasonable sample, it's doesn't mean that there aren't pages where the extra flushing could trigger a more substantial hit. (Although, not crashing is the primary goal in the short term).
from my understanding of the contentsink mechanics, is that we process the tags from a stack and we don't come back anymore to those tags that were flushed before. i may be wrong too :-) however, if is true, then the performance impact is low i think.
JST: i tried to flush with PR_TRUE in HTMLContentSink::PreEvaluateScript before i proposed this patch and it didn't worked unfortunately. btw: harish is on sabbatical :-(
Re: Comment #88 If I remember correctly, PreEvaluateScript is called once the script is the ready for parsing, which is too late in the process.
Arron is right
Hmm, that kinda blows, oh well...
(Just a followup to my previous comment: in testing with live web pages, I see that we do hit this flush much more than with the static pages, particularly on my.netscape.com. But this 'pattern' does seem more widespread in current web page design.).
added reseting of mHasIFRAME after flushing so, whasup dudes! do we want this patch in? :-) when is that deadline? 07-16? if this patch needs to go in, then let's have it soon on the trunk and let it bake there for a couple of days so we know if we want it on the branch or not.
Comment on attachment 91219 [details] [diff] [review] same patch reseting the mHasIFRAME after flushing transfering the review flag
Attachment #91219 - Flags: review+
if by "dudes" you mean the ADT, then um ... yes, we'd like to evaluate this one for inclusion before the 16th, but we need to have an sr=, it landed and verified on the trunk. thanks!
"dudes" was there just for a little fun. I'll try to get Kin again to sr= this. I think we have now enough feedback from a lot of people familiar with the content sink to go with. (i opened a bug about the constructor: bug 157164 but if david wants i can fix it with the fix fo this bug.)
So I saved the testcase at http://www.animecity.nu/mozilla/lynxos/index.html locally and removed the <script> tag and then loaded it and reloaded it (with the F5 key) several times ... as I suspected, I can still reproduce the same problem, granted not as easily, where the iframe content doesn't load and then loading any other page (even the same page) causes the crash. This tells me that all the proposed patch does is adjust the timing for the test case so that we don't see the problem, and that it's still possible for this double initial reflow to happen. Also, I don't see how the patch can work in all cases since it goes on the assumption that a script tag will be encountered before the iframe's content triggers an initial reflow.
ok, i'm working to see if i can find the checkin that caused this kind of regression. thx for testing.
kin: Excellent. I can confirm the crash on a nightly (2002071208) without the two <script> tags after pressing F5 many many many many times. I'll see if I can get it to crash on my debug build and will post my findings. Alexandru: Try a pull from 04/16/2002 21:15 and then try a pull from 04/16/2002 21:18. I have found that the first one does not crash, but the second one does. (I haven't tried these two builds with the test that kin did)
wait you're telling me that this crash was there before 1.0 (2002053012)? 1.0 works perfect for me!
The patches in between those two times are the ones from Bug 52334. They weren't put into the 1.0 branch. I realize this presents a problem, since the crashes are being seen on the 1.0 branch (although I have never been able to verify 1.0 crashes) I'm more inclined to think that Bug 52334 isn't really the problem, but rather just made a problem in the existing code happen more frequently.
i've never been able to repro the crash on the branch either :(
so the branch crashes now (based on some reports), but 1.0 does not crash. ok, based on a talkback query the nsImageFrame::FrameChanged crashes can be tracked back until mid 2001 or so. that means we always a had at least som kind of crash with this signature during the last let's say 12 months. ok, i'm verifying the branch builds now, mkaply has some old builds too and is testing right now.
So that makes it seem like the crash on the branch is due to some other leak of frames, but the crash on the trunk might still be related to what I mentioned in comment 55.
found branch witnesses: http://bugzilla.mozilla.org/show_bug.cgi?id=144315#c54 however this one was at that time a branch crash too, bryner had to apply his patch to both trunk and branch. unfortunately we all assumed that the remaining crashes are seen on the branch too but there is no proof yet. i did a query on talkback (searching for the signature containing nsImageListener::FrameChanged : - there are 5 Mozilla 1.0 crashes reported 04-17 - the rest of them since then are Mozilla Trunk There are 2 sepparate reports (Arrron and Mkaply) that suspect the fix for bug 52334 tha was causing the regression on the trunk. there is no proof for a branch crash yet.
btw: with a recent debug build, i can crash hitting F5 a lot of times but this is another stack: NTDLL! 77f9eea9() nsWindow::Create(nsWindow * const 0x04034174, nsIWidget * 0x00000000, const nsRect & {...}, nsEventStatus (nsGUIEvent *)* 0x02ae4540 HandleEvent(nsGUIEvent *), nsIDeviceContext * 0x0372ea58, nsIAppShell * 0x00000000, nsIToolkit * 0x00000000, nsWidgetInitData * 0x00000000) line 1533 nsView::CreateWidget(nsView * const 0x04011148, const nsID & {...}, nsWidgetInitData * 0x00000000, void * 0x00000000, int 1, int 0) line 804 DocumentViewerImpl::MakeWindow(nsIWidget * 0x0388f30c, const nsRect & {...}) line 5254 + 70 bytes DocumentViewerImpl::InitInternal(nsIWidget * 0x0388f30c, nsIDeviceContext * 0x0372ea58, const nsRect & {...}, int 1) line 1452 + 16 bytes DocumentViewerImpl::Init(DocumentViewerImpl * const 0x039eaa48, nsIWidget * 0x0388f30c, nsIDeviceContext * 0x0372ea58, const nsRect & {...}) line 1269 nsDocShell::SetupNewViewer(nsDocShell * const 0x03716718, nsIContentViewer * 0x039eaa48) line 4574 + 66 bytes nsDocShell::Embed(nsDocShell * const 0x0371673c, nsIContentViewer * 0x039eaa48, const char * 0x01d994df, nsISupports * 0x00000000) line 3925 + 23 bytes nsDocShell::CreateContentViewer(nsDocShell * const 0x03716718, const char * 0x0012fb78, nsIRequest * 0x03a3d5c0, nsIStreamListener * * 0x0012fbcc) line 4324 + 32 bytes nsDSURIContentListener::DoContent(nsDSURIContentListener * const 0x038cdaa0, const char * 0x0012fb78, int 0, nsIRequest * 0x03a3d5c0, nsIStreamListener * * 0x0012fbcc, int * 0x0012fb64) line 110 + 33 bytes nsDocumentOpenInfo::DispatchContent(nsIRequest * 0x03a3d5c0, nsISupports * 0x00000000) line 358 + 93 bytes nsDocumentOpenInfo::OnStartRequest(nsDocumentOpenInfo * const 0x039c4ba8, nsIRequest * 0x03a3d5c0, nsISupports * 0x00000000) line 228 + 16 bytes nsFileChannel::OnStartRequest(nsFileChannel * const 0x03a3d5c8, nsIRequest * 0x039c4c04, nsISupports * 0x00000000) line 501 + 37 bytes nsOnStartRequestEvent::HandleEvent() line 161 + 53 bytes nsARequestObserverEvent::HandlePLEvent(PLEvent * 0x04019d7c) line 116 PL_HandleEvent(PLEvent * 0x04019d7c) line 596 + 10 bytes PL_ProcessPendingEvents(PLEventQueue * 0x00f13ee8) line 526 + 9 bytes _md_EventReceiverProc(HWND__ * 0x0008099c, unsigned int 49405, unsigned int 0, long 15810280) line 1077 + 9 bytes USER32! 77e148dc() USER32! 77e14aa7() USER32! 77e266fd() nsAppShellService::Run(nsAppShellService * const 0x00f44f58) line 452 main1(int 2, char * * 0x002776b8, nsISupports * 0x00000000) line 1510 + 32 bytes main(int 2, char * * 0x002776b8) line 1859 + 37 bytes mainCRTStartup() line 338 + 17 bytes KERNEL32! 77e992a6()
Appoligies if this sounds like gibberish, but I was in a hurry ... The problem seems to be due to a race between the outer document's creation of the layout frames for the IFRAME and the loading of the IFRAME's document. We run into the double initial reflow problems when the onStartRequest() callback for the IFRAME's sub document is called and a content viewer is created for the sub document before the outer document has a chance to create the layout frame for the IFRAME. In this situation, at the time the outer document creates the IFRAME's layout frames, the mSubShell->SetVisibility(PR_TRUE) call in nsHTMLFrameInnerFrame::CreateDocShell() will actually trigger jst's code that calls InitPresentationStuff() with a hardcoded PR_TRUE that forces an initial reflow. It seems strange to me that we are intiating a reflow while we are still in the midst constructing frames (frame constructor code is still on the stack). The 2nd InitialReflow() comes when the body start tag is encountered for the sub document, since HTMLContentSink for the sub document assumes that it will be the only one calling InitialReflow() when it's StartLayout() method is called. I should also mention that the content viewer created during the OnStartRequest(), skips what seems to be the normal code path that creates a presContext because it doesn't have a widget associated with it, because that gets created when the outer document creates the layout frames for the IFRAME. This race condition may exist on both the trunk and the branch, but the code that initiates the extra InitialReflow() during IFRAME frame construction is trunk only. Things work properly whenever the IFRAME is constructed before it's subdocument's content viewer is created.
BIG QUESTION: I was not able to reproduce this with a branch build yet! Anyone?
I suspect the crashes on the branch are related to a different leak of frames. They should thus be covered in a different bug. It might be worth looking through the URLs in the talkback reports for potential reproducable testcases for that bug.
no crash on the current 1.0 linux branch testing dupe bug 153211 (which is what I based bug 144315 comment 54 on) or http://www.animecity.nu/mozilla/lynxos/index.html.
- BRANCH 1.0 NOT AFFECTED AFAICT - OK, so from what i see on talkback there are no branch 1.0 reports about this crash so we can assume that the 1.0 branch is not affected by this bug. I also was not able to crash with the branch so i'll say the branch is safe. If somoene observed crashes like this on the branch please mail me or add a comment on this bug. Also please try to find the URL's that caused the crash on the branch if any.
there isn't a FrameChanged sig under 1.0Branch but there exists one under the stack sig OnStopDecode. although i haven't been sure that those two stack sigs are related. i believe when jay crashed using the animecity repro steps he had a stack dump with OnStopDecode on top.
*** Bug 157823 has been marked as a duplicate of this bug. ***
reassigning as we agreed at the 07-18-2002 meeting
Assignee: alexsavulov → kin
Blocks: 146027
*** Bug 158228 has been marked as a duplicate of this bug. ***
Attached patch nsDocumentViewer Patch Rev 1 (obsolete) — — Splinter Review
This patch seems to prevent the crashes by preventing the InitialReflow() that happens from with InitPresentationStuff() in the manner jst and I discussed yesterday. I'm still trying to verify that we're not duplicating any frame/view/presShell creation between the time jst's code is hit and the code normally triggered by the content sink had the OnStartRequest() for the sub document not fired till after the iframe's construction. In the meantime if people are up to trying out the patch, by all means go ahead, and provide some feedback. I tested dynamic removal and insertion of an iframe and also dynamic removal and setting of the attribute style="display: none;" with this patch, and all work as they did prior to the patch.
Status: NEW → ASSIGNED
Target Milestone: mozilla1.0.1 → mozilla1.1beta
Spent some time going through several of the urls in the dup'd bugs. These urls seem to reproduce the problem rather easily: http://streetmap.co.uk http://www.mystat.pl and no longer crash with the nsDocumentViewer.cpp patch. I can still ocassionally crash using the url in 157823 but it never hits the code I modified, so that may be a similar stack, but different problem. None of the other urls in the dups I tried crashed ... though most of them never hit the area I modified either.
Comment on attachment 92010 [details] [diff] [review] nsDocumentViewer Patch Rev 1 With this patch, Bug 154018 changes from showing 2 goats, to showing no goats at all. I still believe the double InitialReflow is a red herring and a closer-to-the-core problem is that content is being duplicated. 2 InitialReflows result in this testcase because the script is in the head, so that's what it's trying to create twice. 2 goats display in Bug 154018 because the script is in the body, just after the goat. I still have not found the true reason for the double creation. Mind you, I haven't had time to investigate this bug any further over the past week.
*** Bug 158493 has been marked as a duplicate of this bug. ***
*** Bug 155520 has been marked as a duplicate of this bug. ***
*** Bug 158324 has been marked as a duplicate of this bug. ***
*** Bug 159146 has been marked as a duplicate of this bug. ***
eBay is made almost unsable by this bug (see bug #159146), if that matters. Perhaps someone could add ebay.com to the URL field; it's where most people will probably enounter it. Some more crash IDs if it's any help at this point (all from duped bug #159146): 8628605G, 8628802M, 8629058H, 8629751H, 8629979Z, 8630146H, 8630692Q
*** Bug 159097 has been marked as a duplicate of this bug. ***
M11B is now released. We need a new target milestone. M11B data shows a similar stack ending in nsTableCellFrame::SetColIndex. adding to the summary for Talkback tracking.
Summary: crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] (remaining issues from bug 144315) → crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] (remaining issues from bug 144315)
Alias: streetmap.co.uk
*** Bug 159728 has been marked as a duplicate of this bug. ***
this crash also also occuring with various 0x00000000 signatures in talkback...
Summary: crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] (remaining issues from bug 144315) → crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] [ @0x00000000] (remaining issues from bug 144315)
*** Bug 159958 has been marked as a duplicate of this bug. ***
Current Trunk data is showing crashes with the nsImageWin::DrawToImage signature and stacks similar to the stack in comment #9. Adding that signature to the summary for talkback tracking.
Summary: crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] [ @0x00000000] (remaining issues from bug 144315) → crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] [ @0x00000000][@ nsImageWin::DrawToImage] (remaining issues from bug 144315)
For all those interested, I just updated bug 154018: http://bugzilla.mozilla.org/show_bug.cgi?id=154018#c19 with some things I noted while debugging the test case Arron M pointed out in comment #121. These 2 bugs are definitely related.
Blocks: 154018
Whiteboard: [adt2 rtm] custrtm- [ETA 07/14] → [adt2 rtm] custrtm- [ETA Needed]
another talkback ID, just in case anyone cares: TB8728625Y
*** Bug 160929 has been marked as a duplicate of this bug. ***
*** Bug 161146 has been marked as a duplicate of this bug. ***
*** Bug 161389 has been marked as a duplicate of this bug. ***
*** Bug 162263 has been marked as a duplicate of this bug. ***
*** Bug 159878 has been marked as a duplicate of this bug. ***
Target Milestone: mozilla1.1beta → mozilla1.2alpha
TB95825247G for book.asahi.com shows a different stack trace in 20020818 1.1 RC for MacOS9.x. Why is it calling printpreview context in this build?
I have confirmed that 1.1 branch is crashing in a different stack trace (printpreview) from trunk builds (framechanged). The URL is http://book.asahi.com
*** Bug 164837 has been marked as a duplicate of this bug. ***
From the dupe bug: This crash is when loading all comments in a newsstory on www.os.news.com. Talkback has sendt 4 reports: Talkback ID, date and time TB9922966Y , 27/08-2002 14:11 TB9922887W , 27/08-2002 14:08 TB9922685M , 27/08-2002 14:04 TB9922682G , 27/08-2002 14:02
Keywords: topembed
I can recreate this crash VERY consistently on OS/2 with ebay if anyone is interested in having me look at something.
FYI, I think I have a fix for this bug (Bug 153815) that handles jkeiser's related bug 154018 as well. I'll be attaching it to this bug sometime today.
Blocks: 1.2a
Attached patch nsDocumentViewer Patch Rev 2 — — Splinter Review
Ok here's the patch I mentioned earlier. It basically merges my previous nsDocumentViewer.cpp patch and jkeiser's proposed patch in bug 154018, into one place and wraps them with conditions that will trigger them only when necessary. I tried to comment as much as possible to describe the situations that could trigger these conditions. I tested the patch against all the urls mentioned in this bug as well as the dup'd bugs, but only the following URLs managed to trigger the code in this patch: http://www.mystat.pl/ http://debatt.aftenposten.no/Thread.asp?GroupID=43&Group=Norge%20og%20EU http://www.animecity.nu/mozilla/lynxos/index.html http://listings.ebay.com/aw/plistings/endtoday/all/category29500/page2.html http://animecity.nu/mozilla/2goats I do remember that http://streetmap.co.uk used to trigger my previously proposed patch, but it doesn't anymore, so I'm not sure if they've changed their site. This just tells me that either the sites changed, the cause of the crashes reported in the dup'd bugs are not really caused by the same thing, or because of my slow modem connection there's some variation in timing that's not allowing me to trigger my patch or see the problem. In any case, for what it's worth, I did not crash when loading any of the urls I tried, and all of the content seemed to show up ok. Reviews, comments and testing welcome ... I'd also like to hear from jst and jkeiser since I've filled them in previously with the issues involved.
Attachment #92010 - Attachment is obsolete: true
Aye, this should do it. One thing, I'd feel better personally if there was a boolean mFailedToInitialReflow in the doc or iframe element, which we set when we fail to InitialReflow due to no prescontext, and we check and clear in the code you modified (this will make many of the extra checks unnecessary and make the comments and code a lot easier to follow :). Currently we're only guessing whether it happened or not (though it's a pretty accurate guess), and this guess is liable to change as stuff gets restructured. Making it explicit will make people think about it more when they change things in the source of the problem--since we'd be setting this flag when the InitialReflow fails, if someone makes it not fail anymore they are likely to remove this hack. Vice versa, if InitialReflow fails for another reason down the line they can just set the flag. What do you think?
You have to keep in mind that a doc can have multiple presShells/presContexts associated with it, and InitialReflow() has to be called on each one, so a single boolean flag in the element wouldn't do. Also, at the point this code is called, we know for a fact that there is no presShell associated with the DocumentViewer, and since InitialReflow() is a method on presShell, we know it hasn't been called yet.
I am concerned about fragility--if something else changes to cause such a timing problem we'll be in this deal again. Specifically, this fix addresses the cause of the problem but does not address the root cause--that there is no communication between the document and the iframe on whether to do this reflow and there is no generally reliable *implied* communication (AFAICT). Making said communication explicit would be much better for module boundaries and would shore this code up for future changes. As to doInitialReflow = PR_FALSE, I think an mAlreadyInitiallyReflowed or something would be best--and we check it and set it in both the iframe and in the content sink. If we're committed to doing this stuff in both places, we should at least have the two places communicate what they are doing. Thinking, it makes sense to put this mAlreadyInitiallyReflowed flag into PresContext / PresShell. As to flushing, we should *always* flush pending notifications if we are going to do an InitialReflow() and we are unsure whether there is unflushed content--this will nip any other InitialReflow timing problems in the bud. I would do something like this for the flush: if (!mLoaded) { ... <set doInitialReflow> ... } if (doInitialReflow) { FlushPendingNotifications(); } I would actually do the flush inside InitPresentationStuff() somewhere, perhaps at the beginning, to handle other callers of InitPresentationStuff(). This method will not cost much but that's not the point--it is a *requirement* to call this if you don't know whether everything is flushed or not. This reflow problem here is really an implied contract for InitialReflow(): "all content under the element being reflowed must have been flushed from content sink before we do initial reflow". We should ensure that we fulfill that contract in all cases.
*** Bug 165416 has been marked as a duplicate of this bug. ***
kin: I think we need to do both of the above, but to take care of these important bugs, how about (assuming we agree it's safe and performant) making a patch with the FlushPendingNotifications() at the end there? We can deal with the double-initial-reflow explicitness in another bug.
*** Bug 165616 has been marked as a duplicate of this bug. ***
*** Bug 166106 has been marked as a duplicate of this bug. ***
*** Bug 166149 has been marked as a duplicate of this bug. ***
*** Bug 165222 has been marked as a duplicate of this bug. ***
I actually have an alternate approach in an old tree from a couple of weeks ago, which I abandoned, which did basically what you mentioned above. It made the mDidInitialReflow field of the presShell accessible via a get method, and some changes to the content sink that called the method to prevent the double initial reflow ... but I also had to add some code to the content sink to send insert/append notifications for the body node, which never used to happen. The reason I abandoned this approach, was because a document can have multiple presentations, and I wasn't sure if I could assume that all presShells associated with the doc were all in the same state, so I wasn't sure what the impact of the notifications would be. Looking at things in hind sight, it probably doesn't matter. I'll resurrect that patch (remove printfs and make it based on more recent source) and attatch it.
This is a blocker for 1.2alpha which freezes tonight and is scheduled for release at the end of the week. Kin, can you get something ready for this today?
Comment on attachment 97077 [details] [diff] [review] nsDocumentViewer Patch Rev 2 r=jkeiser, though if we do put in this patch I'd appreciate if you could open a new bug to do the other stuff and upload the patch you mentioned to it.
Attachment #97077 - Flags: review+
Ok here's the alternate approach I mentioned above which requires changes to the PresShell, DocViewer and ContentSink.
*** Bug 166516 has been marked as a duplicate of this bug. ***
Comment on attachment 97675 [details] [diff] [review] Alternate Patch Rev 1 (Modifies PresShell, DocViewer, and ContentSink) I like this patch, except I'm unclear on exactly why we are doing NotifyInsert / NotifyAppend there. What problem is that solving--a new one? If we assumed that mBody was flushed in OpenBody() before, why would it not be flushed now?
adding ebay.com to URL so people can find it This bug makes it impossible for me to recommend 1.1 to any users or generalized embedders, and impossible for me to use it internally.
Cc'ing harishd for comment on the 2 attached patches.
Even though I like the approach in patch rev2, because it does not interfere with the content sink's code path, I'm not sure what kind of an impact it might have on frameset and xml documents. And since the root cause of the problem - race condition - is not easy to predict I believe that the alternate apporach ( Alternate patch rev1 ) is the way to go.
Comment on attachment 97675 [details] [diff] [review] Alternate Patch Rev 1 (Modifies PresShell, DocViewer, and ContentSink) - In nsHTMLContentSink.cpp: + if (didInitialReflow && mCurrentContext->mStackPos > 1) { + PRInt32 parentIndex = mCurrentContext->mStackPos - 2; + nsIHTMLContent *parent = mCurrentContext->mStack[parentIndex].mContent; + PRInt32 numFlushed = mCurrentContext->mStack[parentIndex].mNumFlushed; + PRInt32 insertionPoint = mCurrentContext->mStack[parentIndex].mInsertionPoint; + + // XXX: I have yet to see a case where numFlushed is non-zero and + // insertionPoint is not -1, but this code will try to handle + // those cases too. + + if (insertionPoint != -1) { + NotifyInsert(parent, mBody, insertionPoint - 1); + } + else { + // XXX: Would it be better to use |parent->ChildCount() - 1| so that + // we don't cause notifications for the <head> element and it's + // children? + + NotifyAppend(parent, numFlushed); + } + } This is a bit scary, but the alternative is even scarier (changing SinkContext::FlushTags() to flush at the top of the stack, which it doesn't do at the moment). John, the reason this is needed is that the code that flushes the sink context in the content sink never flushes at the root element level, IOW, the document is never told, by the sink context, about children of the document element. This is probably not the ideal thing to do. But this is what we've got, and I'm afraid to change that, at least at this point, and in conjunction with this change. @@ -3755,6 +3798,24 @@ nsCOMPtr<nsIPresShell> shell; mDocument->GetShellAt(i, getter_AddRefs(shell)); if (shell) { + + // Make sure we don't call InitialReflow() for a shell that has + // already called it. This can happen when the layout frame for + // an iframe is constructed *between* the Embed() call for the + // docshell in the iframe, and the content sink's call to OpenBody(). + // (Bug 153815) + + PRBool didInitialReflow = PR_FALSE; + shell->GetDidInitialReflow(&didInitialReflow); + if (didInitialReflow) { + // XXX: The assumption here is that if something already called + // InitialReflow() on this shell, it also did some of the + // setup below, so we do nothing and just move on to the next + // shell in the list. + + continue; + } + // Make shell an observer for next time shell->BeginObservingDocument(); Kin, I think you should avoid doing this whole loop over all the shells if didInitialReflow is true since if initial reflow already happened the pres shell is already observing the document, and the view manager is already enabled. So just don't loop if initial reflow has already happened. - In nsIPresShell.h: + NS_IMETHOD GetDidInitialReflow(PRBool *aDidInitialReflow) = 0; I'd make this NS_IMETHOD_(PRBool) and return true/false and ignore the nsresult. It's about time we start de-COMtaminating nsIPresShell too, IMO. Your call, tough. sr=jst
Attachment #97675 - Flags: superreview+
Duh. Ignore my ranting about not doing the second loop, I'm on crack, or something. I wasn't reading the thing correctly before commenting. Jkeiser set me straight.
Comment on attachment 97675 [details] [diff] [review] Alternate Patch Rev 1 (Modifies PresShell, DocViewer, and ContentSink) r=jkeiser
Attachment #97675 - Flags: review+
Comment on attachment 97675 [details] [diff] [review] Alternate Patch Rev 1 (Modifies PresShell, DocViewer, and ContentSink) a=asa (on behalf of drivers) for checkin to 1.2a.
Attachment #97675 - Flags: approval+
Fix checked in. I did not make the PRBool return value change, even though it would be nice, in the interests of expediency and getting some testing for this slightly scary change :)
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
We want this on the 1.0 branch, as well, right? If yes, then let's ask Randall or Jud for 1.0 Drivers' approval as well.
No longer blocks: 143047
Whiteboard: [adt2 rtm] custrtm- [ETA Needed] → [adt2] custrtm- [ETA 09/06]
The iframe race condition fixed by attachment 97675 [details] [diff] [review] should not happen on the MOZILLA_1_0_BRANCH, unless jst's changing iframe visibility dynamically are also there.
moied: can you pls verify this as fixed on the trunk? thanks! Adding edt1.0.2 to nominate for for inclusion on the 1. branch.
Keywords: edt1.0.2
Whiteboard: [adt2] custrtm- [ETA 09/06] → [adt2] custrtm- [ETA 09/09]
Thanks Kin. Removing Mozilla1.0.2, and edt1.0.2 nominations for the 1.0 branch.
Is talkback ID TB10483117X in BuildID 2002090415 on WinXP likely to have been caused by this bug?
Ian, if you provide the URL you visited prior to closing the window, I can check in the debugger if it is related.
Ian: Looking at you incident, yes...it is the same crash as reported in this bug with the nsImageListener::FrameChanged stack signature.
Hi, I'd really like to download and test a build with the latest patch, is there a windows version available somewhere compiled? Thanks Nick
kin: It was some page on http://metalink.oracle.com not sure which one because I was all over the place.
*** Bug 167179 has been marked as a duplicate of this bug. ***
*** Bug 167372 has been marked as a duplicate of this bug. ***
Looks like Alternate Patch Rev 1 (attachment 97675 [details] [diff] [review]) caused regression bug 167355 which is a case where we recycle a presShell that has already had it's InitialReflow() method called.
*** Bug 167513 has been marked as a duplicate of this bug. ***
*** Bug 146027 has been marked as a duplicate of this bug. ***
*** Bug 167844 has been marked as a duplicate of this bug. ***
*** Bug 168558 has been marked as a duplicate of this bug. ***
*** Bug 169786 has been marked as a duplicate of this bug. ***
Blocks: majorbugs
*** Bug 172943 has been marked as a duplicate of this bug. ***
*** Bug 172985 has been marked as a duplicate of this bug. ***
*** Bug 166398 has been marked as a duplicate of this bug. ***
*** Bug 175586 has been marked as a duplicate of this bug. ***
jkeiser marked this one fixed on 9/05, but we have incidents with the nsImageWin::DrawToImage signature in builds postdating that checkin. John/Kin, Are there stil issues remaining to be addressed in this bug? M120A (nsImageWin::DrawToImage): 15 Unique Users 14 15 Windows 15 2002091014 (12397598 - 2002091014) - [Windows NT 5.0 build 2195]: Opened web page. URL: www.svt.se (12800902 - 2002091014) - [Windows NT 5.0 build 2195]: I had opened a couple pages in new tabs URL: http://slashdot.org/ M120B (nsImageWin::DrawToImage): 9 Unique Users 9 9 Windows 9 2002101612 Trunk (nsImageWin::DrawToImage): 1 Unique Users 1 1 Windows 1 2002101617
*** Bug 180532 has been marked as a duplicate of this bug. ***
*** Bug 175388 has been marked as a duplicate of this bug. ***
*** Bug 173163 has been marked as a duplicate of this bug. ***
*** Bug 181680 has been marked as a duplicate of this bug. ***
*** Bug 182606 has been marked as a duplicate of this bug. ***
URLs all WFM in 2003041009/win2k, no new dupes, and none of the stack signatures show in talkback summaries... marking verified.
Status: RESOLVED → VERIFIED
No longer blocks: majorbugs
Summary: crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] [ @0x00000000][@ nsImageWin::DrawToImage] (remaining issues from bug 144315) → crash in Trunk M1BR [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] [@ @0x00000000][@ nsImageWin::DrawToImage] (remaining issues from bug 144315)
Crash Signature: [@ nsImageListener::FrameChanged] [@ nsImageFrame::FrameChanged] [@ nsImageBoxListener::OnStopDecode] [@ nsTableCellFrame::SetColIndex] [@ @0x00000000] [@ nsImageWin::DrawToImage]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: