The recent orange on the coffee tinderbox is jrgm's page-load test dying (on the eighth page, or so, in the test). I duplicated this in my (tip) debug build, and got the following stack trace: #0 0x419565b6 in nsImageListener::FrameChanged (this=0x88525b8, aContainer=0x8b084d8, aContext=0x87b77f0, newframe=0x8a17990, dirtyRect=0xbffff370) at nsImageFrame.cpp:1638 #1 0x4179a371 in imgRequestProxy::FrameChanged (this=0x88526d8, container=0x8b084d8, cx=0x0, newframe=0x8a17990, dirtyRect=0xbffff370) at imgRequestProxy.cpp:260 #2 0x41797a7f in imgRequest::FrameChanged (this=0x8852b78, container=0x8b084d8, cx=0x0, newframe=0x8a17990, dirtyRect=0xbffff370) at imgRequest.cpp:375 #3 0x4179378c in imgContainer::Notify (this=0x8b084d8, timer=0x8789a50) at imgContainer.cpp:439 #4 0x4155e370 in nsTimerGtk::FireTimeout (this=0x8789a50) at nsTimerGtk.cpp:186 [and so forth...] It looks as though |this| is getting clobbered in memory, maybe. I'm not certain, but I think it might be related to pavlov's May 3rd fix for 78015 combined with mstoltz's enabling checkin for 78831. Other folks on the hook (just looking brainlessly at tinderbox and seeing when solid orange started) are waterson, hewitt, danm, yokoyama, ddrinan, and nhotta (cc'ing).
More info: it's dying on the "My Compuserve" page. I'll attach a full stack trace and some variables in a moment. Setting as smoketest blocker.
This happens on windows, too, but it looks an awful lot like a dependency problem. (On one side of the call, the object looks fine; on the other side, it thinks the mRefCnt slot is the mFrame slot.) I'm clobbering now to see if it goes away. dr: did you start with a clean tree?
Maybe we need "yet another" clobber of coffee ... mcafee?
waterson: My debug build has one change in it, in nsCSSFrameConstructor's internal logic. I think that isn't affecting things for me -- I think I'm seeing the same problem as is being seen on coffee. It is a depend build, though, not clobber, if that's useful info... If this is happening on windows, is the only reason why the windows tinderboxen are green instead of orange that they don't run the page loader tests? Platform, OS -> All
They don't run the page load test. In fact, I don't think they (currently) run any tests. So, as long as they compile ... green.
RH7.1 w/gcc2.96-81 slightly different backtrace, the beginning looks like this: #0 0x00000029 in __strtol_internal (nptr=0x918f5b4 "\200pLA", endptr=0x8965eb8, base=-1073746288, group=0) at eval.c:40 #1 0x4139e9fa in nsImageFrame::FrameChanged () from /home/dark/DISK/mozilla/dist/bin/components/libgklayout.so #2 0x413a12b8 in nsImageListener::FrameChanged () from /home/dark/DISK/mozilla/dist/bin/components/libgklayout.so #3 0x41326271 in imgRequestProxy::FrameChanged () from /home/dark/DISK/mozilla/dist/bin/components/libimglib2.so etc.
Looking more closely at: http://bonsai.mozilla.org/cvsquery.cgi?module=MozillaTinderboxAll&date=explicit&mindate=989949120&maxdate=989952659 (which is the cycle when coffee went orange), I can't see how any of these would have caused this problem. Removing some cc's of the obviously innocent. The one checkin at that time which confuses me is yokoyama's. The checkin comment doesn't have what seems to be the right bug number, so I'm just having trouble deciphering this. It probably doesn't have anything to do with this either, though... Maybe the fact that it's dying on the "My Compuserve" page is useful information? Dunno.
I havn't checked anything in that would have just started causing this at all to this. Start looking at the other checkins.
dr: the patches you are talking about have been in the tree for a long time. I really don't think they are coming back to bite us in the ass right now.
1) I did a full clobber build on win2k; still crashing at the same point 2) I looked at the content, and after experimenting a bit, it turns out that it depends on the page that it is leaving, the "static" copy being http://jrgm.mcom.com/perf/loadtime5/base/web.icq.com/index.html 3) But going to the static copy is not enough, it's only when the page loads are changed together that I can duplicate this crash. (Some (perhaps uncancelled) timer issue?) 4) I noticed the the first interval of orange was crashing on page "38", but that consistently this was crashing on page "8" since then. So, making an assumption that maybe the first crash was '**** happens', then maybe the current crash was due to checkins in the second interval. I tried backing out this checkin from waterson, cvs update -j3.431 -j3.430 mozilla/layout/html/base/src/nsBlockFrame.cpp and, for reasons that I cannot explain that fixes this crash. I tried backing out and then reapplying twice each, and I crash everytime with in applied, and never crash with it backed out. Could someone try this on Linux with a current build (do we even know if this is happening on Mac). Removing the "NOTMYBUG" but leaving with pavlov, since he's coming back from Denny's in a bit I think, and perhaps he can see if this fixes the crash for him. To save time in debugging this (and show that it is the page exited that is driving the crash), I set up a simple test at http://jrgm.mcom.com/yapl-crash/index.html that loops through only three pages, including compuserve and icq, but in a different order (msnbc->compuserve->icq->msnbc...). It will crash as msnbc is shown for a second time.
I don't have any tip builds right now (that arn't ripped apart). reassigning to waterson. let me know if i can help
This seems like it's probably the same as bug 80203, which has a lot of gdb output but doesn't seem to be well-owned right now. The exact location of the crash varied a little on that bug, although some of the stacks here are another stack frame up from anything noted there.
Back to pavlov. The problem is that the imgRequestProxy::Cancel() method isn't being called, its mListener pointing to an nsImageListener on a dead frame. This really isn't my bug, pav. Backatcha.
Allright, taking this back. Looks like my checkins for bug 43914 caused this regression. Backing those out fixes the crash.
Backed out; added dependency to 43914.
waterson: is this really fixed? why are we still seeing orange in tinderbox?
I ran jrgms loadtime test on the linux build 2001-05-16-12-trunk and mac 2001-05-16-12-trunk. It worked fine on both...cruised right on through icq.com.
This isn't fixed. I just crashed with this stack trace which includes your fixes: #4 <signal handler called> #5 0x4181455c in nsFrame::Invalidate (this=0x8e81f8c, aPresContext=0x8df70b0, aDamageRect=@0xbffff250, aImmediate=0) at nsFrame.cpp:2197 #6 0x418225b6 in nsImageFrame::FrameChanged (this=0x8e81f8c, aContainer=0x8e83210, aPresContext=0x8df70b0, aNewFrame=0x8e894b8, aDirtyRect=0xbffff3a0) at nsImageFrame.cpp:430 #7 0x41824eec in nsImageListener::FrameChanged (this=0x8e868a8, aContainer=0x8e83210, aContext=0x8df70b0, newframe=0x8e894b8, dirtyRect=0xbffff3a0) at ../../../../dist/include/nsCOMPtr.h:642 #8 0x410b85b1 in imgRequestProxy::FrameChanged (this=0x8e84b38, container=0x8e83210, cx=0x0, newframe=0x8e894b8, dirtyRect=0xbffff3a0) at imgRequestProxy.cpp:260 #9 0x410b6fb0 in imgRequest::FrameChanged (this=0x8e83050, container=0x8e83210, cx=0x0, newframe=0x8e894b8, dirtyRect=0xbffff3a0) at imgRequest.cpp:375
Didn't think so <grumble> ... Pavlov, I know you checked in a week or so ago, but it looks like mstoltz's checkin to nsImageFrame.cpp (rev 1.170) yesterday activated your code. That's my reason for suspecting that your checkin might conceivably be biting us in the ass only now. (OTOH, I haven't looked at the code, sooo...)
I suspect you are correct. OK, what is the real fix here? Who needs to own this and get it fixed?
Urgh. Still looking at this.
over to pavlov, after we put both pavlov and waterson in the ring.
Ok, someone is freeing the frame but not calling its Destroy() method. mFrame gets nulled out in the imageframe's Destroy() method.... investigating checkins to layout.
The image frame that is not being destroyed is a direct child of a <td> frame.
...I see the <td>'s frame get destroyed, but not the img frame.
back over to waterson.
waterson wrote: > The image frame that is not being destroyed is a direct child of a <td> frame. That's an interesting clue. That shouldn't happen. It means that an assumption is broken. The _direct and sole_ child of a <td> frame should be a block frame (not an image frame or anything else). The table code relies on this assumption. Unpredictable disaster may strike if the assumption is broken.
Or did waterson mean that it was the nsBlockFrame that was the inner frame for a TD element? (When I was poking at bug 80203 in the debugger the I think the parent was an nsBlockFrame.)
Is there a public URL which triggers this bug?
I've investigated further. This is indeed due to my changes in bug 43914, where I'll attach a test case that illustrates the problem.
*** Bug 81063 has been marked as a duplicate of this bug. ***
Verified fixed per email with jrgm, thanks for the help John!