Closed Bug 338595 Opened 18 years ago Closed 18 years ago

Microsoft C++ Library Runtime Error (R6025 - pure virtual function call) [@ purecall - ClassifyWrapper]

Categories

(Core :: DOM: Events, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: stefan.vallaster, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Attachments

(5 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060517 Minefield/3.0a1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060517 Minefield/3.0a1

Closed Minefield and a warning message from Microsoft C++ Library shows up

Runtime Error 
Programm: C:\...\firefox.exe
R0625
- pure virtual function call
[OK]

after pressing ok - minefield crashes
before pressing ok it works well and you can surf (strange)

Reproducible: Didn't try

Actual Results:  
after ok ff hangs 
c++ runtime error
Can you get a stack trace or provide steps to reproduce?  Otherwise this bug isn't much more useful than "Firefox crashed once".

(Why do multiple compilers treat pure virtual function calls so differently from, say, segmentation faults?  For example, gcc on Mac and Linux makes Firefox "abort" in bug 338312, breaking the OS crash dialog on Mac.  According to timeless in bug 291250, MSVC's behavior breaks Talkback on Windows.)
Severity: major → critical
Keywords: crash
i bet we could override the pure virtual handler and cause a crash, but yeah, the result is currently this dialog with nothing useful that can be done unless you have a build w/ symbols and a debugger, you can't do anything with it.

i suspect the reason for the behavior is that standard null function pointers (which is the default internal theoretical implementation) when called result in call stacks that look like this:

0x0

-that's all-

your average debugger isn't very happy with that, and developers can't do anything with it.

compare:

_abort
_purevirt
_lame_function_that_called_pure_virt
Why would the call stack have only 0x0 on it?  That's not consistent with what I see in Talkback, etc.
Stroustrup says in "Design & Evolution of C++" 13.2.3 "My implementation places a pointer to a function called __pure_virtual_called in the vtbl; this function can then be defined to give a reasonable run-time error".

Don't you agree that a "pure virtual function called" message is better than a "program crashed" message?
Not when it interferes with getting a stack trace.
if you jump to 0x0 in certain cases, then that's really the entire stack. there are a couple of crashes floating around where that's all you'll see.

the point is that the implementations decided that instead of doing that, they'd provide this purevirtualfunction and have that be the called method.

as i said, i'm pretty sure we could cause our builds to override the default impl with something talkback would catch....
Confirming as I get this quite often on Firefox shutdown.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Component: General → DOM: Events
Product: Firefox → Core
Version: unspecified → Trunk
jesse: ctho kindly fell into this bug, so my analysis will appear here shortly.
Summary: Microsoft C++ Library Runtime Error (R6025 - pure virtual function call) → Microsoft C++ Library Runtime Error (R6025 - pure virtual function call) [@ purecall - ClassifyWrapper]
IRC transcript:
[stuff about pure virtual calls dying]
(2:27:37 PM) <timeless> unfortunately in gecko
(2:27:44 PM) <timeless> we kinda use the windows event loop
(2:28:05 PM) <timeless> so instead of the app spinning waiting for you to bring the developer to the dialog ...
(2:28:12 PM) <timeless> gecko uses the event loop to kill itself
(2:28:38 PM) <timeless> 03 js3250!JS_DHashTableOperate+0x51
(2:28:59 PM) <timeless> 4c js3250!JS_DHashTableEnumerate+0xce
(2:29:00 PM) <timeless> sorry
(2:29:11 PM) <timeless> 0x03 is asserting that it shouldn't be under 0x4c
(2:29:14 PM) <timeless> and it's right
(2:29:25 PM) <timeless> but, the bug is the purecall in 0x41
(2:29:43 PM) <timeless> anything past that is a waste  of time that you can't easily recognize w/o symbols for the runtime+pos
(2:30:02 PM) <timeless> as for debugging the actual core bug
(2:30:05 PM) <timeless> i suppose we could do that
(2:30:09 PM) <timeless> .frame 42
(2:32:44 PM) <CTho>   WrapperSCCEntry *SCCEntry = NS_STATIC_CAST(WrapperSCCEntry*,
(2:32:44 PM) <CTho>     PL_DHashTableOperate(&sWrapperSCCTable, entry->participant->GetSCCIndex(),
(2:32:44 PM) <CTho>                          PL_DHASH_ADD));
(2:32:52 PM) <timeless> ctho: ok
(2:32:58 PM) <timeless> entry->participant
(2:33:27 PM) <CTho> 0x15b31300 class nsIDOMGCParticipant *
(2:36:03 PM) <timeless> ctho: can you turn on *all* of the buttons in calls?
(2:36:51 PM) <timeless> .frame 60
(2:37:06 PM) <timeless> possibly .frame 61
(2:37:11 PM) <timeless> kinda depends
(2:37:16 PM) <timeless> i think .frame 61
(2:37:31 PM) <timeless> ctho: what branch are you using?
(2:37:56 PM) <timeless> 182     nsContentUtils::RemoveListenerManager(this);
(2:37:58 PM) <timeless> from .frame 61
(2:38:02 PM) <timeless> ok, so the problem is that
(2:38:15 PM) <timeless> this nsHTMLAnchorElement creature
(2:38:30 PM) <timeless> in .frame 6a/.frame 6b
(2:38:34 PM) <timeless> was dying
(2:38:36 PM) <timeless> unfortunately for us
(2:38:43 PM) <timeless> in its process of dying
(2:39:06 PM) <timeless> it kinda let go of something else, looks like maybe an event listener it had
(2:39:28 PM) <timeless> the event listener let go of a js context (probably for the js function that was the event listener)
(2:40:14 PM) <timeless> the release of the js context resulted in a gc
(2:40:24 PM) <timeless> the gc then walked around and found a reference to the anchor
(2:40:30 PM) <timeless> it then asked the anchor to mark itself
(2:40:38 PM) <timeless> unfortunate, the anchor was in the process of being very dead
(2:40:46 PM) <timeless> clear as mud?
(2:40:54 PM) <timeless> s/,/ly,/
(2:40:57 PM) <CTho> yup
(2:41:19 PM) <timeless> stack trace+my commentary into any existing purecall bug that seems to have classifywrapper skid marks
(2:41:20 PM) <timeless> or make your own
(2:41:23 PM) ***timeless doesn't care
(2:42:12 PM) <timeless> cc jst+smaug+bz+dbaron
(2:42:24 PM) <timeless> and everyone involved in nsDOMGCParticipantSH
Looks like it may be a regression from bug 334075.  Moving code from a destructor of a more-derived class to a destructor of a base class can be dangerous if that code relies on the more-derived class still being around, which this code does, since it needs an implementation of GetSCCIndex (and, for certain debugging code, QueryInterface).
Blocks: 334075
That said, something else seems pretty broken if ReleaseWrapper hasn't been called yet for a node that's being destroyed.
Er, ignore comment 12 -- we need separate ReleaseWrapper calls for each participant, and this one (presuming entry->key != entry->participant in frame 42) won't happen until the event listener manager is destroyed, which is what's happening on the stack.
Except no, that still doesn't make sense -- since the nsJSContext is being destroyed, this must be the last preserved wrapper, and ~nsMarkedJSFunctionHolder_base has the correct ordering:  ReleaseWrapper before NS_IF_RELEASE(obj).  So I don't see how the wrapper is still preserved.
I don't understand how this could have worked before. Even if the GetSCCIndex call would have worked, the QI to nsIDOMNode would not. Though maybe we were simply using the |else| code in this case.

But I do think it looks strange that JS still has pointers to these nodes as late as this.
Assignee: nobody → events
QA Contact: general → ian
So could it be that the node being destroyed is still on the parent chain of some node that JS references?  ClassifyWrapper does walk the parent chaing.

I'd _think_ that by the time we're in ~nsINode we won't be the mParent of anything anymore, though...
Where does it walk the parent chain?
It walks it in nsGenericElement::GetSCCIndex if the element is not in a document...

That said, I'm having issues with the posted stacks not actually matching nsDOMClassInfo.cpp from the day they were posted. What version of nsDOMClassInfo.cpp are those line numbers for, Ctho?
Unless there's tail-call optimization happening (which I don't see an opportunity for), it didn't get in to nsGenericElement::GetSCCIndex -- and couldn't have, since it was already in ~nsINode for that element.
So that brings me back to my question in comment 20 -- on what exact line of code is the pure virtual call happening?
I:\smtrunk1>cvs status mozilla/dom/src/base/nsDOMClassInfo.cpp
===================================================================
File: nsDOMClassInfo.cpp        Status: Needs Patch

   Working revision:    1.382
   Repository revision: 1.385   /cvsroot/mozilla/dom/src/base/nsDOMClassInfo.cpp,v
In that revision,

5169 dbaron        1.334     PL_DHashTableOperate(&sWrapperSCCTable, entry->participant->GetSCCIndex(),

5170 dbaron        1.269                          PL_DHASH_ADD));

I assume that you have no local changes, right?

The stack has:

42 0012d95c 100053be 01eee544 125eba80 00000a9a gklayout!ClassifyWrapper(struct PLDHashTable * table = 0x01eee544, struct PLDHashEntryHdr * hdr = 0x125eba80, unsigned int number = 0xa9a, void * arg = 0x0012d9b8)+0x21 (FPO: [Non-Fpo]) (CONV: cdecl) [i:\smtrunk1\mozilla\dom\src\base\nsdomclassinfo.cpp @ 5170]

So entry->participant seems to be dead...

Can you catch this in a debugger again?  What's entry->key?  Is it the JS event listener or marked function holder or something?  That's the only way I can see this happening, I think.  And if that's the case, then we probably need to nsContentUtils::RemoveListenerManager from whatever destructors would still have everything needed for the GC stuff working...

And if so, then we just need to 
> I assume that you have no local changes, right?

configure mailnews/import/Makefile.in toolkit/content/widgets/tabbrowser.xml xpfe/browser/resources/content/navigatorOverlay.xul xpfe/browser/resources/content/nsBrowserStatusHandler.js 
xpfe/communicator/resources/content/contentAreaClick.js 
xpfe/components/bookmarks/resources/bookmarksMenu.js 
xpfe/global/resources/content/bindings/tabbrowser.xml

> Can you catch this in a debugger again?

If it crashes me again, sure.  Otherwise, I did dump some 107MB thing for timeless from WinDbg - maybe it's got what you need in it?
FYI, I got one of these alerts a few days ago using Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060601 Minefield/3.0a1

and since today's nightly automatic update to Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060606 Minefield/3.0a1 I've had about 6 of these crashes.  Not all occurred when closing windows, I think one was while doing a Ctrl-F search.  These crashes have never triggered the Talkback agent.
that's right, and until someone like me writes some evil code, they never will trigger talkback. note that we do have code in cvs.mozilla.org that could enable us to replace these functions, but it's a wee bit of a pain. and I'm currently officially a linux hacker...
*** Bug 340770 has been marked as a duplicate of this bug. ***
Here are a couple crash logs from Mac OS X that appear to be very similar, if not identical, to the WinXP stack.

cl
OS/Platform = All.
OS: Windows XP → All
Hardware: PC → All
*** Bug 340770 has been marked as a duplicate of this bug. ***
I just hit this again; when I had it in the debugger I forgot that it wasn't well understood, and figured it was already known.
Flags: blocking1.9a1+
So one possible solution here is to make us kill the listenermanager in the nsGenericElement dtor. I would be fine with doing that, but it would be good to know if that is a good fix, or just wallpapering another problem.
*** Bug 334027 has been marked as a duplicate of this bug. ***
Whenever I get a "Pure virtual function" crash, Talkback never comes up.
Just caught that someone else posted the same thing some time ago as my previous post, but...

When I launch process explorer right click on Firefox, choose debug and tell it to attach the debugger, Firefox crashes without triggering Talkback as well.


It's possible that this was fixed by some patches that went in recently (bug 345660 and bug 330689).  I'd be interested to hear if people still see this in trunk builds from 2006-08-04 or later.
Depends on: 286619
(In reply to comment #39)
> It's possible that this was fixed by some patches that went in recently (bug
> 345660 and bug 330689).  I'd be interested to hear if people still see this in
> trunk builds from 2006-08-04 or later.
> 

Sorry, still seeing this with current trunk builds. :-(
*** Bug 349427 has been marked as a duplicate of this bug. ***
Bug 349069 was fixed yesterday. It may have helped here.
So if anyone could confirm whether or not there are still
'pure virtual function call' crashes.
(In reply to comment #42)
> Bug 349069 was fixed yesterday. It may have helped here.
> So if anyone could confirm whether or not there are still
> 'pure virtual function call' crashes.


I think this helped, yes. I haven't been able to reproduce this problem since that bug was fixed.

(Camino builds).

I just experienced what seems like this bug in the Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060830 Minefield/3.0a1 build.  Right clicked on a link on tigerdirect.com and got the error.  Can't reproduce it though.

Windows XP SP2
Extensions:
  Adblock Plus 0.7.1.2
  PDF Download 0.7.4
  Tab Mix Plus 0.3.0.60819
(In reply to comment #44)
> I just experienced what seems like this bug in the Mozilla/5.0 (Windows; U;
> Windows NT 5.1; en-US; rv:1.9a1) Gecko/20060830 Minefield/3.0a1 build.  Right
> clicked on a link on tigerdirect.com and got the error.  Can't reproduce it
> though.
> 
> Windows XP SP2
> Extensions:
>   Adblock Plus 0.7.1.2
>   PDF Download 0.7.4
>   Tab Mix Plus 0.3.0.60819
> 

That is because the checkin for bug 349069 was backed out.  A new patch has been developed and is waiting for review.  I have been running for awhile with the new patch and have not encountered this problem since.
*** Bug 350705 has been marked as a duplicate of this bug. ***
This seems to have been fixed by the check-in for bug 349069.  If anyone is still seeing this with a build form 2006-09-03 or later please add a comment.
Bug 349069 should have fixed this. Marking FIXED.
Please re-open if you still see crashes.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
I still wonder if we should try to add assertions that would catch the problem in comment 15.
*** Bug 361746 has been marked as a duplicate of this bug. ***
Crash Signature: [@ purecall - ClassifyWrapper]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: