Closed Bug 669903 Opened 13 years ago Closed 12 years ago

double unlink crashes [@ nsGenericHTMLElement::ClearDataset()] nsDOMStringMap's Unlink

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla10
Tracking Status
firefox9 --- affected
firefox10 --- verified
firefox11 --- verified

People

(Reporter: martijn.martijn, Assigned: wchen)

References

Details

(Keywords: crash, verified-beta, Whiteboard: [qa!][testday-20120203])

Crash Data

Attachments

(1 file, 3 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-66910cea-e075-42a7-919b-98a242110705 .
============================================================= 
0 	xul.dll 	nsGenericHTMLElement::ClearDataset 	content/html/content/src/nsGenericHTMLElement.cpp:394
1 	xul.dll 	nsDOMStringMap::cycleCollection::Unlink 	content/html/content/src/nsDOMStringMap.cpp:54
2 	xul.dll 	nsCycleCollector::CollectWhite 	xpcom/base/nsCycleCollector.cpp:1962
3 	xul.dll 	nsCycleCollector::FinishCollection 	xpcom/base/nsCycleCollector.cpp:2784
4 	xul.dll 	nsCycleCollectorRunner::Collect 	xpcom/base/nsCycleCollector.cpp:3437
5 	xul.dll 	nsCycleCollector_collect 	xpcom/base/nsCycleCollector.cpp:3503
6 	xul.dll 	nsJSContext::CycleCollectNow 	dom/base/nsJSEnvironment.cpp:3271
7 	xul.dll 	CCTimerFired 	dom/base/nsJSEnvironment.cpp:3311
8 	xul.dll 	nsTimerImpl::Fire 	xpcom/threads/nsTimerImpl.cpp:424
9 	xul.dll 	nsTimerEvent::Run 	xpcom/threads/nsTimerImpl.cpp:520
10 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:617
11 	xul.dll 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:134
12 	xul.dll 	xul.dll@0xb79167 	
13 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:202
14 	xul.dll 	xul.dll@0x37125f 	
15 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:176
16 	xul.dll 	nsEvent::nsEvent 	obj-firefox/dist/include/nsGUIEvent.h:569
17 	xul.dll 	nsBaseAppShell::Run 	widget/src/xpwidgets/nsBaseAppShell.cpp:189
18 	xul.dll 	nsAppShell::Run 	widget/src/windows/nsAppShell.cpp:256
19 		@0x7530ffff 

While testing, I'm seeing this crash occuring. Sorry, atm I don't have a testcase for this crash.
Product: Firefox → Core
QA Contact: general → general
Is this a null pointer crash? null slots (maybe even null element) + offset to
mDataset.

Is it possible that something is keeping nsDOMStringMap alive after
first unlink. Then it gets unlinked again, and mElement is already null.
Yeah, looking at the code I couldn't make sense of this.

As long as the dataset is alive, the element should be alive too. *Especially* in the places where ClearDataset is called since we only ever clear the strong mElement reference after calling ClearDataset.

And as long as the dataset is alive, the element should be pointing to it, as that pointer is set up as soon as the dataset is created, and only cleared once the dataset goes away.

And the slots are only deleted when the element is deleted, which again should only happen when the dataset is going away.


Or, at least that was my initial analysis. I just realized that there is one situation when this isn't true. Specifically, if we end up cycle collecting a dataset more than once we'll end up dereferencing a null mElement pointer in the unlink code. I suppose that can happen.

So what we should do is to add a if (mElement) check to the unlink code and see if that helps.


William, can you try that?
Assignee: nobody → wchen
Sorry, I haven't been able to get a minimized or unminimized testcase yet.
I can tell you that in order for this crash to trigger is when I do something like this setTimeout(function(){document.write(document.documentElement.innerHTML)}, 1000);. And the testcase probably keeps a whole lot of dataset references somewhere.
Attached patch Possible fix for crash (obsolete) — Splinter Review
I added a check for the element prior to clearing the dataset. No test case because I was not able to reproduce the crash. I tried a few things but I was not able to get the unlink code to run more than once.
Attachment #548231 - Flags: review?(jonas)
Comment on attachment 548231 [details] [diff] [review]
Possible fix for crash

Add an assertion as well, as there's definitely a bug somewhere if we unlink twice.
Attachment #548231 - Flags: review?(jonas) → review+
Attached patch Possible fix for crash (obsolete) — Splinter Review
Added assertion and comment.
Attachment #548231 - Attachment is obsolete: true
Attachment #548246 - Flags: review?(jonas)
Attached patch Possible fix for crash (obsolete) — Splinter Review
Forgot to add r=sicking
Attachment #548246 - Attachment is obsolete: true
Attachment #548250 - Flags: review?(jonas)
Attachment #548246 - Flags: review?(jonas)
Is this ready to land?
Jesse came up with a test case that looks very similar to this in bug 706281, that only triggers a crash after I landed the weak map cycle collector integration in bug 668855.
In the first CC in Jesse's test case, the CC identifies the nsDOMStringMap as garbage, so it is unlinked.  There's only one reference to the DOMStringMap, an XPCWrappedNative, which is also identified as garbage and has a single reference.  But both of them show up again in the next CC.  I'm not sure why that is.  So you end up with a double unlink.  The attached patch seems like the way to go.  Double unlinks are bad, but they do happen.  I know of a few ways they can be caused.
How bad are double unlinks? Should the assertion be removed from the patch, or should a bug be filed on fixing the assertion?
They are kind of bad, but probably not anything worse than a controlled crash.  Unlinks usually just null out references, so I think the worst thing that will happen is a null deref.  But I don't think anybody has really audited unlinks.
Summary: crash [@ nsGenericHTMLElement::ClearDataset()] → double unlink crashes nsDOMStringMap's Unlink
I've always tried to write unlink code so that it can handle more than one unlinking -
I think most of our unlink handling is like that.
Summary: double unlink crashes nsDOMStringMap's Unlink → double unlink crashes [@ nsGenericHTMLElement::ClearDataset()] nsDOMStringMap's Unlink
Maybe I'm reading the code wrong, but it looks like the wrappee of an XPCWrappedNative isn't unlinked by the XPCWN's unlink, so it will stay alive until the next GC.  Which means that double unlinks are probably fairly common.  I don't know.  One XPCWN I looked at disappeared from one log to another, but a second one didn't.
I just removed the assertion from William's last patch.  Double unlinks will happen whenever a wrapped native is marked as garbage, and the cycle collector runs twice in a row without a GC in between, so I think it shouldn't be an assertion.
Attachment #548250 - Attachment is obsolete: true
Attachment #579901 - Flags: review?(jonas)
Comment on attachment 579901 [details] [diff] [review]
check for double unlink

Low risk fix (add a null check), but also an extremely rare crash, so I am flagging for Aurora but not Beta.  I could only find 9 crashes with this signature in the last week across all versions.
Attachment #579901 - Flags: approval-mozilla-aurora?
Crash Signature: [@ nsGenericHTMLElement::ClearDataset()] → [@ nsGenericHTMLElement::ClearDataset]
https://hg.mozilla.org/mozilla-central/rev/271ba479719f
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Attachment #579901 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/28320cfb9cac
Component: General → DOM
OS: Windows NT → All
QA Contact: general → general
Hardware: x86 → All
Target Milestone: mozilla11 → mozilla10
Is this something QA can verify?
Whiteboard: [qa?]
The steps in bug 706281 comment 0 triggered this crash.
Whiteboard: [qa?] → [qa+]
Verified as fixed with the steps referenced in comment 22 on:
Mozilla/5.0 (Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0
Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:10.0) Gecko/20100101 Firefox/10.0

I have also verified the crash stats from the Socorro interface and this crash didn't reproduce on Firefox 10 anymore.
Keywords: verified-beta
Whiteboard: [qa+] → [qa+][qa!:10]
Verified as fixed with the steps referenced in comment 22 on:
Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0
Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:11.0) Gecko/20100101 Firefox/11.0
Status: RESOLVED → VERIFIED
Whiteboard: [qa+][qa!:10] → [qa!][testday-20120203]
The crash status from the Socorro interface were also verified and this crash didn't happen on Firefox 10 or newer anymore.
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.