Closed Bug 669903 Opened 13 years ago Closed 13 years ago

double unlink crashes [@ nsGenericHTMLElement::ClearDataset()] nsDOMStringMap's Unlink

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla10
Tracking Status
firefox9 --- affected
firefox10 --- verified
firefox11 --- verified

People

(Reporter: martijn.martijn, Assigned: wchen)

References

Details

(Keywords: crash, verified-beta, Whiteboard: [qa!][testday-20120203])

Crash Data

Attachments

(1 file, 3 obsolete files)

This bug was filed from the Socorro interface and is report bp-66910cea-e075-42a7-919b-98a242110705 . ============================================================= 0 xul.dll nsGenericHTMLElement::ClearDataset content/html/content/src/nsGenericHTMLElement.cpp:394 1 xul.dll nsDOMStringMap::cycleCollection::Unlink content/html/content/src/nsDOMStringMap.cpp:54 2 xul.dll nsCycleCollector::CollectWhite xpcom/base/nsCycleCollector.cpp:1962 3 xul.dll nsCycleCollector::FinishCollection xpcom/base/nsCycleCollector.cpp:2784 4 xul.dll nsCycleCollectorRunner::Collect xpcom/base/nsCycleCollector.cpp:3437 5 xul.dll nsCycleCollector_collect xpcom/base/nsCycleCollector.cpp:3503 6 xul.dll nsJSContext::CycleCollectNow dom/base/nsJSEnvironment.cpp:3271 7 xul.dll CCTimerFired dom/base/nsJSEnvironment.cpp:3311 8 xul.dll nsTimerImpl::Fire xpcom/threads/nsTimerImpl.cpp:424 9 xul.dll nsTimerEvent::Run xpcom/threads/nsTimerImpl.cpp:520 10 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:617 11 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:134 12 xul.dll xul.dll@0xb79167 13 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:202 14 xul.dll xul.dll@0x37125f 15 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:176 16 xul.dll nsEvent::nsEvent obj-firefox/dist/include/nsGUIEvent.h:569 17 xul.dll nsBaseAppShell::Run widget/src/xpwidgets/nsBaseAppShell.cpp:189 18 xul.dll nsAppShell::Run widget/src/windows/nsAppShell.cpp:256 19 @0x7530ffff While testing, I'm seeing this crash occuring. Sorry, atm I don't have a testcase for this crash.
Product: Firefox → Core
QA Contact: general → general
Is this a null pointer crash? null slots (maybe even null element) + offset to mDataset. Is it possible that something is keeping nsDOMStringMap alive after first unlink. Then it gets unlinked again, and mElement is already null.
Yeah, looking at the code I couldn't make sense of this. As long as the dataset is alive, the element should be alive too. *Especially* in the places where ClearDataset is called since we only ever clear the strong mElement reference after calling ClearDataset. And as long as the dataset is alive, the element should be pointing to it, as that pointer is set up as soon as the dataset is created, and only cleared once the dataset goes away. And the slots are only deleted when the element is deleted, which again should only happen when the dataset is going away. Or, at least that was my initial analysis. I just realized that there is one situation when this isn't true. Specifically, if we end up cycle collecting a dataset more than once we'll end up dereferencing a null mElement pointer in the unlink code. I suppose that can happen. So what we should do is to add a if (mElement) check to the unlink code and see if that helps. William, can you try that?
Assignee: nobody → wchen
Sorry, I haven't been able to get a minimized or unminimized testcase yet. I can tell you that in order for this crash to trigger is when I do something like this setTimeout(function(){document.write(document.documentElement.innerHTML)}, 1000);. And the testcase probably keeps a whole lot of dataset references somewhere.
Attached patch Possible fix for crash (obsolete) — Splinter Review
I added a check for the element prior to clearing the dataset. No test case because I was not able to reproduce the crash. I tried a few things but I was not able to get the unlink code to run more than once.
Attachment #548231 - Flags: review?(jonas)
Comment on attachment 548231 [details] [diff] [review] Possible fix for crash Add an assertion as well, as there's definitely a bug somewhere if we unlink twice.
Attachment #548231 - Flags: review?(jonas) → review+
Attached patch Possible fix for crash (obsolete) — Splinter Review
Added assertion and comment.
Attachment #548231 - Attachment is obsolete: true
Attachment #548246 - Flags: review?(jonas)
Attached patch Possible fix for crash (obsolete) — Splinter Review
Forgot to add r=sicking
Attachment #548246 - Attachment is obsolete: true
Attachment #548250 - Flags: review?(jonas)
Attachment #548246 - Flags: review?(jonas)
Is this ready to land?
Jesse came up with a test case that looks very similar to this in bug 706281, that only triggers a crash after I landed the weak map cycle collector integration in bug 668855.
In the first CC in Jesse's test case, the CC identifies the nsDOMStringMap as garbage, so it is unlinked. There's only one reference to the DOMStringMap, an XPCWrappedNative, which is also identified as garbage and has a single reference. But both of them show up again in the next CC. I'm not sure why that is. So you end up with a double unlink. The attached patch seems like the way to go. Double unlinks are bad, but they do happen. I know of a few ways they can be caused.
How bad are double unlinks? Should the assertion be removed from the patch, or should a bug be filed on fixing the assertion?
They are kind of bad, but probably not anything worse than a controlled crash. Unlinks usually just null out references, so I think the worst thing that will happen is a null deref. But I don't think anybody has really audited unlinks.
Summary: crash [@ nsGenericHTMLElement::ClearDataset()] → double unlink crashes nsDOMStringMap's Unlink
I've always tried to write unlink code so that it can handle more than one unlinking - I think most of our unlink handling is like that.
Summary: double unlink crashes nsDOMStringMap's Unlink → double unlink crashes [@ nsGenericHTMLElement::ClearDataset()] nsDOMStringMap's Unlink
Maybe I'm reading the code wrong, but it looks like the wrappee of an XPCWrappedNative isn't unlinked by the XPCWN's unlink, so it will stay alive until the next GC. Which means that double unlinks are probably fairly common. I don't know. One XPCWN I looked at disappeared from one log to another, but a second one didn't.
I just removed the assertion from William's last patch. Double unlinks will happen whenever a wrapped native is marked as garbage, and the cycle collector runs twice in a row without a GC in between, so I think it shouldn't be an assertion.
Attachment #548250 - Attachment is obsolete: true
Attachment #579901 - Flags: review?(jonas)
Comment on attachment 579901 [details] [diff] [review] check for double unlink Low risk fix (add a null check), but also an extremely rare crash, so I am flagging for Aurora but not Beta. I could only find 9 crashes with this signature in the last week across all versions.
Attachment #579901 - Flags: approval-mozilla-aurora?
Crash Signature: [@ nsGenericHTMLElement::ClearDataset()] → [@ nsGenericHTMLElement::ClearDataset]
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Attachment #579901 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Component: General → DOM
OS: Windows NT → All
QA Contact: general → general
Hardware: x86 → All
Target Milestone: mozilla11 → mozilla10
Is this something QA can verify?
Whiteboard: [qa?]
The steps in bug 706281 comment 0 triggered this crash.
Whiteboard: [qa?] → [qa+]
Verified as fixed with the steps referenced in comment 22 on: Mozilla/5.0 (Windows NT 6.1; rv:10.0) Gecko/20100101 Firefox/10.0 Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:10.0) Gecko/20100101 Firefox/10.0 I have also verified the crash stats from the Socorro interface and this crash didn't reproduce on Firefox 10 anymore.
Keywords: verified-beta
Whiteboard: [qa+] → [qa+][qa!:10]
Verified as fixed with the steps referenced in comment 22 on: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20100101 Firefox/11.0 Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:11.0) Gecko/20100101 Firefox/11.0
Status: RESOLVED → VERIFIED
Whiteboard: [qa+][qa!:10] → [qa!][testday-20120203]
The crash status from the Socorro interface were also verified and this crash didn't happen on Firefox 10 or newer anymore.
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: