Closed Bug 1266882 Opened 9 years ago Closed 8 years ago

crash in nsCycleCollector::ScanWhiteNodes, on MOZ_RELEASE_ASSERT("Cycle collector found more references to an object than its refcount")

Categories

(Core :: DOM: Editor, defect)

48 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla50
Tracking Status
firefox47 --- unaffected
firefox48 + fixed
firefox49 + fixed
firefox-esr45 --- unaffected
firefox50 + fixed

People

(Reporter: khuey, Assigned: mccr8)

References

Details

(Keywords: crash, regression, topcrash-thunderbird, Whiteboard: [tbird topcrash])

Crash Data

Attachments

(3 files, 1 obsolete file)

This bug was filed from the Socorro interface and is report bp-3e0a0bf5-070f-41b7-a43e-285392160422. ============================================================= "Cycle collector found more references to an object than its refcount" sounds bad.
Flags: needinfo?(continuation)
That's interesting, but I'm not sure if we can do anything about it without knowing what the class of that object is. I looked at some other Nightly crashes with this signature and they were all crashing on the same assert.
Flags: needinfo?(continuation)
Summary: crash in nsCycleCollector::ScanWhiteNodes → crash in nsCycleCollector::ScanWhiteNodes, on MOZ_RELEASE_ASSERT(" "Cycle collector found more references to an object than its refcount")
Summary: crash in nsCycleCollector::ScanWhiteNodes, on MOZ_RELEASE_ASSERT(" "Cycle collector found more references to an object than its refcount") → crash in nsCycleCollector::ScanWhiteNodes, on MOZ_RELEASE_ASSERT("Cycle collector found more references to an object than its refcount")
I hadn't ever seen any crashes in this code the entire time I've been working on the cycle collector, except for a single fuzz bug, so I haven't really thought about how we could get some useful information for this. Maybe I could use Traverse() to get a string, and then annotate the crash report with it. You'd have to be careful to ensure that no URLs are in it, though.
This function has been around since 2014-05-08, but crash-stats says the signature first showed up 2015-08-27. Maybe digging up the commit log for that Nightly would reveal something.
This crash tends to happen during development in case people are doing unusual traversing, like the not-yet-exposed-to-the-web Directory API was suffering from this. I guess we should try to figure out if it was possible during compile time to assert that something can't be traversed too many times.
(In reply to Andrew McCreight [:mccr8] from comment #1) > That's interesting, but I'm not sure if we can do anything about it without > knowing what the class of that object is. I looked at some other Nightly > crashes with this signature and they were all crashing on the same assert. Put pi->mParticipant on the stack there when we decide to abort? Then we can grab it from a minidump and see what class it is.
Flags: needinfo?(continuation)
The other thing you could do is switch from MOZ_RELEASE_ASSERT to NS_RUNTIMEABORT, pass an nsPrintfCString().get() as the argument to NS_RUNTIMEABORT with whatever information you want, and then pull the abort message out of the crash reports. Although that doesn't work in the child process, so you'd need a crash with e10s disabled to make that useful...
(although you still might be able to get the resulting string off the stack in the crash report, debugging in Visual Studio)
I'd have assumed that some kind of load-store optimization would clear mParticipant off the stack but I guess we can try it.
Another 8 crashes in Nightly 20160508030214, making this the #9 topcrash.
Assignee: nobody → continuation
Flags: needinfo?(continuation)
Depends on: 1271848
i hit this very frequently when editing a mana page (nightly(49) and devedition(48)). e10s on both, nightly with a clean profile. STR are edit a mana page and make random changes for up to a couple of minutes. there's no need to save the changes. and, as i found out while writing this comment, the crash may happen after closing the mana tab.
(In reply to Byron Jones ‹:glob› from comment #11) > STR are edit a mana page and make random changes for up to a couple of > minutes. That's interesting. What sort of changes? Just randomly inserting and deleting text?
(In reply to Andrew McCreight [:mccr8] from comment #12) > (In reply to Byron Jones ‹:glob› from comment #11) > > STR are edit a mana page and make random changes for up to a couple of > > minutes. > That's interesting. What sort of changes? Just randomly inserting and > deleting text? yes. i couldn't pin down an exact trigger, so my random edits have included email addresses and bullet points. tomcat wasn't able to trigger a crash with mana on nightly. rfkelly reported multiple crashes while editing https://mana.mozilla.org/wiki/display/Identity/Persona+migration+guide+for+internal+sites (i crashed four times when adding three lines to that page).
FWIW, a comment from a Thunderbird developer. I'm currently running a debug build to track down an intermitted problem which occurs about every month. While doing so, TB crashes for me regularly, mostly here: https://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsCycleCollector.cpp#3153 In a synch CC, no nodes should be marked black early on but also here: https://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsCycleCollector.cpp#3161 Live nodes should all have been traversed
yup, that means we have broken traverse implementation somewhere. Hopefully bug 1271848 will give more information where.
Whiteboard: [tbird topcrash]
(In reply to Olli Pettay [:smaug] from comment #16) > Hopefully bug 1271848 will give more information where. If you're interested in catching this, may I suggest you compile Thunderbird. All you need to do is click on some messages in some folders and most importantly, scroll the folder pane with the scroll bar using the mouse pointer (http://kb.mozillazine.org/Window_layout_-_Thunderbird). I'm running a local build and I'm logging the debug console output to a file. Right now I've got 7222 occurrences of CRASH!!! - In a synch CC, no nodes should be marked black early on. Every time I scroll the folder pane, switch to a new folder and click on messages, I get more. In a previous session I had many thousands more.
Flags: needinfo?(bugs)
Any chance you could try to find regression range for this?
Flags: needinfo?(bugs)
And "CRASH!!! - In a synch CC, no nodes should be marked black early on." is different assertion than what this bug is about.
Indeed, it triggers thousands of times ;-( When I complained in comment #14, your answer in comment #16 was: "yup, that means we have broken traverse implementation somewhere." So isn't this related? The internal structures go bust and then it will crash somewhere. The one I'm reporting has a MOZ_ASSERT, so most people using a release version will not be able to see it. Is there any more info I should print out at the time of the disabled MOZ_ASSERT? As for the regression range: We have an expert in Japan doing those: alice0775@yahoo.co.jp. But he/she(?) usually bisects using dailies. He/she also does local builds, but this one here is a little more complicated. Alice, when did the debug builds of TB start crashing when moving between folders, scrolling the folder pane and displaying messages. I've included a debug patch in attachment 8752602 [details] [diff] [review] that converts the crash to print. Is this something you'd be able to do?
Flags: needinfo?(alice0775)
it is possibly related. My comment was about the other assertion though. Sorry if I wasn't clear.
I have attempted to crash TB49 about for 1 minute, but did not succeed....
Flags: needinfo?(alice0775)
But the "no nodes should be marked black early on" assertion also indicated that something went wrong, right? Is there something else I should dump out? I'm now at 15329 occurrences in my TB session. This code lives in XPCOM, so could it be that TB, which is basically a big JS add-on, exercises this more than FF itself?
(In reply to Alice0775 White from comment #23) > I have attempted to crash TB49 about for 1 minute, but did not succeed.... Thanks. It needs to be a debug build. There are three asserts which are triggered: MOZ_ASSERT(!aFullySynchGraphBuild, "In a synch CC, no nodes should be marked black early on."); That triggers a lot. MOZ_ASSERT(!pi->mParticipant, "Live nodes should all have been traversed"); That triggers occasionally. nsPrintfCString msg("More references to an object than its refcount, for class %s", piName); CrashReporter::AnnotateCrashReport(NS_LITERAL_CSTRING("CycleCollector"), msg); MOZ_CRASH(); That triggers occasionally. Only the third one will be noticed in a release build. While the bug is about the third one, I'm guessing that the first one is also related. And that is so frequent that it will crash a debug build within a minute.
I see two reports on Nightly with my annotation. Unfortunately it is FragmentOrElement in both which is not useful. I'll have to try to reproduce locally.
I know that his bug is about chasing "Cycle collector found more references to an object than its refcount" However, in a Thunderbird session of a few hours I got In a synch CC, no nodes should be marked black early on 24000 (24 thousand) times and Live nodes should all have been traversed 300 times. Something is wrong in that logic.
(In reply to Jorg K (PTO during summer, NI me) from comment #27) > However, in a Thunderbird session of a few hours I got > In a synch CC, no nodes should be marked black early on > 24000 (24 thousand) times and Please file a new XPCOM bug for this issue and CC me.
I get pretty regular crashes after sending an e-mail which closes the composition window. Last one was https://crash-stats.mozilla.com/report/index/abad253a-bc0c-4b46-92f6-c56992160521 That version of Daily was compiled 2016-05-17, after bug 1271848 landed, so it should have reported the details you're interested in. How can I tell whether the report contains useful information? I bet I can produce heaps more just sending e-mail.
Ran into this: https://crash-stats.mozilla.com/report/index/cf645109-825f-40a5-8d45-456a62160523. Is Mana a Confluence instance, by any chance? I think my crash happened while editing a Confluence page.
Interestingly, when I went back and submitted my edits to this Confluence page once more, Firefox crashed again, but with a different crash: https://crash-stats.mozilla.com/report/index/f685643b-a2b8-4960-ba4d-697bf2160523 (see bug 791437).
(In reply to Dirkjan Ochtman (:djc) from comment #31) > Is Mana a Confluence instance, by any chance? yes, mana is a confluence instance. > Interestingly, when I went back and submitted my edits to this Confluence > page once more, Firefox crashed again, but with a different crash: i also see that a lot when editing mana.
I tried to reproduce the crash editing mana, under rr, but hit the crash in bug 1275003 instead. I don't think it's related, although I'm not sure.
I couldn't reproduce the crash editing a page I picked. Is there a particular page on mana that you've seen this on?
Flags: needinfo?(glob)
Never mind; I see comment 13 now.
Flags: needinfo?(glob)
I tried editing the page from comment 13 for about 10 minutes, but didn't crash.
If somebody who can reproduce could run with MOZ_CC_LOG_DIRECTORY set to some directory and MOZ_CC_LOG_ALL set to 1, and then email me the last couple of cc-edges logs (the last one might have "incomplete" in the name) that might help.
Sorry, comment #38 looked a little cryptic ;-) However, on Windows I did: set MOZ_CC_LOG_DIRECTORY=c:\TB-CC set MOZ_CC_LOG_ALL=1 thunderbird -p in the directory where the executable is stored. This gives me plenty of files, named named gc-edges.MMMM-NN.log and cc-edges.MMMM-NN.log. I also saw some with the "incomplete" prefix flash past. Now what you want is the "cc" file at the point where it crashed, right? The only problem is that I already collected 1 GB of data in about eight minutes, so I will soon fill all the space left on my drive. Can I delete the "gc" files since you didn't ask for them? Or anything older than file minutes. Please advise.
s/file minutes/five minutes/
(In reply to Jorg K (PTO during summer, NI me) from comment #41) > Now what you want is the "cc" file at the point where it crashed, right? Yes. That will probably be the one with "incomplete" in the name. > The only problem is that I already collected 1 GB of data in about eight > minutes, so I will soon fill all the space left on my drive. Can I delete > the "gc" files since you didn't ask for them? Or anything older than file > minutes. Yes, that sounds reasonable. It might make more sense to have somebody try logging who can reproduce this issue more easily. (Ideally there would be a logging mode where it deletes the log after it is finished rather than rename it, but this isn't a common scenario.)
Is there enough information in the crash reports now? Here's a fresh one: https://crash-stats.mozilla.com/report/index/990847b6-403e-4688-b7aa-db2512160525
Unfortunately, the information turned out to be not very useful. If somebody could reproduce with the CC logging enabled as I described in comment 38, I might be able to make some progress.
Any updates?
Flags: needinfo?(mozilla)
Sadly not. I got lazy and didn't run with the "special setup" described in comment #41. Also Andrew suggested in comment #43 that ... it might make more sense to have somebody try logging who can reproduce this issue more easily. Sadly it crashed twice in a row now within five minutes: https://crash-stats.mozilla.com/report/index/6c9b64b0-df45-44f3-bb2c-e4aa02160601 https://crash-stats.mozilla.com/report/index/2ccb9daa-1f89-4d4f-9fd0-bbfa02160601
Flags: needinfo?(mozilla)
#10 topcrash on Firefox Beta 48. Many comments in the crash report are related to GMail.
James, this is impacting gmail, do you know who could help with this? Thanks
Flags: needinfo?(snorp)
I think the main thing blocking this is we need better steps to reproduce.
I guess this is a bug in the editor. In the crash bp-4e85523d-f3b0-4d70-b13c-d8ac02160623, I can see that it happened when the user was writing an email (this person is a staff). She also told me that it occurs when writing a mail in gmail.
Component: XPCOM → Editor
Hi all, Firefox just crashed again when I was on Gmail writing down an other email. 25 minutes between this crash and the previous one. Do not hesitate should you have questions.
If this really started showing up more often in Firefox 48, then bug 1253734 is a possible regressor.
Peter, could you help here? Merci
Flags: needinfo?(snorp) → needinfo?(peterv)
My two cents of wisdom: The few cases this has crashed for me were in Thunderbird after sending a message. So the compose window is destroyed and with it, the editor that lives in it. I would assume that some garbage collection is triggered at that stage.
You don't happen to use Linux, do you?
Flags: needinfo?(ccarle)
Looking at that bug mccr8 mentioned... I don't know whether it is guaranteed that http://searchfox.org/mozilla-central/rev/970569ad57ac4436ff31aa2ac63f38ed3ee2932d/editor/libeditor/PlaceholderTxn.cpp#41 is always non-null
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #54) > You don't happen to use Linux, do you? No I'm using Mac OS X 10.11 (OS version 10.11.5 15F34)
Flags: needinfo?(ccarle)
Mine was on the Mac as well - https://crash-stats.mozilla.com/report/index/a47e1c1a-e6d0-4679-ab88-89d2a2160624. I believe I was switching to a tab that happened to have a WordPress page being edited. I've only seen it once.
I looked at 10 reports, and they're all either FragmentOrElement or nsGenericDOMDataNode. :(
Yeah, we need somebody to reproduce it with CC logging on, so we can figure out what exactly is pointing to the bad element. Unless somebody wants to audit all editor CC code.
I've only had two crashes within the last four weeks, today, July 3rd, and June 7th. Since running with CC logging is an expensive operation, I can't really do it for weeks in a row. Today's crash was like all the others, it crashed after sending an e-mail with TB which closes the compose window. The version I'm running is from 2016-06-24, so *before* bug 1281873 landed. I'll start running today's version later today. No feedback from Peter van der Beken yet?
[Tracking Requested - why for this release]: We have a lot of these crashes in beta, it is going to be worst when it reaches release... We should do something before the release.
Track this as described in comment #61.
Today this crashed three times within 10 minutes writing one particular e-mail. After the second crash I switched on the logging and was lucky, or unlucky ... ;-) Download the logs here: https://we.tl/Q1E2KvQGS6
Flags: needinfo?(continuation)
Thanks, I'll see if I can figure out what is going wrong. Also just for future reference, you only needed to include the incomplete CC log plus the newest GC log. :)
Flags: needinfo?(continuation)
I added a basic refcount checker, check_refcounts.py, to my heapgraph library, and used it on the incomplete CC edges file that Jorg K sent me. This revealed the bad node (12C399C0), which has a refcount of 2, with 3 known references to it. The node with too many references: 12C399C0 [rc=2] FragmentOrElement (xhtml) br (orphan) about:blank > 12C396A0 GetParent() > 127186C0 mSlots->mChildrenList The objects referring to it: 13F08DF0 [rc=1] EditTxn > 12C396A0 mParent > 12C399C0 mNewNode <-- reference 1 12C396A0 [rc=7] FragmentOrElement (xhtml) tt (orphan) about:blank > 13768790 GetParent() > 12C399C0 mAttrsAndChildren[i] <-- reference 2 > 12708BC0 mSlots->mAttributeMap > 127158C0 mSlots->mChildrenList 127158C0 [rc=1] nsBaseContentList > 12C399C0 mElements[i] <-- reference 3 Now, unfortunately, EditTxn (or rather CreateElementTxn, which is the actual class here) seems very normal. I was hoping there was a traversal problem in some editor class, but it may be that there's an actual bug in the reference counting. This editor code seems to have a whole mishmash of styles for handling refcounting.
about:blank is the URL of the document? That's correct for the TB compose window. And <br> and <tt> are elements in the composition? That's correct, too.
(In reply to Andrew McCreight [:mccr8] from comment #4) > This function has been around since 2014-05-08, but crash-stats says the > signature first showed up 2015-08-27. Maybe digging up the commit log for > that Nightly would reveal something. It became a lot more common around 2016-04-20 or 2016-04-25, though.
The pattern of increased occurrence seems to pretty closely match bug 1156062 parts 4-8. (It matches the landing, backout, and relanding of bug 1156062 and bug 1190172, except that bug 1156062 parts 9-12 and bug 1190172 were backed out again in a way that doesn't match this pattern.)
Blocks: 1156062
And I'm most suspicious of https://hg.mozilla.org/integration/mozilla-inbound/rev/42af5753ad68b7d06f0a8ba85d5b361f9ce37f6d because it's not clear to me what causes the return BR node not to reach reference count 0 and be destroyed while being returned from nsHTMLEditor::CreateBR.
Kyle pointed out in IRC that this callsite for CreateBR fails to properly addref the return value: https://bugzilla.mozilla.org/attachment.cgi?id=8742783&action=diff#a/editor/libeditor/nsHTMLEditor.cpp_sec3 Prior to the change, CreateBR had a return type of already_AddRefed. I'm not sure why it was changed.
Flags: needinfo?(peterv)
Maybe Masayuki or Aryeh (despite his absence) can answer this.
Flags: needinfo?(masayuki)
I should be able to figure it out, thanks though.
Flags: needinfo?(masayuki)
FWIW, Ehsan was right in bug bug 1156062: landing it in chunks did help figure out the regression. :) (In reply to :Ehsan Akhgari (out sick) from comment #15) > Past experience has shown that landing these changes en masse is a mistake, > since they are regression prone. I'd really appreciate if you could split > up the work and try to land one or two of these patches per bug...
This is a minimal fix for backporting. I think assigning to a refptr and then doing forget is less weird than the code prior to bug 1156062 part 5, which did a take(). It is also nice because it works whether CreateBR() returns a raw pointer or an already_Refed<> pointer. I have audited all call sites for nsHTMLEditor::CreateBR(), nsPlaintextEditor::CreateBRImpl(), and nsWSRunObject::InsertBreak() (the three methods changed by bug 1156062 part 5), and this was the only one that does not assign to a refptr. The try run has not finished yet, but this seems very safe to me. try run for just part 1: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc217227d087
Attachment #8768241 - Flags: review?(masayuki)
I think the old return types for these methods are better, because returning already_AddRefed rather than a raw pointer makes it harder to make mistakes with refcounting like the one seen in part 1. (Incidentally, this is also a little more efficient, because it avoids some addrefs and releases.) try run for part 1 and part 2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c93e8d8f32a5
Attachment #8768242 - Flags: review?(masayuki)
This bug is now about a specific regression that caused a large increase in this crash signature, so I'm going to mark older branches as unaffected.
Keywords: regression
Sigh, these patches will break some patches for bug 1260651 but they should be uplift, so, I'll review them soon and struggle with merging by my hand.
I wasn't planning on uplifting part 2, which is the only one that should be hard to rebase over, so I can wait until after you've landed bug 1260651. Part 1 is the more important one.
Comment on attachment 8768241 [details] [diff] [review] part 1 - Ensure the return value of CreateBR is addrefed in CopyLastEditableChildStyles. Okay, fine (according to your comment). Let's land this now.
Attachment #8768241 - Flags: review?(masayuki) → review+
(In reply to Andrew McCreight [:mccr8] from comment #78) > I wasn't planning on uplifting part 2, which is the only one that should be > hard to rebase over, so I can wait until after you've landed bug 1260651. > Part 1 is the more important one. Okay, thank you. I'm asking review to smaug. If he will do that, I'll be back this bug after that. Otherwise, I'll review part.2 and land it first.
Blocks: 1284963
Comment on attachment 8768242 [details] [diff] [review] part 2 - Revert part 5 of bug 1156062. I just split part 2 into a separate bug, bug 1284963, to simplify tracking what branches have what.
Attachment #8768242 - Attachment is obsolete: true
Attachment #8768242 - Flags: review?(masayuki)
Pushed by amccreight@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1b6bb50925cd Ensure the return value of CreateBR is addrefed in CopyLastEditableChildStyles. r=masayuki
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment on attachment 8768241 [details] [diff] [review] part 1 - Ensure the return value of CreateBR is addrefed in CopyLastEditableChildStyles. Approval Request Comment [Feature/regressing bug #]: bug 1156062 [User impact if declined]: bad crashes. This may not fix the crash in the bug, but the existing code is definitely wrong. [Describe test coverage new/current, TreeHerder]: there must be some tests that run this code, but I'm not sure. [Risks and why]: Low. It just adds an obviously missing addref. [String/UUID change made/needed]: none
Attachment #8768241 - Flags: approval-mozilla-beta?
Attachment #8768241 - Flags: approval-mozilla-aurora?
As I said in comment #63: When editing a certain e-mail I can reliably reproduce the crash. The e-mail has this HTML: <ul> <li><tt>text</tt></li> <li><tt>text</tt></li> <li><tt>text</tt></li> <li><tt><br> </tt></li> </ul> The edit operation is adding more items to the list by hitting <enter> at the end of an item/a line. This operation runs nsHTMLEditor::CopyLastEditableChildStyles(). I'm sure you can reproduce it in contenteditable <div> in FF alone. With the patch, the crash has gone. So thanks!! Perhaps I can use the opportunity to draw some attention to bug 1273207. My debug builds still crash on that and I have to compile with a special patch to convert those MOZ_ASSERT into printf in order to work at all.
Status: RESOLVED → VERIFIED
Comment on attachment 8768241 [details] [diff] [review] part 1 - Ensure the return value of CreateBR is addrefed in CopyLastEditableChildStyles. Crash fix for 48/49, verified on m-c, let's try it for beta 7 next week.
Attachment #8768241 - Flags: approval-mozilla-beta?
Attachment #8768241 - Flags: approval-mozilla-beta+
Attachment #8768241 - Flags: approval-mozilla-aurora?
Attachment #8768241 - Flags: approval-mozilla-aurora+
This needs a rebased patch for Beta.
Flags: needinfo?(continuation)
Sorry, ignore my comment 89. The line in question was changed twice: First in bug 1156062 part 5 which landed on mozilla 48: https://hg.mozilla.org/mozilla-central/rev/42af5753ad68#l1.41 Then again in bug 1190172 part 10 which also landed on mozilla49: https://hg.mozilla.org/mozilla-central/rev/6dabbd41ffa2#l3.31
I just attached a fixed up version for beta. Thanks.
Flags: needinfo?(continuation)
Version: unspecified → 48 Branch
I don't see any crashes on Nightly or Aurora after this patch landed, which is good. It was happening regularly before that.
Firefox crashed again after I updated it to the latest version on 07/11/2016 at 16:21 (FR time) as I was typing an email... Crash number is: bp-03892072-487b-475a-8b8e-9525a2160711
(In reply to ccarle from comment #95) > Crash number is: bp-03892072-487b-475a-8b8e-9525a2160711 The buildid on that crash is 20160706215822 (July 6), which is from before this patch landed in beta (July 8). Beta is only released once or twice a week.
Oh ok, got it. Thanks and apologies for the noise!
See Also: → 1287119
It looks like there are some remaining crashes with my patch applied. I filed bug 1287119 for those, in case we can figure anything out.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: