Crash in [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView]
Categories
(Core :: Layout, defect, P2)
Tracking
()
People
(Reporter: aryx, Assigned: tnikkel)
References
(Regression)
Details
(Keywords: crash, csectype-framepoisoning, regression, Whiteboard: [win:stability])
Crash Data
Attachments
(2 files)
|
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-esr115+
|
Details | Review |
|
26.55 KB,
text/plain
|
Details |
192 crashes from Windows and Linux, first one on 2022-12-29, this started to gain traction on 2023-01-05. 25% of crashes in first minute after launch
Crash report: https://crash-stats.mozilla.org/report/index/1dc83a24-d1cd-49d4-b0c8-a8d9f0230110
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 xul.dll nsCOMPtr<nsITreeView>::nsCOMPtr xpcom/base/nsCOMPtr.h:489
0 xul.dll nsTreeBodyFrame::GetExistingView const layout/xul/tree/nsTreeBodyFrame.h:76
0 xul.dll nsTreeBodyFrame::GetSelection const layout/xul/tree/nsTreeBodyFrame.cpp:445
1 xul.dll nsTreeBodyFrame::SetFocused layout/xul/tree/nsTreeBodyFrame.cpp:454
2 xul.dll mozilla::dom::XULTreeElement_Binding::set_focused dom/bindings/XULTreeElementBinding.cpp:338
3 xul.dll mozilla::dom::binding_detail::GenericSetter<mozilla::dom::binding_detail::NormalThisPolicy> dom/bindings/BindingUtils.cpp:3235
4 xul.dll CallJSNative js/src/vm/Interpreter.cpp:459
4 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:547
4 xul.dll InternalCall js/src/vm/Interpreter.cpp:614
4 xul.dll js::Call js/src/vm/Interpreter.cpp:646
| Reporter | ||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 1•2 years ago
|
||
Ok, this one I can't explain unless there's already a freed or corrupt pointer in nsTreeBodyFrame::mView? The crash address is also not poison, so that's weird...
Do you know if we have a crash annotation for a11y being enabled? I see a bit of suspect code related to nsITreeView in accessibility here and in similar places, but hard to see if it might be at fault.
| Reporter | ||
Comment 2•2 years ago
|
||
Try these reports which have accessibility set as true.
Comment 3•2 years ago
|
||
The bug is marked as tracked for firefox109 (beta) and tracked for firefox110 (nightly). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned.
:fgriffith, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 4•2 years ago
|
||
Set release status flags based on info from the regressing bug 1806408
Hi Emilio, how are we progressing on this bug, do we have next steps?
Comment 6•2 years ago
|
||
I would expect bug 1809635 to fix this, but it was pretty much a patch based on code inspection, that's why I didn't close this right away.
Comment 7•2 years ago
|
||
So something interesting about this crash is that we didn't have any crash before 109b6 (and similarly there's no crash yet as of 110b5). I wonder if we have something enabled in early beta or earlier that is preventing this crash from happening?
In any case the bug mentioned above made it to 110, so let's check in a couple weeks to confirm there are no beta crashes, and if there are then we need to figure out what might be going on there.
Comment 8•2 years ago
|
||
In the past week, around 8 crash reports have a comment mentioning that the crash happened when they were manually clearing history.
The most specific STR-like comment there was:
Selecting all under "older than 6 months" in the history tab and hitting the 'delete' key caused a crash.
Comment 9•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #6)
I would expect bug 1809635 to fix this, but it was pretty much a patch based on code inspection, that's why I didn't close this right away.
update: emlio's requested beta/release uplift on that patch. We'll see if crash volume reduces once it makes it to those channels (hopefully soon, if approval is granted).
Comment 10•2 years ago
|
||
Unfortunately, the crashes are still happening in 110.0b7 :(
Comment 11•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #7)
So something interesting about this crash is that we didn't have any crash before 109b6 (and similarly there's no crash yet as of 110b5). I wonder if we have something enabled in early beta or earlier that is preventing this crash from happening?
So this seems true then... But it's unclear what could cause this :/
Updated•2 years ago
|
Updated•2 years ago
|
Comment 12•2 years ago
|
||
anything in particular? This doesn't seem actionable without reproducing...
Comment 13•2 years ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #12)
anything in particular? This doesn't seem actionable without reproducing...
sorry about the ping, I must have made a misclick during the reo weekly triage session.
Updated•2 years ago
|
Comment 14•2 years ago
|
||
There are several more reports mentioning that the crash happened after deleting history. The crash seems to reliably be on 0x7ffffffff0de7fff, too, which is kinda odd.
Comment 15•2 years ago
•
|
||
I have one theory about how-to-repro here... I haven't repro'd the actual crash, but I've managed to trigger something odd that could conceivably crash.
My STR:
- Copy a giant
places.sqliteinto a fresh profile (so as not to corrupt your regular profile). I'm using my regular browsing profile'splaces.squlitefile. - Start Firefox using that fresh profile.
- History | Show All HIstory
- Click "Older than 6 months"
- Select-all in the right window. For me, this shows "27000 results" or so.
- Now press-and-hold your delete key on your keyboard (or alternately: press it rapidly & repeatedly). Keep doing this, for tens of seconds. Watch how Firefox performs, watch its memory-usage in "top".
- If your selection gets lost (I think it occasionally did), then repeat steps 5-6.
ACTUAL RESULTS:
- The reported number of selected history-items holds steady (i.e. nothing obviously being deleted) for an unusual amount of time. Or it might go down a bit but then stop going down.
- Memory usage steadily increases -- Firefox consumes another ~1% of my system memory (64GB) every couple seconds.
- Firefox is often unresponsive; e.g. I can open tabs but sites won't load.
I'm guessing we're queuing up and servicing history-deletion-handlers for each "del" keypress, so if you hold "Del" down, we queue up a zillion handlers, consuming memory & resources with redundant handlers. The handlers also probably run into each other (maybe getting handles on the same history items?) as they're serviced; and my parent process is pegged with these handlers and can't perform other actions on behalf of child processes.
I wouldn't be surprised if the crash reports are just cases where users are doing something like this^ and running into some sort of limit (e.g. addreffing a handle into oblivion as we accumulate tasks, or something along those lines? Or running out of memory; though at first glance it doesn't look like this was an OOM.)
Comment 16•2 years ago
|
||
Here's a profile of Firefox spinning its wheels in response to my comment 15 STR:
https://share.firefox.dev/3TPWWad
Again, not sure if this is precisely what's going on for the users here, but it's definitely a way that Firefox can be brought-to-its-knees with similar delete-massive-amounts-of-history STR.
Comment 17•2 years ago
•
|
||
I tried my STR from comment 15 on a less-powerful-computer (in Firefox release on Windows) and I did in fact trigger an OOM, though it took a few minutes of holding down the Del key. My crash reports there were
bp-135225da-8f77-4d7a-bc7d-1441a0230327
bp-79846f25-f2cf-45f5-95ae-fb2db0230327
...which have signature [@ OOM | small ] and don't look like the crash reports on this bug here. So: comment 15 might be an independent issue from what's going on for the users crashing in this bug here.
Comment 18•2 years ago
|
||
I spun off bug 1824872 for comment 15 - comment 17, since it seems like probably an independent issue.
Updated•2 years ago
|
Comment 20•2 years ago
|
||
So we still see crashes on 113 beta, but only late beta, which is a bit bizarre as noted above. I went through the prefs that we tweak on early-vs-late beta and I don't see anything that could potentially be related off-hand... :/
| Comment hidden (obsolete) |
Comment 22•2 years ago
|
||
But I'm confused, nsTreeBodyFrame::mView is a nsITreeView, not an nsView?
| Comment hidden (obsolete) |
Comment 24•2 years ago
|
||
So that means that mView.mRawPtr is dead, which means that someone (else, probably?) manually decremented the reference count or messed up the reference content.
Comment 25•2 years ago
•
|
||
Sorry, what I wrote was (again) wrong.
The main point is wanted to convey is that 0x7ffffffff0de7fff is not a random value. It is the result of calling mozWritePoison(). This is a crucial hint that we should try to use for solving this crash.
When we free most objects, we will use e5 repeatedly to poison them. This is what arena_dalloc()'s MaybePoison() does. This is how the object behind mView.mRawPtr gets poisoned after it gets freed. Hence the problem is not with that object (contrary to what I wrote before).
In a few locations, we do a similar kind of poisoning, except we do it with pattern 0x7ffffffff0de7fff and not e5. In particular, 0x7ffffffff0de7fff is used to poison nsTreeBodyFrame objects after they are destroyed. This goes through the following path:
nsTreeBodyFrame::DestroyFrom()callsSimpleXULLeafFrame::DestroyFrom()(thensTreeBodyFrame*isthis);SimpleXULLeafFrame::DestroyFrom()is actuallynsIFrame::DestroyFrom()(thensTreeBodyFrame*isthis);nsIFrame::DestroyFrom()callsmozilla::PresShell::FreeFrame()(thensTreeBodyFrame*isaPtr);- this ends up calling
mozWritePoison()on thensTreeBodyFrameobject.
So, with high confidence this time, what we see in these crashes is that the layout of our nsTreeBodyFrame object has already been poisoned after it was freed through nsTreeBodyFrame::DestroyFrom(). Yet, we are now handling a blur event, that ends up calling nsTreeBodyFrame::SetFocused() on that same freed (and poisoned) nsTreeBodyFrame object. So, when we try to access its mView.mRawPtr, the value we get back is (nsITreeView*)0x7ffffffff0de7fff. When we try to use that pointer, we crash.
I think this is the JavaScript handler definition that ends up calling into nsTreeBodyFrame::SetFocused():
this.addEventListener(
"blur",
event => {
this.focused = false;
if (event.target == this.inputField) {
this.stopEditing(true);
}
},
true
);
This calls into XULTreeElement::SetFocused(), which calls nsTreeBodyFrame::SetFocused() on the nsTreeBodyFrame* returned by XULTreeElement::GetTreeBodyFrame().
void XULTreeElement::SetFocused(bool aFocused) {
nsTreeBodyFrame* body = GetTreeBodyFrame();
if (body) {
body->SetFocused(aFocused);
}
}
There, XULTreeElement::GetTreeBodyFrame() will most often return the cached pointer value stored in the mTreeBody field of the XULTreeElement. But this field is a raw pointer and I don't see any ref counting occur when the pointer is cached.
I think that the nsFocusManager could be properly retaining the XULTreeElement, but that then the XULTreeElement would fail to guarantee that the pointer in mTreeBody lives for at least as long as the XULTreeElement itself, which would then result in this crash.
Updated•2 years ago
|
Comment 26•2 years ago
•
|
||
There is some mechanism in place to try to guarantee that mTreeBody outlives the XULTreeElement: nsTreeBodyFrame::DestroyFrom() will call mTree->BodyDestroyed(mTopRowIndex); if mTree is set for that nsTreeBodyFrame. That will in turn set mTreeBody to nullptr for that XULTreeElement. This raises some questions:
- Are we sure that
mTreeis always correctly set for anynsTreeBodyFramewhose pointer is part of the layout of aXULTreeElement? As far as I can tell,mTreeis set only when we call intonsTreeBodyFrame::GetBaseElement(). Do we always call into that function for thesensTreeBodyFrameobjects? Are we sure that this function finds the correctXULTreeElement? - Are we sure that after
mTreeBodyhas been set tonullptrfollowing a call toXULTreeElement::BodyDestroyed, the next call toXULTreeElement::GetTreeBodyFrame()cannot set it back to the samensTreeBodyFrame*?
Updated•2 years ago
|
Comment 27•2 years ago
•
|
||
Another suspicious part in the code is the kungfuDeathGrip in XULTreeElement::GetTreeBodyFrame(). I suppose that this is present to guarantee that the XULTreeElement object lives at least until the end of XULTreeElement::GetTreeBodyFrame(), by preventing the refcount from droping to 0 during the doc->FlushPendingNotifications(aFlushType) call. But what about when we come out of XULTreeElement::GetTreeBodyFrame()?
I wonder if we could be observing the following scenario:
- the
XULTreeElementand itsnsTreeBodyFrameare both alive and well; XULTreeElement::SetFocused()calls intoXULTreeElement::GetTreeBodyFrame();- in
XULTreeElement::GetTreeBodyFrame(), afterdoc->FlushPendingNotifications(aFlushType);, the refcount for theXULTreeElementreaches 1 as thekungfuDeathGripjust saved us from reaching 0; - so far the
XULTreeElementand itsnsTreeBodyFrameare still both alive and well; - so we are able to collect a
nsTreeBodyFrame*(or recycle the cached one), which we will return as a result; - but as we are now exiting
XULTreeElement::GetTreeBodyFrame(), thekungfuDeathGripdestructor makes the refcount for theXULTreeElementfall to 0, so theXULTreeElementis freed, along with itsnsTreeBodyFrame(Is it the case that freeing theXULTreeElementalso frees thensTreeBodyFrame? I'm not sure about that part); - back in
XULTreeElement::SetFocused(), we obtained a pointer to thensTreeBodyFrameobject as a result of callingXULTreeElement::GetTreeBodyFrame(), but this object is now freed and poisoned; - as we try to use the poisoned object, we crash on poison value
0x7ffffffff0de7fff.
Do we perhaps need an extra kungfuDeathGrip in the scope of XULTreeElement::SetFocused(), or a way to detect that the objects were freed as we exited XULTreeElement::GetTreeBodyFrame()?
Comment 28•2 years ago
•
|
||
For all crashes received in the past 6 months, the address on which we crash has never been different from the poison value (0x7ffffffff0de7fff is Windows x64, 0xf0de7fff is Windows x86, 0x7ffffffff0dea7ff is Linux x64):
1 0x7ffffffff0de7fff 5749 86.93 %
2 0xf0de7fff 678 10.25 %
3 0x7ffffffff0dea7ff 186 2.81 %
This is evidence that the time between freeing (and poisoning) the nsTreeBodyFrame and crashing is so short, that never on any of these machines the allocation slot that was occupied by the freed object has been reallocated for another object that would have made use of the bytes where mView.mRawPtr was stored. Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values.
I believe that this evidence supports the scenario from comment 27, in which the crash would immediately follow freeing the two objects.
(Edit: Somehow I wrote comment 8 for comment 27, I'm not sure why. I mean comment 27 indeed here.)
Edit: These hints from support not only the scenario from comment 27 but more generally the possibility that the nsTreeBodyFrame is freed during the call to GetTreeBodyFrame, right before calling SetFocused.
Comment 29•2 years ago
|
||
XULTreeElement is kept alive by the caller of SetFocused.
| Assignee | ||
Comment 30•2 years ago
|
||
(In reply to Yannis Juglaret [:yannis] from comment #26)
- Are we sure that after
mTreeBodyhas been set tonullptrfollowing a call toXULTreeElement::BodyDestroyed, the next call toXULTreeElement::GetTreeBodyFrame()cannot set it back to the samensTreeBodyFrame*?
This could potentially happen. We only clear the primary frame pointer in nsIFrame::DestroyFrom which happens after that code.
(In reply to Yannis Juglaret [:yannis] from comment #28)
This is evidence that the time between freeing (and poisoning) the
nsTreeBodyFrameand crashing is so short, that never on any of these machines the allocation slot that was occupied by the freed object has been reallocated for another object that would have made use of the bytes wheremView.mRawPtrwas stored. Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values.
If the memory got reallocated then the crash signature might be different, we'd get function names from the vtable of the new object. There was another crash where we saw this happen recently with nsMenuPopupFrame functions showing up in an impossible place. Or if it was a new object of the same type the code would potentially just not crash.
| Assignee | ||
Comment 31•2 years ago
|
||
| Assignee | ||
Updated•2 years ago
|
| Assignee | ||
Updated•2 years ago
|
Comment 32•2 years ago
•
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #30)
If the memory got reallocated then the crash signature might be different, we'd get function names from the vtable of the new object. There was another crash where we saw this happen recently with nsMenuPopupFrame functions showing up in an impossible place. Or if it was a new object of the same type the code would potentially just not crash.
I didn't find other crash signatures of significant volume where the proto signature would contain, say, mozilla::dom::XULTreeElement_Binding::set_focused. But, indeed, we could just be not crashing, also allocating from this PresShell arena is probably much less common than the usual paths for allocation.
Reading your comment made me wonder if we could find additional hints in other crash signatures. For example, where do Nightly and early beta users crash here, since we do not have reports from them like [:emilio] mentioned in comment 7?
I think the following signatures look interesting:
mozilla::PresShell::DoFlushPendingNotifications: This signature is often reached through crashing onMOZ_DIAGNOSTIC_ASSERT(!mForbiddenToFlush) (This is bad!). Nightly and early beta users could be crashing here. One user comment mentions trying to delete a lot of history lines then using Firefox during deletion (e.g. switching tabs), and their crash stack seems very interesting, it goes fromnsNavHistoryResult::OnVisittonsTreeSelection::FireOnSelectHandler,mozilla::dom::XULElement_Binding::focus,mozilla::PresShell::DoFlushPendingNotifications,nsTreeBodyFrame::DestroyFrom,mozilla::dom::XULTreeElement_Binding::endUpdateBatch,mozilla::PresShell::DoFlushPendingNotifications(flushing reentrancy related to navigation history withnsTreeBodyFrameandXULTreeElementinvolved);nsCycleCollectingAutoRefCnt::incr: This signature contains another late beta / release crash where user comments talk about deleting a lot of old navigation history e.g. here and here. It seems like it could be a variation of the other crash with potentially the same root cause. It occurs throughmozilla::dom::XULTreeElement_Binding::get_columnsinstead ofmozilla::dom::XULTreeElement_Binding::set_focused.
Note: Both signatures also contain crashes which do not seem related to this bug, especially for the second signature where there are many different proto signatures. But a proportion seems related in both.
Updated•2 years ago
|
| Assignee | ||
Comment 33•2 years ago
|
||
(In reply to Yannis Juglaret [:yannis] from comment #32)
One user comment mentions trying to delete a lot of history lines then using Firefox during deletion (e.g. switching tabs), and their crash stack seems very interesting,
This seems like it is painting a giant red X over exactly how we can hit the crash in this bug! If we can enter JS and basically do anything after calling BodyDestroyed in nsTreeBodyFrame::DestroyFrom then it is very likely that we can set XULTreeElement::mTreeBody back to the frame that is doomed and we will then hit this crash not too long after. I will come back to this later today and post a patch.
Comment 34•2 years ago
•
|
||
Comment 35•2 years ago
•
|
||
This bug explains 29 of the 45 crashes received in 116 early beta on the mozilla::PresShell::DoFlushPendingNotifications signature, so ~64% of the volume, which can be useful to estimate the potential impact of bug 1845266 (which explains ~25% of the same volume).
Comment 36•2 years ago
•
|
||
I think we can call this mitigated by frame poisoning?
Looking at the last month of crash reports for the nsTreeBodyFrame::GetExistingView signatures, they all seem to be crashing on frame-poisoning addresses (0x7ffffffff0de7fff and similar, which I mentioned above in comment 14 but didn't realize at that point was a frame-poisoning address. :) I refreshed my memory that this is indeed a frame-poisoning address over in a pernosco trace in bug 1845223, though.)
Comment 25 and comment 28 seems to be confirming this as well. Comment 28 mentioned "Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values" -- that's indeed why poison-pointer crashes are typically security-sensitive -- but specifically with the layout frame tree classes (nsTreeBodyFrame & friends), we use arena-allocation and ensure that a given address can only ever be used to allocate instances of the same concrete class. So if the memory gets reallocated and then we use a dangling pointer to the old/deleted nsTreeBodyFrame, we'll potentially get some weirdness but we'll at least be using the pointer with the proper type. https://robert.ocallahan.org/2010/10/mitigating-dangling-pointer-bugs-using_15.html has some notes about this, too.
So I think we've confirmed that this crash is not exploitable and is mitigated by frame poisoning, and hence we don't need to have this bug hidden as security-sensitive. tnikkel, does that make sense to you?
Comment 37•2 years ago
|
||
(I'm replacing the just-added sec-high keyword with csectype-framepoisoning, too; frame poisoning crashes aren't sec-high, fortunately.)
| Assignee | ||
Comment 38•2 years ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #36)
So I think we've confirmed that this crash is not exploitable and is mitigated by frame poisoning, and hence we don't need to have this bug hidden as security-sensitive. tnikkel, does that make sense to you?
Yes, that all agrees with my knowledge.
Comment 39•2 years ago
|
||
Thanks. I'll remove the security-sensitive flag then.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 40•2 years ago
|
||
Comment 41•2 years ago
|
||
| bugherder | ||
| Assignee | ||
Comment 42•2 years ago
|
||
Oops, I had the leave-open keyword in there when I wanted to land a diagnostic patch, but we ended up figuring this out and I pushed a patch that should fix this (at least at the immediate crash, the code is still not as good as I would like). So this bug should have been resolved when this landed. I'll do that now.
Updated•2 years ago
|
Comment 43•2 years ago
|
||
The patch landed in nightly and beta is affected.
:tnikkel, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox117towontfix.
For more information, please visit BugBot documentation.
Comment 45•2 years ago
•
|
||
If we ignore OOMs, bitflips and shutdown hangs then I think this is a top parent process crasher for 116 release. From a stability point of view it should be very interesting to take this patch for 117.0 and 115.2.0esr. Can you elaborate if you think that there could be a risk here [:tnikkel]? Thank you.
| Assignee | ||
Updated•2 years ago
|
Comment 47•2 years ago
|
||
Now that this has had a bit more bake time, can we nominate this for ESR115 uplift? It grafts cleanly.
| Assignee | ||
Comment 48•2 years ago
|
||
Comment on attachment 9345229 [details]
Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: top crash in parent process
- User impact if declined: crashes
- Fix Landed on Version: 118
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): it's already on release
Comment 49•2 years ago
|
||
Comment on attachment 9345229 [details]
Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio
Approved for 115.4esr.
Comment 50•2 years ago
|
||
| uplift | ||
Updated•2 years ago
|
Updated•2 years ago
|
Description
•