Closed Bug 1809492 Opened 2 years ago Closed 11 months ago

Crash in [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView]

Categories

(Core :: Layout, defect, P2)

defect

Tracking

()

RESOLVED FIXED
118 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 119+ fixed
firefox108 - wontfix
firefox109 + wontfix
firefox110 + wontfix
firefox111 --- wontfix
firefox116 --- wontfix
firefox117 --- wontfix
firefox118 --- fixed

People

(Reporter: aryx, Assigned: tnikkel)

References

(Regression)

Details

(Keywords: crash, csectype-framepoisoning, regression, Whiteboard: [win:stability])

Crash Data

Attachments

(2 files)

192 crashes from Windows and Linux, first one on 2022-12-29, this started to gain traction on 2023-01-05. 25% of crashes in first minute after launch

Crash report: https://crash-stats.mozilla.org/report/index/1dc83a24-d1cd-49d4-b0c8-a8d9f0230110

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  xul.dll  nsCOMPtr<nsITreeView>::nsCOMPtr  xpcom/base/nsCOMPtr.h:489
0  xul.dll  nsTreeBodyFrame::GetExistingView const  layout/xul/tree/nsTreeBodyFrame.h:76
0  xul.dll  nsTreeBodyFrame::GetSelection const  layout/xul/tree/nsTreeBodyFrame.cpp:445
1  xul.dll  nsTreeBodyFrame::SetFocused  layout/xul/tree/nsTreeBodyFrame.cpp:454
2  xul.dll  mozilla::dom::XULTreeElement_Binding::set_focused  dom/bindings/XULTreeElementBinding.cpp:338
3  xul.dll  mozilla::dom::binding_detail::GenericSetter<mozilla::dom::binding_detail::NormalThisPolicy>  dom/bindings/BindingUtils.cpp:3235
4  xul.dll  CallJSNative  js/src/vm/Interpreter.cpp:459
4  xul.dll  js::InternalCallOrConstruct  js/src/vm/Interpreter.cpp:547
4  xul.dll  InternalCall  js/src/vm/Interpreter.cpp:614
4  xul.dll  js::Call  js/src/vm/Interpreter.cpp:646

Ok, this one I can't explain unless there's already a freed or corrupt pointer in nsTreeBodyFrame::mView? The crash address is also not poison, so that's weird...

Do you know if we have a crash annotation for a11y being enabled? I see a bit of suspect code related to nsITreeView in accessibility here and in similar places, but hard to see if it might be at fault.

Flags: needinfo?(emilio) → needinfo?(aryx.bugmail)

Try these reports which have accessibility set as true.

Flags: needinfo?(aryx.bugmail)

The bug is marked as tracked for firefox109 (beta) and tracked for firefox110 (nightly). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned.

:fgriffith, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(fgriffith)
Assignee: nobody → emilio
Flags: needinfo?(fgriffith)

Set release status flags based on info from the regressing bug 1806408

Hi Emilio, how are we progressing on this bug, do we have next steps?

Flags: needinfo?(emilio)

I would expect bug 1809635 to fix this, but it was pretty much a patch based on code inspection, that's why I didn't close this right away.

Flags: needinfo?(emilio)

So something interesting about this crash is that we didn't have any crash before 109b6 (and similarly there's no crash yet as of 110b5). I wonder if we have something enabled in early beta or earlier that is preventing this crash from happening?

In any case the bug mentioned above made it to 110, so let's check in a couple weeks to confirm there are no beta crashes, and if there are then we need to figure out what might be going on there.

Severity: -- → S2
Priority: -- → P2

In the past week, around 8 crash reports have a comment mentioning that the crash happened when they were manually clearing history.

The most specific STR-like comment there was:

Selecting all under "older than 6 months" in the history tab and hitting the 'delete' key caused a crash.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #6)

I would expect bug 1809635 to fix this, but it was pretty much a patch based on code inspection, that's why I didn't close this right away.

update: emlio's requested beta/release uplift on that patch. We'll see if crash volume reduces once it makes it to those channels (hopefully soon, if approval is granted).

Unfortunately, the crashes are still happening in 110.0b7 :(

(In reply to Emilio Cobos Álvarez (:emilio) from comment #7)

So something interesting about this crash is that we didn't have any crash before 109b6 (and similarly there's no crash yet as of 110b5). I wonder if we have something enabled in early beta or earlier that is preventing this crash from happening?

So this seems true then... But it's unclear what could cause this :/

Flags: needinfo?(emilio)

anything in particular? This doesn't seem actionable without reproducing...

Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #12)

anything in particular? This doesn't seem actionable without reproducing...

sorry about the ping, I must have made a misclick during the reo weekly triage session.

There are several more reports mentioning that the crash happened after deleting history. The crash seems to reliably be on 0x7ffffffff0de7fff, too, which is kinda odd.

I have one theory about how-to-repro here... I haven't repro'd the actual crash, but I've managed to trigger something odd that could conceivably crash.

My STR:

  1. Copy a giant places.sqlite into a fresh profile (so as not to corrupt your regular profile). I'm using my regular browsing profile's places.squlite file.
  2. Start Firefox using that fresh profile.
  3. History | Show All HIstory
  4. Click "Older than 6 months"
  5. Select-all in the right window. For me, this shows "27000 results" or so.
  6. Now press-and-hold your delete key on your keyboard (or alternately: press it rapidly & repeatedly). Keep doing this, for tens of seconds. Watch how Firefox performs, watch its memory-usage in "top".
  7. If your selection gets lost (I think it occasionally did), then repeat steps 5-6.

ACTUAL RESULTS:

  • The reported number of selected history-items holds steady (i.e. nothing obviously being deleted) for an unusual amount of time. Or it might go down a bit but then stop going down.
  • Memory usage steadily increases -- Firefox consumes another ~1% of my system memory (64GB) every couple seconds.
  • Firefox is often unresponsive; e.g. I can open tabs but sites won't load.

I'm guessing we're queuing up and servicing history-deletion-handlers for each "del" keypress, so if you hold "Del" down, we queue up a zillion handlers, consuming memory & resources with redundant handlers. The handlers also probably run into each other (maybe getting handles on the same history items?) as they're serviced; and my parent process is pegged with these handlers and can't perform other actions on behalf of child processes.

I wouldn't be surprised if the crash reports are just cases where users are doing something like this^ and running into some sort of limit (e.g. addreffing a handle into oblivion as we accumulate tasks, or something along those lines? Or running out of memory; though at first glance it doesn't look like this was an OOM.)

Here's a profile of Firefox spinning its wheels in response to my comment 15 STR:
https://share.firefox.dev/3TPWWad

Again, not sure if this is precisely what's going on for the users here, but it's definitely a way that Firefox can be brought-to-its-knees with similar delete-massive-amounts-of-history STR.

I tried my STR from comment 15 on a less-powerful-computer (in Firefox release on Windows) and I did in fact trigger an OOM, though it took a few minutes of holding down the Del key. My crash reports there were
bp-135225da-8f77-4d7a-bc7d-1441a0230327
bp-79846f25-f2cf-45f5-95ae-fb2db0230327
...which have signature [@ OOM | small ] and don't look like the crash reports on this bug here. So: comment 15 might be an independent issue from what's going on for the users crashing in this bug here.

See Also: → 1824872

I spun off bug 1824872 for comment 15 - comment 17, since it seems like probably an independent issue.

I'm hoping bug 1824957 helps here.

Depends on: 1824957
Flags: needinfo?(emilio)

So we still see crashes on 113 beta, but only late beta, which is a bit bizarre as noted above. I went through the prefs that we tweak on early-vs-late beta and I don't see anything that could potentially be related off-hand... :/

Flags: needinfo?(emilio)

But I'm confused, nsTreeBodyFrame::mView is a nsITreeView, not an nsView?

Flags: needinfo?(yjuglaret)

So that means that mView.mRawPtr is dead, which means that someone (else, probably?) manually decremented the reference count or messed up the reference content.

Sorry, what I wrote was (again) wrong.

The main point is wanted to convey is that 0x7ffffffff0de7fff is not a random value. It is the result of calling mozWritePoison(). This is a crucial hint that we should try to use for solving this crash.

When we free most objects, we will use e5 repeatedly to poison them. This is what arena_dalloc()'s MaybePoison() does. This is how the object behind mView.mRawPtr gets poisoned after it gets freed. Hence the problem is not with that object (contrary to what I wrote before).

In a few locations, we do a similar kind of poisoning, except we do it with pattern 0x7ffffffff0de7fff and not e5. In particular, 0x7ffffffff0de7fff is used to poison nsTreeBodyFrame objects after they are destroyed. This goes through the following path:

So, with high confidence this time, what we see in these crashes is that the layout of our nsTreeBodyFrame object has already been poisoned after it was freed through nsTreeBodyFrame::DestroyFrom(). Yet, we are now handling a blur event, that ends up calling nsTreeBodyFrame::SetFocused() on that same freed (and poisoned) nsTreeBodyFrame object. So, when we try to access its mView.mRawPtr, the value we get back is (nsITreeView*)0x7ffffffff0de7fff. When we try to use that pointer, we crash.

I think this is the JavaScript handler definition that ends up calling into nsTreeBodyFrame::SetFocused():

      this.addEventListener(
        "blur",
        event => {
          this.focused = false;
          if (event.target == this.inputField) {
            this.stopEditing(true);
          }
        },
        true
      );

This calls into XULTreeElement::SetFocused(), which calls nsTreeBodyFrame::SetFocused() on the nsTreeBodyFrame* returned by XULTreeElement::GetTreeBodyFrame().

void XULTreeElement::SetFocused(bool aFocused) {
  nsTreeBodyFrame* body = GetTreeBodyFrame();
  if (body) {
    body->SetFocused(aFocused);
  }
}

There, XULTreeElement::GetTreeBodyFrame() will most often return the cached pointer value stored in the mTreeBody field of the XULTreeElement. But this field is a raw pointer and I don't see any ref counting occur when the pointer is cached.

I think that the nsFocusManager could be properly retaining the XULTreeElement, but that then the XULTreeElement would fail to guarantee that the pointer in mTreeBody lives for at least as long as the XULTreeElement itself, which would then result in this crash.

Group: core-security

There is some mechanism in place to try to guarantee that mTreeBody outlives the XULTreeElement: nsTreeBodyFrame::DestroyFrom() will call mTree->BodyDestroyed(mTopRowIndex); if mTree is set for that nsTreeBodyFrame. That will in turn set mTreeBody to nullptr for that XULTreeElement. This raises some questions:

  • Are we sure that mTree is always correctly set for any nsTreeBodyFrame whose pointer is part of the layout of a XULTreeElement? As far as I can tell, mTree is set only when we call into nsTreeBodyFrame::GetBaseElement(). Do we always call into that function for these nsTreeBodyFrame objects? Are we sure that this function finds the correct XULTreeElement?
  • Are we sure that after mTreeBody has been set to nullptr following a call to XULTreeElement::BodyDestroyed, the next call to XULTreeElement::GetTreeBodyFrame() cannot set it back to the same nsTreeBodyFrame*?
Group: core-security → layout-core-security

Another suspicious part in the code is the kungfuDeathGrip in XULTreeElement::GetTreeBodyFrame(). I suppose that this is present to guarantee that the XULTreeElement object lives at least until the end of XULTreeElement::GetTreeBodyFrame(), by preventing the refcount from droping to 0 during the doc->FlushPendingNotifications(aFlushType) call. But what about when we come out of XULTreeElement::GetTreeBodyFrame()?

I wonder if we could be observing the following scenario:

  • the XULTreeElement and its nsTreeBodyFrame are both alive and well;
  • XULTreeElement::SetFocused() calls into XULTreeElement::GetTreeBodyFrame();
  • in XULTreeElement::GetTreeBodyFrame(), after doc->FlushPendingNotifications(aFlushType);, the refcount for the XULTreeElement reaches 1 as the kungfuDeathGrip just saved us from reaching 0;
  • so far the XULTreeElement and its nsTreeBodyFrame are still both alive and well;
  • so we are able to collect a nsTreeBodyFrame* (or recycle the cached one), which we will return as a result;
  • but as we are now exiting XULTreeElement::GetTreeBodyFrame(), the kungfuDeathGrip destructor makes the refcount for the XULTreeElement fall to 0, so the XULTreeElement is freed, along with its nsTreeBodyFrame (Is it the case that freeing the XULTreeElement also frees the nsTreeBodyFrame? I'm not sure about that part);
  • back in XULTreeElement::SetFocused(), we obtained a pointer to the nsTreeBodyFrame object as a result of calling XULTreeElement::GetTreeBodyFrame(), but this object is now freed and poisoned;
  • as we try to use the poisoned object, we crash on poison value 0x7ffffffff0de7fff.

Do we perhaps need an extra kungfuDeathGrip in the scope of XULTreeElement::SetFocused(), or a way to detect that the objects were freed as we exited XULTreeElement::GetTreeBodyFrame()?

For all crashes received in the past 6 months, the address on which we crash has never been different from the poison value (0x7ffffffff0de7fff is Windows x64, 0xf0de7fff is Windows x86, 0x7ffffffff0dea7ff is Linux x64):

1 	0x7ffffffff0de7fff 	5749 	86.93 %
2 	0xf0de7fff         	678 	10.25 %
3 	0x7ffffffff0dea7ff 	186 	2.81 %

This is evidence that the time between freeing (and poisoning) the nsTreeBodyFrame and crashing is so short, that never on any of these machines the allocation slot that was occupied by the freed object has been reallocated for another object that would have made use of the bytes where mView.mRawPtr was stored. Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values.

I believe that this evidence supports the scenario from comment 27, in which the crash would immediately follow freeing the two objects.

(Edit: Somehow I wrote comment 8 for comment 27, I'm not sure why. I mean comment 27 indeed here.)

Edit: These hints from support not only the scenario from comment 27 but more generally the possibility that the nsTreeBodyFrame is freed during the call to GetTreeBodyFrame, right before calling SetFocused.

XULTreeElement is kept alive by the caller of SetFocused.

(In reply to Yannis Juglaret [:yannis] from comment #26)

  • Are we sure that after mTreeBody has been set to nullptr following a call to XULTreeElement::BodyDestroyed, the next call to XULTreeElement::GetTreeBodyFrame() cannot set it back to the same nsTreeBodyFrame*?

This could potentially happen. We only clear the primary frame pointer in nsIFrame::DestroyFrom which happens after that code.

(In reply to Yannis Juglaret [:yannis] from comment #28)

This is evidence that the time between freeing (and poisoning) the nsTreeBodyFrame and crashing is so short, that never on any of these machines the allocation slot that was occupied by the freed object has been reallocated for another object that would have made use of the bytes where mView.mRawPtr was stored. Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values.

If the memory got reallocated then the crash signature might be different, we'd get function names from the vtable of the new object. There was another crash where we saw this happen recently with nsMenuPopupFrame functions showing up in an impossible place. Or if it was a new object of the same type the code would potentially just not crash.

Keywords: leave-open
Crash Signature: [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] → [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ nsCOMPtr<T>::nsCOMPtr | nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView]

(In reply to Timothy Nikkel (:tnikkel) from comment #30)

If the memory got reallocated then the crash signature might be different, we'd get function names from the vtable of the new object. There was another crash where we saw this happen recently with nsMenuPopupFrame functions showing up in an impossible place. Or if it was a new object of the same type the code would potentially just not crash.

I didn't find other crash signatures of significant volume where the proto signature would contain, say, mozilla::dom::XULTreeElement_Binding::set_focused. But, indeed, we could just be not crashing, also allocating from this PresShell arena is probably much less common than the usual paths for allocation.

Reading your comment made me wonder if we could find additional hints in other crash signatures. For example, where do Nightly and early beta users crash here, since we do not have reports from them like [:emilio] mentioned in comment 7?

I think the following signatures look interesting:

  • mozilla::PresShell::DoFlushPendingNotifications: This signature is often reached through crashing on MOZ_DIAGNOSTIC_ASSERT(!mForbiddenToFlush) (This is bad!). Nightly and early beta users could be crashing here. One user comment mentions trying to delete a lot of history lines then using Firefox during deletion (e.g. switching tabs), and their crash stack seems very interesting, it goes from nsNavHistoryResult::OnVisit to nsTreeSelection::FireOnSelectHandler, mozilla::dom::XULElement_Binding::focus, mozilla::PresShell::DoFlushPendingNotifications, nsTreeBodyFrame::DestroyFrom, mozilla::dom::XULTreeElement_Binding::endUpdateBatch, mozilla::PresShell::DoFlushPendingNotifications (flushing reentrancy related to navigation history with nsTreeBodyFrame and XULTreeElement involved);
  • nsCycleCollectingAutoRefCnt::incr: This signature contains another late beta / release crash where user comments talk about deleting a lot of old navigation history e.g. here and here. It seems like it could be a variation of the other crash with potentially the same root cause. It occurs through mozilla::dom::XULTreeElement_Binding::get_columns instead of mozilla::dom::XULTreeElement_Binding::set_focused.

Note: Both signatures also contain crashes which do not seem related to this bug, especially for the second signature where there are many different proto signatures. But a proportion seems related in both.

Crash Signature: [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ nsCOMPtr<T>::nsCOMPtr | nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] → [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ nsCOMPtr<T>::nsCOMPtr | nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ mozilla::PresShell::DoFlushPendingNotifications] [@ nsCycleCollectingAutoRefCnt::incr]

(In reply to Yannis Juglaret [:yannis] from comment #32)

One user comment mentions trying to delete a lot of history lines then using Firefox during deletion (e.g. switching tabs), and their crash stack seems very interesting,

This seems like it is painting a giant red X over exactly how we can hit the crash in this bug! If we can enter JS and basically do anything after calling BodyDestroyed in nsTreeBodyFrame::DestroyFrom then it is very likely that we can set XULTreeElement::mTreeBody back to the frame that is doomed and we will then hit this crash not too long after. I will come back to this later today and post a patch.

Attached is what the full stack looks like. I think the C++ to JS to C++ sequence here would be `nsTreeBodyFrame::DestroyFrom` calling `view->SetTree(nullptr);`, calling `PlacesTreeView.setTree(null);`, calling `PlacesTreeView.batching(false);`, calling `this._tree.endUpdateBatch();`, calling `mozilla::dom::XULTreeElement_Binding::endUpdateBatch`.

This bug explains 29 of the 45 crashes received in 116 early beta on the mozilla::PresShell::DoFlushPendingNotifications signature, so ~64% of the volume, which can be useful to estimate the potential impact of bug 1845266 (which explains ~25% of the same volume).

Keywords: sec-high

I think we can call this mitigated by frame poisoning?

Looking at the last month of crash reports for the nsTreeBodyFrame::GetExistingView signatures, they all seem to be crashing on frame-poisoning addresses (0x7ffffffff0de7fff and similar, which I mentioned above in comment 14 but didn't realize at that point was a frame-poisoning address. :) I refreshed my memory that this is indeed a frame-poisoning address over in a pernosco trace in bug 1845223, though.)

Comment 25 and comment 28 seems to be confirming this as well. Comment 28 mentioned "Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values" -- that's indeed why poison-pointer crashes are typically security-sensitive -- but specifically with the layout frame tree classes (nsTreeBodyFrame & friends), we use arena-allocation and ensure that a given address can only ever be used to allocate instances of the same concrete class. So if the memory gets reallocated and then we use a dangling pointer to the old/deleted nsTreeBodyFrame, we'll potentially get some weirdness but we'll at least be using the pointer with the proper type. https://robert.ocallahan.org/2010/10/mitigating-dangling-pointer-bugs-using_15.html has some notes about this, too.

So I think we've confirmed that this crash is not exploitable and is mitigated by frame poisoning, and hence we don't need to have this bug hidden as security-sensitive. tnikkel, does that make sense to you?

Flags: needinfo?(tnikkel)

(I'm replacing the just-added sec-high keyword with csectype-framepoisoning, too; frame poisoning crashes aren't sec-high, fortunately.)

(In reply to Daniel Holbert [:dholbert] from comment #36)

So I think we've confirmed that this crash is not exploitable and is mitigated by frame poisoning, and hence we don't need to have this bug hidden as security-sensitive. tnikkel, does that make sense to you?

Yes, that all agrees with my knowledge.

Flags: needinfo?(tnikkel)

Thanks. I'll remove the security-sensitive flag then.

Group: layout-core-security
See Also: → 1845266
Whiteboard: [win:stability]
See Also: → 1845904
Attachment #9345229 - Attachment description: Bug 1809492. Assert tree body frame points to tree element. r?emilio → Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio
Assignee: emilio → tnikkel
Pushed by tnikkel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0744a4057867
Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r=emilio

Oops, I had the leave-open keyword in there when I wanted to land a diagnostic patch, but we ended up figuring this out and I pushed a patch that should fix this (at least at the immediate crash, the code is still not as good as I would like). So this bug should have been resolved when this landed. I'll do that now.

Status: NEW → RESOLVED
Closed: 11 months ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 118 Branch

The patch landed in nightly and beta is affected.
:tnikkel, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox117 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(tnikkel)

I'm happy to let this ride the trains.

Flags: needinfo?(tnikkel)

If we ignore OOMs, bitflips and shutdown hangs then I think this is a top parent process crasher for 116 release. From a stability point of view it should be very interesting to take this patch for 117.0 and 115.2.0esr. Can you elaborate if you think that there could be a risk here [:tnikkel]? Thank you.

Flags: needinfo?(tnikkel)
Flags: needinfo?(tnikkel)
See Also: 1845266

Now that this has had a bit more bake time, can we nominate this for ESR115 uplift? It grafts cleanly.

Flags: needinfo?(tnikkel)

Comment on attachment 9345229 [details]
Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: top crash in parent process
  • User impact if declined: crashes
  • Fix Landed on Version: 118
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): it's already on release
Flags: needinfo?(tnikkel)
Attachment #9345229 - Flags: approval-mozilla-esr115?

Comment on attachment 9345229 [details]
Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio

Approved for 115.4esr.

Attachment #9345229 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: