Bugzilla

Updated

•

2 years ago

status-firefox108: affected → wontfix

status-firefox-esr102: --- → unaffected

tracking-firefox108: ? → -

tracking-firefox109: ? → +

tracking-firefox110: --- → +

Updated

•

2 years ago

Keywords: regression

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 1

•

2 years ago

Ok, this one I can't explain unless there's already a freed or corrupt pointer in nsTreeBodyFrame::mView? The crash address is also not poison, so that's weird...

Do you know if we have a crash annotation for a11y being enabled? I see a bit of suspect code related to nsITreeView in accessibility here and in similar places, but hard to see if it might be at fault.

Flags: needinfo?(emilio) → needinfo?(aryx.bugmail)

Reporter

Comment 2

•

2 years ago

Try these reports which have accessibility set as true.

Flags: needinfo?(aryx.bugmail)

Comment 3

•

2 years ago

The bug is marked as tracked for firefox109 (beta) and tracked for firefox110 (nightly). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned.

:fgriffith, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(fgriffith)

Updated

•

2 years ago

Depends on: 1809635

Frank Griffith Jr

Updated

•

2 years ago

Assignee: nobody → emilio

Flags: needinfo?(fgriffith)

Comment 4

•

2 years ago

Set release status flags based on info from the regressing bug 1806408

status-firefox111: --- → affected

Greg Hess

Updated

•

2 years ago

status-firefox109: affected → wontfix

Greg Hess

Comment 5

•

2 years ago

Hi Emilio, how are we progressing on this bug, do we have next steps?

Flags: needinfo?(emilio)

Comment 6

•

2 years ago

I would expect bug 1809635 to fix this, but it was pretty much a patch based on code inspection, that's why I didn't close this right away.

Flags: needinfo?(emilio)

Comment 7

•

2 years ago

So something interesting about this crash is that we didn't have any crash before 109b6 (and similarly there's no crash yet as of 110b5). I wonder if we have something enabled in early beta or earlier that is preventing this crash from happening?

In any case the bug mentioned above made it to 110, so let's check in a couple weeks to confirm there are no beta crashes, and if there are then we need to figure out what might be going on there.

Severity: -- → S2

Priority: -- → P2

Comment 8

•

1 year ago

In the past week, around 8 crash reports have a comment mentioning that the crash happened when they were manually clearing history.

The most specific STR-like comment there was:

Selecting all under "older than 6 months" in the history tab and hitting the 'delete' key caused a crash.

Comment 9

•

1 year ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #6)

I would expect bug 1809635 to fix this, but it was pretty much a patch based on code inspection, that's why I didn't close this right away.

update: emlio's requested beta/release uplift on that patch. We'll see if crash volume reduces once it makes it to those channels (hopefully soon, if approval is granted).

Comment 10

•

1 year ago

Unfortunately, the crashes are still happening in 110.0b7 :(

Comment 11

•

1 year ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #7)

So something interesting about this crash is that we didn't have any crash before 109b6 (and similarly there's no crash yet as of 110b5). I wonder if we have something enabled in early beta or earlier that is preventing this crash from happening?

So this seems true then... But it's unclear what could cause this :/

Pascal Chevrel:pascalc

Updated

•

1 year ago

status-firefox110: affected → wontfix

Donal Meehan [:dmeehan]

Updated

•

1 year ago

status-firefox111: affected → fix-optional

Flags: needinfo?(emilio)

Comment 12

•

1 year ago

anything in particular? This doesn't seem actionable without reproducing...

Flags: needinfo?(emilio)

Donal Meehan [:dmeehan]

Comment 13

•

1 year ago

(In reply to Emilio Cobos Álvarez (:emilio) from comment #12)

anything in particular? This doesn't seem actionable without reproducing...

sorry about the ping, I must have made a misclick during the reo weekly triage session.

Donal Meehan [:dmeehan]

Updated

•

1 year ago

status-firefox111: fix-optional → wontfix

Comment 14

•

1 year ago

There are several more reports mentioning that the crash happened after deleting history. The crash seems to reliably be on 0x7ffffffff0de7fff, too, which is kinda odd.

Comment 15

•

1 year ago

•

Edited

I have one theory about how-to-repro here... I haven't repro'd the actual crash, but I've managed to trigger something odd that could conceivably crash.

My STR:

Copy a giant places.sqlite into a fresh profile (so as not to corrupt your regular profile). I'm using my regular browsing profile's places.squlite file.
Start Firefox using that fresh profile.
History | Show All HIstory
Click "Older than 6 months"
Select-all in the right window. For me, this shows "27000 results" or so.
Now press-and-hold your delete key on your keyboard (or alternately: press it rapidly & repeatedly). Keep doing this, for tens of seconds. Watch how Firefox performs, watch its memory-usage in "top".
If your selection gets lost (I think it occasionally did), then repeat steps 5-6.

ACTUAL RESULTS:

The reported number of selected history-items holds steady (i.e. nothing obviously being deleted) for an unusual amount of time. Or it might go down a bit but then stop going down.
Memory usage steadily increases -- Firefox consumes another ~1% of my system memory (64GB) every couple seconds.
Firefox is often unresponsive; e.g. I can open tabs but sites won't load.

I'm guessing we're queuing up and servicing history-deletion-handlers for each "del" keypress, so if you hold "Del" down, we queue up a zillion handlers, consuming memory & resources with redundant handlers. The handlers also probably run into each other (maybe getting handles on the same history items?) as they're serviced; and my parent process is pegged with these handlers and can't perform other actions on behalf of child processes.

I wouldn't be surprised if the crash reports are just cases where users are doing something like this^ and running into some sort of limit (e.g. addreffing a handle into oblivion as we accumulate tasks, or something along those lines? Or running out of memory; though at first glance it doesn't look like this was an OOM.)

Comment 16

•

1 year ago

Here's a profile of Firefox spinning its wheels in response to my comment 15 STR:
https://share.firefox.dev/3TPWWad

Again, not sure if this is precisely what's going on for the users here, but it's definitely a way that Firefox can be brought-to-its-knees with similar delete-massive-amounts-of-history STR.

Comment 17

•

1 year ago

•

Edited

I tried my STR from comment 15 on a less-powerful-computer (in Firefox release on Windows) and I did in fact trigger an OOM, though it took a few minutes of holding down the Del key. My crash reports there were
bp-135225da-8f77-4d7a-bc7d-1441a0230327
bp-79846f25-f2cf-45f5-95ae-fb2db0230327
...which have signature [@ OOM | small ] and don't look like the crash reports on this bug here. So: comment 15 might be an independent issue from what's going on for the users crashing in this bug here.

Updated

•

1 year ago

Comment 18

•

1 year ago

I spun off bug 1824872 for comment 15 - comment 17, since it seems like probably an independent issue.

Comment 19

•

1 year ago

I'm hoping bug 1824957 helps here.

Depends on: 1824957

Updated

•

1 year ago

Flags: needinfo?(emilio)

Comment 20

•

1 year ago

So we still see crashes on 113 beta, but only late beta, which is a bit bizarre as noted above. I went through the prefs that we tweak on early-vs-late beta and I don't see anything that could potentially be related off-hand... :/

Flags: needinfo?(emilio)

Comment hidden (obsolete)

Comment 22

•

1 year ago

But I'm confused, nsTreeBodyFrame::mView is a nsITreeView, not an nsView?

Flags: needinfo?(yjuglaret)

Comment hidden (obsolete)

Yes, sorry, you are correct. I have updated my comment. The call stack looks like this if I wait for an example mView to get poisoned:

 # Child-SP          RetAddr               Call Site
00 00000049`503fdde8 00007ffa`c4b4f40e     VCRUNTIME140!memset+0xbe [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 187] 
01 (Inline Function) --------`--------     mozglue!MaybePoison+0xa [/builds/worker/checkouts/gecko/memory/build/mozjemalloc.cpp @ 1501] 
02 (Inline Function) --------`--------     mozglue!arena_dalloc+0x4a [/builds/worker/checkouts/gecko/memory/build/mozjemalloc.cpp @ 3740] 
03 (Inline Function) --------`--------     mozglue!BaseAllocator::free+0x67 [/builds/worker/checkouts/gecko/memory/build/mozjemalloc.cpp @ 4547] 
04 (Inline Function) --------`--------     mozglue!Allocator<MozJemallocBase>::free+0x67 [/builds/worker/checkouts/gecko/memory/build/malloc_decls.h @ 54] 
05 00000049`503fddf0 00007ffa`353f6d33     mozglue!je_free+0x9e [/builds/worker/checkouts/gecko/memory/build/malloc_decls.h @ 54] 
06 (Inline Function) --------`--------     xul!operator delete+0x6 [/builds/worker/workspace/obj-build/dist/include/mozilla/cxxalloc.h @ 51] 
07 (Inline Function) --------`--------     xul!NS_DestroyXPTCallStub+0x6 [/builds/worker/checkouts/gecko/xpcom/reflect/xptcall/xptcall.cpp @ 46] 
08 (Inline Function) --------`--------     xul!nsAutoXPTCStub::~nsAutoXPTCStub+0x19 [/builds/worker/workspace/obj-build/dist/include/nsXPTCUtils.h @ 30] 
09 00000049`503fdee0 00007ffa`353fb538     xul!nsXPCWrappedJS::~nsXPCWrappedJS+0x113 [/builds/worker/checkouts/gecko/js/xpconnect/src/XPCWrappedJS.cpp @ 445] 
0a (Inline Function) --------`--------     xul!nsXPCWrappedJS::DeleteCycleCollectable+0x8 [/builds/worker/checkouts/gecko/js/xpconnect/src/XPCWrappedJS.cpp @ 314] 
0b 00000049`503fdf30 00007ffa`36478b91     xul!nsXPCWrappedJS::cycleCollection::DeleteCycleCollectable+0x18 [/builds/worker/checkouts/gecko/js/xpconnect/src/xpcprivate.h @ 1571] 
0c (Inline Function) --------`--------     xul!SnowWhiteKiller::MaybeKillObject+0x4d7 [/builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp @ 2486] 
0d (Inline Function) --------`--------     xul!SnowWhiteKiller::Visit+0x9b6 [/builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp @ 2511] 
0e 00000049`503fdf60 00007ffa`35188c77     xul!nsPurpleBuffer::VisitEntries<SnowWhiteKiller>+0xb81 [/builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp @ 969] 
0f 00000049`503fe0b0 00007ffa`353f45d8     xul!nsCycleCollector::FreeSnowWhiteWithBudget+0xa7 [/builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp @ 2680] 
10 (Inline Function) --------`--------     xul!nsCycleCollector_doDeferredDeletionWithBudget+0x4a [/builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp @ 3971] 
11 00000049`503fe160 00007ffa`351c53e4     xul!AsyncFreeSnowWhite::Run+0xf8 [/builds/worker/checkouts/gecko/js/xpconnect/src/XPCJSRuntime.cpp @ 158] 
12 00000049`503fe270 00007ffa`3648dfcc     xul!IdleRunnableWrapper::Run+0x44 [/builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp @ 326] 
13 (Inline Function) --------`--------     xul!mozilla::RunnableTask::Run+0x11 [/builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp @ 555] 
14 00000049`503fe2b0 00007ffa`364909eb     xul!mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal+0x7ac [/builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp @ 879] 
15 (Inline Function) --------`--------     xul!mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal+0x2bc [/builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp @ 744] 
16 (Inline Function) --------`--------     xul!mozilla::TaskController::ProcessPendingMTTask+0x2c8 [/builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp @ 491] 
17 (Inline Function) --------`--------     xul!mozilla::TaskController::TaskController::<lambda_4>::operator()+0x2d4 [/builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp @ 218] 
18 00000049`503fe6c0 00007ffa`36204711     xul!mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:218:7'>::Run+0x2fb [/builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.h @ 549] 
19 (Inline Function) --------`--------     xul!nsThread::ProcessNextEvent+0xb49 [/builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp @ 1240] 
1a 00000049`503fe790 00007ffa`364c86bf     xul!NS_ProcessNextEvent+0xba1 [/builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp @ 479] 
1b 00000049`503feb40 00007ffa`35379e4f     xul!mozilla::ipc::MessagePump::Run+0x25f [/builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp @ 85] 
1c (Inline Function) --------`--------     xul!MessageLoop::RunInternal+0x16 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 368] 
1d 00000049`503fedc0 00007ffa`349cd1de     xul!MessageLoop::RunHandler+0x2f [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 362] 
1e 00000049`503fee10 00007ffa`34b13c58     xul!MessageLoop::Run+0x4e [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 344] 
1f 00000049`503fee70 00007ffa`34b12b6a     xul!nsBaseAppShell::Run+0x28 [/builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp @ 150] 
20 00000049`503feeb0 00007ffa`39308eb1     xul!nsAppShell::Run+0x3a [/builds/worker/checkouts/gecko/widget/windows/nsAppShell.cpp @ 615] 
21 00000049`503ff030 00007ffa`3937fc02     xul!nsAppStartup::Run+0x41 [/builds/worker/checkouts/gecko/toolkit/components/startup/nsAppStartup.cpp @ 296] 
22 00000049`503ff080 00007ffa`39380963     xul!XREMain::XRE_mainRun+0xc12 [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 5659] 
23 00000049`503ff3a0 00007ffa`36b168fb     xul!XREMain::XRE_main+0x323 [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 5859] 
24 00000049`503ff450 00007ff6`1cfaf319     xul!XRE_main+0x6b [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 5915] 
25 (Inline Function) --------`--------     firefox!do_main+0xc6 [/builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp @ 227] 
26 (Inline Function) --------`--------     firefox!NS_internal_main+0x497 [/builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp @ 445] 
27 00000049`503ff530 00007ff6`1cfc03c8     firefox!wmain+0x729 [/builds/worker/checkouts/gecko/toolkit/xre/nsWindowsWMain.cpp @ 167] 
28 (Inline Function) --------`--------     firefox!invoke_main+0x22 [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 90] 
29 00000049`503ff760 00007ffa`e25626ad     firefox!__scrt_common_main_seh+0x10c [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288] 
2a 00000049`503ff7a0 00007ffa`e358aa68     KERNEL32!BaseThreadInitThunk+0x1d
2b 00000049`503ff7d0 00000000`00000000     ntdll!RtlUserThreadStart+0x28

No nsView involved, but still a use-after-poison.

Edit: This comment is misleading, again.

Comment 24

•

1 year ago

So that means that mView.mRawPtr is dead, which means that someone (else, probably?) manually decremented the reference count or messed up the reference content.

Comment 25

•

1 year ago

•

Edited

Sorry, what I wrote was (again) wrong.

The main point is wanted to convey is that 0x7ffffffff0de7fff is not a random value. It is the result of calling mozWritePoison(). This is a crucial hint that we should try to use for solving this crash.

When we free most objects, we will use e5 repeatedly to poison them. This is what arena_dalloc()'s MaybePoison() does. This is how the object behind mView.mRawPtr gets poisoned after it gets freed. Hence the problem is not with that object (contrary to what I wrote before).

In a few locations, we do a similar kind of poisoning, except we do it with pattern 0x7ffffffff0de7fff and not e5. In particular, 0x7ffffffff0de7fff is used to poison nsTreeBodyFrame objects after they are destroyed. This goes through the following path:

nsTreeBodyFrame::DestroyFrom() calls SimpleXULLeafFrame::DestroyFrom() (the nsTreeBodyFrame* is this);
SimpleXULLeafFrame::DestroyFrom() is actually nsIFrame::DestroyFrom() (the nsTreeBodyFrame* is this);
nsIFrame::DestroyFrom() calls mozilla::PresShell::FreeFrame() (the nsTreeBodyFrame* is aPtr);
this ends up calling mozWritePoison() on the nsTreeBodyFrame object.

So, with high confidence this time, what we see in these crashes is that the layout of our nsTreeBodyFrame object has already been poisoned after it was freed through nsTreeBodyFrame::DestroyFrom(). Yet, we are now handling a blur event, that ends up calling nsTreeBodyFrame::SetFocused() on that same freed (and poisoned) nsTreeBodyFrame object. So, when we try to access its mView.mRawPtr, the value we get back is (nsITreeView*)0x7ffffffff0de7fff. When we try to use that pointer, we crash.

I think this is the JavaScript handler definition that ends up calling into nsTreeBodyFrame::SetFocused():

      this.addEventListener(
        "blur",
        event => {
          this.focused = false;
          if (event.target == this.inputField) {
            this.stopEditing(true);
          }
        },
        true
      );

This calls into XULTreeElement::SetFocused(), which calls nsTreeBodyFrame::SetFocused() on the nsTreeBodyFrame* returned by XULTreeElement::GetTreeBodyFrame().

void XULTreeElement::SetFocused(bool aFocused) {
  nsTreeBodyFrame* body = GetTreeBodyFrame();
  if (body) {
    body->SetFocused(aFocused);
  }
}

There, XULTreeElement::GetTreeBodyFrame() will most often return the cached pointer value stored in the mTreeBody field of the XULTreeElement. But this field is a raw pointer and I don't see any ref counting occur when the pointer is cached.

I think that the nsFocusManager could be properly retaining the XULTreeElement, but that then the XULTreeElement would fail to guarantee that the pointer in mTreeBody lives for at least as long as the XULTreeElement itself, which would then result in this crash.

Updated

•

1 year ago

Group: core-security

Comment 26

•

1 year ago

•

Edited

There is some mechanism in place to try to guarantee that mTreeBody outlives the XULTreeElement: nsTreeBodyFrame::DestroyFrom() will call mTree->BodyDestroyed(mTopRowIndex); if mTree is set for that nsTreeBodyFrame. That will in turn set mTreeBody to nullptr for that XULTreeElement. This raises some questions:

Are we sure that mTree is always correctly set for any nsTreeBodyFrame whose pointer is part of the layout of a XULTreeElement? As far as I can tell, mTree is set only when we call into nsTreeBodyFrame::GetBaseElement(). Do we always call into that function for these nsTreeBodyFrame objects? Are we sure that this function finds the correct XULTreeElement?
Are we sure that after mTreeBody has been set to nullptr following a call to XULTreeElement::BodyDestroyed, the next call to XULTreeElement::GetTreeBodyFrame() cannot set it back to the same nsTreeBodyFrame*?

Updated

•

1 year ago

Group: core-security → layout-core-security

Comment 27

•

1 year ago

•

Edited

Another suspicious part in the code is the kungfuDeathGrip in XULTreeElement::GetTreeBodyFrame(). I suppose that this is present to guarantee that the XULTreeElement object lives at least until the end of XULTreeElement::GetTreeBodyFrame(), by preventing the refcount from droping to 0 during the doc->FlushPendingNotifications(aFlushType) call. But what about when we come out of XULTreeElement::GetTreeBodyFrame()?

I wonder if we could be observing the following scenario:

the XULTreeElement and its nsTreeBodyFrame are both alive and well;
XULTreeElement::SetFocused() calls into XULTreeElement::GetTreeBodyFrame();
in XULTreeElement::GetTreeBodyFrame(), after doc->FlushPendingNotifications(aFlushType);, the refcount for the XULTreeElement reaches 1 as the kungfuDeathGrip just saved us from reaching 0;
so far the XULTreeElement and its nsTreeBodyFrame are still both alive and well;
so we are able to collect a nsTreeBodyFrame* (or recycle the cached one), which we will return as a result;
but as we are now exiting XULTreeElement::GetTreeBodyFrame(), the kungfuDeathGrip destructor makes the refcount for the XULTreeElement fall to 0, so the XULTreeElement is freed, along with its nsTreeBodyFrame (Is it the case that freeing the XULTreeElement also frees the nsTreeBodyFrame? I'm not sure about that part);
back in XULTreeElement::SetFocused(), we obtained a pointer to the nsTreeBodyFrame object as a result of calling XULTreeElement::GetTreeBodyFrame(), but this object is now freed and poisoned;
as we try to use the poisoned object, we crash on poison value 0x7ffffffff0de7fff.

Do we perhaps need an extra kungfuDeathGrip in the scope of XULTreeElement::SetFocused(), or a way to detect that the objects were freed as we exited XULTreeElement::GetTreeBodyFrame()?

Comment 28

•

1 year ago

•

Edited

For all crashes received in the past 6 months, the address on which we crash has never been different from the poison value (0x7ffffffff0de7fff is Windows x64, 0xf0de7fff is Windows x86, 0x7ffffffff0dea7ff is Linux x64):

1 	0x7ffffffff0de7fff 	5749 	86.93 %
2 	0xf0de7fff         	678 	10.25 %
3 	0x7ffffffff0dea7ff 	186 	2.81 %

This is evidence that the time between freeing (and poisoning) the nsTreeBodyFrame and crashing is so short, that never on any of these machines the allocation slot that was occupied by the freed object has been reallocated for another object that would have made use of the bytes where mView.mRawPtr was stored. Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values.

I believe that this evidence supports the scenario from comment 27, in which the crash would immediately follow freeing the two objects.

(Edit: Somehow I wrote comment 8 for comment 27, I'm not sure why. I mean comment 27 indeed here.)

Edit: These hints from support not only the scenario from comment 27 but more generally the possibility that the nsTreeBodyFrame is freed during the call to GetTreeBodyFrame, right before calling SetFocused.

Comment 29

•

1 year ago

XULTreeElement is kept alive by the caller of SetFocused.

Assignee

Comment 30

•

1 year ago

(In reply to Yannis Juglaret [:yannis] from comment #26)

Are we sure that after mTreeBody has been set to nullptr following a call to XULTreeElement::BodyDestroyed, the next call to XULTreeElement::GetTreeBodyFrame() cannot set it back to the same nsTreeBodyFrame*?

This could potentially happen. We only clear the primary frame pointer in nsIFrame::DestroyFrom which happens after that code.

(In reply to Yannis Juglaret [:yannis] from comment #28)

This is evidence that the time between freeing (and poisoning) the nsTreeBodyFrame and crashing is so short, that never on any of these machines the allocation slot that was occupied by the freed object has been reallocated for another object that would have made use of the bytes where mView.mRawPtr was stored. Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values.

If the memory got reallocated then the crash signature might be different, we'd get function names from the vtable of the new object. There was another crash where we saw this happen recently with nsMenuPopupFrame functions showing up in an impossible place. Or if it was a new object of the same type the code would potentially just not crash.

Assignee

Comment 31

•

1 year ago

Attached file Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio — Details

Assignee

Updated

•

1 year ago

Keywords: leave-open

Assignee

Updated

•

1 year ago

Crash Signature: [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] → [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ nsCOMPtr<T>::nsCOMPtr | nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView]

Comment 32

•

1 year ago

•

Edited

(In reply to Timothy Nikkel (:tnikkel) from comment #30)

If the memory got reallocated then the crash signature might be different, we'd get function names from the vtable of the new object. There was another crash where we saw this happen recently with nsMenuPopupFrame functions showing up in an impossible place. Or if it was a new object of the same type the code would potentially just not crash.

I didn't find other crash signatures of significant volume where the proto signature would contain, say, mozilla::dom::XULTreeElement_Binding::set_focused. But, indeed, we could just be not crashing, also allocating from this PresShell arena is probably much less common than the usual paths for allocation.

Reading your comment made me wonder if we could find additional hints in other crash signatures. For example, where do Nightly and early beta users crash here, since we do not have reports from them like [:emilio] mentioned in comment 7?

I think the following signatures look interesting:

mozilla::PresShell::DoFlushPendingNotifications: This signature is often reached through crashing on MOZ_DIAGNOSTIC_ASSERT(!mForbiddenToFlush) (This is bad!). Nightly and early beta users could be crashing here. One user comment mentions trying to delete a lot of history lines then using Firefox during deletion (e.g. switching tabs), and their crash stack seems very interesting, it goes from nsNavHistoryResult::OnVisit to nsTreeSelection::FireOnSelectHandler, mozilla::dom::XULElement_Binding::focus, mozilla::PresShell::DoFlushPendingNotifications, nsTreeBodyFrame::DestroyFrom, mozilla::dom::XULTreeElement_Binding::endUpdateBatch, mozilla::PresShell::DoFlushPendingNotifications (flushing reentrancy related to navigation history with nsTreeBodyFrame and XULTreeElement involved);
nsCycleCollectingAutoRefCnt::incr: This signature contains another late beta / release crash where user comments talk about deleting a lot of old navigation history e.g. here and here. It seems like it could be a variation of the other crash with potentially the same root cause. It occurs through mozilla::dom::XULTreeElement_Binding::get_columns instead of mozilla::dom::XULTreeElement_Binding::set_focused.

Note: Both signatures also contain crashes which do not seem related to this bug, especially for the second signature where there are many different proto signatures. But a proportion seems related in both.

Updated

•

1 year ago

Crash Signature: [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ nsCOMPtr<T>::nsCOMPtr | nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] → [@ nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ nsCOMPtr<T>::nsCOMPtr | nsCOMPtr<T>::nsCOMPtr | nsTreeBodyFrame::GetExistingView] [@ mozilla::PresShell::DoFlushPendingNotifications] [@ nsCycleCollectingAutoRefCnt::incr]

Assignee

Comment 33

•

1 year ago

(In reply to Yannis Juglaret [:yannis] from comment #32)

One user comment mentions trying to delete a lot of history lines then using Firefox during deletion (e.g. switching tabs), and their crash stack seems very interesting,

This seems like it is painting a giant red X over exactly how we can hit the crash in this bug! If we can enter JS and basically do anything after calling BodyDestroyed in nsTreeBodyFrame::DestroyFrom then it is very likely that we can set XULTreeElement::mTreeBody back to the frame that is doomed and we will then hit this crash not too long after. I will come back to this later today and post a patch.

Comment 34

•

1 year ago

•

Edited

Attached file Stack from Nightly crash — Details

Attached is what the full stack looks like. I think the C++ to JS to C++ sequence here would be `nsTreeBodyFrame::DestroyFrom` calling `view->SetTree(nullptr);`, calling `PlacesTreeView.setTree(null);`, calling `PlacesTreeView.batching(false);`, calling `this._tree.endUpdateBatch();`, calling `mozilla::dom::XULTreeElement_Binding::endUpdateBatch`.

Comment 35

•

1 year ago

•

Edited

This bug explains 29 of the 45 crashes received in 116 early beta on the mozilla::PresShell::DoFlushPendingNotifications signature, so ~64% of the volume, which can be useful to estimate the potential impact of bug 1845266 (which explains ~25% of the same volume).

Daniel Veditz [:dveditz]

Updated

•

11 months ago

Keywords: sec-high

Comment 36

•

11 months ago

•

Edited

I think we can call this mitigated by frame poisoning?

Looking at the last month of crash reports for the nsTreeBodyFrame::GetExistingView signatures, they all seem to be crashing on frame-poisoning addresses (0x7ffffffff0de7fff and similar, which I mentioned above in comment 14 but didn't realize at that point was a frame-poisoning address. :) I refreshed my memory that this is indeed a frame-poisoning address over in a pernosco trace in bug 1845223, though.)

Comment 25 and comment 28 seems to be confirming this as well. Comment 28 mentioned "Usually, due to reallocation, use-after-free crash signatures show not only poisoned values, but also partially or completely overwritten poisoned values" -- that's indeed why poison-pointer crashes are typically security-sensitive -- but specifically with the layout frame tree classes (nsTreeBodyFrame & friends), we use arena-allocation and ensure that a given address can only ever be used to allocate instances of the same concrete class. So if the memory gets reallocated and then we use a dangling pointer to the old/deleted nsTreeBodyFrame, we'll potentially get some weirdness but we'll at least be using the pointer with the proper type. https://robert.ocallahan.org/2010/10/mitigating-dangling-pointer-bugs-using_15.html has some notes about this, too.

So I think we've confirmed that this crash is not exploitable and is mitigated by frame poisoning, and hence we don't need to have this bug hidden as security-sensitive. tnikkel, does that make sense to you?

Flags: needinfo?(tnikkel)

Comment 37

•

11 months ago

(I'm replacing the just-added sec-high keyword with csectype-framepoisoning, too; frame poisoning crashes aren't sec-high, fortunately.)

Keywords: sec-high → csectype-framepoisoning

Assignee

Comment 38

•

11 months ago

(In reply to Daniel Holbert [:dholbert] from comment #36)

So I think we've confirmed that this crash is not exploitable and is mitigated by frame poisoning, and hence we don't need to have this bug hidden as security-sensitive. tnikkel, does that make sense to you?

Yes, that all agrees with my knowledge.

Flags: needinfo?(tnikkel)

Comment 39

•

11 months ago

Thanks. I'll remove the security-sensitive flag then.

Group: layout-core-security

Updated

•

11 months ago

Updated

•

11 months ago

Updated

•

11 months ago

Attachment #9345229 - Attachment description: Bug 1809492. Assert tree body frame points to tree element. r?emilio → Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio

Natalia Csoregi [:nataliaCs]

Updated

•

11 months ago

Assignee: emilio → tnikkel

Pulsebot

Comment 40

•

11 months ago

Pushed by tnikkel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0744a4057867
Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r=emilio

Comment 41

•

11 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0744a4057867

Assignee

Comment 42

•

11 months ago

Oops, I had the leave-open keyword in there when I wanted to land a diagnostic patch, but we ended up figuring this out and I pushed a patch that should fix this (at least at the immediate crash, the code is still not as good as I would like). So this bug should have been resolved when this landed. I'll do that now.

Status: NEW → RESOLVED

Closed: 11 months ago

status-firefox118: --- → fixed

Keywords: leave-open

Resolution: --- → FIXED

Target Milestone: --- → 118 Branch

Updated

•

11 months ago

status-firefox116: --- → wontfix

status-firefox117: --- → affected

status-firefox-esr115: --- → affected

Comment 43

•

11 months ago

The patch landed in nightly and beta is affected.
:tnikkel, is this bug important enough to require an uplift?

If yes, please nominate the patch for beta approval.
If no, please set status-firefox117 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(tnikkel)

Assignee

Comment 44

•

11 months ago

I'm happy to let this ride the trains.

status-firefox117: affected → wontfix

Flags: needinfo?(tnikkel)

Comment 45

•

11 months ago

•

Edited

If we ignore OOMs, bitflips and shutdown hangs then I think this is a top parent process crasher for 116 release. From a stability point of view it should be very interesting to take this patch for 117.0 and 115.2.0esr. Can you elaborate if you think that there could be a risk here [:tnikkel]? Thank you.

Flags: needinfo?(tnikkel)

Assignee

Updated

•

11 months ago

Flags: needinfo?(tnikkel)

Mathew Hodson

Updated

•

10 months ago

Comment 47

•

9 months ago

Now that this has had a bit more bake time, can we nominate this for ESR115 uplift? It grafts cleanly.

Flags: needinfo?(tnikkel)

Assignee

Comment 48

•

9 months ago

Comment on attachment 9345229 [details]
Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio

ESR Uplift Approval Request

If this is not a sec:{high,crit} bug, please state case for ESR consideration: top crash in parent process
User impact if declined: crashes
Fix Landed on Version: 118
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): it's already on release

Flags: needinfo?(tnikkel)

Attachment #9345229 - Flags: approval-mozilla-esr115?

Comment 49

•

9 months ago

Comment on attachment 9345229 [details]
Bug 1809492. Clear pointer to nsTreeBodyFrame on XULTreeElement after any possible calls that can set it. r?emilio

Approved for 115.4esr.

Attachment #9345229 - Flags: approval-mozilla-esr115? → approval-mozilla-esr115+

Pulsebot

Comment 50

•

9 months ago

uplift

https://hg.mozilla.org/releases/mozilla-esr115/rev/5d728501a599

Updated

•

9 months ago

status-firefox-esr115: affected → fixed