Open Bug 1856672 Opened 2 years ago Updated 7 months ago

Crash in [@ mozilla::dom::TypedArray_base<T>::ProcessFixedData<T>]

Categories

(Core :: DOM: Bindings (WebIDL), defect)

defect

Tracking

()

ASSIGNED
Tracking Status
firefox-esr115 --- unaffected
firefox118 --- unaffected
firefox119 + wontfix
firefox120 + wontfix
firefox121 + wontfix
firefox122 + wontfix
firefox123 + wontfix
firefox124 --- wontfix

People

(Reporter: diannaS, Assigned: sfink)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: crash, regression, stalled)

Crash Data

Attachments

(1 file, 2 obsolete files)

Crash report: https://crash-stats.mozilla.org/report/index/dafaefbe-6ebc-44f4-89d3-d344b0230927

MOZ_CRASH Reason: MOZ_CRASH(small oom when moving inline data out-of-line)

Top 10 frames of crashing thread:

0  xul.dll  mozilla::dom::TypedArray_base<JS::ArrayBufferView>::ProcessFixedData<`lambda at /builds/worker/checkouts/gecko/dom/media/mediasource/SourceBuffer.cpp:732:33'> const  dom/bindings/TypedArray.h:660
1  xul.dll  mozilla::dom::SourceBuffer::PrepareAppend  dom/media/mediasource/SourceBuffer.cpp:732
1  xul.dll  mozilla::dom::SourceBuffer::AppendBuffer  dom/media/mediasource/SourceBuffer.cpp:194
2  xul.dll  mozilla::dom::SourceBuffer_Binding::appendBuffer  dom/bindings/SourceBufferBinding.cpp:891
3  xul.dll  mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>  dom/bindings/BindingUtils.cpp:3327
4  ?  @0x3d329a0a  
5  ?  @0x3d2b8a50  
6  ?  @0x3d321805  
7  ?  @0x3d2b8a50  
8  ?  @0x3d3a3a15  

Looks like a normal 32-bit OOM to me.

Component: DOM: Bindings (WebIDL) → Audio/Video

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

This is happening with plenty of memory available and with amd64 builds.

I wonder why ProcessFixedData() was chosen over ProcessData() here.

Severity: -- → S2
Keywords: regression
OS: Windows 10 → All
Regressed by: 1690111
Hardware: x86 → All

(In reply to Karl Tomlinson (:karlt) from comment #3)

This is happening with plenty of memory available and with amd64 builds.

I wonder why ProcessFixedData() was chosen over ProcessData() here.

Probably because PrepareAppend() calls EvictData() which does all kinds of stuff, including calling TrackBuffersManager::ProcessTasks() that sounds like it might conceivably do a GC. SetReadyState() also does lots of things that smell like they could GC, at least to the analysis and me.

It is possible that the jsapi.Init(mImplObj) is failing here rather than OOMing in JS::EnsureNonInlineArrayBufferOrView(). Though I don't really know how that could be, that seems like it should only happen if the global object is gone and that shouldn't happen afaict.

Thanks. Yes, there's definitely value in picking an option for which the analysis can guarantee we're good.

PrepareAppend() uses only the length, not the data, across EvictData(). Presumably the length won't change during GC and so ProcessData() could be used after EvictData() and some other API used to get the length for EvictData().

Just posting so that no-one else goes to the trouble of testing the same thing.
This is not producing the crash.

This crash is also occurring (mac, linux) with plenty of available memory with Crypto.getRandomValues().

That has only the ArrayBufferView form.

All the reports I've seen have the ArrayBufferView template parameter, even though SourceBuffer.appendBuffer() has the ArrayBuffer overload, but perhaps the compiler has folded the two instantiations into one.

Component: Audio/Video → DOM: Bindings (WebIDL)

(In reply to Karl Tomlinson (:karlt) from comment #8)

PrepareAppend() uses only the length, not the data, across EvictData(). Presumably the length won't change during GC and so ProcessData() could be used after EvictData() and some other API used to get the length for EvictData().

"GC" here is unhelpfully used to stand in for "run JS code", and JS code can change the length. Currently the only way would be to detach the ArrayBuffer, but we'll be adding more ways in bug 1842773 after the dust settles here.

The bug is marked as tracked for firefox119 (beta). We have limited time to fix this, the soft freeze is in 7 days. However, the bug still isn't assigned.

:hsinyi, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(htsai)
Attached file Bug 1856672 - Remove more remnants of old API (obsolete) —
Assignee: nobody → sphink
Status: NEW → ASSIGNED

Comment on attachment 9358146 [details]
Bug 1856672 - Remove more remnants of old API

Revision D190845 was moved to bug 1858533. Setting attachment 9358146 [details] to obsolete.

Attachment #9358146 - Attachment is obsolete: true

Sorry, attached a patch for the wrong bug. I still don't know what's going on with this one.

Assignee: sphink → nobody
Status: ASSIGNED → NEW

(In reply to Steve Fink [:sfink] [:s:] from comment #6)

It is possible that the jsapi.Init(mImplObj) is failing here rather than OOMing in JS::EnsureNonInlineArrayBufferOrView(). Though I don't really know how that could be, that seems like it should only happen if the global object is gone and that shouldn't happen afaict.

We could try to separate the two calls to get better stacks. Though it does look unlikely that the Init is failing

Flags: needinfo?(peterv)

(In reply to BugBot [:suhaib / :marco/ :calixte] from comment #13)

The bug is marked as tracked for firefox119 (beta). We have limited time to fix this, the soft freeze is in 7 days. However, the bug still isn't assigned.

:hsinyi, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

I am following this to ensure we give this right attention and actions. We are actively investigating the issue. We will assign an assignee after we know more about what's up.

Flags: needinfo?(htsai)
Flags: needinfo?(htsai)

:hsinyi we are in RC week for Fx119 and go-live is on 2023-10-24, though monitoring this for a fix that we could uplift for a dot release

This is a reminder regarding comment #13!

The bug is marked as tracked for firefox119 (beta). We have limited time to fix this, the soft freeze is in a day. However, the bug still isn't assigned.

Peter, I am assigning to you for now since you've been actively investing this.

Once we have more ideas about the root cause, we may reassign and ask for the help from other teams .

Assignee: nobody → peterv
Flags: needinfo?(htsai)
Assignee: peterv → sphink
Status: NEW → ASSIGNED
Attached file Bug 1856672 - Diagnostics for OOM (?) crash (obsolete) —

When looking at this again, I noticed another way this could fail that doesn't involve OOM: if the supposed ArrayBufferView object passed in is of the wrong type, then it will throw a JS exception and return false.

If that is what is going on, then it seems like we should be handling errors here somehow instead of just (safely) crashing. Unfortunately, ProcessFixedData doesn't have a way to handle errors that originate in the infrastructure code itself, and it forwards the return value through from the given lambda, so the body of ProcessFixedData doesn't know what sort of error value it could return.

We could perhaps add another parameter giving an error value? In this case, the return value is RefPtr<MediaByteBuffer> if I'm looking at the right thing, and it is immediately bool-checked. Although it just returns on error in this case; we'd really probably want to use the ErrorResult& aRv in this case.

If necessary, instead of an error return value, we could have a whole additional lambda passed in to handle the error case.

What I mean by "the wrong type" above is anything that doesn't resolve to ArrayBufferObject or ArrayBufferViewObject when passed through CheckedUnwrapStatic.

Bleh. Automation accidentally reassigned this to me. I put a diagnostic patch up for review, but we might want to deal with the type issue first instead?

Assignee: sphink → peterv

It's not looking likely for 119, but do we expect something for Fx120?
Setting Fx120 to tracking+

Flags: needinfo?(peterv)

peterv - do you want me to land the diagnostics patch, or would you rather come up with a mechanism for error handling first (or instead)?

Crash Signature: [@ mozilla::dom::TypedArray_base<T>::ProcessFixedData<T>] → [@ mozilla::dom::TypedArray_base<T>::ProcessFixedData<T>] [@ mozilla::dom::TypedArray_base<T>::ProcessFixedData]

(In reply to Steve Fink [:sfink] [:s:] from comment #24)

When looking at this again, I noticed another way this could fail that doesn't involve OOM: if the supposed ArrayBufferView object passed in is of the wrong type, then it will throw a JS exception and return false.

I don't understand how that could happen for the crash in comment 0. The argument conversion code in the binding calls ArrayBufferView_base::Init, which eventually makes us end up calling JS::ArrayBufferView::unwrap. JS::ArrayBufferView::unwrap should be doing the necessary checks, how can that end up with an object of the wrong type? The binding code will just immediately throw an exception if JS::ArrayBufferView::unwrap returns false.

Flags: needinfo?(peterv) → needinfo?(sphink)

Also, I've been digging into why this only seems to be happening for ArrayBufferView. Would it be possible that ArrayBufferViewObject::ensureBufferObject is returning nullptr? I'm not sure when that could happen, but it would cause ArrayBufferViewObject::ensureNonInline to return false.

Depends on: 1865987
Depends on: 1866744
Attachment #9360498 - Attachment is obsolete: true

(I think peterv is right in comment 23, but he's also landing a bunch of new diagnostics that should pinpoint this problem precisely.)

Flags: needinfo?(sphink)
Depends on: 1867941
Depends on: 1868654

Ni'ing peterv to provide insights from the debugging added in bug 1867941 once available.

Flags: needinfo?(peterv)

We're waiting for crashes on nightly.

Flags: needinfo?(peterv)

Hi Peter, any updates now that we've got some newer crash reports?

Flags: needinfo?(peterv)

:mccr8 as triage owner, do you have any insight here? See Comment 33

Flags: needinfo?(continuation)

Here is a crash stats search for this signature with a facet on reasons. The most recent ones are "We did run out of memory!" but there are some older ones that are different. I'll message Peter and see if he has any ideas.

Flags: needinfo?(continuation)
Depends on: 1876107

We're still trying to diagnose what exactly is going on. Landed another diagnostics patch.

Flags: needinfo?(peterv)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash

:peterv the volume seems to have gone down but still not fixed. Has the diagnostic patches helped uncover the issue here?

Flags: needinfo?(peterv)

Nope, Steve and I are still trying to understand the additional data that we have.

Flags: needinfo?(peterv)

We added diagnostic patches and have been looking into additional data from crash reports for months, but yet figure out a clear path forward. I am marking this stalled for now.

Keywords: stalled
Assignee: peterv → sphink
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: