Closed Bug 1803675 Opened 2 years ago Closed 2 years ago

Large spike in OOM-like crashes on November 30th on YouTube

Categories

(Core :: General, enhancement)

enhancement

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox107 - affected
firefox108 - affected
firefox109 - affected

People

(Reporter: mccr8, Unassigned)

References

Details

Crash Data

There has been a massive spike in OOM-ish crashes across release, beta and Nightly, starting on November 30th, on both Android and desktop. I've marked some of the existing bugs for these signatures in the "see also" field.

Lots of comments talking about how YouTube isn't working for them any more.

[@ core::option::expect_failed | alloc::alloc::alloc ] is also spiking up, but I'm not sure it is an OOM issue and it isn't quite as frequent as the others on release.

The signature [@ stackoverflow | mozilla::Internals::GetPrefValue<T> ] also looks related. Lots of comments like "After the new update, FF is unusable with YouTube."

Comments on lots of the DOM-ish variants [@ nsGlobalWindowInner::ClearDocumentDependentSlots ], [@ nsGlobalWindowOuter::SetNewDocument ] and [@ mozilla::dom::JSActorManager::ReceiveRawMessage ] also mention YouTube frequently.

Blocks: media-triage

The mozilla::Internals::GetPrefValue<T> crash looks like it involves infinite recursion in FontFaceSet bindings. I wonder if we hit an OOM at some odd point and ended up with a data structure in a weird configuration that causes us to infinitely loop. Here's an example of that: bp-ac3a6a51-2607-4f64-92c9-251570221130

[Tracking Requested - why for this release]: Some mostly rare OOM-ish crashes suddenly became top crashes overnight on multiple channels (see bug 1405521 comment 24), and YouTube is frequently mentioned.

[@ stackoverflow | js::SharedShape::getInitialShape ] looks like it is another manifestation of the fontface set DOM bindings infinite recursion:
bp-6310245c-bfed-450b-b5de-c545d0221201

Crash Signature: [@ nsGlobalWindowInner::ClearDocumentDependentSlots ][@ nsGlobalWindowOuter::SetNewDocument ][@ mozilla::dom::JSActorManager::ReceiveRawMessage ][@ stackoverflow | mozilla::Internals::GetPrefValue<T> ] → [@ nsGlobalWindowInner::ClearDocumentDependentSlots ][@ nsGlobalWindowOuter::SetNewDocument ][@ mozilla::dom::JSActorManager::ReceiveRawMessage ][@ stackoverflow | mozilla::Internals::GetPrefValue<T> ] [@ stackoverflow | js::SharedShape::getInitialShape ]

The urls also mention YouTube. The crash spike started at ~2022-11-30T17:37 UTC and got more frequent at ~17:50 (first one was 2022-11-29 21:23:46 UTC). Volume dropped ~6h later - not a sharp drop, problematic code might have been loaded before.

Because all release channels were affected:

  1. Could this be related to the Widevine update (bug 1801201), e.g. Firefox needed a restart to complete the update?
  2. Have we heard from YouTube about changes on their end and metrics identifying issues followed by a roll back?

The DOM bindings infinite recursion issue seems to be happening when we fail to allocate a JS object due to an OOM, and Emilio found a place where we seem to not be recovering gracefully from that situation, so that at least explains how an OOM might turn into a stack overflow.

Andrew, is there some way we could tell if these crashes were related to bug 1801201?

Flags: needinfo?(aosmond)

I looked at a handful of these crashes, and some had 4.10.2449.0 and some had 4.10.2557.0 for their gmp-widevinecdm value in the telemetry environment, so there doesn't seem to be a strict correlation to either the old or new version.

Emilio filed bug 1803682 for the infinite recursion, but I suspect that if we fix that, those crashes will just turn into some other OOM.

See Also: → 1803682

Comment 4 says this was a 6 hour spike, so maybe it doesn't need to be tracked, but it would be good to understand what happened here.

(In reply to Andrew McCreight [:mccr8] from comment #8)

Emilio filed bug 1803682 for the infinite recursion, but I suspect that if we fix that, those crashes will just turn into some other OOM.

I think that (and maybe this one too) are a dupe of bug 1746997.

Well, maybe this one is more about the OOM.

This signature also seems correlated, plenty of YouTube URLs in there.

I could believe it, but it seems a bit surprising if it was the Widevine update? We updated nightly on Nov 21st and saw almost every client updated within a few days (we verified this through telemetry), but the crash spike started at the same time as every other channel 9 days later, even ESR.

Flags: needinfo?(aosmond)

Not tracking against a specific release. As mentioned in Comment 4, this was across all releases for a period of time. The spike went away when presumably YouTube rolled out a fix or rolled back a change?
Could we contact them to see what change they rolled out recently?
Would that help the investigation to harden against similar problems in the future?

Summary: Large spike in OOM-like crashes on November 30th → Large spike in OOM-like crashes on November 30th on YouTube

Something similar happening on Reddit but in CSS code, bug 1803876. Maybe it's related, maybe not.

Moving out of media. The signatures here fall in various components. Maybe this bug should be broken up? Overall though it appears the cause has gone away.

No longer blocks: media-triage
Component: Audio/Video: Playback → General

The crash signatures are a side effect of high memory usage while using YouTube. I'd guess that it is related to video playback, but we don't know for certain. I was hoping that the media team might have a pre-existing relationship with YouTube to figure out if they deployed and reverted some change to the their site, so that we can better understand what went wrong in Firefox.

The bug is linked to topcrash signatures, which match the following criteria:

  • Top 10 content process crashes on release
  • Top 10 desktop browser crashes on nightly
  • Top 10 content process crashes on beta

:freddy, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(fbraun)
Keywords: topcrash

Well, if this is a top crash, then yes -- we should increase the severity. Given the variety of clues and hunches, it's just not clear to me what the reasonable next steps should look like.

Severity: -- → S2
Flags: needinfo?(fbraun)

The bug is linked to topcrash signatures, which match the following criteria:

  • Top 10 desktop browser crashes on nightly
  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on release

For more information, please visit auto_nag documentation.

There's other bugs on file for these crashes. I don't think this serves any purpose, as it looks like nobody is going to look at the YouTube crash spike.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.