Open Bug 1835488 Opened 1 years ago Updated 1 year ago

Frequent WebRender crash in [@ <glean_core::metrics::timing_distribution::TimingDistributionMetric as core::clone::Clone>::clone ]

Categories

(Data Platform and Tools :: Glean: SDK, defect, P5)

x86_64
Linux
defect

Tracking

(firefox114 affected)

Tracking Status
firefox114 --- affected

People

(Reporter: ranmyaku262, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0

Steps to reproduce:

I wish i could give steps, but the timing/steps seem absolutely random. It might be related to watching streams/videos, but even that isn't clear.

Attached is the crash report:
https://crash-stats.mozilla.org/report/index/a008e509-b4a6-4adb-8702-9ecf30230527#allthreads

Actual results:

With no real consistency/warning firefox will suddenly crash, and i will get the submit error report popup.
Usually its seemingly without warning, but the most consistent thing is a stream is either playing, or is playing on another tab.
This can take anywhere from 10 minutes to an entire day to happen.

Expected results:

Not crashing.

Crash Signature: [@ <glean_core::metrics::timing_distribution::TimingDistributionMetric as core::clone::Clone>::clone ]

The bug has a crash signature, thus the bug will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true

I can confirm this crash when running Linux 6.4-rc3, while I don't have it on Linux 6.3 (smells like kernel regression).

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core
Blocks: wr-linux
Component: Widget: Gtk → Glean: SDK
Keywords: crash
OS: Unspecified → Linux
Product: Core → Data Platform and Tools
Hardware: Unspecified → x86_64
Summary: Frequent WRRende crash → Frequent WebRender crash in [@ <glean_core::metrics::timing_distribution::TimingDistributionMetric as core::clone::Clone>::clone ]
Version: Firefox 113 → unspecified
See Also: → 1838701
See Also: 18387011838323

I have some potentially interesting information:

While I have this crash, It has multiple crash signatures, all different.

https://crash-stats.mozilla.org/report/index/9df3bcd3-9f18-4e6a-9807-69ad30230630
https://crash-stats.mozilla.org/report/index/554356bc-5347-40db-8aff-884000230629
https://crash-stats.mozilla.org/report/index/0feec8f4-f2b4-423c-b60b-f31990230629

The first 2 appear to refer to Glean, the latter doesn't appear to. Maybe it's related, maybe it's not.

Aside from that, symptoms appear to match - unpredictable crashing, often when playing a video, which can occur as often as an hour or as rarely as whole weeks.

Something else I noted is that, when symptoms occur, it's easy to retrigger them (should the browser not crash in the process...), but getting to that stage from a fresh restart is frustratingly hard.

What I think is going on is that something is "sensitizing" the browser to such behavior. Causes for this can vary, but in my experience playing video, using PiP, minimizing the browser, and entering a virtual console seem to increase the likelihood of this.

When the browser is sensitized, it's quite easy to trigger symptoms. For example, enter and exit a virtual console while playing a video, which clears VRAM. Depending on how the browser is sensitized, this could cause a hard lockup, corrupt visual elements, or just corrupt the video stream.

Additionally, having other 3d-accelerated applications running seems to activate the bug more often - when running FF as the only 3d-accelerated app, such weirdness happens much more rarely than when also running other apps, in which case it is just a matter of time.

What symptoms can happen seem to depend on whether hardware acceleration is enabled. If it is, crashes can occur, otherwise, effects appear limited to framebuffer and graphic element (glyphs, pictures...) corruption - which can be resolved by dragging the tabs to another window; and the PiP window flickering between two frames

This makes me think something is going wrong when tracking which section of memory is used to store decoded video and other 3d acceleration information, and ensuring that information is properly given to the drivers, and tracked within the browser itself.

PS: Symptoms ocurred since I was on kernel 6.3.3. I don't recall hard crashes, but that probably was due to disabling hardware acceleration.

The recent surge in crashes here seems to be correlated with Linux 6.4.

https://bugzilla.kernel.org/show_bug.cgi?id=217624

Here to report that despite disabling force-enablining hardware acceleration, I still got a crash.

https://crash-stats.mozilla.org/report/index/c2884ef4-81bb-4aab-a099-aa7430230704

Despite that, it seemed more stable. Makes sense if that's the cause, as no hw acceleration probably means less ffmpeg forks, thus, less chance for corruption.

I had this crash for the first time (along with crashes with many other signatures) after upgrading to Linux 6.4. After downgrading back to 6.1 LTS I'm not experiencing this or any other crashes anymore.

Assignee: nobody → tlong
Priority: -- → P2

...but how are you going to fix this if the cause of the modern burst of crashes is (somehow) a bug in kernel? code your own fork!?

and good luck solving the other, very similar issues while that's still a problem.

Flags: needinfo?(tlong)

(In reply to FavoritoHJS from comment #8)

...but how are you going to fix this if the cause of the modern burst of crashes is (somehow) a bug in kernel? code your own fork!?

and good luck solving the other, very similar issues while that's still a problem.

Calm, calm.

The kernel-side fix is already mainlined in v6.5-rc1, while some (if not all of it) also being backported to v6.4.y stable series.
Specifically, commit f96c48670319d6 ("mm: disable CONFIG_PER_VMA_LOCK until its fixed") disables problematic CONFIG_PER_VMA_LOCK
for now while being (on progress of) fixed.

Bye!

appears 6.4.3 does indeed have that change, so that's good. now as to ensure people actually update their kernel...

and sorry for sounding somewhat angry, assumed this wouldn't be patched for a while so a workaround would be required

Assignee: tlong → nobody
Flags: needinfo?(tlong)
Priority: P2 → --
Priority: -- → P5

here to report that this issue is no longer happening with firefox 115.0.2... but since the original reporting date I disabled the nvidia driver.
I wonder if this could be a bug with nvidia video decoding specifically...

You need to log in before you can comment on or make changes to this bug.