Crash in std::list<T>::clear() from D2D1::EndDraw()

RESOLVED FIXED

Status

()

defect
--
critical
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: marco, Unassigned)

Tracking

(6 keywords)

50 Branch
x86_64
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox49 unaffected, firefox50+ fixed, firefox51+ fixed, firefox52+ fixed)

Details

(Whiteboard: [gfx-noted], crash signature)

Reporter

Description

3 years ago
[Tracking Requested - why for this release]:

+++ This bug was initially created as a clone of Bug #1291084 +++

Crashes with the "std::list<T>::clear" signature spiked on Nightly 50
around Jul 31 2016. The signature was later modified by bug 1295362,
becoming "std::list<T>::clear | CDeviceChild<T>::~CDeviceChild<T>" (and
others, but this is largely the most occurring one).

See bug 1291084 for more details.

Requesting tracking for 50, since the signature spiked in 50.
Track 51+ as the #6 of top content crash in 51.
From triage with Jet: Bas says the speculative fix from bug 1291084 comment 63 was hoping to fix this. That landed on the 5th, and so should have made beta 5's build... But looking at the crash signatures for [@ std::list<T>::clear | CDeviceChild<T>::~CDeviceChild<T> ] (linked up in this bug's header), I see a lot of reports from 50.0b5. So unless this actually missed beta 5, sounds like the speculative fix didn't work.
Flags: needinfo?(bas)
Tracking 52+ as we need to get to the bottom of what is going here, especially if the speculative fix did not work.
Can someone link to an exact report they want to use this bug for?
Flags: needinfo?(bas)
Reporter

Comment 6

3 years ago
The first Nightly build ID to expose the regression is 20160731030203.
New crash in Fx50, tracked.
Clearing the priority field, this cannot be a P3.
Priority: P3 → --
Reporter

Comment 10

3 years ago
This is #2 top (both browser and content) crasher on beta, second only to 'OOM | small'.

(100.0% in signature vs 07.65% overall) cpu_arch = amd64
(11.03% in signature vs 77.71% overall) useragent_locale = en-US
(95.97% in signature vs 35.06% overall) "DWrite+" in app_notes = true
(95.97% in signature vs 35.06% overall) "DWrite?" in app_notes = true
(95.97% in signature vs 35.06% overall) "D2D1.1+" in app_notes = true
(99.95% in signature vs 39.88% overall) os_arch = amd64
(99.95% in signature vs 40.23% overall) platform_version = 6.1.7601 Service Pack 1
(100.0% in signature vs 43.26% overall) reason = EXCEPTION_ACCESS_VIOLATION_READ
(04.03% in signature vs 57.33% overall) "D2D1.1-" in app_notes = true
(57.81% in signature vs 13.86% overall) adapter_vendor_id = Advanced Micro Devices, Inc. [AMD/ATI]
(100.0% in signature vs 59.53% overall) platform_pretty_version = Windows 7
(43.34% in signature vs 03.98% overall) useragent_locale = de
(22.90% in signature vs 61.60% overall) adapter_vendor_id = Intel Corporation
(78.93% in signature vs 48.05% overall) "D3D11 Layers+" in app_notes = true
(78.93% in signature vs 49.36% overall) "D3D11 Layers?" in app_notes = true
(51.70% in signature vs 35.15% overall) cpu_microcode_version = null
(36.80% in signature vs 20.34% overall) "EGL+" in app_notes = true
(36.80% in signature vs 20.74% overall) "WebGL+" in app_notes = true
(36.80% in signature vs 21.39% overall) "GL Context+" in app_notes = true
(36.80% in signature vs 21.39% overall) "GL Context?" in app_notes = true
(36.80% in signature vs 21.48% overall) "EGL?" in app_notes = true

It's a 64-bit only crash (Firefox 64-bit on 64-bit OS), only happening on Windows 7 SP1 often with locales != en-US.
Looking at the URLs, it seems related to video-playing.

Comment 11

3 years ago
I haven't seen any crashes with those particular crash signatures produced in Bughunter though there have been crashes with different signatures on a few (~38) urls from Socorro which originally had those signatures.

The most recent of those on Beta were due to 

Assertion failure: state.filterSourceGraphicTainted == isWriteOnly

which were fixed in bug 1307749 on Nightly/52.

I've resubmitted the urls for all Socorro urls (~3100) that we have tested in Bughunter where the original Socorro signature began with std::list<T>::clear. This corresponds to about 10% of the total urls from Socorro with the corresponding signature. The tests will be on 64 bit Fedora, Ubuntu, Windows 7 and Windows 10; 32 bit Windows 7 and Windows 10 for all 3 branches Beta/50, Aurora/51 and Nightly/52. Everyone will be tested with both opt and debug builds while Linux will also test opt-asan.

It will take until tomorrow to complete the testing but if I get early results I'll let you know. They decided not to uplift bug 1307749 since the fix was for an "assertion failure". Perhaps you can revisit that.
Jeff, should this be moved to graphics? The speculative fix in  bug 1291084 didn't fix it.
Flags: needinfo?(jmuizelaar)
Reporter

Comment 13

3 years ago
Virtual is able to reproduce crashes with this signature and with the signatures from bug 1294748, which makes me think that these crashes might be related.

We've also seen an increase of crashes on automation (https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure%2C%20&keywords_type=allwords&short_desc_type=allwordssubstr&short_desc=nvwgf2umx.dll&resolution=---&query_format=advanced&list_id=13273839).

Virtual filed bug 1294748 on 2016-08-12, we've started to see an increased rate of crashes on automation since 2016-08-10, this signature spiked on Nightly on 2016-07-31, went down and spiked again around 2016-08-10: https://crash-stats.mozilla.com/signature/?release_channel=nightly&product=Firefox&signature=std%3A%3Alist%3CT%3E%3A%3Aclear%20%7C%20CDeviceChild%3CT%3E%3A%3A~CDeviceChild%3CT%3E&date=%3E%3D2016-04-20T19%3A47%3A37.000Z&date=%3C2016-10-20T19%3A47%3A37.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#graphs.

Virtual, could you try to pin the regression down with mozregression?

Ryan noticed that bug 1289525 landed a patch on Jul-31 and one on Aug-10. He's going to start a try build with a backout of those patches. Virtual, could you try this build once it's ready?
Flags: needinfo?(virtual)
Component: Audio/Video: Playback → Graphics
Flags: needinfo?(jmuizelaar)
(In reply to Marco Castelluccio [:marco] from comment #13)
> Virtual, could you try to pin the regression down with mozregression?
> 
> Ryan noticed that bug 1289525 landed a patch on Jul-31 and one on Aug-10.
> He's going to start a try build with a backout of those patches. Virtual,
> could you try this build once it's ready?

I'm very sorry, but now I don't have that much time to spend on finding regression range as steps to reproduce are very time consuming, because I'm in the course of finding the new job.
Flags: needinfo?(virtual)
If there's any chance you can try out the builds from comment 14, it would be immensely helpful. We're getting short on time in the 50 cycle and are having a lot of problems reproducing on our end.
Flags: needinfo?(virtual)
For now, the only thing I can do is the advise and the recommendation of backing out the suspected patches. There are available 3 test branches of Firefox which are affected by these issues (Beta [50], Aurora [51] and Nightly [52]) to test with at least 3 reverted packs of patches in each branch to diagnose what and which patches were the cause.
Flags: needinfo?(virtual)
Hardware: Unspecified → x86_64
Version: Trunk → 50 Branch

Comment 18

3 years ago
this signature has totally ceased on 50.0b10 and 51.0a2 builds after 20161020004015.

changelog beta:https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_50_0b9_RELEASE&tochange=FIREFOX_50_0b10_RELEASE
changelog aurora: https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchane=95567c6551ea7dfef517e7b3f50fe0be209a3a2a&tochange=10be9d40fa865be7c3c203b9cd042722ab3069ca

so maybe we can tentatively change the state of this bug to fixed...

Comment 19

3 years ago
the crash is also consistently gone in 52.0a1 after build 20161019030208 - changelog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=90d8afaddf9150853b0b68b35b30c1e54a8683e7&tochange=99a239e1866a57f987b08dad796528e4ea30e622

the common denominating patch that landed in all those timeframes in beta/aurora/nightly and is most plausible to have fixed this per conversation on irc with RyanVM and marco is bug 1308418.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Depends on: 1308418
Resolution: --- → FIXED
Reporter

Comment 20

3 years ago
I still find it quite strange that bug 1308418 fixed this, but the other common bugs are even less likely (bug 1310061, bug 1309980, bug 1211270, bug 1149162, bug 1294442, bug 1311588, bug 1308259).
Not that surprising. Bug 1308418 had UAF of mutexes. What kind of damage it could have is anyone guess.
If true, this is the best surprise/unexpected win of Beta50 cycle! Great job everyone :)

Updated

3 years ago
See Also: → 1310600
You need to log in before you can comment on or make changes to this bug.