Open Bug 1639258 Opened 4 years ago Updated 1 year ago

Crash in [@ GetCoeffsFast]

Categories

(Core :: Graphics: ImageLib, defect)

76 Branch
Unspecified
Windows 10
defect

Tracking

()

Tracking Status
firefox76 --- wontfix

People

(Reporter: yoasif, Unassigned)

Details

(Keywords: crash)

Crash Data

This bug is for crash report bp-ce5c3ccd-8744-4f1f-a6bb-2f8f40200518.

Top 10 frames of crashing thread:

0 xul.dll GetCoeffsFast media/libwebp/src/dec/vp8_dec.c:443
1 xul.dll VP8DecodeMB media/libwebp/src/dec/vp8_dec.c:614
2 xul.dll IDecode media/libwebp/src/dec/idec_dec.c:590
3 xul.dll mozilla::image::nsWebPDecoder::ReadSingle image/decoders/nsWebPDecoder.cpp:444
4 xul.dll mozilla::image::nsWebPDecoder::ReadPayload image/decoders/nsWebPDecoder.cpp:418
5 xul.dll mozilla::image::nsWebPDecoder::ReadHeader image/decoders/nsWebPDecoder.cpp:412
6 xul.dll mozilla::image::nsWebPDecoder::ReadData image/decoders/nsWebPDecoder.cpp:85
7 xul.dll mozilla::image::nsWebPDecoder::UpdateBuffer image/decoders/nsWebPDecoder.cpp
8 xul.dll mozilla::image::nsWebPDecoder::DoDecode image/decoders/nsWebPDecoder.cpp:109
9 xul.dll mozilla::image::Decoder::Decode image/Decoder.cpp:133

Seems like a new crash in 76.0.1. 82 crashes in the last week on 72 installs.

Got a user report at: https://www.reddit.com/r/firefox/comments/gm9nb1/firefox_crashing_tabs_randomly_looking_for_some/

Moving to ImageLib based on the stack. I had a quick look to see if any of the files involved changed in 76 and I don't see anything (there are changes before and after). Andrew, anything standing out here?

Component: Audio/Video: Playback → ImageLib
Flags: needinfo?(aosmond)

Just adding Timothy to this in case he as ideas/suggestions

Flags: needinfo?(tnikkel)

Looks like something we can and should fix quickly.

Severity: -- → S2

The crash has existed with very low volume at least as far back at esr 68.

The crashing lines (some of the time) were last touched when we updated libwebp in bug 1618288, which is in 74. But only 76.0.1 has a spike in crashes, not 76, not anything in 75 or 74.

There are 143 crashes in the last 3 months. Of those 130 are on a cpu with "family 23 model 1 stepping 1", this seems to correspond to AMD Ryzen cpus. Of the 13 crashes that aren't on that specific cpu only 2 are on 76.0.1. So that seems like those correspond to the previous low volume crash.

Is this enough circumstantial evidence to look closer at that specific cpu?

Flags: needinfo?(tnikkel) → needinfo?(dmajor)

Oh and from what I can tell there are only two small changes in 76.0.1 over 76.

https://hg.mozilla.org/releases/mozilla-release/rev/2aa3ab8e2feeb3f8b67684ace2f3db7d8126b460
https://hg.mozilla.org/releases/mozilla-release/rev/e2de5f11bc0afd9a3024d32b83cb9f0ada95717a

Neither seem like they would cause this. One of them is about some nvidia dll, but this crash seems to happen whether the gpu is nvidia or amd (and there are a few intel in there, but given the cpu is quite powerful this cohort likely also has a powerful gpu).

The function_offsets are all over GetCoeffsFast; it's not a single particular line that's crashing. Many of the crashing instructions are benign like cmp or reg-to-reg movs. There's even a subset of crashes that aren't on a proper instruction boundary, so they misinterpret a privileged in operation and crash with EXCEPTION_PRIV_INSTRUCTION.

A cpu issue is certainly a possibility at this point. Given the wide range of crash addresses, maybe a jump accidentally went to a garbage offset. (EDIT: On second thought, in light of the benign instructions, more likely the cpu is decoding something other than what we're seeing.)

Flags: needinfo?(dmajor)
Flags: needinfo?(aosmond)

S1 or S2 bugs need an assignee - could you find someone for this bug?

Flags: needinfo?(aosmond)

Okay, then I think the severity should be reduced. The volume is low given this is already in release and it is relatively small volume. I confirmed there is no funny business with PGO going on in the method by comparing the produced assembly (they are the same). We certainly didn't uplift anything imagelib for 76.0.1. Reviewing the code, I don't have an explanation given it accesses that structure several times prior to the "typical" instruction it fails at. It is restricted to a particular CPU for the most part, which suggests there might be a processor bug interacting and we are getting bit in this particular place. I think the best we can do at this point is wait and see if the next release changes the volume.

Severity: S2 → S3
Flags: needinfo?(aosmond)
You need to log in before you can comment on or make changes to this bug.