Closed Bug 1520760 Opened 11 months ago Closed 10 months ago

crash on jsimd_idct_islow_avx2

Categories

(Core :: ImageLib, defect, P1)

64 Branch
x86
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
firefox-esr60 --- unaffected
firefox64 --- wontfix
firefox65 + wontfix
firefox66 + fixed
firefox67 --- verified

People

(Reporter: alexander, Assigned: aosmond)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0

Steps to reproduce:

Please, take a look at https://crash-stats.mozilla.com/signature/?signature=jsimd_idct_islow_avx2&date=%3E%3D2019-01-10T13%3A57%3A52.000Z&date=%3C2019-01-17T13%3A57%3A52.000Z&_sort=-date

A lot of crashes on Windows on Windows platform, for FF 64 and 65

Trace:
This bug is for crash report bp-d843d57e-1bb1-493f-9dcb-ac3e00190117.

Top 10 frames of crashing thread:

0 xul.dll jsimd_idct_islow_avx2 
1  @0x1010100f 
2 xul.dll jsimd_idct_islow media/libjpeg/simd/i386/jsimd.c:1137
3 xul.dll static int decompress_onepass media/libjpeg/jdcoefct.c:142
4 xul.dll static void process_data_context_main media/libjpeg/jdmainct.c:338
5 xul.dll void mozilla::image::nsJPEGDecoder::OutputScanlines image/decoders/nsJPEGDecoder.cpp:670
6 xul.dll class mozilla::image::LexerTransition<mozilla::image::nsJPEGDecoder::State> mozilla::image::nsJPEGDecoder::ReadJPEGData image/decoders/nsJPEGDecoder.cpp:466
7 xul.dll static class mozilla::image::LexerTransition<mozilla::image::nsJPEGDecoder::State> mozilla::image::nsJPEGDecoder::DoDecode::<unnamed-tag>::operator image/decoders/nsJPEGDecoder.cpp:213
8 xul.dll static class mozilla::Maybe<mozilla::Variant<mozilla::image::TerminalState, mozilla::image::Yield> > mozilla::image::StreamingLexer<mozilla::image::nsJPEGDecoder::State, 16>::ContinueUnbufferedRead<`lambda at z:/build/build/src/image/decoders/nsJPEGDecoder.cpp:210:21'> image/StreamingLexer.h:598
9 xul.dll static class mozilla::Maybe<mozilla::Variant<mozilla::image::TerminalState, mozilla::image::Yield> > mozilla::image::StreamingLexer<mozilla::image::nsJPEGDecoder::State, 16>::UnbufferedRead<`lambda at z:/build/build/src/image/decoders/nsJPEGDecoder.cpp:210:21'> image/StreamingLexer.h:543

Crash Signature: [@ jsimd_idct_islow_avx2]
Keywords: crash
Summary: jsimd_idct_islow_avx2 → crash on jsimd_idct_islow_avx2

New regression with 64 it seems

Moving to a better component based on the source which points back to Bug 1425835, which landed in the 64 timeframe. ni on RyanVM - do we think this could have caused this regression?

Status: UNCONFIRMED → NEW
Component: Untriaged → ImageLib
Ever confirmed: true
Flags: needinfo?(ryanvm)
Product: Firefox → Core

That was the update which landed the crashing function in question, so yeah, makes sense.

Blocks: 1425835
Flags: needinfo?(ryanvm)

This appears related to https://github.com/libjpeg-turbo/libjpeg-turbo/issues/288 although it appears we have included the fix for this in https://github.com/libjpeg-turbo/libjpeg-turbo/commit/d5f281b734425fc1d930ff2c3f8441aad731343e#diff-37ff9a9aa9b533b05baf4137c47092a6. I note this only happens on x86 so it is possible the detection is still flawed for that architecture.

OS: Unspecified → Windows
Hardware: Unspecified → x86

The libjpeg-turbo github issue was predominantly for Windows 7 pre-SP1. We see this problem on Windows 7 SP 1 and Windows 8, specifically versions 6.1.7601 Service Pack 1, 6.2.9200 and 6.3.9600.

Looks like our AVX2 support check requires XSAVE in addition to OSXSAVE, and libjpeg-turbo does not:

https://searchfox.org/mozilla-central/rev/dac799c9f4e9f5f05c1071cba94f2522aa31f7eb/mozglue/build/SSE.cpp#169

Assignee: nobody → aosmond
Priority: -- → P3

Changing the priority to p1 as the bug is tracked by a release manager for the current beta.
See How Do You Triage for more information

Priority: P3 → P1
Blocks: 1515792

So, to confirm, this is not something that needs to be fixed upstream?

(In reply to DRC from comment #12)

So, to confirm, this is not something that needs to be fixed upstream?

I think it does, but I wanted to make sure we actually fixed the crash for Firefox users before filing an upstream request.

Why have none of my users observed the problem, then? Please help me understand how it can be reproduced using only libjpeg-turbo.

Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1e76cea31571
Fix AVX2 detection to ensure we have all required CPU parameters. r=jrmuizel

(In reply to DRC from comment #14)

Why have none of my users observed the problem, then? Please help me understand how it can be reproduced using only libjpeg-turbo.

If it is the same as bug #1515792 you can set CPUID max = Enabled in BIOS and view a page with a JPG after resuming from hibernate to crash the Tab. No crash logs are generated for me though so I'm not sure it is the same.

I'll repeat, with emphasis:

Please help me understand how the bug can be reproduced using only libjpeg-turbo, i.e. outside of the context of Firefox.

Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66

Please nominate this for release approval when you get a chance. Looks like we're not going to get any useful confirmation from the Nightly channel that the patch works.

Flags: needinfo?(aosmond)

FYI, no reports so far from 66.0b1 released to the Aurora channel on Tuesday.

Recusing myself from this bug, as no one has answered my question regarding how to repro it upstream. You know where to reach me.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)

Please nominate this for release approval when you get a chance. Looks like we're not going to get any useful confirmation from the Nightly channel that the patch works.

I can only comment on #1515792 and I don't know if this bug #1520760 is related. But the new patch didn't fix my crash for bug #1515792.

Yeah, this is still happening :(

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla66 → ---

This is happening on "GenuineIntel family 6 model 42 stepping 7" and "GenuineIntel family 6 model 58 stepping 9"

Those are the i7-2600K and Xeon E3-1265L V2 processors, respectively. Neither supports AVX2, so that makes sense at least.

https://ark.intel.com/products/52214/Intel-Core-i7-2600K-Processor-8M-Cache-up-to-3-80-GHz-
https://ark.intel.com/products/65728/Intel-Xeon-Processor-E3-1265L-v2-8M-Cache-2-50-GHz-

The code contains a note at the top that it's not multi-thread safe, and it is fail deadly in the sense that if the detection (init_simd) is not called before the routines run, they assume all CPU extensions are present. Might be worth reversing the logic here, and perhaps making that flags integer an atomic?

It has been racy for a long time. Looking at the telemetry environment, the Mozilla written detection appears to succeed as the reports are all tagged as hasAVX, but not hasAVX2. So there clearly is still a difference in how the detection works. For libwebp, I made it use our feature detection instead of relying upon the library's original implementation, maybe something similar should be done for libjpeg. I'm not sure if the crash volume justifies P1 however.

Flags: needinfo?(aosmond)

It has been racy for a long time.

Yes, and it started crashing when AVX2 use was added in the libjpeg-turbo update.

I see only one caller of this code:
https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#1137

So the previous version would work as long as the system had at least SSE2. We dropped support for non-SSE2 systems around Firefox 49 due to Rust. That's a pretty big window where you wouldn't have had this failure even if the detection didn't work correctly?

So if we're doing that AVX2 call on systems that we know have no AVX2 whatsoever, that means simd_support has too much bits set. It starts out as: https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#35

It seems unlikely at this point that the upstream and upstream+our patch detection code is still broken. Given that, making the initialization fail-safe and race-free seems like it's worth a shot? I have a patch, let me know what you think?

Actually nevermind. There can't really be a catastrophic race if everything calls init_simd before attempting to use AVX2.

Which begs the question:
https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#1137

Who guarantees that init_simd() will precede that in the calling sequence? Maybe we should just throw an init_simd() on top of that function? This isn't very obvious without being familiar with the code.

Okay, guess that's jsimd_can_fdct_islow which puts the function pointer in place. I'm out of ideas then.

Hey Nils, could you take a look at this?

Flags: needinfo?(drno)

Which begs the question:
https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#1137

Who guarantees that init_simd() will precede that in the calling sequence? Maybe we should just throw an init_simd() on top of that function? This isn't very obvious without being familiar with the code.

jsimd.c isn't a standalone API, so nothing other than the libjpeg API library code will be calling those functions, and in the libjpeg API library, a jsimd_*() function is never called prior to its associated jsimd_can_*() function.

Also note: at no point does the code say that it isn't "safe" in a multi-threaded environment. It simply says that it's "racy." The worst case is that multiple threads may call the guts of init_simd() simultaneously, because simd_support hasn't been set by one before the other reaches the check for that variable. However, there is no real harm to that AFAICT-- at least, it has never been a problem in the 10 years of libjpeg-turbo's existence.

(In reply to Jared Wein [:jaws] (Regression Engineering Owner for 65) (please needinfo? me) from comment #31)

Hey Nils, could you take a look at this?

As much as I would love to help, but I have now idea about JPEG or assembler, and I think neither in my team does.

Let me know if I can be of any help with this.

Flags: needinfo?(drno)

According to the crash reports, our (Mozilla) feature detection seems to be working correctly. As an experiment, I've put together a patch grafting a replacement for jpeg_simd_cpu_support which uses that instead.

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f62d41a7af8eca40f74a16e071a128dda859675e

Terrox, could I ask for your help once more to confirm whether or not this build works for you?

https://queue.taskcluster.net/v1/task/bmByEpPgSae3rJ9610l7Zw/runs/0/artifacts/public/build/target.zip

Flags: needinfo?(disposable08)

@aosmond Looks good. No crash with CPUID max = Enabled in BIOS after resuming from Sleep. Firefox 65.01 still crashes tab after resume from Sleep.

Flags: needinfo?(disposable08)

Reading more up on that CPUID max BIOS option, I see that it will typically limit the supported value in the eax register to 0-3 or 0-4. We set it to 7 when checking for AVX2 support. We do check the maximum supported level in the Mozilla code [1] and I don't see an equivalent check in the jsimdcpu.asm code. It will typically be disabled, so that could explain why only a small group of users see the problem...

[1] https://searchfox.org/mozilla-central/rev/b36e97fc776635655e84f2048ff59f38fa8a4626/mozglue/build/SSE.cpp#33-34

Reviewing https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2a-manual.html, page 292-293:

If a value entered for CPUID.EAX is higher than the maximum input value for basic or extended function for that processor then the data for the highest basic information leaf is returned. For example, using the Intel Core i7 processor, the following is true:
CPUID.EAX = 05H (* Returns MONITOR/MWAIT leaf. )
CPUID.EAX = 0AH (
Returns Architectural Performance Monitoring leaf. )
CPUID.EAX = 0BH (
Returns Extended Topology Enumeration leaf. )
CPUID.EAX = 0CH (
INVALID: Returns the same information as CPUID.EAX = 0BH. )
CPUID.EAX = 80000008H (
Returns linear/physical address size data. )
CPUID.EAX = 8000000AH (
INVALID: Returns same information as CPUID.EAX = 0BH. *)

So we can expect a maximum of 3 or 4 to yield:

03H EBX:
Reserved

04H EBX:
Bits 11 - 00: L = System Coherency Line Size**.
Bits 21 - 12: P = Physical Line partitions**.
Bits 31 - 22: W = Ways of associativity**

Which is useless :). Supposedly we can do version detection via:

0H EAX:
Maximum Input Value for Basic CPUID Information.

Based on comment 39, this might be the complete fix:

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=663d61afaa3ba7d328536f065e1cd4e6e2a0cadb

I'll see if I can confirm myself if my Ivy/Sandy Bridge machines have the necessary BIOS option.

Whoops, I think I got the 64-bit patch slightly wrong (it will still work, but might disable SSE/SSE2 if you don't have a high enough CPUID max support): https://treeherder.mozilla.org/#/jobs?repo=try&revision=5669703a4ab132cecf2fc560a938a73e82377e6b

(In reply to Andrew Osmond [:aosmond] from comment #40)

Based on comment 39, this might be the complete fix:

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=663d61afaa3ba7d328536f065e1cd4e6e2a0cadb

I'll see if I can confirm myself if my Ivy/Sandy Bridge machines have the necessary BIOS option.

While I have the option on the Ivy Bridge machine, sadly it doesn't seem to work as advertised. The CPUID remained at 0xD which is not sufficient to trigger the issue outlined in comment 39. Possibly because this only seems to affect particular versions of Windows. Terrox, would you mind testing it for me?

https://queue.taskcluster.net/v1/task/ZqZyxOnqQCSVFmu2hELDHg/runs/0/artifacts/public/build/target.zip

This should be the last build since I'm 99% confident this is the problem :).

Flags: needinfo?(disposable08)

(In reply to Andrew Osmond [:aosmond] from comment #43)

(In reply to Andrew Osmond [:aosmond] from comment #40)

Based on comment 39, this might be the complete fix:

try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=663d61afaa3ba7d328536f065e1cd4e6e2a0cadb

I'll see if I can confirm myself if my Ivy/Sandy Bridge machines have the necessary BIOS option.

While I have the option on the Ivy Bridge machine, sadly it doesn't seem to work as advertised. The CPUID remained at 0xD which is not sufficient to trigger the issue outlined in comment 39. Possibly because this only seems to affect particular versions of Windows. Terrox, would you mind testing it for me?

https://queue.taskcluster.net/v1/task/ZqZyxOnqQCSVFmu2hELDHg/runs/0/artifacts/public/build/target.zip

This should be the last build since I'm 99% confident this is the problem :).

Also, just for the sake of curiosity, do you see this crash in Chrome? I would have expected so given they also use libjpeg-turbo.

Pushed by aosmond@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/afbcec667ac4
Fix AVX2 detection to ensure we support the required CPUID version. r=jrmuizel
Status: REOPENED → RESOLVED
Closed: 11 months ago10 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

Can this stay in 67 till release or do you think it is safe for uplift to 66 beta? Keeping in mind we only have 2 weeks left before the 66 release.

Flags: needinfo?(aosmond)

(In reply to Andrew Osmond [:aosmond] from comment #43)

Terrox, would you mind testing it for me?
https://queue.taskcluster.net/v1/task/ZqZyxOnqQCSVFmu2hELDHg/runs/0/artifacts/public/build/target.zip

No crash with CPUID enabled in BIOS on my Windows 7 64 bit on the above test build. Still crashes in live Firefox 65.0.2 with/without addons enabled.

No crash in Chrome.

Flags: needinfo?(disposable08)

(In reply to Terrox from comment #48)

(In reply to Andrew Osmond [:aosmond] from comment #43)

Terrox, would you mind testing it for me?
https://queue.taskcluster.net/v1/task/ZqZyxOnqQCSVFmu2hELDHg/runs/0/artifacts/public/build/target.zip

No crash with CPUID enabled in BIOS on my Windows 7 64 bit on the above test build. Still crashes in live Firefox 65.0.2 with/without addons enabled.

Rejoice! Thank you very much for your invaluable help in testing this -- I don't think we would have gotten to the bottom of this as soon as we did without you :).

No crash in Chrome.

That is puzzling.

Flags: needinfo?(aosmond)

Comment on attachment 9045959 [details]
Bug 1520760 - Fix AVX2 detection to ensure we support the required CPUID version.

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: Bug 1425835
  • User impact if declined: May experience crash when viewing JPEG files on older computers and/or ones using a BIOS option to limit the CPUID (most likely users who migrated from Windows XP to a later version of Windows without buying new hardware), including at startup.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Small/isolated change in an area that is now well understood and received independent verification from a user experiencing the crash. Automated testing exercises this code, although not the failure case experienced in the field as that is quite limited to certain hardware configurations.
  • String changes made/needed: N/A
Attachment #9045959 - Flags: approval-mozilla-beta?

Comment on attachment 9045959 [details]
Bug 1520760 - Fix AVX2 detection to ensure we support the required CPUID version.

Crash fix, verified in nightly. Let's take this for beta 13.

Attachment #9045959 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Duplicate of this bug: 1515792

(In reply to Andrew Osmond [:aosmond] from comment #49)

That is puzzling.

Last I knew, Chrome was still on 1.6 or something, which would predate these optimizations.

@DRC: Looks like we were able to narrow down the issue in comments 38 and 39. Required specific BIOS settings on specific CPUs on specific OSes :(. That said, we haven't seen any crash reports on the Beta channel since this fix shipped there and we were able to get verification from one person who could consistently reproduce the issue in this bug that the fix works. Is that enough for you to consider upstreaming this fix or is there more you'd want to see still?

Flags: needinfo?(dcommander)

Last I knew, Chrome was still on 1.6 or something, which would predate these optimizations.

If you're referring to libjpeg-turbo, there isn't a 1.6. It went from 1.5.3 to 2.0.0. 2.0.0 introduced the AVX2 code.

@DRC: Looks like we were able to narrow down the issue in comments 38 and 39. Required specific BIOS settings on specific CPUs on specific OSes :(. That said, we haven't seen any crash reports on the Beta channel since this fix shipped there and we were able to get verification from one person who could consistently reproduce the issue in this bug that the fix works. Is that enough for you to consider upstreaming this fix or is there more you'd want to see still?

The fix looks straightforward and innocuous, and it is consistent with the documentation for the CPUID instruction. However, since I have no way to reproduce the issue (my only AVX2-equipped machine is a Mac), I need someone to help me fully describe it.

Is the issue limited only to certain CPUs?
Is the issue limited only to certain O/S's or O/S revisions?
Has anyone reproduced it reliably enough to know whether it affects libjpeg-turbo outside of Firefox?
Why isn't Chrome affected?

If there aren't answers to those questions, then upstreaming it is still a possibility, but it would have to go in as an undocumented fix. I can't document the fix unless there is a procedure for reproducing it using only libjpeg-turbo.

Flags: needinfo?(dcommander)

(In reply to DRC from comment #55)

Last I knew, Chrome was still on 1.6 or something, which would predate these optimizations.

If you're referring to libjpeg-turbo, there isn't a 1.6. It went from 1.5.3
to 2.0.0. 2.0.0 introduced the AVX2 code.

Sorry, was going off memory on the version number there. I just recalled it predating 2.0 and the AVX2 code as you noted :). Looking upstream, it looks like they updated from 1.4.90 to 2.0.1 a few days ago, actually.
https://chromium.googlesource.com/chromium/deps/libjpeg_turbo/+/cca8c4dec783a048da6933c86028556622d7c355

Two new reports, our detection says we don't have AVX2, libjpeg does though:

https://crash-stats.mozilla.com/report/index/fd21e3df-4458-4e71-a84e-5ca7e0190320
https://crash-stats.mozilla.com/report/index/f132bca6-dc9c-4702-8dc0-587170190320

As for the hardware, the former should actually have AVX2 support, but the latter definitely does not.

Can someone please answer my questions above and thus help me reproduce and fix this upstream?

(In reply to Ryan VanderMeulen [:RyanVM] from comment #56)

Looking upstream, it looks like they updated from 1.4.90 to 2.0.1 a few days ago, actually.
https://chromium.googlesource.com/chromium/deps/libjpeg_turbo/+/cca8c4dec783a048da6933c86028556622d7c355

That's just Chromium's copy of libjpeg-turbo. We still need to land a DEPS roll for Chromium to actually be using it (https://chromium-review.googlesource.com/c/chromium/src/+/1496758).

Hello everyone,
I believe you are using libjpeg-turbo v2.0.0, which exhibited this problem.
I suspect v2.0.1 fixed it with this patch: https://github.com/libjpeg-turbo/libjpeg-turbo/commit/d5f281b734425fc1d930ff2c3f8441aad731343e

The local patch here checks the max CPUID function supported before making the call to function 7. Both of these corrections should be applied. But only one is needed to stop the bad behavior, I believe.

A similar CPUID maxval check was just added upstream:
https://github.com/libjpeg-turbo/libjpeg-turbo/commit/aa9db616774e24af7ab2fbcddd5711057b8a901e
It should show up in v2.0.3

(In reply to Chris Blume from comment #60)

Hello everyone,
I believe you are using libjpeg-turbo v2.0.0, which exhibited this problem.
I suspect v2.0.1 fixed it with this patch: https://github.com/libjpeg-turbo/libjpeg-turbo/commit/d5f281b734425fc1d930ff2c3f8441aad731343e

We included that patch when we landed the 2.0.0 update.
https://bugzilla.mozilla.org/show_bug.cgi?id=1425835#c8

Glad to hear that 2.0.3 should resolve it. I'm thinking we could try cherry-picking that upstream commit into the upcoming Fx67 release too to see if it helps at all.

The patch that just landed upstream (https://github.com/libjpeg-turbo/libjpeg-turbo/commit/aa9db616774e24af7ab2fbcddd5711057b8a901e) brings the upstream CPU detection code in line with Mozilla's, with one exception: the upstream code seems to be checking only OSXSAVE, not XSAVE. Is checking XSAVE necessary? That seems like a technicality to me, since (to the best of my understanding) OSXSAVE won't work without XSAVE.

You need to log in before you can comment on or make changes to this bug.