crash on jsimd_idct_islow_avx2
Categories
(Core :: Graphics: ImageLib, defect, P1)
Tracking
()
People
(Reporter: alexander, Assigned: aosmond)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files)
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
lizzard
:
approval-mozilla-beta+
|
Details | Review |
User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0
Steps to reproduce:
Please, take a look at https://crash-stats.mozilla.com/signature/?signature=jsimd_idct_islow_avx2&date=%3E%3D2019-01-10T13%3A57%3A52.000Z&date=%3C2019-01-17T13%3A57%3A52.000Z&_sort=-date
A lot of crashes on Windows on Windows platform, for FF 64 and 65
Comment 1•6 years ago
|
||
Trace:
This bug is for crash report bp-d843d57e-1bb1-493f-9dcb-ac3e00190117.
Top 10 frames of crashing thread:
0 xul.dll jsimd_idct_islow_avx2
1 @0x1010100f
2 xul.dll jsimd_idct_islow media/libjpeg/simd/i386/jsimd.c:1137
3 xul.dll static int decompress_onepass media/libjpeg/jdcoefct.c:142
4 xul.dll static void process_data_context_main media/libjpeg/jdmainct.c:338
5 xul.dll void mozilla::image::nsJPEGDecoder::OutputScanlines image/decoders/nsJPEGDecoder.cpp:670
6 xul.dll class mozilla::image::LexerTransition<mozilla::image::nsJPEGDecoder::State> mozilla::image::nsJPEGDecoder::ReadJPEGData image/decoders/nsJPEGDecoder.cpp:466
7 xul.dll static class mozilla::image::LexerTransition<mozilla::image::nsJPEGDecoder::State> mozilla::image::nsJPEGDecoder::DoDecode::<unnamed-tag>::operator image/decoders/nsJPEGDecoder.cpp:213
8 xul.dll static class mozilla::Maybe<mozilla::Variant<mozilla::image::TerminalState, mozilla::image::Yield> > mozilla::image::StreamingLexer<mozilla::image::nsJPEGDecoder::State, 16>::ContinueUnbufferedRead<`lambda at z:/build/build/src/image/decoders/nsJPEGDecoder.cpp:210:21'> image/StreamingLexer.h:598
9 xul.dll static class mozilla::Maybe<mozilla::Variant<mozilla::image::TerminalState, mozilla::image::Yield> > mozilla::image::StreamingLexer<mozilla::image::nsJPEGDecoder::State, 16>::UnbufferedRead<`lambda at z:/build/build/src/image/decoders/nsJPEGDecoder.cpp:210:21'> image/StreamingLexer.h:543
Comment 2•6 years ago
|
||
New regression with 64 it seems
Comment 3•6 years ago
|
||
Moving to a better component based on the source which points back to Bug 1425835, which landed in the 64 timeframe. ni on RyanVM - do we think this could have caused this regression?
Comment 4•6 years ago
|
||
That was the update which landed the crashing function in question, so yeah, makes sense.
Assignee | ||
Comment 5•6 years ago
|
||
This appears related to https://github.com/libjpeg-turbo/libjpeg-turbo/issues/288 although it appears we have included the fix for this in https://github.com/libjpeg-turbo/libjpeg-turbo/commit/d5f281b734425fc1d930ff2c3f8441aad731343e#diff-37ff9a9aa9b533b05baf4137c47092a6. I note this only happens on x86 so it is possible the detection is still flawed for that architecture.
Assignee | ||
Comment 6•6 years ago
|
||
The libjpeg-turbo github issue was predominantly for Windows 7 pre-SP1. We see this problem on Windows 7 SP 1 and Windows 8, specifically versions 6.1.7601 Service Pack 1, 6.2.9200 and 6.3.9600.
Assignee | ||
Comment 7•6 years ago
|
||
Looks like our AVX2 support check requires XSAVE in addition to OSXSAVE, and libjpeg-turbo does not:
Assignee | ||
Comment 8•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 9•6 years ago
|
||
Assignee | ||
Comment 10•6 years ago
|
||
Updated•6 years ago
|
Comment 11•6 years ago
|
||
Changing the priority to p1 as the bug is tracked by a release manager for the current beta.
See How Do You Triage for more information
Comment 12•6 years ago
|
||
So, to confirm, this is not something that needs to be fixed upstream?
Assignee | ||
Comment 13•6 years ago
|
||
(In reply to DRC from comment #12)
So, to confirm, this is not something that needs to be fixed upstream?
I think it does, but I wanted to make sure we actually fixed the crash for Firefox users before filing an upstream request.
Comment 14•6 years ago
|
||
Why have none of my users observed the problem, then? Please help me understand how it can be reproduced using only libjpeg-turbo.
Comment 15•6 years ago
|
||
Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1e76cea31571 Fix AVX2 detection to ensure we have all required CPU parameters. r=jrmuizel
Comment 16•6 years ago
|
||
(In reply to DRC from comment #14)
Why have none of my users observed the problem, then? Please help me understand how it can be reproduced using only libjpeg-turbo.
If it is the same as bug #1515792 you can set CPUID max = Enabled in BIOS and view a page with a JPG after resuming from hibernate to crash the Tab. No crash logs are generated for me though so I'm not sure it is the same.
Comment 17•6 years ago
|
||
I'll repeat, with emphasis:
Please help me understand how the bug can be reproduced using only libjpeg-turbo, i.e. outside of the context of Firefox.
Comment 18•6 years ago
|
||
bugherder |
Comment 19•6 years ago
|
||
Please nominate this for release approval when you get a chance. Looks like we're not going to get any useful confirmation from the Nightly channel that the patch works.
Comment 20•6 years ago
|
||
FYI, no reports so far from 66.0b1 released to the Aurora channel on Tuesday.
Comment 21•6 years ago
|
||
Recusing myself from this bug, as no one has answered my question regarding how to repro it upstream. You know where to reach me.
Comment 22•6 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)
Please nominate this for release approval when you get a chance. Looks like we're not going to get any useful confirmation from the Nightly channel that the patch works.
I can only comment on #1515792 and I don't know if this bug #1520760 is related. But the new patch didn't fix my crash for bug #1515792.
Comment 23•6 years ago
|
||
Yeah, this is still happening :(
Comment 24•6 years ago
|
||
This is happening on "GenuineIntel family 6 model 42 stepping 7" and "GenuineIntel family 6 model 58 stepping 9"
Comment 25•6 years ago
|
||
Those are the i7-2600K and Xeon E3-1265L V2 processors, respectively. Neither supports AVX2, so that makes sense at least.
https://ark.intel.com/products/52214/Intel-Core-i7-2600K-Processor-8M-Cache-up-to-3-80-GHz-
https://ark.intel.com/products/65728/Intel-Xeon-Processor-E3-1265L-v2-8M-Cache-2-50-GHz-
Comment 26•6 years ago
|
||
The code contains a note at the top that it's not multi-thread safe, and it is fail deadly in the sense that if the detection (init_simd) is not called before the routines run, they assume all CPU extensions are present. Might be worth reversing the logic here, and perhaps making that flags integer an atomic?
Assignee | ||
Comment 27•6 years ago
|
||
It has been racy for a long time. Looking at the telemetry environment, the Mozilla written detection appears to succeed as the reports are all tagged as hasAVX, but not hasAVX2. So there clearly is still a difference in how the detection works. For libwebp, I made it use our feature detection instead of relying upon the library's original implementation, maybe something similar should be done for libjpeg. I'm not sure if the crash volume justifies P1 however.
Comment 28•6 years ago
|
||
It has been racy for a long time.
Yes, and it started crashing when AVX2 use was added in the libjpeg-turbo update.
I see only one caller of this code:
https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#1137
So the previous version would work as long as the system had at least SSE2. We dropped support for non-SSE2 systems around Firefox 49 due to Rust. That's a pretty big window where you wouldn't have had this failure even if the detection didn't work correctly?
So if we're doing that AVX2 call on systems that we know have no AVX2 whatsoever, that means simd_support has too much bits set. It starts out as: https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#35
It seems unlikely at this point that the upstream and upstream+our patch detection code is still broken. Given that, making the initialization fail-safe and race-free seems like it's worth a shot? I have a patch, let me know what you think?
Comment 29•6 years ago
|
||
Actually nevermind. There can't really be a catastrophic race if everything calls init_simd before attempting to use AVX2.
Which begs the question:
https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#1137
Who guarantees that init_simd() will precede that in the calling sequence? Maybe we should just throw an init_simd() on top of that function? This isn't very obvious without being familiar with the code.
Comment 30•6 years ago
|
||
Okay, guess that's jsimd_can_fdct_islow which puts the function pointer in place. I'm out of ideas then.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 32•5 years ago
•
|
||
Which begs the question:
https://searchfox.org/mozilla-central/source/media/libjpeg/simd/i386/jsimd.c#1137Who guarantees that init_simd() will precede that in the calling sequence? Maybe we should just throw an init_simd() on top of that function? This isn't very obvious without being familiar with the code.
jsimd.c isn't a standalone API, so nothing other than the libjpeg API library code will be calling those functions, and in the libjpeg API library, a jsimd_*() function is never called prior to its associated jsimd_can_*() function.
Comment 33•5 years ago
|
||
Also note: at no point does the code say that it isn't "safe" in a multi-threaded environment. It simply says that it's "racy." The worst case is that multiple threads may call the guts of init_simd() simultaneously, because simd_support hasn't been set by one before the other reaches the check for that variable. However, there is no real harm to that AFAICT-- at least, it has never been a problem in the 10 years of libjpeg-turbo's existence.
Comment 34•5 years ago
|
||
(In reply to Jared Wein [:jaws] (Regression Engineering Owner for 65) (please needinfo? me) from comment #31)
Hey Nils, could you take a look at this?
As much as I would love to help, but I have now idea about JPEG or assembler, and I think neither in my team does.
Let me know if I can be of any help with this.
Assignee | ||
Comment 35•5 years ago
|
||
According to the crash reports, our (Mozilla) feature detection seems to be working correctly. As an experiment, I've put together a patch grafting a replacement for jpeg_simd_cpu_support which uses that instead.
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f62d41a7af8eca40f74a16e071a128dda859675e
Assignee | ||
Comment 36•5 years ago
|
||
Terrox, could I ask for your help once more to confirm whether or not this build works for you?
Comment 37•5 years ago
|
||
@aosmond Looks good. No crash with CPUID max = Enabled in BIOS after resuming from Sleep. Firefox 65.01 still crashes tab after resume from Sleep.
Assignee | ||
Comment 38•5 years ago
|
||
Reading more up on that CPUID max BIOS option, I see that it will typically limit the supported value in the eax register to 0-3 or 0-4. We set it to 7 when checking for AVX2 support. We do check the maximum supported level in the Mozilla code [1] and I don't see an equivalent check in the jsimdcpu.asm code. It will typically be disabled, so that could explain why only a small group of users see the problem...
Assignee | ||
Comment 39•5 years ago
|
||
Reviewing https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2a-manual.html, page 292-293:
If a value entered for CPUID.EAX is higher than the maximum input value for basic or extended function for that processor then the data for the highest basic information leaf is returned. For example, using the Intel Core i7 processor, the following is true:
CPUID.EAX = 05H (* Returns MONITOR/MWAIT leaf. )
CPUID.EAX = 0AH ( Returns Architectural Performance Monitoring leaf. )
CPUID.EAX = 0BH ( Returns Extended Topology Enumeration leaf. )
CPUID.EAX = 0CH ( INVALID: Returns the same information as CPUID.EAX = 0BH. )
CPUID.EAX = 80000008H ( Returns linear/physical address size data. )
CPUID.EAX = 8000000AH ( INVALID: Returns same information as CPUID.EAX = 0BH. *)
So we can expect a maximum of 3 or 4 to yield:
03H EBX:
Reserved
04H EBX:
Bits 11 - 00: L = System Coherency Line Size**.
Bits 21 - 12: P = Physical Line partitions**.
Bits 31 - 22: W = Ways of associativity**
Which is useless :). Supposedly we can do version detection via:
0H EAX:
Maximum Input Value for Basic CPUID Information.
Assignee | ||
Comment 40•5 years ago
|
||
Based on comment 39, this might be the complete fix:
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=663d61afaa3ba7d328536f065e1cd4e6e2a0cadb
I'll see if I can confirm myself if my Ivy/Sandy Bridge machines have the necessary BIOS option.
Assignee | ||
Comment 41•5 years ago
|
||
Whoops, I think I got the 64-bit patch slightly wrong (it will still work, but might disable SSE/SSE2 if you don't have a high enough CPUID max support): https://treeherder.mozilla.org/#/jobs?repo=try&revision=5669703a4ab132cecf2fc560a938a73e82377e6b
Assignee | ||
Comment 42•5 years ago
|
||
Assignee | ||
Comment 43•5 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #40)
Based on comment 39, this might be the complete fix:
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=663d61afaa3ba7d328536f065e1cd4e6e2a0cadb
I'll see if I can confirm myself if my Ivy/Sandy Bridge machines have the necessary BIOS option.
While I have the option on the Ivy Bridge machine, sadly it doesn't seem to work as advertised. The CPUID remained at 0xD which is not sufficient to trigger the issue outlined in comment 39. Possibly because this only seems to affect particular versions of Windows. Terrox, would you mind testing it for me?
This should be the last build since I'm 99% confident this is the problem :).
Assignee | ||
Comment 44•5 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #43)
(In reply to Andrew Osmond [:aosmond] from comment #40)
Based on comment 39, this might be the complete fix:
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=663d61afaa3ba7d328536f065e1cd4e6e2a0cadb
I'll see if I can confirm myself if my Ivy/Sandy Bridge machines have the necessary BIOS option.
While I have the option on the Ivy Bridge machine, sadly it doesn't seem to work as advertised. The CPUID remained at 0xD which is not sufficient to trigger the issue outlined in comment 39. Possibly because this only seems to affect particular versions of Windows. Terrox, would you mind testing it for me?
This should be the last build since I'm 99% confident this is the problem :).
Also, just for the sake of curiosity, do you see this crash in Chrome? I would have expected so given they also use libjpeg-turbo.
Comment 45•5 years ago
|
||
Pushed by aosmond@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/afbcec667ac4 Fix AVX2 detection to ensure we support the required CPUID version. r=jrmuizel
Comment 46•5 years ago
|
||
bugherder |
Comment 47•5 years ago
|
||
Can this stay in 67 till release or do you think it is safe for uplift to 66 beta? Keeping in mind we only have 2 weeks left before the 66 release.
Comment 48•5 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #43)
Terrox, would you mind testing it for me?
https://queue.taskcluster.net/v1/task/ZqZyxOnqQCSVFmu2hELDHg/runs/0/artifacts/public/build/target.zip
No crash with CPUID enabled in BIOS on my Windows 7 64 bit on the above test build. Still crashes in live Firefox 65.0.2 with/without addons enabled.
No crash in Chrome.
Assignee | ||
Comment 49•5 years ago
|
||
(In reply to Terrox from comment #48)
(In reply to Andrew Osmond [:aosmond] from comment #43)
Terrox, would you mind testing it for me?
https://queue.taskcluster.net/v1/task/ZqZyxOnqQCSVFmu2hELDHg/runs/0/artifacts/public/build/target.zipNo crash with CPUID enabled in BIOS on my Windows 7 64 bit on the above test build. Still crashes in live Firefox 65.0.2 with/without addons enabled.
Rejoice! Thank you very much for your invaluable help in testing this -- I don't think we would have gotten to the bottom of this as soon as we did without you :).
No crash in Chrome.
That is puzzling.
Assignee | ||
Comment 50•5 years ago
|
||
Comment on attachment 9045959 [details]
Bug 1520760 - Fix AVX2 detection to ensure we support the required CPUID version.
Beta/Release Uplift Approval Request
- Feature/Bug causing the regression: Bug 1425835
- User impact if declined: May experience crash when viewing JPEG files on older computers and/or ones using a BIOS option to limit the CPUID (most likely users who migrated from Windows XP to a later version of Windows without buying new hardware), including at startup.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Small/isolated change in an area that is now well understood and received independent verification from a user experiencing the crash. Automated testing exercises this code, although not the failure case experienced in the field as that is quite limited to certain hardware configurations.
- String changes made/needed: N/A
Comment 51•5 years ago
|
||
Comment on attachment 9045959 [details]
Bug 1520760 - Fix AVX2 detection to ensure we support the required CPUID version.
Crash fix, verified in nightly. Let's take this for beta 13.
Comment 52•5 years ago
|
||
bugherder uplift |
Comment 54•5 years ago
|
||
(In reply to Andrew Osmond [:aosmond] from comment #49)
That is puzzling.
Last I knew, Chrome was still on 1.6 or something, which would predate these optimizations.
@DRC: Looks like we were able to narrow down the issue in comments 38 and 39. Required specific BIOS settings on specific CPUs on specific OSes :(. That said, we haven't seen any crash reports on the Beta channel since this fix shipped there and we were able to get verification from one person who could consistently reproduce the issue in this bug that the fix works. Is that enough for you to consider upstreaming this fix or is there more you'd want to see still?
Comment 55•5 years ago
|
||
Last I knew, Chrome was still on 1.6 or something, which would predate these optimizations.
If you're referring to libjpeg-turbo, there isn't a 1.6. It went from 1.5.3 to 2.0.0. 2.0.0 introduced the AVX2 code.
@DRC: Looks like we were able to narrow down the issue in comments 38 and 39. Required specific BIOS settings on specific CPUs on specific OSes :(. That said, we haven't seen any crash reports on the Beta channel since this fix shipped there and we were able to get verification from one person who could consistently reproduce the issue in this bug that the fix works. Is that enough for you to consider upstreaming this fix or is there more you'd want to see still?
The fix looks straightforward and innocuous, and it is consistent with the documentation for the CPUID instruction. However, since I have no way to reproduce the issue (my only AVX2-equipped machine is a Mac), I need someone to help me fully describe it.
Is the issue limited only to certain CPUs?
Is the issue limited only to certain O/S's or O/S revisions?
Has anyone reproduced it reliably enough to know whether it affects libjpeg-turbo outside of Firefox?
Why isn't Chrome affected?
If there aren't answers to those questions, then upstreaming it is still a possibility, but it would have to go in as an undocumented fix. I can't document the fix unless there is a procedure for reproducing it using only libjpeg-turbo.
Comment 56•5 years ago
|
||
(In reply to DRC from comment #55)
Last I knew, Chrome was still on 1.6 or something, which would predate these optimizations.
If you're referring to libjpeg-turbo, there isn't a 1.6. It went from 1.5.3
to 2.0.0. 2.0.0 introduced the AVX2 code.
Sorry, was going off memory on the version number there. I just recalled it predating 2.0 and the AVX2 code as you noted :). Looking upstream, it looks like they updated from 1.4.90 to 2.0.1 a few days ago, actually.
https://chromium.googlesource.com/chromium/deps/libjpeg_turbo/+/cca8c4dec783a048da6933c86028556622d7c355
Assignee | ||
Comment 57•5 years ago
|
||
Two new reports, our detection says we don't have AVX2, libjpeg does though:
https://crash-stats.mozilla.com/report/index/fd21e3df-4458-4e71-a84e-5ca7e0190320
https://crash-stats.mozilla.com/report/index/f132bca6-dc9c-4702-8dc0-587170190320
As for the hardware, the former should actually have AVX2 support, but the latter definitely does not.
Comment 58•5 years ago
|
||
Can someone please answer my questions above and thus help me reproduce and fix this upstream?
Comment 59•5 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #56)
Looking upstream, it looks like they updated from 1.4.90 to 2.0.1 a few days ago, actually.
https://chromium.googlesource.com/chromium/deps/libjpeg_turbo/+/cca8c4dec783a048da6933c86028556622d7c355
That's just Chromium's copy of libjpeg-turbo. We still need to land a DEPS roll for Chromium to actually be using it (https://chromium-review.googlesource.com/c/chromium/src/+/1496758).
Comment 60•5 years ago
|
||
Hello everyone,
I believe you are using libjpeg-turbo v2.0.0, which exhibited this problem.
I suspect v2.0.1 fixed it with this patch: https://github.com/libjpeg-turbo/libjpeg-turbo/commit/d5f281b734425fc1d930ff2c3f8441aad731343e
The local patch here checks the max CPUID function supported before making the call to function 7. Both of these corrections should be applied. But only one is needed to stop the bad behavior, I believe.
A similar CPUID maxval check was just added upstream:
https://github.com/libjpeg-turbo/libjpeg-turbo/commit/aa9db616774e24af7ab2fbcddd5711057b8a901e
It should show up in v2.0.3
Comment 61•5 years ago
•
|
||
(In reply to Chris Blume from comment #60)
Hello everyone,
I believe you are using libjpeg-turbo v2.0.0, which exhibited this problem.
I suspect v2.0.1 fixed it with this patch: https://github.com/libjpeg-turbo/libjpeg-turbo/commit/d5f281b734425fc1d930ff2c3f8441aad731343e
We included that patch when we landed the 2.0.0 update.
https://bugzilla.mozilla.org/show_bug.cgi?id=1425835#c8
Glad to hear that 2.0.3 should resolve it. I'm thinking we could try cherry-picking that upstream commit into the upcoming Fx67 release too to see if it helps at all.
Comment 62•5 years ago
|
||
The patch that just landed upstream (https://github.com/libjpeg-turbo/libjpeg-turbo/commit/aa9db616774e24af7ab2fbcddd5711057b8a901e) brings the upstream CPU detection code in line with Mozilla's, with one exception: the upstream code seems to be checking only OSXSAVE, not XSAVE. Is checking XSAVE necessary? That seems like a technicality to me, since (to the best of my understanding) OSXSAVE won't work without XSAVE.
Description
•