crash in qcms_transform_data_rgb_out_lut_sse1

RESOLVED FIXED in Firefox 39

Status

()

--
critical
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: lizzard, Assigned: seth)

Tracking

(Blocks: 1 bug, {crash, topcrash})

40 Branch
mozilla41
x86
Windows NT
crash, topcrash
Points:
---

Firefox Tracking Flags

(firefox39+ fixed, firefox40+ fixed, firefox41 fixed)

Details

(crash signature)

Attachments

(1 attachment, 4 obsolete attachments)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-3d197472-d5a5-46dd-9ec7-3d5db2150511.
=============================================================
This is the #7 topcrash for 40 in the last 3 days. It's showing up in the explosiveness report for 40. Most of the crashes are on the 2015050803 build. This is often a startup crash.  

Crashing thread:

0 	xul.dll 	qcms_transform_data_rgb_out_lut_sse1 	gfx/qcms/transform-sse1.c
1 	xul.dll 	mozilla::image::nsJPEGDecoder::OutputScanlines(bool*) 	image/decoders/nsJPEGDecoder.cpp
2 	xul.dll 	mozilla::image::nsJPEGDecoder::WriteInternal(char const*, unsigned int) 	image/decoders/nsJPEGDecoder.cpp
3 	xul.dll 	mozilla::image::Decoder::Write(char const*, unsigned int) 	image/src/Decoder.cpp
4 	xul.dll 	mozilla::image::Decoder::Decode() 	image/src/Decoder.cpp
5 	xul.dll 	mozilla::image::DecodePool::Decode(mozilla::image::Decoder*) 	image/src/DecodePool.cpp
6 	xul.dll 	mozilla::image::DecodeWorker::Run() 	image/src/DecodePool.cpp
7 	xul.dll 	nsThreadPool::Run() 	xpcom/threads/nsThreadPool.cpp
8 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
9 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp
10 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp
11 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc
12 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc
13 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp
14 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
15 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
16 	msvcr120.dll 	_callthreadstartex 	f:\dd\vctools\crt\crtw32\startup\threadex.c:376
17 	msvcr120.dll 	msvcr120.dll@0x2c000 	
18 	kernel32.dll 	BaseThreadStart
Together these signatures spiked in build 20150508030204, and then subsided to lower volume. At this point I can't say whether it's because the crash drove away most of the affected users, or whether the spike was just a bad luck data point with a few very persistent reporters. We might have a better picture in a few days.

If the spike is real, the regression range would be: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=617dbce26726&tochange=86203ac87a08

The crash is only on Windows XP, and only on AMD family 6 processors (note, this is not "the AMD bug"). Family 6 is pretty ancient, like Athlon era. Normally my first instinct would be that these don't support SSE -- but -- we're not crashing on an SSE instruction. We're crashing later on trying to read "otdata_r[output[0]]". One of the components that pointer math (so either otdata_r, or output[0], not sure which) is 0x80000000.

Image experts, any idea?
Crash Signature: [@ qcms_transform_data_rgb_out_lut_sse1] → [@ qcms_transform_data_rgb_out_lut_sse1] [@ qcms_transform_data_rgba_out_lut_sse1]
Flags: needinfo?(tnikkel)
Flags: needinfo?(seth)
(Assignee)

Comment 2

3 years ago
The only image-related change of any substance I see in that regression range is bug 1124084, which flipped on downscale-during-decode. That *does* mean that we'd be taking a different code path in nsJPEGDecoder::OutputScanlines. It looks to me like this crash is happening before we ever get in the vicinity of the code that does the downscaling, though. My best guess is that either:

(1) The buffer allocated in the downscaler (where we're writing the image data into on the new code path) somehow displeases the qcms code. Possibly it's not aligned correctly?

(2) The compiler is generating some bad code here that trips up those processors.

(In reply to David Major [:dmajor] from comment #1)
> we're not crashing on an SSE instruction. We're crashing later on trying to
> read "otdata_r[output[0]]". One of the components that pointer math (so
> either otdata_r, or output[0], not sure which) is 0x80000000.

I don't have any explanation for this part. The data that is being manipulated here shouldn't have changed at all, so I'd assume that something meta like the alignment is somehow triggering this, but I don't really know...
Flags: needinfo?(seth)
(In reply to David Major [:dmajor] from comment #1)
> The crash is only on Windows XP, and only on AMD family 6 processors (note,
> this is not "the AMD bug"). Family 6 is pretty ancient, like Athlon era.
> Normally my first instinct would be that these don't support SSE -- but --
> we're not crashing on an SSE instruction. We're crashing later on trying to
> read "otdata_r[output[0]]". One of the components that pointer math (so
> either otdata_r, or output[0], not sure which) is 0x80000000.

The processors don't support sse and we're crashing in a file and function with sse in the name. Isn't it more likely that the crashing instruction is wrong somehow? Even if it is correct and we fix this crash aren't we then going to crash when actually trying to execute the sse code?
(In reply to Timothy Nikkel (:tn) from comment #3)

In the crash dump disassembly, we've already executed SSE instructions prior to the current line, and I do trust the EIP, for now.

Are you sure these processors don't support SSE? If they are original Athlons then I'll agree, and I'm surprised we didn't crash before this. Or maybe they're Athlon XP? I can't tell.

If these cpus don't support SSE then the next question is why we're executing this function. It makes me suspect an issue with the cpuid detection. I notice that the JS code is paranoid and zeroes out its inputs: https://dxr.mozilla.org/mozilla-central/source/js/src/jit/x86-shared/Assembler-x86-shared.cpp#249 but not the qcms code: https://dxr.mozilla.org/mozilla-central/source/gfx/qcms/transform.c#1013
Although, SSE1 comes from EDX, so it wouldn't be subject to that cpuid bug. If we were getting random data in ECX then we'd be detecting SSE3 or SSE4, so we'd be using the SSE2 functions. So nevermind that theory.
> maybe they're Athlon XP? I can't tell.

Apparently they are. Crash-stats says the most frequent chip is:
AuthenticAMD family 6 model 8 stepping 1 | 1

Which appear to be Athlon XP or better: http://www.cpu-world.com/cgi-bin/CPUID.pl?MANUF=AMD&SIGNATURE=1665

So it seems they do have SSE: http://en.wikipedia.org/wiki/Athlon#Athlon_XP.2FMP
Oddly enough that are even a few reports on Linux from the same processors that crash on the same line. So I guess it's not a highly specific compiler-related thing.
(Assignee)

Comment 8

3 years ago
Hmmm. I had somehow missed the fact that you mentioned that these processors don't support SSE, though I guess I don't need to worry about that concern now.

It sounds like we have eliminated the compiler as a concern, as well.

Given that nothing has changed in the qcms code, and nothing has changed in the ImageLib code aside from the origin of the buffer that the qcms code is working in, my best guess is still that there's an alignment issue.
The history on this one is weird. Here are the crashes since 1 January:

3 	20150318055750 	150 	17.22 %
1 	20150319030202 	195 	22.39 %
4 	20150320030211 	74 	8.50 %
2 	20150508030204 	194 	22.27 %
11 	20150509030210 	10 	1.15 %
10 	20150509081231 	13 	1.49 %
8 	20150510030207 	37 	4.25 %
7 	20150511030203 	48 	5.51 %
6 	20150511122605 	55 	6.31 %
5 	20150512030215 	63 	7.23 %
9 	20150513030209 	32 	3.67 %

It came and went in March. Then flared up again on 0508 with a steady lower volume since.
(Assignee)

Comment 10

3 years ago
(In reply to David Major [:dmajor] from comment #9)
> It came and went in March. Then flared up again on 0508 with a steady lower
> volume since.

That'd point even more strongly at bug 1124084, which landed in March and was backed out a short time later.
(Assignee)

Updated

3 years ago
Keywords: leave-open
(Assignee)

Comment 11

3 years ago
Created attachment 8606053 [details] [diff] [review]
Use the correct alignment for SIMD for the row buffer in image::Downscaler

OK, let's go ahead and fix the alignment thing. It's the only issue that jumps
out at me. This is a speculative fix, so I marked this bug 'leave-open'. In the
worst scenario, at least we'll have eliminated this as a possible cause.
Attachment #8606053 - Flags: review?(tnikkel)
(Assignee)

Updated

3 years ago
Assignee: nobody → seth
Status: NEW → ASSIGNED
Flags: needinfo?(tnikkel)
Attachment #8606053 - Flags: review?(tnikkel) → review+
Adding a tracking flag for firefox40 as this crash is spiking at times into the top-crash bucket.
tracking-firefox40: ? → +
(Assignee)

Comment 14

3 years ago
Thanks for the review!

Unfortunately I'm going to need to fix AlignedStorage properly rather than use the hack I did in that patch, as the hack doesn't work on Windows.
(Assignee)

Comment 15

3 years ago
Bummer, it looks like this is failing silently on Linux as well:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15795
After staring at the disassembly, I can see that output[0] is 0x80000000, i.e. -0.0f. And that in turn must have come from |result| a few lines up. Perhaps these processors don't clamp it correctly against |min| (which is +0.0f)?
Another possibility, it could be coming from |_mm_cvtps_pi32(result)|

The documentation for that intrinsic says: "If an overflow occurs, 0x80000000 is placed in the appropriate elements of the return value."
Created attachment 8608416 [details] [diff] [review]
Diagnostic patch
Attachment #8608416 - Flags: review?(seth)
Created attachment 8608418 [details] [diff] [review]
Diagnostic patch
Attachment #8608416 - Attachment is obsolete: true
Attachment #8608416 - Flags: review?(seth)
Attachment #8608418 - Flags: review?(seth)
(Assignee)

Comment 20

3 years ago
Comment on attachment 8608418 [details] [diff] [review]
Diagnostic patch

Review of attachment 8608418 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Hopefully this will give us a clue as to where the problem lies.
Attachment #8608418 - Flags: review?(seth) → review+
Created attachment 8609595 [details] [diff] [review]
Diagnostic patch 2

Ugh, I overlooked this in the previous patch, and xmm0/xmm1 were still getting clobbered.
Attachment #8609595 - Flags: review?(seth)
(Assignee)

Comment 24

3 years ago
Comment on attachment 8609595 [details] [diff] [review]
Diagnostic patch 2

Review of attachment 8609595 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for all your work on this, David.
Attachment #8609595 - Flags: review?(seth) → review+
Finally something resembling a clue:

        /* crunch, crunch, crunch */
        vec_r  = _mm_add_ps(vec_r, _mm_add_ps(vec_g, vec_b));
        vec_r  = _mm_max_ps(min, vec_r);
        vec_r  = _mm_min_ps(max, vec_r);
        result = _mm_mul_ps(vec_r, scale);

|result| has indeterminate values in three of its four lanes:
xmm1=           0      -1.#IND      -1.#IND      -1.#IND

The values for |min| and |max| and |scale| are all as expected. So the problem must already be present in the first value for |vec_r| in the code above.
Any chance we might have some bogus math due to bad parameters?

I guess |src| and |length| aren't really suspects, because presumably the tables can handle any byte. But what about |transform->matrix| or |transform->input_gamma_table_{r,g,b}|?

Perhaps did the downscale patch change some ordering so that something here is not initialized yet? Just a random guess.

If you don't have any ideas then I guess the next layer of yak is to get |vec_r| or its predecessors into registers.
Flags: needinfo?(seth)
(In reply to David Major [:dmajor] from comment #28)
> Any chance we might have some bogus math due to bad parameters?

This has been staring me in the face the whole time and I didn't realize it:

xmm5=           0      -1.#IND      -1.#IND      -1.#IND
xmm6=           0      -1.#IND      -1.#IND      -1.#IND
xmm7=           0      -1.#IND      -1.#IND      -1.#IND

Those are the |mat| values that come from |matrix->transform|.
(Reporter)

Comment 31

3 years ago
Looks like this was accidentally marked as tracking- instead of tracking+ for 39.
tracking-firefox39: - → +
Going on the assumption that these matrix values came from: https://hg.mozilla.org/mozilla-central/file/e537a1ba501b/gfx/qcms/transform.c#l1330

One possible way to get #IND is 0*infinity. Maybe we had one of those in the matrix_multiply call there? (Although, the code does check out_matrix.invalid)

I guess we could test all those values right there, and crash if any are #IND. It might get us to the the next question at least. I'm hoping Seth has a better idea :)
(Assignee)

Comment 33

3 years ago
(In reply to David Major [:dmajor] from comment #32)
> One possible way to get #IND is 0*infinity. Maybe we had one of those in the
> matrix_multiply call there? (Although, the code does check
> out_matrix.invalid)

If the qcms profile is corrupt I think we could still potentially end up with garbage. After all, build_colorant_matrix always sets |invalid| to false:

https://dxr.mozilla.org/mozilla-central/source/gfx/qcms/transform_util.c?from=build_colorant_matrix&case=true#226

matrix_multiply just checks if either of the input matrices are invalid:

https://dxr.mozilla.org/mozilla-central/source/gfx/qcms/matrix.c?from=matrix_multiply&case=true#118

And matrix_invert just checks if the determinant is strictly equal to 0:

https://dxr.mozilla.org/mozilla-central/source/gfx/qcms/matrix.c?from=matrix_multiply&case=true#65

So I don't think checking |out_matrix.invalid| will necessarily detect issues earlier in the pipeline.
Flags: needinfo?(seth)
(Assignee)

Comment 34

3 years ago
Created attachment 8610903 [details] [diff] [review]
Paper over qcms crashes due to NaN values in qcms_transform::matrix

If the problem is indeed due to NaN values in qcms_transform::matrix, this patch
should cause the crash volume to greatly drop. Let's find out.

If this patch *does* reduce the volume of crashes, we still need to dig deeper
and find the real problem - there's an input validation issue somewhere earlier
in the pipeline.
Attachment #8610903 - Flags: review?(dmajor)

Updated

3 years ago
Attachment #8610903 - Flags: review?(dmajor) → review+
Could you uplift this to 40 as well?
Flags: needinfo?(seth)
(Assignee)

Updated

3 years ago
Attachment #8606053 - Attachment is obsolete: true
Flags: needinfo?(seth)
(Assignee)

Comment 38

3 years ago
Comment on attachment 8610903 [details] [diff] [review]
Paper over qcms crashes due to NaN values in qcms_transform::matrix

Approval Request Comment
[Feature/regressing bug #]: It *seems* to be an ancient bug that bug 1124084 made more visible.
[User impact if declined]: Crashes. This is a top crash.
[Describe test coverage new/current, TreeHerder]: None, because we don't know any STR yet. As we continue to debug this we should be able to construct a testcase in the future.
[Risks and why]: Very low risk. Just makes us reject ICC profiles that contain invalid matrices.
[String/UUID change made/needed]: None.
Attachment #8610903 - Flags: approval-mozilla-aurora?
Crash-stats confirms no hits on the 0528 nightly.
(Assignee)

Comment 40

3 years ago
(In reply to David Major [:dmajor] from comment #39)
> Crash-stats confirms no hits on the 0528 nightly.

That's great news!

We should probably get this on 39 ASAP, as well.

Since this is tracked, I think we should make life easier for release managers by going ahead and resolving this bug. I'll file a followup in which we can fix the input validation issue (the root cause).
Yeah, that sounds good.
status-firefox41: --- → fixed
Target Milestone: --- → mozilla41
(Assignee)

Updated

3 years ago
Attachment #8608418 - Attachment is obsolete: true
(Assignee)

Updated

3 years ago
Attachment #8609595 - Attachment is obsolete: true
(Assignee)

Comment 42

3 years ago
Comment on attachment 8610903 [details] [diff] [review]
Paper over qcms crashes due to NaN values in qcms_transform::matrix

Requesting beta uplift as well:

Approval Request Comment
[Feature/regressing bug #]: It *seems* to be an ancient bug that bug 1124084 made more visible.
[User impact if declined]: Crashes. This is a top crash.
[Describe test coverage new/current, TreeHerder]: None, because we don't know any STR yet. As we continue to debug this we should be able to construct a testcase in the future.
[Risks and why]: Very low risk. Just makes us reject ICC profiles that contain invalid matrices.
[String/UUID change made/needed]: None.
Attachment #8610903 - Flags: approval-mozilla-beta?
(Assignee)

Updated

3 years ago
Blocks: 1170316
(Assignee)

Comment 43

3 years ago
I filed bug 1170316 about the input validation issue.

Resolving this since the crash is fixed on central.
(Assignee)

Updated

3 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Keywords: leave-open
Resolution: --- → FIXED
Comment on attachment 8610903 [details] [diff] [review]
Paper over qcms crashes due to NaN values in qcms_transform::matrix

Fix a top crash, taking it.
Attachment #8610903 - Flags: approval-mozilla-beta?
Attachment #8610903 - Flags: approval-mozilla-beta+
Attachment #8610903 - Flags: approval-mozilla-aurora?
Attachment #8610903 - Flags: approval-mozilla-aurora+
FWIW, I would've preferred to review this instead of dmajor.
Oh sorry, I didn't know this was in a different module from imagelib.
You need to log in before you can comment on or make changes to this bug.