Closed Bug 1265615 Opened 8 years ago Closed 7 years ago

crash in SkPath::isRectContour

Categories

(Firefox Build System :: General, defect)

x86
Windows NT
defect
Not set
critical

Tracking

(firefox48 affected)

RESOLVED WONTFIX
Tracking Status
firefox48 --- affected

People

(Reporter: n.nethercote, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-90dd96d0-d626-4a4e-8c9b-98fb92160418.
=============================================================

There have been 15 crashes with this signature in the past 7 days. I think it first showed up in Firefox 48.

It's a crash on an illegal instruction. I opened the minidump in Visual Studio and it claims the bad instruction is "movss xmm0,dword ptr [ebp-20h]".
lsalzman, can you please take a look?
Flags: needinfo?(lsalzman)
Whiteboard: [gfx-noted]
Could this have been caused by the compiler update? Are we still using /arch:IA32 everywhere? Does /arch:IA32 still work in VS2105?
It looks like VS2015 is potentially generating SSE movss instructions even though we are compiling SkPath.cpp with /arch:IA32. What to do?
Flags: needinfo?(lsalzman) → needinfo?(gps)
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win32-pgo/1461065403/mozilla-central-win32-pgo-bm70-build1-build4.txt.gz confirms we're still passing /arch:IA32 to the compiler.

We may want to pour over the release notes and see if something changed in VS2015 w.r.t. SSE instructions.
Depends on: vs2015
Flags: needinfo?(gps)
Oh, we are seeing things like:

06:01:00     INFO -  c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/_virtualenv/Scripts/python.exe -m mozbuild.action.cl  c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/vs2015u1/VC/bin/amd64_x86/cl.EXE -FoSkPathOpsDebug.obj -c -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/stl_wrappers  -DNDEBUG=1 -DTRIMMED=1 -DUNICODE -D_UNICODE -DSKIA_IMPLEMENTATION=1 -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/gfx/skia -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/c -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/config -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/core -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/effects -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/gpu -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/images -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/pathops -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/ports -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/private -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/utils -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/utils/mac -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/utils/win -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/views -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/core -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/gpu -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/gpu/effects -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/gpu/gl -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/image -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/lazy -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/opts -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/sfnt -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/utils -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/utils/mac -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/utils/win -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/include  -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/include/nspr -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/include/nss        -MD -FI c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/mozilla-config.h -DMOZILLA_CLIENT   -TP -nologo -wd5026 -wd5027 -Zc:sizedDealloc- -Zc:threadSafeInit- -wd4091 -D_HAS_EXCEPTIONS=0 -W3 -Gy -arch:IA32 -FS -wd4251 -wd4244 -wd4267 -wd4345 -wd4351 -wd4800 -wd4819 -wd4595 -we4553 -GR-  -Zi -GL -O1 -Oi -Oy-  -Fdgenerated.pdb   c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/pathops/SkPathOpsDebug.cpp
06:01:01     INFO -  cl : Command line warning D9025 : overriding '/arch:IA32' with '/arch:SSE2'

Not sure exactly why we're getting that D9025 warning. My guess is one of the other compiler flags implies /arch:SSE2.
On my machine, SkPath.cpp gets merged into Unified_cpp_gfx_skia5.cpp. From the linked build log:

06:03:17     INFO -  c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/_virtualenv/Scripts/python.exe -m mozbuild.action.cl  c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/vs2015u1/VC/bin/amd64_x86/cl.EXE -FoUnified_cpp_gfx_skia5.obj -c -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/stl_wrappers  -DNDEBUG=1 -DTRIMMED=1 -DUNICODE -D_UNICODE -DSKIA_IMPLEMENTATION=1 -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/gfx/skia -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/c -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/config -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/core -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/effects -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/gpu -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/images -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/pathops -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/ports -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/private -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/utils -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/utils/mac -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/utils/win -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/include/views -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/core -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/gpu -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/gpu/effects -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/gpu/gl -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/image -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/lazy -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/opts -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/sfnt -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/utils -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/utils/mac -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/gfx/skia/skia/src/utils/win -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/include  -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/include/nspr -Ic:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/dist/include/nss        -MD -FI c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/mozilla-config.h -DMOZILLA_CLIENT   -TP -nologo -wd5026 -wd5027 -Zc:sizedDealloc- -Zc:threadSafeInit- -wd4091 -D_HAS_EXCEPTIONS=0 -W3 -Gy -arch:IA32 -FS -wd4251 -wd4244 -wd4267 -wd4345 -wd4351 -wd4800 -wd4819 -wd4595 -we4553 -GR-  -Zi -GL -O1 -Oi -Oy-  -Fdgenerated.pdb   c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/src/obj-firefox/gfx/skia/Unified_cpp_gfx_skia5.cpp

That has /arch:IA32 and I don't see a D9025 warning for it.
Unfortunately, build system output is interleaved from multiple processes. While I don't think we're getting that D9025 warning for Unified_cpp_gfx_skia5.cpp, I'm not 100% certain. Only way to know for sure is to build locally.

Unfortunately I don't have access to a Windows machine right this moment. But anyone should be able to test this...
It's a straight undocumented case: during PGO, what is actually compiling is the linker. And it turns out the linker does take -arch:ia32 too, which it expects passed both during profile gen and use passes. To make matters funnier, link.exe does reject -arch:ia32 when *not* doing PGO.
Assignee: nobody → mh+mozilla
Component: Graphics → Build Config
Erf no, in fact, it just takes time to complain about the option being unknown...
It does seem like a MSVC bug, where it happily compiles parts that are -arch:ia32 as if they had -arch:sse2. Interestingly, I can't reproduce with a small testcase, and it doesn't seem to be happening in smaller binaries of ours, but it's hard to tell because the CRT comes with its own functions that use SSE (but I'm assuming they are only used when SSE is supported, considering most of them have a function name starting with Xmm), so "simple" grep on disassembled sources find those...

That's about as far I can take it for today.
Assignee: mh+mozilla → nobody
Something that I didn't try but that would be worth trying, is whether this happens when building with LTCG but not PGO, which would tell us if it's a LTCG problem or a profile-induced problem.
Also, I should mention that one reason I wrote comment 8 is that the first thing I tried was to rebuild xul.dll per the profile-use pass, adding -arch:ia32, and the linker barfed that the flag was not passed during the profile-gen pass.
Note we could try solving this for the time being by surrounding the code where this occurs with a #pragma to disable optimizations for that function. It shouldn't have a critical effect on performance and may solve the issue while we figure out what's going on.
(In reply to Bas Schouten (:bas.schouten) from comment #13)
> Note we could try solving this for the time being by surrounding the code
> where this occurs with a #pragma to disable optimizations for that function.
> It shouldn't have a critical effect on performance and may solve the issue
> while we figure out what's going on.

How do you ensure it's the only place where this is happening? Are you suggesting we chase the crashes until there is none left?
(In reply to Mike Hommey [:glandium] from comment #14)
> (In reply to Bas Schouten (:bas.schouten) from comment #13)
> > Note we could try solving this for the time being by surrounding the code
> > where this occurs with a #pragma to disable optimizations for that function.
> > It shouldn't have a critical effect on performance and may solve the issue
> > while we figure out what's going on.
> 
> How do you ensure it's the only place where this is happening? Are you
> suggesting we chase the crashes until there is none left?

I'm suggesting there's no reason to believe this is a common occurrence. If a crash due to this reaches a significant volume, I think preventing that crash is a better idea than waiting until we understand exactly what's going wrong. It's obviously not a long-term sustainable solution, and I never suggested it was :-).
Should we see if vs2015u2 helps here?
See Also: → 1208644
It wouldn't be the first time we've worked around a compiler bug. It should be feasible to get a sense of the scope of the problem by writing a little script to grep the full disassembly of xul.dll, something like (incomplete Python):
```
import subprocess
function = None
sse2_funcs = []
for line in subprocess.Popen(['dumpbin.exe', '-disasm', path_to_xul_dll], stdout=subprocess.PIPE).stdout:
  if line.endswith(':\r\n'):
    function = line[:-3]
    print function
  elif function and line_contains_sse2_instruction(line):
    sse2_funcs.append(function)
print '\n'.join(sse2_funcs)
```
How feasible is it to write an automation check that verifies certain instructions don't accidentally sneak into our code? Could we maintain a whitelist of functions that are allowed to have e.g. SSE instructions?
(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)
> Should we see if vs2015u2 helps here?

I did my tests with a fresh install, so I was using update 2.

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #17)
> It wouldn't be the first time we've worked around a compiler bug. It should
> be feasible to get a sense of the scope of the problem by writing a little
> script to grep the full disassembly of xul.dll, something like (incomplete
> Python):

The problem is... there are a lot of legitimate uses of SSE instructions in xul.dll, both from our code and from the CRT. In those cases, the SSE instructions are behind a CPU check. But they still are in the binary. So you will always find SSE instructions in xul.dll. And as I wrote, I found some in smaller binaries too (like firefox.exe), from the CRT (but then that has nothing to do with LTCG).
(In reply to Gregory Szorc [:gps] from comment #18)
> How feasible is it to write an automation check that verifies certain
> instructions don't accidentally sneak into our code? Could we maintain a
> whitelist of functions that are allowed to have e.g. SSE instructions?

Depends on how accurate the symbol information is for xul.dll after PGO gets done with it.  It's also possible that PGO might suddenly decide that an SSE-but-check-first function on the whitelist should be inlined into some other function, and then somebody would have to go verify that yes, it's OK that *this* function now contains SSE instructions...but then that doesn't prevent the compiler from using SSE instructions in the newly-whitelisted function.  And so forth.

Actually, putting things that way, it sounds pretty tedious to do with high probability of things going wrong.  This scenario is why we have things like compiler switches. :)
Did someone report this to Microsoft?
(In reply to Mike Hommey [:glandium] from comment #21)
> Did someone report this to Microsoft?

If they have, I would have expected it to be reported in the bug. I think you should report it.
Depends on: 1270664
(In reply to Ted Mielczarek [:ted.mielczarek] (Vacation May 5th-11th) from comment #17)
> It wouldn't be the first time we've worked around a compiler bug. It should
> be feasible to get a sense of the scope of the problem by writing a little
> script to grep the full disassembly of xul.dll, something like (incomplete
> Python):
> ```
> import subprocess
> function = None
> sse2_funcs = []
> for line in subprocess.Popen(['dumpbin.exe', '-disasm', path_to_xul_dll],
> stdout=subprocess.PIPE).stdout:
>   if line.endswith(':\r\n'):
>     function = line[:-3]
>     print function
>   elif function and line_contains_sse2_instruction(line):
>     sse2_funcs.append(function)
> print '\n'.join(sse2_funcs)
> ```

So, some interesting things to note here:
- xul.dll contains legitimate SSE, so that wouldn't work.
- while it previously didn't happen on non-PGO builds, after I confirmed it happened on LTCG-non-PGO builds, I now can reproduce on non-LTCG builds... weird, but at least it makes things easier to reproduce
- I've run dumpbin.exe on all .objs on a non-PGO build, and got a large list of .obj files, including prtime.obj.
- Since that one looked like it would be a nicer candidate for a bug report (preprocessed, it's 10 times smaller than SkPath.cpp preprocessed), I compared the obj with the assembly (from -FA), and the assembly didn't contain SSE instructions (!)
- It turns out that (what I suspect is) offset tables for switch statements are not discriminated by the the disassembler, and thanks to x86 machine code being variable length with a very large set of opcode, some of the data in those tables disassembles as SSE instructions.
- That is not, however, what happens in SkPath.obj, which does have real SSE instructions, that are visible with the .asm output from MSVC.
- So on non-PGO builds, to detect if SSE instructions are used, you need to look at the .asm output from MSVC, which means adding -FA to the build flags.
- Obviously, that doesn't help with PGO builds... although we could cross fingers and hope that nothing more than what is detected in the .asm files on non-PGO is being SSE'd.

FWIW, the list of files that I found affected by cross checking .obj and .asm (eliminating those that contain SSE or AVX in their name ; note skia was de-unified on this build):
./gfx/cairo/cairo/src/cairo-d2d-surface.obj
./gfx/graphite2/src/Unified_cpp_gfx_graphite2_src0.obj
./gfx/graphite2/src/Unified_cpp_gfx_graphite2_src1.obj
./gfx/skia/GrAAHairLinePathRenderer.obj
./gfx/skia/GrPathUtils.obj
./gfx/skia/GrPLSPathRenderer.obj
./gfx/skia/SkDistanceFieldGen.obj
./gfx/skia/SkParsePath.obj
./gfx/skia/SkPath.obj
./gfx/skia/SkScan_Hairline.obj
./js/src/Unified_cpp_js_src1.obj
./js/src/Unified_cpp_js_src11.obj
./js/src/Unified_cpp_js_src16.obj
./js/src/Unified_cpp_js_src17.obj
./js/src/Unified_cpp_js_src23.obj
./media/libvpx/rtcd.obj
./media/libyuv/libyuv_libyuv/Unified_cpp_media_libyuv0.obj
./media/libyuv/libyuv_libyuv/Unified_cpp_media_libyuv1.obj
./media/webrtc/trunk/webrtc/common_audio/common_audio_common_audio/Unified_cpp_webrtc_common_audio1.obj
(In reply to Mike Hommey [:glandium] from comment #23)
> ./media/libyuv/libyuv_libyuv/Unified_cpp_media_libyuv0.obj
> ./media/libyuv/libyuv_libyuv/Unified_cpp_media_libyuv1.obj

Those come from compare_win.cc, rotate.cc, row_win.cc and scale_win.cc all containing direct SSE assembly, and it's fine. False alarm for them.

> ./gfx/graphite2/src/Unified_cpp_gfx_graphite2_src0.obj
> ./gfx/graphite2/src/Unified_cpp_gfx_graphite2_src1.obj

These two come from Collider.cpp and Slot.cpp.

> ./media/webrtc/trunk/webrtc/common_audio/common_audio_common_audio/Unified_cpp_webrtc_common_audio1.obj

This one comes from window_generator.cc, and it's small enough that I can derive a very small test case out of it. Yay \o/ (like, as small as 16 lines of C++ + a simple command line)
Blocks: 1186064
Depends on: 1270714
Since we'll be requiring support for SSE2, does that make this bug WONTFIX?
(In reply to Gregory Szorc [:gps] from comment #26)
> Since we'll be requiring support for SSE2, does that make this bug WONTFIX?

I agree.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.