Closed Bug 929834 Opened 11 years ago Closed 10 years ago

Unbalanced floating point stack with VS2013, leads to divide by zero in mochitest-chrome

Categories

(Firefox Build System :: General, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: away, Unassigned)

References

Details

Attachments

(2 files)

STR:
Build opt with VS2013
mach mochitest-chrome toolkit/components

Result:
Test crashes with divide-by-zero at mozjs!double_conversion::DigitGen

I've attached a heavily-reduced version of test_jsctypes.js that demonstrates the minimal repro. Originally this test was designed to verify that you get a TypeError if you try to stick non-integral values (Infinity, NaN, floating point) into integral types like ctypes.unsigned_long.

For now let's look at the first call to run_wrapped_integer_tests. The test takes the four values in [Infinity, -Infinity, NaN, 0.1] and tries to put each one into a ctypes.unsigned_long. All of these operations fail with a TypeError (as expected). For the fourth element, the code calls mozjs!double_conversion::FastDtoa to convert 0.1 into a string so that we can include it in the message of the TypeError. All of this is fine so far.

But on the third call to run_wrapped_integer_tests, we hit a divide by zero in the FastDtoa for the 0.1 value. The zero divisor results from an incorrect lookup in  mozjs!double_conversion::PowersOfTenCache::GetCachedPowerForBinaryExponentRange. That function uses a "fld; fstp" sequence essentially as a substitute for memcpy. On the third FastDtoa, the floating point stack has become full, so the fld fails to load the desired value.

It seems that over time, all these Infinity, -Infinity, and NaN values build up on the floating point stack. Here's what the stack looks like on the first FastDtoa(0.1):

mozjs!double_conversion::PowersOfTenCache::GetCachedPowerForBinaryExponentRange:
5f4beea0 83ec10          sub     esp,10h
0:050> rF
fpcw=027F: rn 53 puozdi  fpsw=2801: top=5 cc=0000 -------i  fptw=ABFF
fopcode=0000  fpip=0023:66300120  fpdp=002b:10d2e84c
st0= 1.#QNAN0000000000000000e+0000  st1=-1.#INF00000000000000000e+0000
st2= 1.#INF00000000000000000e+0000  st3= 0.000000000000000000000e+0000
st4= 0.000000000000000000000e+0000  st5= 0.000000000000000000000e+0000
st6= 0.000000000000000000000e+0000  st7= 0.000000000000000000000e+0000

And on the second FastDtoa(0.1):

mozjs!double_conversion::PowersOfTenCache::GetCachedPowerForBinaryExponentRange:
5f4beea0 83ec10          sub     esp,10h
0:050> rF
fpcw=027F: rn 53 puozdi  fpsw=1001: top=2 cc=0000 -------i  fptw=AAAF
fopcode=0000  fpip=0023:66300120  fpdp=002b:10d2e84c
st0= 1.#QNAN0000000000000000e+0000  st1=-1.#INF00000000000000000e+0000
st2= 1.#INF00000000000000000e+0000  st3= 1.#QNAN0000000000000000e+0000
st4=-1.#INF00000000000000000e+0000  st5= 1.#INF00000000000000000e+0000
st6= 0.000000000000000000000e+0000  st7= 0.000000000000000000000e+0000

And on the third FastDtoa(0.1):

mozjs!double_conversion::PowersOfTenCache::GetCachedPowerForBinaryExponentRange:
5f4beea0 83ec10          sub     esp,10h
0:050> rF
fpcw=027F: rn 53 puozdi  fpsw=3841: top=7 cc=0000 -s-----i  fptw=AAAA
fopcode=0000  fpip=0023:66300120  fpdp=002b:10d2e62c
st0= 1.#QNAN0000000000000000e+0000  st1=-1.#INF00000000000000000e+0000
st2= 1.#INF00000000000000000e+0000  st3= 1.#QNAN0000000000000000e+0000
st4=-1.#INF00000000000000000e+0000  st5= 1.#INF00000000000000000e+0000
st6= 1.#QNAN0000000000000000e+0000  st7=-1.#INF00000000000000000e+0000

I think VS2013 is producing unbalanced code for these NAN/INF values. We push these values onto the floating point stack while we are handling them, but we never clean up the stack. Normally you have pushes and pops in close proximity so stuff never accumulates long-term. I wonder if it's relevant that this test deliberately causes exceptions -- maybe we push, hit the exception, and fail to pop. I will continue investigating tomorrow.
Here is the fld with no corresponding pop. It is inside a CRT error handler.

MSVCR120!_except1+0xb1:
67f00120 dd4318          fld     qword ptr [ebx+18h] ds:002b:0c44ee9c=000000000000f0ff

This error handler runs when we try to convert the NaN/Infinity values to integral types. I've added a few frames that are lost due to jmp's.

MSVCR120!_except1+0xb1
mozjs!_ftol3_except+0x3e
(mozjs!_ftol3_arg_error)
(mozjs!_dtol3_NaN)
(mozjs!_dtoui3)
mozjs!js::ctypes::jsvalToInteger<unsigned long>+0x2a
mozjs!js::ctypes::ImplicitConvert+0x5bb
mozjs!js::ctypes::CData::ValueSetter+0x70
mozjs!js::Shape::set+0x15d
mozjs!js::baseops::SetPropertyHelper+0x2d2
mozjs!SetPropertyOperation+0xc1
mozjs!Interpret+0x3ecb
mozjs!js::RunScript+0x147

That fld looks really strange, since it is near the end of _except1, right before it starts restoring registers and doing epilogue stuff. Either _except1 was supposed to consume this before it returned, or it deliberately left something on the FP stack for the caller to consume, but neither is happening. I wonder what the older MSVCRTs did.
FWIW, this piece of _except1 appears the same in MSVCR100, MSVCR110, and MSVCR120D. It's likely that the problem is on the _ftol3 side.
In VC10, mozjs!js::ctypes::jsvalToInteger<unsigned long> just has a bunch of floating point instructions inline, and does not call out to any CRT functions.

_ftol3 must be new for VC12, since older builds (and web searches) only show _ftol2.
Attached file Minimal repro
This is a minimal cpp file that demonstrates the problem. I included this in the Connect bug.
Casting non-finite float values to integral types invokes undefined behavior, so MSVC is "correct" here.  Whether they want to make the compiler not fail so dramatically, however, is up to them.  :-)  We should fix our code not to do these things, because we're already going to get back a completely bogus result.
FWIW, this code uses "ConvertExact" (http://hg.mozilla.org/mozilla-central/file/518f5bff0ae4/js/src/ctypes/CTypes.cpp#l1537) which notices that the conversion was inexact and notifies the caller, who then doesn't touch the garbage value.

However, in the process of realizing that the conversion is inexact, as a side effect the FP stack gets messed up :-) Maybe that code should try to prevent the bad conversions in the first place, rather than letting it happen and checking it after the fact.
(In reply to comment #7)
> Maybe that code should try to prevent the bad conversions in
> the first place, rather than letting it happen and checking
> it after the fact.

Yes!  You can't retroactively check that undefined behavior was invoked, that's just wrong.  You have to check in advance and guard against the bad possibilities.

This is the language for how float->integer conversions work in C++11:

"""
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.  [ Note: If the destination type is bool, see 4.12. — end note ]
"""

The ConvertExact method appears to exist for basically every possible one of those conversions, which I'm sure makes for fun here in getting it all right.  Possibly excluding all !mozilla::IsFinite(d) and all cases where the truncated float aren't in the min/max range of the output type will address this.  Careful patching and reviewing required!
Didn't realize this was still assigned to me -- I was just doing the initial diagnosis.
Assignee: dmajor → nobody
I think the same problem is the cause when Firefox is left idle for a while. It then crashes with a div-by-zero (somewhere in mozglue.dll) as well.

Is there any progress at Mozilla or Microsoft side regarding this problem? I saw that the Microsoft Connect Report has been closed as duplicate but the duplicate bug is not visible.
> I think the same problem is the cause when Firefox is left idle for a while.
> It then crashes with a div-by-zero (somewhere in mozglue.dll) as well.

Did you build FF yourself using VS2013? The official releases are still built with previous compilers, so they wouldn't have this problem. I would still be interested in looking though. Could you paste some of your crash report IDs from about:crashes either here (if VS2013) or into a new bug?
(In reply to David Major [:dmajor] from comment #11)
> Did you build FF yourself using VS2013? 
Yes I do and beside of this bug, it runs pretty well.

> Could you paste some of your crash report IDs from about:crashes either here (if VS2013) or into a new bug?
I have none because I use "--disable-crashreporter" but I can make a new build having it enabled.
If you're building with VC2013 locally then you could just attach a debugger and get a stack. We have a nice wiki page on doing that with WinDBG if you haven't had to do that in the past:
https://developer.mozilla.org/en-US/docs/How_to_get_a_stacktrace_with_WinDbg
In particular, the rF command (show floating-point registers) would be useful here:

0:050> rF
fpcw=027F: rn 53 puozdi  fpsw=3841: top=7 cc=0000 -s-----i  fptw=AAAA
fopcode=0000  fpip=0023:66300120  fpdp=002b:10d2e62c
st0= 1.#QNAN0000000000000000e+0000  st1=-1.#INF00000000000000000e+0000
st2= 1.#INF00000000000000000e+0000  st3= 1.#QNAN0000000000000000e+0000
st4=-1.#INF00000000000000000e+0000  st5= 1.#INF00000000000000000e+0000
st6= 1.#QNAN0000000000000000e+0000  st7=-1.#INF00000000000000000e+0000

If you see the "s" bit in the flags ("-s-----i") and all eight slots are full of INF/QNAN, then it's likely this bug.
MS responded on the Connect bug and said this will be fixed in the spring VS update.
(In reply to David Major [:dmajor] from comment #14)
> In particular, the rF command (show floating-point registers) would be
> useful here:
> 
> 0:050> rF
> fpcw=027F: rn 53 puozdi  fpsw=3841: top=7 cc=0000 -s-----i  fptw=AAAA
> fopcode=0000  fpip=0023:66300120  fpdp=002b:10d2e62c
> st0= 1.#QNAN0000000000000000e+0000  st1=-1.#INF00000000000000000e+0000
> st2= 1.#INF00000000000000000e+0000  st3= 1.#QNAN0000000000000000e+0000
> st4=-1.#INF00000000000000000e+0000  st5= 1.#INF00000000000000000e+0000
> st6= 1.#QNAN0000000000000000e+0000  st7=-1.#INF00000000000000000e+0000
> 
> If you see the "s" bit in the flags ("-s-----i") and all eight slots are
> full of INF/QNAN, then it's likely this bug.

I had no luck reproducing my original problem. But a new one occurred: the Flashplayer-plugin (FlashPlayerPlugin_12_0_0_43.exe) crashes very often since the last 2 weeks or so. It does not crash in my build from 1/11 but in later builds from 1/23 (I have  not tried earlier builds).

Here a some hang reports
https://crash-analysis.mozilla.com/hang-reports/2014/02-02/hr-20140202-86d292ba-c819-455f-944a-49bb02cafb69.html
https://crash-analysis.mozilla.com/hang-reports/2014/02-02/hr-20140202-bc1b4c3b-5439-45e3-a259-8e10347e5f93.html

and I also have a crash dump, if you are interested.
(In reply to Ronny Perinke from comment #16)
Please use a new bug for the Flash hangs. This bug is about the floating point issues in VS.
BUGCHECK_STR:  APPLICATION_FAULT_STATUS_INTEGER_DIVIDE_BY_ZERO_BEFORE_WRITE

PRIMARY_PROBLEM_CLASS:  STATUS_INTEGER_DIVIDE_BY_ZERO_BEFORE_WRITE

DEFAULT_BUCKET_ID:  STATUS_INTEGER_DIVIDE_BY_ZERO_BEFORE_WRITE

0:000> rF
fpcw=027F: rn 53 puozdi  fpsw=1061: top=2 cc=0000 -sp----i  fptw=222E
fopcode=0000  fpip=0000:62349eaf  fpdp=0000:00eadaec
st0=-1.#IND00000000000000000e+0000  st1= 6.149611482392309760000e+0019
st2=-1.#IND00000000000000000e+0000  st3= 6.149611482392309760000e+0019
st4=-1.#IND00000000000000000e+0000  st5= 6.149611482392309760000e+0019
st6=-1.#IND00000000000000000e+0000  st7=-1.#IND00000000000000000e+0000
mozglue!double_conversion::DoubleToStringConverter::ToShortestIeeeNumber+0x318:
6234be38 f7f6            div     eax,esi
This issue is fixed in the second preview build (CTP 2) of Visual Studio Studio 2013 Update 2.
Resolving based on VS 2013 Update 2 CTP 2.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: