Closed Bug 461808 Opened 16 years ago Closed 9 years ago

Build shell with icc on mac and linux and compare perf

Categories

(Core :: JavaScript Engine, enhancement)

x86
macOS
enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: gal, Assigned: gal)

References

Details

Attachments

(1 file, 4 obsolete files)

Attached patch patch (obsolete) — Splinter Review
      No description provided.
Attached patch makes us build and pass sunspider with DEBUG=1, but we iloop on BUILD_OPT=1. Also, performance lacks behind gcc at least on mac.
Attached patch the right patch this time (obsolete) — Splinter Review
Attachment #344956 - Attachment is obsolete: true
It seems -Os caused the iloop/crash. With -O2 we get a working OPT built with excellent results (28% speedup over the gcc build using icc+PGO).

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           *1.28x as slow*   1019.2ms +/- 0.9%   1303.4ms +/- 0.9%     significant

=============================================================================

  3d:                  *1.166x as slow*   104.7ms +/- 0.9%    122.1ms +/- 1.3%     significant
    cube:              *1.183x as slow*    37.1ms +/- 2.6%     43.9ms +/- 3.1%     significant
    morph:             *1.076x as slow*    27.5ms +/- 1.4%     29.6ms +/- 2.3%     significant
    raytrace:          *1.21x as slow*     40.1ms +/- 1.3%     48.6ms +/- 1.2%     significant

  access:              *1.49x as slow*    141.7ms +/- 0.9%    211.6ms +/- 1.5%     significant
    binary-trees:      *1.080x as slow*    35.1ms +/- 1.8%     37.9ms +/- 1.7%     significant
    fannkuch:          *1.62x as slow*     69.1ms +/- 1.2%    111.9ms +/- 1.5%     significant
    nbody:             *1.95x as slow*     25.7ms +/- 1.3%     50.0ms +/- 1.5%     significant
    nsieve:            -                   11.8ms +/- 2.6%     11.8ms +/- 3.8% 

  bitops:              *2.19x as slow*     37.8ms +/- 2.3%     82.7ms +/- 1.3%     significant
    3bit-bits-in-byte: -                    1.6ms +/- 23.1%      1.6ms +/- 23.1% 
    bits-in-byte:      ??                   7.8ms +/- 3.9%      8.1ms +/- 2.8%     not conclusive: might be *1.038x as slow*
    bitwise-and:       -                    2.7ms +/- 12.8%      2.5ms +/- 15.1% 
    nsieve-bits:       *2.74x as slow*     25.7ms +/- 1.3%     70.5ms +/- 1.3%     significant

  controlflow:         1.019x as fast      32.5ms +/- 1.2%     31.9ms +/- 1.3%     significant
    recursive:         1.019x as fast      32.5ms +/- 1.2%     31.9ms +/- 1.3%     significant

  crypto:              *1.27x as slow*     47.1ms +/- 1.5%     59.7ms +/- 1.3%     significant
    aes:               *1.23x as slow*     26.6ms +/- 1.4%     32.8ms +/- 1.4%     significant
    md5:               *1.37x as slow*     14.6ms +/- 2.5%     20.0ms +/- 1.7%     significant
    sha1:              *1.169x as slow*     5.9ms +/- 6.9%      6.9ms +/- 3.3%     significant

  date:                *1.21x as slow*    176.4ms +/- 1.3%    214.1ms +/- 1.3%     significant
    format-tofte:      *1.27x as slow*     87.5ms +/- 1.0%    111.0ms +/- 1.2%     significant
    format-xparb:      *1.160x as slow*    88.9ms +/- 1.6%    103.1ms +/- 1.4%     significant

  math:                1.032x as fast      41.5ms +/- 1.7%     40.2ms +/- 2.2%     significant
    cordic:            1.27x as fast       24.1ms +/- 1.7%     19.0ms +/- 1.8%     significant
    partial-sums:      *1.38x as slow*      9.9ms +/- 2.3%     13.7ms +/- 2.5%     significant
    spectral-norm:     -                    7.5ms +/- 5.0%      7.5ms +/- 5.0% 

  regexp:              *1.30x as slow*    154.0ms +/- 0.9%    199.6ms +/- 1.1%     significant
    dna:               *1.30x as slow*    154.0ms +/- 0.9%    199.6ms +/- 1.1%     significant

  string:              *1.20x as slow*    283.5ms +/- 1.0%    341.5ms +/- 1.0%     significant
    base64:            *1.23x as slow*     12.5ms +/- 3.0%     15.4ms +/- 2.4%     significant
    fasta:             *1.159x as slow*    61.6ms +/- 1.6%     71.4ms +/- 1.0%     significant
    tagcloud:          *1.170x as slow*    89.4ms +/- 0.9%    104.6ms +/- 1.6%     significant
    unpack-code:       *1.28x as slow*     94.0ms +/- 1.1%    119.9ms +/- 1.0%     significant
    validate-input:    *1.162x as slow*    26.0ms +/- 1.3%     30.2ms +/- 1.0%     significant
The icc build causes a regression in 3d-morph. Both interpreter and jit produce an incorrect result. This is probably a builtin issue (potentially rounding error).
The part of the patch that applies to jstracer.cpp seems to have been landed on the TraceMonkey branch.  The part that applies to Makefile.ref hasn't, but it doesn't seem appropriate as it has hardwired paths.

As well as SunSpider failures, there are also some failures in trace-test.js which I see on my Mac when I build 'js' with ICC (see below;  I grepped for "FAILURE").  Similar to the SunSpider case, these occur with an optimised build (--disable-debug --enable-optimize) but not with a debug build (--enable-debug --disable-optimize).  All the errors occur both with and without tracing (-j) so it does looks like a built-in issue.


Infinity/Math.asin(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.atan(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.atan2(0,-0) : FAILED: expected number ( 3.141592653589793 )  != actual number ( 0 ) 
Infinity/Math.atan2(-0,1) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.atan2(-0, -0) : FAILED: expected number ( -3.141592653589793 )  != actual number ( 0 ) 
Math.atan2(-0, -1) : FAILED: expected number ( -3.141592653589793 )  != actual number ( 3.141592653589793 ) 
Math.atan2(-1,Number.POSITIVE_INFINITY) : FAILED: expected number ( 0 )  != actual number ( -0 ) 
Infinity/Math.ceil('-0') : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.ceil(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.ceil(-Number.MIN_VALUE) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.ceil(-0.9) : FAILED: expected number ( 0 )  != actual number ( -0 ) 
Math.ceil(-0.9) : FAILED: expected number ( 0 )  != actual number ( -0 ) 
Infinity/Math.floor(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.max(-0,-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.min(0,-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.min(-0,-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.pow(Number.NEGATIVE_INFINITY, -1) : FAILED: expected number ( 0 )  != actual number ( -0 ) 
Math.pow(Number.NEGATIVE_INFINITY, -3) : FAILED: expected number ( 0 )  != actual number ( -0 ) 
Infinity/Math.pow(-0, 1) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.pow(-0,3) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.pow(-0, -1) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.pow(-0, -10001) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.round(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.round(-0.49) : FAILED: expected number ( 0 )  != actual number ( -0 ) 
Math.round(-0.5) : FAILED: expected number ( 0 )  != actual number ( -0 ) 
Infinity/Math.sqrt(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Infinity/Math.tan(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
FAILED: Infinity/Math.asin(-0),Infinity/Math.atan(-0),Math.atan2(0,-0),Infinity/Math.atan2(-0,1),Math.atan2(-0, -0),Math.atan2(-0, -1),Math.atan2(-1,Number.POSITIVE_INFINITY),Infinity/Math.ceil('-0'),Infinity/Math.ceil(-0),Infinity/Math.ceil(-Number.MIN_VALUE),Math.ceil(-0.9),Math.ceil(-0.9),Infinity/Math.floor(-0),Infinity/Math.max(-0,-0),Infinity/Math.min(0,-0),Infinity/Math.min(-0,-0),Math.pow(Number.NEGATIVE_INFINITY, -1),Math.pow(Number.NEGATIVE_INFINITY, -3),Infinity/Math.pow(-0, 1),Infinity/Math.pow(-0,3),Math.pow(-0, -1),Math.pow(-0, -10001),Infinity/Math.round(-0),Math.round(-0.49),Math.round(-0.5),Infinity/Math.sqrt(-0),Infinity/Math.tan(-0)
The fails all seem related to handling special values (NaN, Inf, etc.). I'll investigate.
The following one-liner demonstrates the problem:

  print(Math.atan2(0,-0));

The answer is supposed to be 3.141592653589793; with ICC-opt the answer is zero.
By the time math_atan() is reached, the problem is already manifest -- the 2nd argument has somehow become the (jsval) integer 0, rather than the (jsval) double -0.  I suspect this one problem (-0 becoming 0) is causing all the above failures.

Working out where the bad conversion took place is beyond me at the moment...
(In reply to comment #7)
> Working out where the bad conversion took place is beyond me at the moment...

Since you did all of the hard work, I figured I'd swoop in with the easy stuff:

It appears that the code at http://hg.mozilla.org/mozilla-central/file/23aa9ede6535/js/src/jsinterp.cpp#l3901 wants to be turned into a configure test (JS_NEG_ZERO_BUG?) and not specify HPUX specifically.
Changing all three places where the HPUX-specific code is reduces the number of failures from 28 to 10:

Math.ceil('-0') : FAILED: expected number ( -0 )  != actual number ( 0 ) 
Infinity/Math.ceil('-0') : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.ceil(-0) : FAILED: expected number ( -0 )  != actual number ( 0 ) 
Infinity/Math.ceil(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.ceil(-Number.MIN_VALUE) : FAILED: expected number ( -0 )  != actual number ( 0 ) 
Infinity/Math.ceil(-Number.MIN_VALUE) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 
Math.ceil(-Number.MIN_VALUE) : FAILED: expected number ( -0 )  != actual number ( 0 ) 
Math.floor(-0) : FAILED: expected number ( -0 )  != actual number ( 0 ) 
Infinity/Math.floor(-0) : FAILED: expected number ( -Infinity )  != actual number ( Infinity ) 

I used diagnostic printf's to confirm that all three places are executed.

Attached is a very dirty patch that fixes those tests if you're using ICC (it breaks with all other compilers).  This will be a good start for a cleaner patch, I wasn't sure how the #defines should be done.

For the remaining failures, ICC's implementation of ceil() and floor() give the wrong answer when the argument is -0, returning 0 instead of -0!  I'm also not sure how best to address that.
Without -O2 ICC seems to generate correct code, so this looks definitively like a compiler bug in ICC (if you configure SM without --enable-optimization, we pass all tests). Maybe Moh can get us a non-buggy version. In parallel I think we should hack up a work-around. The ICC misoptimization seems localized enough to work around it.
I submitted a bug report against ICC and will let you know when the fix is ready.

Meanwhile, we may try a quick and dirty fix such as:

#ifdef ICC
#define ceil(x)  (x == -0.0) ? -0.0 : ceil(x)
#define floor(x) (x == -0.0) ? -0.0 : floor(x)
#endif
Yeah, I added macros just like that. Unfortunately 3d-morph is still wrong. There must be some other regression somewhere. Probably also -0 related. Patch to follow.
Attached patch v3 (obsolete) — Splinter Review
Passes trace-tests. 3d-morph still incorrect.
Attachment #344958 - Attachment is obsolete: true
Attachment #364474 - Attachment is obsolete: true
3d morph is a fairly simple loop involving sin() and icc shows some serious rounding error in the result. I will try to verify whether sinus is the culprit.

expected: 6.394884621840902e-14
icc: 6.750155989720952e-14
I submitted a bug report against ICC and will let you know when the fix is ready.

Meanwhile, we may try a quick and dirty fix such as:

#ifdef ICC
#define ceil(x)  (x == -0.0) ? -0.0 : ceil(x)
#define floor(x) (x == -0.0) ? -0.0 : floor(x)
#endif
Sorry for pushing the wrong button ;)

I think I found the reason.

It is a one liner: printf ("%lf", neg0*0.0);
ICC at O2 computes (-0.0)*0.0 as 0.0, while at -Od, it computes it as -0.0, which I guess is the intended result.

In 3d-morph, we have the expression sin()* -f30, which turns to 0.0 * (-0.0).

I don't have access to your ICC build of TM. If 3d-morph is changed appropriately, this can quickly be tested.
I posted these in the wrong bug:

So it seems icc's sin() implementation is a bit off. This affects ICC -O2 and
no optimization.

whale:src gal$ ./Darwin_OPT.OBJ/js -e "print(Math.sin(10))"
-0.5440211108893699
whale:src gal$ ./Darwin_ICC_OPT.OBJ/js -e "print(Math.sin(10))"
-0.5440211108893698
whale:src gal$

MSVC: -0.5440211108893698 (tested using ff windows build).

Maybe GCC is off here. This should be investigated further in a separate bug.
Latest numbers: We are now around 20% speedup through icc. Probably went down as we trace more code.

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.197x as fast    1030.6ms +/- 0.1%   860.8ms +/- 0.1%     significant

=============================================================================

  3d:                  1.24x as fast      154.1ms +/- 0.2%   124.4ms +/- 0.3%     significant
    cube:              1.182x as fast      40.4ms +/- 0.7%    34.2ms +/- 0.8%     significant
    morph:             1.23x as fast       29.1ms +/- 0.3%    23.7ms +/- 0.6%     significant
    raytrace:          1.27x as fast       84.6ms +/- 0.2%    66.5ms +/- 0.3%     significant

  access:              1.078x as fast     132.2ms +/- 0.2%   122.6ms +/- 0.2%     significant
    binary-trees:      1.074x as fast      40.0ms +/- 0.4%    37.3ms +/- 0.3%     significant
    fannkuch:          1.058x as fast      57.0ms +/- 0.2%    53.9ms +/- 0.2%     significant
    nbody:             1.170x as fast      23.9ms +/- 0.5%    20.4ms +/- 0.7%     significant
    nsieve:            1.025x as fast      11.3ms +/- 1.1%    11.0ms +/- 0.5%     significant

  bitops:              1.060x as fast      35.5ms +/- 0.5%    33.5ms +/- 0.5%     significant
    3bit-bits-in-byte: ??                   1.6ms +/- 8.8%     1.6ms +/- 8.4%     not conclusive: might be *1.025x as slow*
    bits-in-byte:      1.049x as fast       8.1ms +/- 1.0%     7.7ms +/- 1.7%     significant
    bitwise-and:       *1.108x as slow*     2.0ms +/- 2.8%     2.3ms +/- 5.6%     significant
    nsieve-bits:       1.086x as fast      23.8ms +/- 0.5%    21.9ms +/- 0.7%     significant

  controlflow:         1.035x as fast      32.5ms +/- 0.4%    31.4ms +/- 0.4%     significant
    recursive:         1.035x as fast      32.5ms +/- 0.4%    31.4ms +/- 0.4%     significant

  crypto:              1.26x as fast       60.9ms +/- 0.5%    48.4ms +/- 0.4%     significant
    aes:               1.20x as fast       34.6ms +/- 0.4%    28.8ms +/- 0.4%     significant
    md5:               1.41x as fast       19.7ms +/- 0.7%    14.0ms +/- 0.7%     significant
    sha1:              1.183x as fast       6.6ms +/- 2.2%     5.6ms +/- 2.6%     significant

  date:                1.24x as fast      170.1ms +/- 0.1%   137.5ms +/- 0.2%     significant
    format-tofte:      1.32x as fast       67.5ms +/- 0.2%    51.2ms +/- 0.3%     significant
    format-xparb:      1.188x as fast     102.5ms +/- 0.2%    86.3ms +/- 0.2%     significant

  math:                1.061x as fast      38.8ms +/- 0.5%    36.6ms +/- 0.6%     significant
    cordic:            *1.149x as slow*    19.0ms +/- 0.5%    21.8ms +/- 0.5%     significant
    partial-sums:      1.51x as fast       13.8ms +/- 0.9%     9.1ms +/- 1.1%     significant
    spectral-norm:     1.071x as fast       6.1ms +/- 1.5%     5.7ms +/- 2.4%     significant

  regexp:              1.024x as fast      44.1ms +/- 0.3%    43.0ms +/- 0.4%     significant
    dna:               1.024x as fast      44.1ms +/- 0.3%    43.0ms +/- 0.4%     significant

  string:              1.28x as fast      362.6ms +/- 0.1%   283.5ms +/- 0.1%     significant
    base64:            1.30x as fast       16.2ms +/- 0.7%    12.5ms +/- 1.1%     significant
    fasta:             1.23x as fast       75.3ms +/- 0.2%    61.3ms +/- 0.2%     significant
    tagcloud:          1.190x as fast      99.4ms +/- 0.2%    83.5ms +/- 0.3%     significant
    unpack-code:       1.41x as fast      140.5ms +/- 0.1%    99.7ms +/- 0.2%     significant
    validate-input:    1.176x as fast      31.2ms +/- 0.4%    26.5ms +/- 0.5%     significant
whale:v8 gal$ ../Darwin_ICC_OPT.OBJ/js -j run.js 
Richards: 358
DeltaBlue: 142
Crypto: 867
RayTrace: 292
EarleyBoyer: 367
----
Score: 342
whale:v8 gal$ ../Darwin_OPT.OBJ/js -j run.js 
Richards: 298
DeltaBlue: 109
Crypto: 832
RayTrace: 243
EarleyBoyer: 350
----
Score: 297
Attached patch v4Splinter Review
Latest patch.
Attachment #364478 - Attachment is obsolete: true
CCing jim. icc gets confused by the shell and editline static libraries. The pgo data is placed with the executable, and the linker can't find them when linking the libraries. I can also currently not use -ipo for a similar reason (not compatible with static linking). I don't see a huge win from the static libraries. Lets just integrated them into the regular build/link.
Back to 27% speedup using ipo:

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.27x as fast     1030.6ms +/- 0.1%   813.3ms +/- 0.2%     significant

=============================================================================

  3d:                  1.30x as fast      154.1ms +/- 0.2%   118.3ms +/- 0.3%     significant
    cube:              1.25x as fast       40.4ms +/- 0.7%    32.3ms +/- 0.7%     significant
    morph:             1.29x as fast       29.1ms +/- 0.3%    22.5ms +/- 0.6%     significant
    raytrace:          1.33x as fast       84.6ms +/- 0.2%    63.5ms +/- 0.2%     significant

  access:              1.147x as fast     132.2ms +/- 0.2%   115.3ms +/- 0.2%     significant
    binary-trees:      1.154x as fast      40.0ms +/- 0.4%    34.7ms +/- 0.4%     significant
    fannkuch:          1.110x as fast      57.0ms +/- 0.2%    51.4ms +/- 0.3%     significant
    nbody:             1.32x as fast       23.9ms +/- 0.5%    18.2ms +/- 0.6%     significant
    nsieve:            1.020x as fast      11.3ms +/- 1.1%    11.1ms +/- 1.0%     significant

  bitops:              1.095x as fast      35.5ms +/- 0.5%    32.4ms +/- 0.7%     significant
    3bit-bits-in-byte: -                    1.6ms +/- 8.8%     1.5ms +/- 9.4% 
    bits-in-byte:      1.047x as fast       8.1ms +/- 1.0%     7.7ms +/- 1.7%     significant
    bitwise-and:       -                    2.0ms +/- 2.8%     2.0ms +/- 3.5% 
    nsieve-bits:       1.122x as fast      23.8ms +/- 0.5%    21.2ms +/- 0.6%     significant

  controlflow:         *1.007x as slow*    32.5ms +/- 0.4%    32.7ms +/- 0.4%     significant
    recursive:         *1.007x as slow*    32.5ms +/- 0.4%    32.7ms +/- 0.4%     significant

  crypto:              1.30x as fast       60.9ms +/- 0.5%    46.7ms +/- 0.4%     significant
    aes:               1.24x as fast       34.6ms +/- 0.4%    27.9ms +/- 0.5%     significant
    md5:               1.46x as fast       19.7ms +/- 0.7%    13.5ms +/- 1.1%     significant
    sha1:              1.23x as fast        6.6ms +/- 2.2%     5.3ms +/- 2.5%     significant

  date:                1.31x as fast      170.1ms +/- 0.1%   129.3ms +/- 0.2%     significant
    format-tofte:      1.37x as fast       67.5ms +/- 0.2%    49.4ms +/- 0.3%     significant
    format-xparb:      1.28x as fast      102.5ms +/- 0.2%    80.0ms +/- 0.2%     significant

  math:                1.067x as fast      38.8ms +/- 0.5%    36.4ms +/- 0.7%     significant
    cordic:            *1.153x as slow*    19.0ms +/- 0.5%    21.9ms +/- 0.4%     significant
    partial-sums:      1.54x as fast       13.8ms +/- 0.9%     8.9ms +/- 0.9%     significant
    spectral-norm:     1.086x as fast       6.1ms +/- 1.5%     5.6ms +/- 2.5%     significant

  regexp:              1.039x as fast      44.1ms +/- 0.3%    42.4ms +/- 0.3%     significant
    dna:               1.039x as fast      44.1ms +/- 0.3%    42.4ms +/- 0.3%     significant

  string:              1.40x as fast      362.6ms +/- 0.1%   259.8ms +/- 0.3%     significant
    base64:            1.40x as fast       16.2ms +/- 0.7%    11.6ms +/- 1.2%     significant
    fasta:             1.45x as fast       75.3ms +/- 0.2%    52.1ms +/- 0.3%     significant
    tagcloud:          1.30x as fast       99.4ms +/- 0.2%    76.3ms +/- 0.6%     significant
    unpack-code:       1.49x as fast      140.5ms +/- 0.1%    94.5ms +/- 0.2%     significant
    validate-input:    1.23x as fast       31.2ms +/- 0.4%    25.4ms +/- 0.5%     significant

Build instructions: 

mkdir icc-build
cd icc-build
AR="/opt/intel/Compiler/11.0/056/bin/ia32/xiar" CXX="icpc -m32 -ipo" CC="icc -m32 -ipo" ../configure --enable-optimization --disable-debug
MOZ_PROFILE_GENERATE=1 make
cd ..
./bench-icc.sh
cd icc-build
cp pgopti.dpi editline
cp pgopti.dpi shell
MOZ_PROFILE_USE=1 make
gal: it's been a while since I fiddled with icc, but I believe I had it working for PGO if you ran the binary out of dist/bin. Can you try running your bench script with ./dist/bin/js and see if the build system can find the PGO files properly?

What's the status here, are you able to build correctly with a shipping version of ICC?
Ted, I did run js out of the bin directory. The pgo data is always placed in the same directory as the executable. Its the subsequent re-compilation/linking that fails. I need essentially the two cp lines in the makefile (see #23).

As for icc, yes, with v4 applied we build correctly with icc and pass all our JIT regression tests. I will run the full JS regression tests today.
Andreas,

Would you please try the ICC command-line option fp-model precise on the origianl source? I.e., the original ceil, floor, etc.

Linux:   icc -fp-model precise ...
Windows: icl /fp:precise ...

This should solve the problem, but we need to know the performance impact of enforcing the precise floating-point model. We'd need a SunSpider run.
Hi Moh. I am running the benchmarks with -fp-model precise right now.
I apparently attempted to handle this using the -prof-dir option:
http://mxr.mozilla.org/mozilla-central/source/js/src/configure.in#4661

Did that change (or stop working right?)
New test run with precise fp and without the modifications to the ceil/floor code and the various other workarounds (NEGZERO_BUG). This passes trace tests. We lose about 3ms performance. Still a very good result.

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.26x as fast     1030.6ms +/- 0.1%   816.2ms +/- 0.1%     significant

=============================================================================

  3d:                  1.29x as fast      154.1ms +/- 0.2%   119.5ms +/- 0.3%     significant
    cube:              1.23x as fast       40.4ms +/- 0.7%    32.8ms +/- 0.8%     significant
    morph:             1.27x as fast       29.1ms +/- 0.3%    22.9ms +/- 0.6%     significant
    raytrace:          1.32x as fast       84.6ms +/- 0.2%    63.9ms +/- 0.3%     significant

  access:              1.132x as fast     132.2ms +/- 0.2%   116.8ms +/- 0.2%     significant
    binary-trees:      1.151x as fast      40.0ms +/- 0.4%    34.7ms +/- 0.4%     significant
    fannkuch:          1.097x as fast      57.0ms +/- 0.2%    52.0ms +/- 0.3%     significant
    nbody:             1.25x as fast       23.9ms +/- 0.5%    19.1ms +/- 0.5%     significant
    nsieve:            1.029x as fast      11.3ms +/- 1.1%    11.0ms +/- 0.7%     significant

  bitops:              1.103x as fast      35.5ms +/- 0.5%    32.2ms +/- 0.7%     significant
    3bit-bits-in-byte: -                    1.6ms +/- 8.8%     1.5ms +/- 9.6% 
    bits-in-byte:      1.047x as fast       8.1ms +/- 1.0%     7.7ms +/- 1.7%     significant
    bitwise-and:       1.074x as fast       2.0ms +/- 2.8%     1.9ms +/- 4.5%     significant
    nsieve-bits:       1.129x as fast      23.8ms +/- 0.5%    21.1ms +/- 0.5%     significant

  controlflow:         *1.031x as slow*    32.5ms +/- 0.4%    33.5ms +/- 0.4%     significant
    recursive:         *1.031x as slow*    32.5ms +/- 0.4%    33.5ms +/- 0.4%     significant

  crypto:              1.31x as fast       60.9ms +/- 0.5%    46.6ms +/- 0.3%     significant
    aes:               1.24x as fast       34.6ms +/- 0.4%    27.9ms +/- 0.3%     significant
    md5:               1.47x as fast       19.7ms +/- 0.7%    13.5ms +/- 1.1%     significant
    sha1:              1.27x as fast        6.6ms +/- 2.2%     5.2ms +/- 2.2%     significant

  date:                1.31x as fast      170.1ms +/- 0.1%   130.2ms +/- 0.2%     significant
    format-tofte:      1.36x as fast       67.5ms +/- 0.2%    49.7ms +/- 0.3%     significant
    format-xparb:      1.27x as fast      102.5ms +/- 0.2%    80.4ms +/- 0.2%     significant

  math:                1.129x as fast      38.8ms +/- 0.5%    34.4ms +/- 0.7%     significant
    cordic:            1.109x as fast      19.0ms +/- 0.5%    17.1ms +/- 0.5%     significant
    partial-sums:      1.168x as fast      13.8ms +/- 0.9%    11.8ms +/- 1.0%     significant
    spectral-norm:     1.110x as fast       6.1ms +/- 1.5%     5.5ms +/- 2.6%     significant

  regexp:              1.095x as fast      44.1ms +/- 0.3%    40.2ms +/- 0.5%     significant
    dna:               1.095x as fast      44.1ms +/- 0.3%    40.2ms +/- 0.5%     significant

  string:              1.38x as fast      362.6ms +/- 0.1%   262.9ms +/- 0.1%     significant
    base64:            1.35x as fast       16.2ms +/- 0.7%    12.0ms +/- 0.3%     significant
    fasta:             1.42x as fast       75.3ms +/- 0.2%    53.1ms +/- 0.2%     significant
    tagcloud:          1.26x as fast       99.4ms +/- 0.2%    78.7ms +/- 0.3%     significant
    unpack-code:       1.49x as fast      140.5ms +/- 0.1%    94.2ms +/- 0.2%     significant
    validate-input:    1.25x as fast       31.2ms +/- 0.4%    24.9ms +/- 0.5%     significant
Ted, I am not very good at parsing configure scripts. Could you spell out what exactly I should try?
(In reply to comment #29)

Great. The ICC team analyzed my submitted test and they suggest to use -fp-model precise or -fp-model source. The performance loss due to -fp-model source is higher than -fp-precise. But, it seems "fp-model precise" is sufficiently precise for us here. I'll post a link to a reference on the details for possible future use.
This is the current configure setting to build with ICC:

AR="/opt/intel/Compiler/11.0/056/bin/ia32/xiar" CXX="icpc -m32 -ipo -fp-model precise" CC="icc -m32 -ipo -fp-model precise" ../configure --enable-optimization --disable-debug

Also note that you must delete config.* from the build directory, otherwise configure is not picking up changes to CXX/CC.
(In reply to comment #30)
> Ted, I am not very good at parsing configure scripts. Could you spell out what
> exactly I should try?

I don't have any suggestions, I'm just wondering whether "-prof-dir" no longer works. moh: do you happen to know?
prof-dir should work. It provides a convenient way of pointing profmerge and the compiler to the intended profile director.

The prof-gen build should also remove the old .dyn/.dpi files from the profile directory. Otherwise, if a new change in the source results in the change of the control-flow graph, we'll end up having profile data (.dyn/.dpi files) that have different assumptions about the structure of the source. Later, in  profmerge or prf-use phase, profmerge will pick the profile files in a given order and throws away inconsistent profile data (a warning is issued regarding the mismatch of profile data. it's good to always check for that warning). But this can have a major negative performance impact.

I hope this is also taken care in the build scripts.
From the profile data, one can get nice coverage reports very easily. I opened a separate item for tracking.

https://bugzilla.mozilla.org/show_bug.cgi?id=480603
The shell makefile is still just as broken as ever, but here are some updated numbers for an icc pgo build:

============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total:                  760.4ms +/- 0.2%
--------------------------------------------

  3d:                   111.4ms +/- 0.3%
    cube:                33.9ms +/- 0.9%
    morph:               20.7ms +/- 0.7%
    raytrace:            56.8ms +/- 0.3%

  access:               113.1ms +/- 0.2%
    binary-trees:        36.1ms +/- 0.2%
    fannkuch:            47.0ms +/- 0.3%
    nbody:               18.8ms +/- 0.6%
    nsieve:              11.3ms +/- 1.1%

  bitops:                31.5ms +/- 0.8%
    3bit-bits-in-byte:    1.4ms +/- 10.0%
    bits-in-byte:         7.7ms +/- 1.7%
    bitwise-and:          2.6ms +/- 5.6%
    nsieve-bits:         19.8ms +/- 0.5%

  controlflow:           32.0ms +/- 0.3%
    recursive:           32.0ms +/- 0.3%

  crypto:                43.8ms +/- 0.9%
    aes:                 25.2ms +/- 1.2%
    md5:                 11.9ms +/- 0.8%
    sha1:                 6.8ms +/- 1.8%

  date:                 110.6ms +/- 0.2%
    format-tofte:        53.7ms +/- 0.3%
    format-xparb:        56.9ms +/- 0.2%

  math:                  22.8ms +/- 0.9%
    cordic:               8.1ms +/- 1.8%
    partial-sums:         9.0ms +/- 0.6%
    spectral-norm:        5.7ms +/- 2.4%

  regexp:                40.2ms +/- 0.3%
    dna:                 40.2ms +/- 0.3%

  string:               255.1ms +/- 0.3%
    base64:              12.3ms +/- 1.1%
    fasta:               54.2ms +/- 0.4%
    tagcloud:            78.4ms +/- 0.6%
    unpack-code:         84.9ms +/- 0.3%
    validate-input:      25.3ms +/- 0.6%
v8-bleeding-edge vs icc build

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           *1.46x as slow*   518.7ms +/- 0.6%   754.9ms +/- 0.5%     significant

=============================================================================

  3d:                  *1.32x as slow*    83.0ms +/- 0.6%   109.6ms +/- 0.5%     significant
    cube:              *1.39x as slow*    23.8ms +/- 1.9%    33.1ms +/- 1.2%     significant
    morph:             1.70x as fast      34.6ms +/- 1.1%    20.4ms +/- 1.8%     significant
    raytrace:          *2.28x as slow*    24.6ms +/- 1.5%    56.1ms +/- 0.4%     significant

  access:              *2.99x as slow*    37.5ms +/- 1.3%   112.1ms +/- 0.6%     significant
    binary-trees:      *11.6x as slow*     3.1ms +/- 7.3%    35.9ms +/- 0.6%     significant
    fannkuch:          *3.47x as slow*    13.4ms +/- 2.8%    46.5ms +/- 0.8%     significant
    nbody:             *1.108x as slow*   16.7ms +/- 2.9%    18.5ms +/- 2.0%     significant
    nsieve:            *2.60x as slow*     4.3ms +/- 8.0%    11.2ms +/- 2.7%     significant

  bitops:              1.145x as fast     36.4ms +/- 1.4%    31.8ms +/- 1.4%     significant
    3bit-bits-in-byte: 2.07x as fast       3.1ms +/- 7.3%     1.5ms +/- 25.1%     significant
    bits-in-byte:      -                   8.0ms +/- 0.0%     8.0ms +/- 0.0% 
    bitwise-and:       3.70x as fast      10.0ms +/- 0.0%     2.7ms +/- 12.8%     significant
    nsieve-bits:       *1.28x as slow*    15.3ms +/- 2.3%    19.6ms +/- 1.9%     significant

  controlflow:         *11.0x as slow*     2.9ms +/- 7.8%    31.8ms +/- 0.9%     significant
    recursive:         *11.0x as slow*     2.9ms +/- 7.8%    31.8ms +/- 0.9%     significant

  crypto:              *1.191x as slow*   36.6ms +/- 1.4%    43.6ms +/- 2.5%     significant
    aes:               *1.47x as slow*    17.0ms +/- 0.0%    25.0ms +/- 3.8%     significant
    md5:               *1.175x as slow*   10.3ms +/- 3.4%    12.1ms +/- 1.9%     significant
    sha1:              1.43x as fast       9.3ms +/- 3.7%     6.5ms +/- 5.8%     significant

  date:                *1.81x as slow*    60.9ms +/- 1.3%   110.0ms +/- 0.6%     significant
    format-tofte:      *1.52x as slow*    35.1ms +/- 1.2%    53.3ms +/- 0.6%     significant
    format-xparb:      *2.20x as slow*    25.8ms +/- 1.8%    56.7ms +/- 0.9%     significant

  math:                1.99x as fast      44.9ms +/- 0.5%    22.6ms +/- 3.1%     significant
    cordic:            2.21x as fast      17.7ms +/- 2.0%     8.0ms +/- 6.0%     significant
    partial-sums:      2.17x as fast      19.5ms +/- 1.9%     9.0ms +/- 0.0%     significant
    spectral-norm:     1.38x as fast       7.7ms +/- 4.5%     5.6ms +/- 6.6%     significant

  regexp:              *1.57x as slow*    25.5ms +/- 1.5%    40.1ms +/- 0.6%     significant
    dna:               *1.57x as slow*    25.5ms +/- 1.5%    40.1ms +/- 0.6%     significant

  string:              *1.33x as slow*   191.0ms +/- 0.7%   253.3ms +/- 0.6%     significant
    base64:            1.59x as fast      19.4ms +/- 1.9%    12.2ms +/- 2.5%     significant
    fasta:             *1.94x as slow*    27.7ms +/- 1.2%    53.6ms +/- 0.7%     significant
    tagcloud:          *1.61x as slow*    48.4ms +/- 1.0%    77.9ms +/- 1.2%     significant
    unpack-code:       *1.27x as slow*    66.5ms +/- 0.8%    84.5ms +/- 0.4%     significant
    validate-input:    1.155x as fast     29.0ms +/- 1.2%    25.1ms +/- 0.9%     significant
Summary: TM: Build shell with icc on mac and linux and compare perf → Build shell with icc on mac and linux and compare perf
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: