4.8% regression on octane-zlib on March 3rd, 2015

RESOLVED WONTFIX

Status

()

RESOLVED WONTFIX
4 years ago
2 years ago

People

(Reporter: h4writer, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
AWFY is reporting a 4.8% regression on octane-zlib:
http://arewefastyet.com/#machine=28&view=single&suite=octane&subtest=zlib&start=1425044595&end=1425080473

(This is just before AWFY went crazy. After it started acting normal we still have this regression. So this is not a fluke).

The regression range is:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=b94bcbc389e8&tochange=20d0b5cb2071

I guess one of the following might have caused this regression:
4c9ebb3591c4	Dan Gohman — Bug 1137573 - OdinMonkey: Generalize alignment analysis to handle adds with multiple uses r=luke a=ryanvm
5def1d193a0c	Dan Gohman — Bug 1137573 - OdinMonkey: Alignment Mask Analysis r=luke
(Reporter)

Comment 1

4 years ago
@Sunfish can you take a look?
Flags: needinfo?(sunfish)
(Reporter)

Updated

4 years ago
Depends on: 1137573
Looks like it does not play well with the later dehoisting of the constant index offsets. The dehoisting is just not working now. Perhaps the pattern matching broke.
(In reply to Douglas Crosher [:dougc] from comment #2)
> Looks like it does not play well with the later dehoisting of the constant
> index offsets. The dehoisting is just not working now. Perhaps the pattern
> matching broke.

Sorry, disregard the above - got my logic reversed, and I see that the dehoisting logic assumes the low bit mask has already been moved so it is understandable that it does not work if the movement of the alignment mask is disabled.

It does seems to be very fast here - but tested on more recently translated versions of zlib. Some of the fastest time I have seen. It might make a difference if Emscripten dehoists the offsets, and could the benchmarks be updated to use this style now?

It would help if the exact octave zlib file emitted by the benchmark could be attached.
(Reporter)

Comment 4

4 years ago
(In reply to Douglas Crosher [:dougc] from comment #3)
> It would help if the exact octave zlib file emitted by the benchmark could
> be attached.

This is the official benchmark of octane:
http://octane-benchmark.googlecode.com/svn/latest/index.html

We don't have the opportunity to update that benchmark, currently. We have already requested some other changes and they put them in the queue for when a new version eventually will get released. We will need to fix the regression in our engine.
Perhaps a clue. Looks like a case in which the offset could have been dehoisted.

HEAP8[$3 + ($$09 + 1) >> 0]

7fb3397ef08e 19 zlib.js:14660:211: Function _longest_match - Block 14
   0x00007fb3397ef08e:	mov    %edx,%ebx
   0x00007fb3397ef090:	add    $0x1,%ebx
   0x00007fb3397ef093:	add    %ebp,%ebx
   0x00007fb3397ef095:	movsbl (%r15,%rbx,1),%ebx
   0x00007fb3397ef09a:	movsbl (%r15,%r9,1),%esi
   0x00007fb3397ef09f:	cmp    %esi,%ebx
   0x00007fb3397ef0a1:	jne    0x7fb3397ef20b

Another:
$34 = HEAP32[$4 >> 2] | 0;
HEAPU8[(HEAP32[$5 >> 2] | 0) + ($34 + 2) >> 0] 

7fb3397ea366 90 zlib.js:12436:28: Function _deflate_slow - Block 15
   0x00007fb3397ea366:	mov    0x6c(%r15,%r8,1),%eax
   0x00007fb3397ea36b:	mov    0x38(%r15,%r8,1),%edx
   0x00007fb3397ea370:	mov    %eax,%ebx
   0x00007fb3397ea372:	add    $0x2,%ebx
   0x00007fb3397ea375:	add    %ebx,%edx
   0x00007fb3397ea377:	movzbl (%r15,%rdx,1),%edx

The code is generally looking a lot better. The low bit masking appears to be hoisted to good advantage, and the index constant offsets are often being dehoisted.

Just speculation, but running GVN before dehoisting might be frustrating the dehoisting. For some WIP the dehoisting happens early, before GVN - this was necessary for the ARM where the remainder could still be hoisted by GVN. My current opinion is that it might be best to have two GVN passes, and to do the dehoisting after range analysis and then rerun GVN - but this might significantly increase compilation time.
(In reply to Douglas Crosher [:dougc] from comment #3)
> It might make a
> difference if Emscripten dehoists the offsets

Could you explain what you mean by this?

> HEAP8[$3 + ($$09 + 1) >> 0]

Good find; reassociating addition could indeed expose more opportunities for constant-offset folding.

> 7fb3397ef08e 19 zlib.js:14660:211: Function _longest_match - Block 14
>    0x00007fb3397ef08e:	mov    %edx,%ebx
>    0x00007fb3397ef090:	add    $0x1,%ebx
>    0x00007fb3397ef093:	add    %ebp,%ebx
>    0x00007fb3397ef095:	movsbl (%r15,%rbx,1),%ebx
>    0x00007fb3397ef09a:	movsbl (%r15,%r9,1),%esi
>    0x00007fb3397ef09f:	cmp    %esi,%ebx
>    0x00007fb3397ef0a1:	jne    0x7fb3397ef20b

This is 64-bit code; the reassociation opportunity is interesting, however the regression in this bug appears to affect 32-bit machines only. 64-bit actually saw a speedup on zlib with the offending patch.

> Just speculation, but running GVN before dehoisting might be frustrating the
> dehoisting. For some WIP the dehoisting happens early, before GVN - this was
> necessary for the ARM where the remainder could still be hoisted by GVN. My
> current opinion is that it might be best to have two GVN passes, and to do
> the dehoisting after range analysis and then rerun GVN - but this might
> significantly increase compilation time.

We're either not actually running GVN before dehoisting right now, or I'm misunderstanding what you mean by dehoisting. Could you explain what you mean here, if possible without using the word "dehoisting" :)?
Flags: needinfo?(sunfish)
Flags: needinfo?(sunfish)
(In reply to Dan Gohman [:sunfish] from comment #6)
> (In reply to Douglas Crosher [:dougc] from comment #3)
> > It might make a
> > difference if Emscripten dehoists the offsets
> 
> Could you explain what you mean by this?

Example below.

> > HEAP8[$3 + ($$09 + 1) >> 0]
> 
> Good find; reassociating addition could indeed expose more opportunities for
> constant-offset folding.

The patch might have frustrated this, possible example below.
 
> > 7fb3397ef08e 19 zlib.js:14660:211: Function _longest_match - Block 14
> >    0x00007fb3397ef08e:	mov    %edx,%ebx
> >    0x00007fb3397ef090:	add    $0x1,%ebx
> >    0x00007fb3397ef093:	add    %ebp,%ebx
> >    0x00007fb3397ef095:	movsbl (%r15,%rbx,1),%ebx
> >    0x00007fb3397ef09a:	movsbl (%r15,%r9,1),%esi
> >    0x00007fb3397ef09f:	cmp    %esi,%ebx
> >    0x00007fb3397ef0a1:	jne    0x7fb3397ef20b
> 
> This is 64-bit code; the reassociation opportunity is interesting, however
> the regression in this bug appears to affect 32-bit machines only. 64-bit
> actually saw a speedup on zlib with the offending patch.

Yes, the x64 code runs very fast here, so I was surprised. Fwiw on quick inspection I see no obvious issues on x86, although there is a suspect missed optimization opportunity. Here's a hot block with some low bit masking:

f735918b f7 zlib.js:12567:32: Function _deflate_slow - Block 68
   0xf735918b: cmp    $0x8ffffc4,%edi
   0xf7359191: ja     0xf7359fff
   0xf7359197: mov    -0x16afffc8(%edi),%eax
   0xf735919d: mov    0x20(%esp),%ecx
   0xf73591a1: add    $0x3,%ecx  <<<< missed optimization??
   0xf73591a4: add    %ecx,%eax
   0xf73591a6: cmp    $0x8ffffff,%eax
   0xf73591ab: ja     0xf735a009
   0xf73591b1: movzbl -0x16b00000(%eax),%eax
   0xf73591b8: cmp    $0x8ffffb4,%edi
   0xf73591be: ja     0xf735a017
   0xf73591c4: mov    -0x16afffb8(%edi),%edx
   0xf73591ca: cmp    $0x8ffffa4,%edi
   0xf73591d0: ja     0xf735a028
   0xf73591d6: mov    -0x16afffa8(%edi),%ecx
   0xf73591dc: shl    %cl,%edx
   0xf73591de: xor    %edx,%eax
   0xf73591e0: cmp    $0x8ffffa8,%edi
   0xf73591e6: ja     0xf735a039
   0xf73591ec: mov    -0x16afffac(%edi),%ecx
   0xf73591f2: and    %ecx,%eax
   0xf73591f4: cmp    $0x8ffffb4,%edi
   0xf73591fa: ja     0xf735a043
   0xf7359200: mov    %eax,-0x16afffb8(%edi)
   0xf7359206: cmp    $0x8ffffb8,%edi
   0xf735920c: ja     0xf735a058
   0xf7359212: mov    -0x16afffbc(%edi),%ebx
   0xf7359218: lea    (%ebx,%eax,2),%edx
   0xf735921b: cmp    $0x8ffffbc,%edi
   0xf7359221: ja     0xf735a069
   0xf7359227: mov    -0x16afffc0(%edi),%ebx
   0xf735922d: cmp    $0x8ffffc8,%edi
   0xf7359233: ja     0xf735a07a
   0xf7359239: mov    -0x16afffcc(%edi),%ebp
   0xf735923f: and    %esi,%ebp
   0xf7359241: lea    (%ebx,%ebp,2),%eax
   0xf7359244: and    $0xfffffffe,%eax
   0xf7359247: and    $0xfffffffe,%edx
   0xf735924a: cmp    $0x8fffffe,%edx
   0xf7359250: ja     0xf735a084
   0xf7359256: movswl -0x16b00000(%edx),%ecx
   0xf735925d: cmp    $0x8fffffe,%eax
   0xf7359262: ja     0xf735926f
   0xf7359268: mov    %cx,-0x16b00000(%eax)
   0xf735926f: cmp    $0x8fffffe,%edx
   0xf7359275: ja     0xf7359282
   0xf735927b: mov    %si,-0x16b00000(%edx)

The bounds check stand out, but this is not a change. Fwiw the same block using index masking ;) Not just shameless plug, as it does show one missed optimization above.

f76e56ce 6e zlibmac.js:12186:32: Function _deflate_slow - Block 68
   0xf76e56ce: mov    -0x167fffc8(%edi),%eax
   0xf76e56d4: add    0x1c(%esp),%eax
   0xf76e56d8: and    $0x7ffffff,%eax
   0xf76e56dd: movzbl -0x167ffffd(%eax),%eax  <<< + 3
   0xf76e56e4: mov    -0x167fffb8(%edi),%edx
   0xf76e56ea: mov    -0x167fffa8(%edi),%ecx
   0xf76e56f0: shl    %cl,%edx
   0xf76e56f2: xor    %edx,%eax
   0xf76e56f4: mov    -0x167fffac(%edi),%ecx
   0xf76e56fa: and    %ecx,%eax
   0xf76e56fc: mov    %eax,-0x167fffb8(%edi)
   0xf76e5702: mov    -0x167fffbc(%edi),%ecx
   0xf76e5708: lea    (%ecx,%eax,2),%edx
   0xf76e570b: mov    -0x167fffc0(%edi),%ebx
   0xf76e5711: mov    -0x167fffcc(%edi),%ebp
   0xf76e5717: and    %esi,%ebp
   0xf76e5719: lea    (%ebx,%ebp,2),%eax
   0xf76e571c: and    $0x7fffffe,%eax
   0xf76e5721: and    $0x7fffffe,%edx
   0xf76e5727: movswl -0x16800000(%edx),%ecx
   0xf76e572e: mov    %cx,-0x16800000(%eax)
   0xf76e5735: mov    %si,-0x16800000(%edx)

> > Just speculation, but running GVN before dehoisting might be frustrating the
> > dehoisting. For some WIP the dehoisting happens early, before GVN - this was
> > necessary for the ARM where the remainder could still be hoisted by GVN. My
> > current opinion is that it might be best to have two GVN passes, and to do
> > the dehoisting after range analysis and then rerun GVN - but this might
> > significantly increase compilation time.
> 
> We're either not actually running GVN before dehoisting right now, or I'm
> misunderstanding what you mean by dehoisting. Could you explain what you
> mean here, if possible without using the word "dehoisting" :)?

By 'dehoisting' I mean the movement of the constant index offset to the leaf of the index computation, and into the instruction address mode calculation, even if it creates redundant computation in the addressing mode computation. Yes, Ion currently does this after GVN. There might be good reason to do this before GVN, especially for the ARM.

For example. I understand clang hoists 'j + 8' in the following, and that by default Emscripten emits this:
var i = j + 8;
u32[i>>2] = v;
k = u32[i>>2] + 10;

Perhaps it would be useful for Emscripen to now 'dehoist' these offsets to simplify pattern matching in the compiler:
u32[(j + 8)>>2] = v;
k = u32[(j + 8)>>2] + 10;

GVN is run reverting this, and it's back to:
i = j + 8;
u32[i>>2] = v;
k = u32[i>>2] + 10;

Now might this have frustrated later pattern matching that wants to 'dehoist' it back to:
u32[(j + 8)>>2] = v;
k = u32[(j + 8)>>2] + 10;

Another example, common in some asm.js code, and with the ARM in mind:
u32[(j + 0x10004)>>2] = v;
u32[(j + 0x10008)>>2] = v;

with the limited immediate offset range of instructions on the ARM this becomes:
u32[(j + 0x10000 + 4)>>2] = v;
u32[(j + 0x10000 + 8)>>2] = v;

it's useful to run GVN giving:
i = (j + 0x10000);
u32[(i + 4)>>2] = v;
u32[(i + 8)>>2] = v;

Thus my current direction is to explore 'dehoisting' the constant offsets early, before a GVN pass. If Emscripten has done much of the work then the pattern matching is simplified - and this is the reason for requesting an update of the benchmark. For the ARM this would 'dehoist' small offsets leaving the remainder. This optimization depends on type information so is most productive after range analysis (before de-beta). Then run a GVN pass to hoist the remainders, for the benefit of the ARM.
Hardware: x86_64 → x86
(Reporter)

Comment 8

4 years ago
Any update on this regression? This regression is now already 6 days in our tree. This can be a bad thing, since it could hide other potential regressions. So it would be good if we quickly have a solution for this. Is there maybe a possibility to backout the original patch to give more time to investigate and find a fix?
The patches were overall large (8-11%) wins across the board, so I don't think we should back this out; it seems more like an unfortunate coincidence for x86 regalloc/codegen.  It would be good to investigate what's happening, though, since it could help point out further improvements.  Also, Dan and I planned out a 32-bit bounds-check-hoisting optimization that should be significant and put us back to winning on 32-bit Octane-zlib.
(Reporter)

Comment 10

4 years ago
(In reply to Luke Wagner [:luke] from comment #9)
> The patches were overall large (8-11%) wins across the board, so I don't
> think we should back this out; it seems more like an unfortunate coincidence
> for x86 regalloc/codegen.  It would be good to investigate what's happening,
> though, since it could help point out further improvements.  Also, Dan and I
> planned out a 32-bit bounds-check-hoisting optimization that should be
> significant and put us back to winning on 32-bit Octane-zlib.

I agree this is awesome! And I want this improvement as much as you do.

But:
* The longer it takes for a fix to get committed. The harder to confirm the regression is fully fixed.
* The longer a regression is in the tree. More chance it hides another regression.
* The fix seems a new optimizations, which could get not landed in FF39. Not fully fix the regression. Get backed out due to errors ... So waiting means more chance that we have FF39 with asmjs improvement and zlib regression.
* This zlib regression puts us again behind v8 on octane.
* ...

So given the above, I was just thinking it would be better to have FF40 with the asmjs improvements and no zlib regression. Than FF39 with asmjs improvement and zlib regression and only FF40 with zlib fix.

That is my reasoning about backing out this patch until the '32-bit bounds-check-hoisting optimization' lands and we can confirm it fixes the zlib regression.

(IMHO)
Perhaps just restore the support in EffectiveAddressAnalysis.cpp, so that the new alignment mask optimization can be toggle on or off easily without also destroying the prior dehoisting optimization. Then it could be easily disable on non-x64, and it would be much easier to do a differential analysis on the emitted code to try and spot the issue.
I looked at code dumps before and after the patch; the only significant differences I'm seeing among the hot functions is in register allocation differences. The patch significantly reduces the total number of instructions emitted, but there are some MoveGroups in some hot code which look like they might be worse. I haven't been able to find anything which suggests a problem with the patch itself.

Also, octane-zlib is quite obsolete at this point. Emscripten hasn't been generating code like the kind of code in that benchmark for quite some time now. A more relevant measure of zlib performance today is the asmjs-apps version of zlib, and it happens that both x86 and x64 saw speedups on asmjs-apps zlib with this patch.

So we should back out this patch if x86 Octane score is more important than x64 Octane score (which improved with this patch) and x86 and x64 performance on actual real world code (which saw significant and broad speedups).
Also, the regression is 32-bit OSX-only, not Linux or Windows, so really pretty insignificant.  For a while I thought I saw a regression on Windows too, but looking back now it appears to have been machine-wide anomaly affected v8 as well and it went away.  Lastly, the pending 32-bit bounds-check-hoisting opt is mostly unrelated to this patch (not a fix) so its landing should be independent.
2 years later, I don't think this is going anywhere.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Flags: needinfo?(sunfish)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.