Closed Bug 472791 Opened 11 years ago Closed 11 years ago

TM: ebp is not aligned

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: gal, Assigned: gal)

References

Details

(Keywords: fixed1.9.1)

Attachments

(5 files)

We currently push 3 non-volatile registers onto the stack before establishing our own frame. This causes ebp to be not 8-byte aligned, which might be bad for performance (i.e. when storing quad words).
Blocks: 471822
Assignee: general → gal
No longer blocks: 471822
Blocks: 471822
Turns out edwsmith had the same hack in place before but it got lost with the nj2 merge business.
This seems to fix the problem in my current test environment. I will now do some testing with trunk TM to confirm.
For future reference, David identified the root cause by inserting profiling calls around individual ops on the trace and bisecting down to a bunch of bytecodes (or their native code equivalent) executing 5-6x slower. The biggest slowdown was a fmul, and by individually nop-ing out the spills and re-loads (2 re-loads and one spill for the fmul operands and results) we identified that the spills are slow, which eventually brought up the topic of cache or page faults or alignment. Through varying SP by large offsets David excluded page faults and we couldn't think of a way this would cause cache threshing so we eventually homed in on mis-alignment of ebp. The part that remains unanswered is why SP = X was slow, but SP = X + 32 was fast (both against an unaligned EBP). x86 is a mystery.
Looks like we got it! See below for trunk SS, with this patch vs. without. The huge win on nbody depends on shell environment as I showed earlier, so don't pay too much attention to the total. But we also get some nice wins on other math-heavy stuff. Let's get this in!

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.177x as fast    1472.1ms +/- 0.8%   1250.8ms +/- 0.1%     significant

=============================================================================

  3d:                  1.124x as fast     200.8ms +/- 0.3%    178.6ms +/- 0.2%     significant
    cube:              1.034x as fast      51.6ms +/- 0.4%     49.9ms +/- 0.2%     significant
    morph:             1.111x as fast      38.1ms +/- 0.3%     34.3ms +/- 0.4%     significant
    raytrace:          1.176x as fast     111.1ms +/- 0.5%     94.4ms +/- 0.3%     significant

  access:              2.19x as fast      333.3ms +/- 3.3%    152.1ms +/- 0.2%     significant
    binary-trees:      ??                  45.4ms +/- 1.2%     45.7ms +/- 0.4%     not conclusive: might be *1.005x as slow*
    fannkuch:          1.045x as fast      66.0ms +/- 0.3%     63.1ms +/- 0.3%     significant
    nbody:             7.18x as fast      208.1ms +/- 5.4%     29.0ms +/- 0.4%     significant
    nsieve:            *1.029x as slow*    13.9ms +/- 0.8%     14.3ms +/- 0.9%     significant

  bitops:              1.029x as fast      46.8ms +/- 0.5%     45.5ms +/- 0.5%     significant
    3bit-bits-in-byte: ??                   1.9ms +/- 5.0%      1.9ms +/- 4.6%     not conclusive: might be *1.032x as slow*
    bits-in-byte:      *1.043x as slow*    11.6ms +/- 1.2%     12.1ms +/- 0.9%     significant
    bitwise-and:       ??                   2.3ms +/- 6.2%      2.3ms +/- 5.8%     not conclusive: might be *1.017x as slow*
    nsieve-bits:       1.067x as fast      31.0ms +/- 0.5%     29.0ms +/- 0.4%     significant

  controlflow:         ??                  37.0ms +/- 0.4%     37.1ms +/- 0.2%     not conclusive: might be *1.002x as slow*
    recursive:         ??                  37.0ms +/- 0.4%     37.1ms +/- 0.2%     not conclusive: might be *1.002x as slow*

  crypto:              1.039x as fast      73.9ms +/- 0.4%     71.1ms +/- 0.5%     significant
    aes:               1.028x as fast      42.1ms +/- 0.4%     40.9ms +/- 0.7%     significant
    md5:               1.046x as fast      23.6ms +/- 0.7%     22.6ms +/- 0.7%     significant
    sha1:              1.076x as fast       8.2ms +/- 1.3%      7.6ms +/- 1.9%     significant

  date:                1.006x as fast     247.2ms +/- 0.2%    245.8ms +/- 0.2%     significant
    format-tofte:      1.006x as fast     131.8ms +/- 0.2%    131.0ms +/- 0.3%     significant
    format-xparb:      1.006x as fast     115.4ms +/- 0.2%    114.8ms +/- 0.2%     significant

  math:                1.132x as fast      55.1ms +/- 0.5%     48.7ms +/- 0.5%     significant
    cordic:            1.177x as fast      26.3ms +/- 0.6%     22.4ms +/- 0.6%     significant
    partial-sums:      1.073x as fast      19.2ms +/- 0.7%     17.9ms +/- 0.4%     significant
    spectral-norm:     1.138x as fast       9.6ms +/- 1.6%      8.4ms +/- 1.8%     significant

  regexp:              1.007x as fast      63.4ms +/- 0.5%     62.9ms +/- 0.3%     significant
    dna:               1.007x as fast      63.4ms +/- 0.5%     62.9ms +/- 0.3%     significant

  string:              1.013x as fast     414.6ms +/- 0.8%    409.1ms +/- 0.2%     significant
    base64:            1.010x as fast      18.3ms +/- 0.7%     18.1ms +/- 0.6%     significant
    fasta:             1.023x as fast      83.6ms +/- 0.3%     81.7ms +/- 0.4%     significant
    tagcloud:          -                  128.0ms +/- 2.6%    126.8ms +/- 0.6% 
    unpack-code:       1.011x as fast     148.2ms +/- 0.3%    146.6ms +/- 0.3%     significant
    validate-input:    1.018x as fast      36.5ms +/- 0.4%     35.8ms +/- 0.4%     significant
This seems to be totally machine dependent. I do not get any wins on my macbook pro, but it reduces somewhat the wild swings I used to get (+/- 20% down to +/- 5% on runs of bench.sh)
Attachment #356218 - Flags: review?(graydon)
Comment on attachment 356218 [details] [diff] [review]
Push an extra word to make sure ebp is aligned again. This is more of a hack. Will provide a better patch next.

Picking likely TM nanojit peer to review.

/be
(In reply to comment #4)
> The part that remains unanswered
> is why SP = X was slow, but SP = X + 32 was fast (both against an unaligned
> EBP). x86 is a mystery.

Cc'ing Moh.

Kudos to dmandelin, he never gives up and never surrenders (Galaxy Quest, time for another showing!).

/be
Comment on attachment 356218 [details] [diff] [review]
Push an extra word to make sure ebp is aligned again. This is more of a hack. Will provide a better patch next.

Thanks for tracking this down; sorry for introducing it.
Attachment #356218 - Flags: review?(graydon) → review+
We could turn on alignment checking in x86 by modifying bit 18 (AC) of eflags before and after calling into jitted code in js_ExecuteTree if we wanted, using pushf and popf.
Requesting blocking since its a 17% perf gain and already fixed.
Flags: blocking1.9.1?
Whiteboard: fixed-in-tracemonkey
(In reply to comment #12)
> Requesting blocking since its a 17% perf gain and already fixed.

Yes please. Great work dmandelin and andreas.
Flags: blocking1.9.1? → blocking1.9.1+
Moh, regarding comment 11, Andreas tried this on OS X with the following code:

/*********************/
static void EnableAlignmentCheck() {
  asm volatile ("pushf; orl $(1<<18), (%esp); popf");
}

static void DisableAlignmentCheck() {
  asm volatile ("pushf; andl $(~(1<<18)), (%esp); popf");
}

int main() {
  EnableAlignmentCheck();
  char x[16];
  *(double*)(&x[1]) *= 5.5;
  DisableAlignmentCheck();
  return 0;
}
/*********************/

The program executed without error.  When I tried the same thing on Linux in a VM, I got an error.  Alignment checks only occur when both AC in EFLAGS and AM in CR0 are set, so I suspect OS X isn't setting AM in CR0; I'd like to verify this is why the program runs without error on OS X.

As far as I know, I can't examine CR0 from userspace and would have to write a kernel extension to be able to use mov to examine it, but SMSW will reveal the lower 16 bits of CR0 and according to docs leaves the upper 16 bits undefined.  However, in my testing, it *appears* the actual value in CR0 is copied over: on Linux I get 0x80050033, while on OS X I get 0x80010033, on an Intel Core 2 Duo T7500 @ 2.20GHz.  (Linux is inside a VMware VM, although I don't think this matters.)  Both values are consistent with the documentation for the bits of CR0 and with the values I think each OS is setting.  For that specific CPU, is it the case that the upper 16 bits of the target address happen to be the actual upper 16 bits of CR0?  If so, this would confirm my hypothesis; if not, do you happen to know if OS X swallows these exceptions and prevents applications from handling them via signals or similar mechanisms, or do you have any other ideas what might be happening?
Sorry for the delay. I was out of town since the middle of the week with limited network access. It seems I missed an opportunity and a lot of fun. Great job by David and Andreas and congratulations to the team.

The problem is micro-architectural and is what we call "split loads/stores." If an unaligned access crosses cache boundaries, the overhead is particularly high. In the attachment, I wrote a simple test that measures the time for a computation involving a double multiply, with a wrapper loop that iterates over the address. Here's part of the output:

%64 time
  0   94
  1   94
  2   94
...
 54   94
 55   93
 56   94
 57  250
 58  266
 59  234
 60  266
 61  250
 62  234
 63  250
  0   94
  1   94
  2   93
  3   94
...

So, you can see as soon as we cross the cache line, the execution time is increased by a 2.5x factor. On a related point, note that if two different processors want to write to an address whose corresponding data crosses a cache line, the operations are not necessarily atomic, that is the result my be something that is different from both values that the processors wanted to write. 

I'll get back regarding Jeff's question. This might have something to do with the ABI. But for now, note that x[0] is not necessarily aligned at 16 boundary as it is defined with char type.
BTW, Intel VTune has these two events: "Retired Split Loads" and "Retired Split Stores". You can run an application and configure VTune to show the instances of these events (you can also control the number of times after which the event is triggered).

- moh
One can infer from the results that the cache line of the machine that has run this experiment is 64 bytes.
Attachment #356378 - Attachment mime type: application/octet-stream → text/plain
Attachment #356379 - Attachment mime type: application/octet-stream → text/plain
Thanks for the explanation of the splits. I had suspected that splitting cache lines would cost something, but I assumed it would be relatively small.

But I'm not convinced that explains everything that I measured. In particular, in your experiment I'm sure the slowdown would be periodic, happening for every 64 bytes of offset. And, suspecting cache effects of *some* kind, I looked to see if the slowdown had a period of 64 in the stack offset. In fact, I couldn't find a period at all, and specifically, adding another 64 to the bump made the slowdown go away.

Revisiting this just now, I noticed that ebp in the bad case is 0xbfff01c. And the load/store that seemed to be causing the worst problem (estimated to be taking a 100 cycle penalty for either load or store) was for ebp-0x20. So it actually crosses a page boundary.

For some reason, although I noticed that before, I didn't properly check for a 4096-byte period--probably because the only effect I knew about with a 4096-byte period was page faults, which take much longer than 100 cycles. I modified your program to look for page boundary splits and I did see a second effect:

     iters mod-64 mod-4096       clocks clk/iter       ms  us/iter
  10000000      0        0    122943183     12.3    56.02    0.006
  10000000     52       52    113016651     11.3    51.50    0.005
  10000000     60       60    445824247     44.6   203.16    0.020
  10000000     60     1020    423288173     42.3   192.89    0.019
  10000000     60     4092   2624075102    262.4  1195.78    0.120
  20000000      0        0    223726294     11.2   101.95    0.005
  20000000     52       52    225649545     11.3   102.83    0.005
  20000000     60       60    874256284     43.7   398.39    0.020
  20000000     60     1020    856285309     42.8   390.21    0.020
  20000000     60     4092   5205228468    260.3  2371.97    0.119
  40000000      0        0    434348398     10.9   197.93    0.005
  40000000     52       52    434259166     10.9   197.89    0.005
  40000000     60       60   1665305345     41.6   758.87    0.019
  40000000     60     1020   1684272293     42.1   767.51    0.019
  40000000     60     4092  10416002795    260.4  4746.48    0.119
  80000000      0        0    885687583     11.1   403.60    0.005
  80000000     52       52    909311172     11.4   414.36    0.005
  80000000     60       60   3367102244     42.1  1534.36    0.019
  80000000     60     1020   3343114093     41.8  1523.42    0.019
  80000000     60     4092  20906631328    261.3  9526.95    0.119

Do you know what this effect is?
TLB miss may have a fast handler -- MIPS and such RISC chips from my salad days did, much faster than a full page fault or even validity fault but slower than a cache penalty.

/be
Right. Unaligned accesses that cross cache line boudaries are expensive, the ones that cross page boundaries are way more expensive. With an access that crosses page boundaries, on the first access, TLB miss may occur; on other repretitive accesses to the same address, TLB miss may not occur (both pages are now in the memory), but a HW exception in the micro-code that takes care of the case may still occur repeatedly. Such HW exceptions are micro-architectural and not specified in the arichitecture specification, as they depend on the particular implementation.

The IA specification, however, says that:

http://download.intel.com/design/processor/manuals/253668.pdf

17.18.11 Operands Split Across Segments and/or Pages
On the P6 family, Pentium, and Intel486 processor FPUs, when the first half of an operand to be written is inside a page or segment and the second half is outside, a memory fault can cause the first half to be stored but not the second half. In this situation,
the Intel 387 math coprocessor stores nothing.

BTW, there is a bug in my test that propgated to Dave's. Here's the correcttion:
ptr = (char*) (((unsigned int) ptr + 4096) & 0xfffff000);
http://hg.mozilla.org/mozilla-central/rev/a9556623dcf4
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Back in June, Jungwoo Ha (one of our interns)looked at the EBP alignment in Tamarin. Instead of using "push/pop ebp" to adjust ESP, Jungwoo used add/seb instructions on ESP (note that push/pop are memory operations while add/sub are in ALU), we got some slight improvements. It might be worth trying it in TM as well. Here's the speedup we got on TT at the time:

test                                                   avm    avm2     %sp
 
sunspider/access-binary-trees.as                      84.0    84.0     0.0
sunspider/access-fannkuch.as                         138.0   136.0     1.4
sunspider/access-nbody.as                            160.0   160.0     0.0
sunspider/access-nsieve.as                            60.0    60.0     0.0
sunspider/bitops-3bit-bits-in-byte.as                 14.0    14.0     0.0
sunspider/bitops-bits-in-byte.as                      40.0    40.0     0.0
sunspider/bitops-bitwise-and.as                      206.0   201.0     2.4
sunspider/bitops-nsieve-bits.as                       52.0    52.0     0.0
sunspider/controlflow-recursive.as                    30.0    29.0     3.3
sunspider/crypto-aes.as                              169.0   169.0     0.0
sunspider/crypto-sha1.as                              39.0    39.0     0.0
sunspider/math-cordic.as                              52.0    52.0     0.0
sunspider/math-partial-sums.as                       196.0   194.0     1.0
sunspider/math-spectral-norm.as                       33.0    33.0     0.0
sunspider/s3d-cube.as                                155.0   154.0     0.6
sunspider/s3d-morph.as                                77.0    75.0     2.6
sunspider/string-fasta.as                            159.0   155.0     2.5
 
Of course, freeing up EBP and going to ESP based addressing has more potentials as you have an extra register for general use.

- moh
Whiteboard: fixed-in-tracemonkey → [needs 1.9.1 landing]
Flags: in-testsuite-
Flags: in-litmus-
I figured out how to munge the code given in comment 14 to compile and run on Windows, so here it is for future reference:

/********************************/
static void EnableAlignmentCheck() {
  __asm
  {
    pushf
    or [ESP], (1<<18)
    popf
  }
}

static void DisableAlignmentCheck() {
  __asm
  {
    pushf
    and [ESP], ~(1<<18)
    popf
  }
}

int main() {
  EnableAlignmentCheck();
  char x[16];
  *(double*)(&x[1]) *= 5.5;
  DisableAlignmentCheck();
  return 0;
}
/********************************/

I filed bug 476037 to look into setting AC when executing traced code so as not to clutter up this bug with that orthogonal task.
You need to log in before you can comment on or make changes to this bug.