Closed Bug 472791 Opened 16 years ago Closed 16 years ago

TM: ebp is not aligned

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: gal, Assigned: gal)

References

Details

(Keywords: fixed1.9.1)

Attachments

(5 files)

We currently push 3 non-volatile registers onto the stack before establishing our own frame. This causes ebp to be not 8-byte aligned, which might be bad for performance (i.e. when storing quad words).
Blocks: 471822
Assignee: general → gal
No longer blocks: 471822
Blocks: 471822
Turns out edwsmith had the same hack in place before but it got lost with the nj2 merge business.
This seems to fix the problem in my current test environment. I will now do some testing with trunk TM to confirm.
For future reference, David identified the root cause by inserting profiling calls around individual ops on the trace and bisecting down to a bunch of bytecodes (or their native code equivalent) executing 5-6x slower. The biggest slowdown was a fmul, and by individually nop-ing out the spills and re-loads (2 re-loads and one spill for the fmul operands and results) we identified that the spills are slow, which eventually brought up the topic of cache or page faults or alignment. Through varying SP by large offsets David excluded page faults and we couldn't think of a way this would cause cache threshing so we eventually homed in on mis-alignment of ebp. The part that remains unanswered is why SP = X was slow, but SP = X + 32 was fast (both against an unaligned EBP). x86 is a mystery.
Looks like we got it! See below for trunk SS, with this patch vs. without. The huge win on nbody depends on shell environment as I showed earlier, so don't pay too much attention to the total. But we also get some nice wins on other math-heavy stuff. Let's get this in! TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.177x as fast 1472.1ms +/- 0.8% 1250.8ms +/- 0.1% significant ============================================================================= 3d: 1.124x as fast 200.8ms +/- 0.3% 178.6ms +/- 0.2% significant cube: 1.034x as fast 51.6ms +/- 0.4% 49.9ms +/- 0.2% significant morph: 1.111x as fast 38.1ms +/- 0.3% 34.3ms +/- 0.4% significant raytrace: 1.176x as fast 111.1ms +/- 0.5% 94.4ms +/- 0.3% significant access: 2.19x as fast 333.3ms +/- 3.3% 152.1ms +/- 0.2% significant binary-trees: ?? 45.4ms +/- 1.2% 45.7ms +/- 0.4% not conclusive: might be *1.005x as slow* fannkuch: 1.045x as fast 66.0ms +/- 0.3% 63.1ms +/- 0.3% significant nbody: 7.18x as fast 208.1ms +/- 5.4% 29.0ms +/- 0.4% significant nsieve: *1.029x as slow* 13.9ms +/- 0.8% 14.3ms +/- 0.9% significant bitops: 1.029x as fast 46.8ms +/- 0.5% 45.5ms +/- 0.5% significant 3bit-bits-in-byte: ?? 1.9ms +/- 5.0% 1.9ms +/- 4.6% not conclusive: might be *1.032x as slow* bits-in-byte: *1.043x as slow* 11.6ms +/- 1.2% 12.1ms +/- 0.9% significant bitwise-and: ?? 2.3ms +/- 6.2% 2.3ms +/- 5.8% not conclusive: might be *1.017x as slow* nsieve-bits: 1.067x as fast 31.0ms +/- 0.5% 29.0ms +/- 0.4% significant controlflow: ?? 37.0ms +/- 0.4% 37.1ms +/- 0.2% not conclusive: might be *1.002x as slow* recursive: ?? 37.0ms +/- 0.4% 37.1ms +/- 0.2% not conclusive: might be *1.002x as slow* crypto: 1.039x as fast 73.9ms +/- 0.4% 71.1ms +/- 0.5% significant aes: 1.028x as fast 42.1ms +/- 0.4% 40.9ms +/- 0.7% significant md5: 1.046x as fast 23.6ms +/- 0.7% 22.6ms +/- 0.7% significant sha1: 1.076x as fast 8.2ms +/- 1.3% 7.6ms +/- 1.9% significant date: 1.006x as fast 247.2ms +/- 0.2% 245.8ms +/- 0.2% significant format-tofte: 1.006x as fast 131.8ms +/- 0.2% 131.0ms +/- 0.3% significant format-xparb: 1.006x as fast 115.4ms +/- 0.2% 114.8ms +/- 0.2% significant math: 1.132x as fast 55.1ms +/- 0.5% 48.7ms +/- 0.5% significant cordic: 1.177x as fast 26.3ms +/- 0.6% 22.4ms +/- 0.6% significant partial-sums: 1.073x as fast 19.2ms +/- 0.7% 17.9ms +/- 0.4% significant spectral-norm: 1.138x as fast 9.6ms +/- 1.6% 8.4ms +/- 1.8% significant regexp: 1.007x as fast 63.4ms +/- 0.5% 62.9ms +/- 0.3% significant dna: 1.007x as fast 63.4ms +/- 0.5% 62.9ms +/- 0.3% significant string: 1.013x as fast 414.6ms +/- 0.8% 409.1ms +/- 0.2% significant base64: 1.010x as fast 18.3ms +/- 0.7% 18.1ms +/- 0.6% significant fasta: 1.023x as fast 83.6ms +/- 0.3% 81.7ms +/- 0.4% significant tagcloud: - 128.0ms +/- 2.6% 126.8ms +/- 0.6% unpack-code: 1.011x as fast 148.2ms +/- 0.3% 146.6ms +/- 0.3% significant validate-input: 1.018x as fast 36.5ms +/- 0.4% 35.8ms +/- 0.4% significant
This seems to be totally machine dependent. I do not get any wins on my macbook pro, but it reduces somewhat the wild swings I used to get (+/- 20% down to +/- 5% on runs of bench.sh)
Attachment #356218 - Flags: review?(graydon)
Comment on attachment 356218 [details] [diff] [review] Push an extra word to make sure ebp is aligned again. This is more of a hack. Will provide a better patch next. Picking likely TM nanojit peer to review. /be
(In reply to comment #4) > The part that remains unanswered > is why SP = X was slow, but SP = X + 32 was fast (both against an unaligned > EBP). x86 is a mystery. Cc'ing Moh. Kudos to dmandelin, he never gives up and never surrenders (Galaxy Quest, time for another showing!). /be
Comment on attachment 356218 [details] [diff] [review] Push an extra word to make sure ebp is aligned again. This is more of a hack. Will provide a better patch next. Thanks for tracking this down; sorry for introducing it.
Attachment #356218 - Flags: review?(graydon) → review+
We could turn on alignment checking in x86 by modifying bit 18 (AC) of eflags before and after calling into jitted code in js_ExecuteTree if we wanted, using pushf and popf.
Requesting blocking since its a 17% perf gain and already fixed.
Flags: blocking1.9.1?
Whiteboard: fixed-in-tracemonkey
(In reply to comment #12) > Requesting blocking since its a 17% perf gain and already fixed. Yes please. Great work dmandelin and andreas.
Flags: blocking1.9.1? → blocking1.9.1+
Moh, regarding comment 11, Andreas tried this on OS X with the following code: /*********************/ static void EnableAlignmentCheck() { asm volatile ("pushf; orl $(1<<18), (%esp); popf"); } static void DisableAlignmentCheck() { asm volatile ("pushf; andl $(~(1<<18)), (%esp); popf"); } int main() { EnableAlignmentCheck(); char x[16]; *(double*)(&x[1]) *= 5.5; DisableAlignmentCheck(); return 0; } /*********************/ The program executed without error. When I tried the same thing on Linux in a VM, I got an error. Alignment checks only occur when both AC in EFLAGS and AM in CR0 are set, so I suspect OS X isn't setting AM in CR0; I'd like to verify this is why the program runs without error on OS X. As far as I know, I can't examine CR0 from userspace and would have to write a kernel extension to be able to use mov to examine it, but SMSW will reveal the lower 16 bits of CR0 and according to docs leaves the upper 16 bits undefined. However, in my testing, it *appears* the actual value in CR0 is copied over: on Linux I get 0x80050033, while on OS X I get 0x80010033, on an Intel Core 2 Duo T7500 @ 2.20GHz. (Linux is inside a VMware VM, although I don't think this matters.) Both values are consistent with the documentation for the bits of CR0 and with the values I think each OS is setting. For that specific CPU, is it the case that the upper 16 bits of the target address happen to be the actual upper 16 bits of CR0? If so, this would confirm my hypothesis; if not, do you happen to know if OS X swallows these exceptions and prevents applications from handling them via signals or similar mechanisms, or do you have any other ideas what might be happening?
Sorry for the delay. I was out of town since the middle of the week with limited network access. It seems I missed an opportunity and a lot of fun. Great job by David and Andreas and congratulations to the team. The problem is micro-architectural and is what we call "split loads/stores." If an unaligned access crosses cache boundaries, the overhead is particularly high. In the attachment, I wrote a simple test that measures the time for a computation involving a double multiply, with a wrapper loop that iterates over the address. Here's part of the output: %64 time 0 94 1 94 2 94 ... 54 94 55 93 56 94 57 250 58 266 59 234 60 266 61 250 62 234 63 250 0 94 1 94 2 93 3 94 ... So, you can see as soon as we cross the cache line, the execution time is increased by a 2.5x factor. On a related point, note that if two different processors want to write to an address whose corresponding data crosses a cache line, the operations are not necessarily atomic, that is the result my be something that is different from both values that the processors wanted to write. I'll get back regarding Jeff's question. This might have something to do with the ABI. But for now, note that x[0] is not necessarily aligned at 16 boundary as it is defined with char type.
BTW, Intel VTune has these two events: "Retired Split Loads" and "Retired Split Stores". You can run an application and configure VTune to show the instances of these events (you can also control the number of times after which the event is triggered). - moh
One can infer from the results that the cache line of the machine that has run this experiment is 64 bytes.
Attachment #356378 - Attachment mime type: application/octet-stream → text/plain
Attachment #356379 - Attachment mime type: application/octet-stream → text/plain
Thanks for the explanation of the splits. I had suspected that splitting cache lines would cost something, but I assumed it would be relatively small. But I'm not convinced that explains everything that I measured. In particular, in your experiment I'm sure the slowdown would be periodic, happening for every 64 bytes of offset. And, suspecting cache effects of *some* kind, I looked to see if the slowdown had a period of 64 in the stack offset. In fact, I couldn't find a period at all, and specifically, adding another 64 to the bump made the slowdown go away. Revisiting this just now, I noticed that ebp in the bad case is 0xbfff01c. And the load/store that seemed to be causing the worst problem (estimated to be taking a 100 cycle penalty for either load or store) was for ebp-0x20. So it actually crosses a page boundary. For some reason, although I noticed that before, I didn't properly check for a 4096-byte period--probably because the only effect I knew about with a 4096-byte period was page faults, which take much longer than 100 cycles. I modified your program to look for page boundary splits and I did see a second effect: iters mod-64 mod-4096 clocks clk/iter ms us/iter 10000000 0 0 122943183 12.3 56.02 0.006 10000000 52 52 113016651 11.3 51.50 0.005 10000000 60 60 445824247 44.6 203.16 0.020 10000000 60 1020 423288173 42.3 192.89 0.019 10000000 60 4092 2624075102 262.4 1195.78 0.120 20000000 0 0 223726294 11.2 101.95 0.005 20000000 52 52 225649545 11.3 102.83 0.005 20000000 60 60 874256284 43.7 398.39 0.020 20000000 60 1020 856285309 42.8 390.21 0.020 20000000 60 4092 5205228468 260.3 2371.97 0.119 40000000 0 0 434348398 10.9 197.93 0.005 40000000 52 52 434259166 10.9 197.89 0.005 40000000 60 60 1665305345 41.6 758.87 0.019 40000000 60 1020 1684272293 42.1 767.51 0.019 40000000 60 4092 10416002795 260.4 4746.48 0.119 80000000 0 0 885687583 11.1 403.60 0.005 80000000 52 52 909311172 11.4 414.36 0.005 80000000 60 60 3367102244 42.1 1534.36 0.019 80000000 60 1020 3343114093 41.8 1523.42 0.019 80000000 60 4092 20906631328 261.3 9526.95 0.119 Do you know what this effect is?
TLB miss may have a fast handler -- MIPS and such RISC chips from my salad days did, much faster than a full page fault or even validity fault but slower than a cache penalty. /be
Right. Unaligned accesses that cross cache line boudaries are expensive, the ones that cross page boundaries are way more expensive. With an access that crosses page boundaries, on the first access, TLB miss may occur; on other repretitive accesses to the same address, TLB miss may not occur (both pages are now in the memory), but a HW exception in the micro-code that takes care of the case may still occur repeatedly. Such HW exceptions are micro-architectural and not specified in the arichitecture specification, as they depend on the particular implementation. The IA specification, however, says that: http://download.intel.com/design/processor/manuals/253668.pdf 17.18.11 Operands Split Across Segments and/or Pages On the P6 family, Pentium, and Intel486 processor FPUs, when the first half of an operand to be written is inside a page or segment and the second half is outside, a memory fault can cause the first half to be stored but not the second half. In this situation, the Intel 387 math coprocessor stores nothing. BTW, there is a bug in my test that propgated to Dave's. Here's the correcttion: ptr = (char*) (((unsigned int) ptr + 4096) & 0xfffff000);
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Back in June, Jungwoo Ha (one of our interns)looked at the EBP alignment in Tamarin. Instead of using "push/pop ebp" to adjust ESP, Jungwoo used add/seb instructions on ESP (note that push/pop are memory operations while add/sub are in ALU), we got some slight improvements. It might be worth trying it in TM as well. Here's the speedup we got on TT at the time: test avm avm2 %sp sunspider/access-binary-trees.as 84.0 84.0 0.0 sunspider/access-fannkuch.as 138.0 136.0 1.4 sunspider/access-nbody.as 160.0 160.0 0.0 sunspider/access-nsieve.as 60.0 60.0 0.0 sunspider/bitops-3bit-bits-in-byte.as 14.0 14.0 0.0 sunspider/bitops-bits-in-byte.as 40.0 40.0 0.0 sunspider/bitops-bitwise-and.as 206.0 201.0 2.4 sunspider/bitops-nsieve-bits.as 52.0 52.0 0.0 sunspider/controlflow-recursive.as 30.0 29.0 3.3 sunspider/crypto-aes.as 169.0 169.0 0.0 sunspider/crypto-sha1.as 39.0 39.0 0.0 sunspider/math-cordic.as 52.0 52.0 0.0 sunspider/math-partial-sums.as 196.0 194.0 1.0 sunspider/math-spectral-norm.as 33.0 33.0 0.0 sunspider/s3d-cube.as 155.0 154.0 0.6 sunspider/s3d-morph.as 77.0 75.0 2.6 sunspider/string-fasta.as 159.0 155.0 2.5 Of course, freeing up EBP and going to ESP based addressing has more potentials as you have an extra register for general use. - moh
Whiteboard: fixed-in-tracemonkey → [needs 1.9.1 landing]
Flags: in-testsuite-
Flags: in-litmus-
I figured out how to munge the code given in comment 14 to compile and run on Windows, so here it is for future reference: /********************************/ static void EnableAlignmentCheck() { __asm { pushf or [ESP], (1<<18) popf } } static void DisableAlignmentCheck() { __asm { pushf and [ESP], ~(1<<18) popf } } int main() { EnableAlignmentCheck(); char x[16]; *(double*)(&x[1]) *= 5.5; DisableAlignmentCheck(); return 0; } /********************************/ I filed bug 476037 to look into setting AC when executing traced code so as not to clutter up this bug with that orthogonal task.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: