Closed Bug 471822 Opened 16 years ago Closed 13 years ago

TM: We have grown some performance hair

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: gal, Assigned: dmandelin)

References

Details

Attachments

(4 files)

Shell script to run a complete SunSpider run from TM repo 16 years ago David Mandelin [:dmandelin] 247 bytes, text/plain		Details
Python program to run tests over a range of revisions 16 years ago David Mandelin [:dmandelin] 1.59 KB, text/plain		Details
patch making sure we use unsigned comparison for pointers 16 years ago Andreas Gal :gal 1.80 KB, patch		Details \| Diff \| Splinter Review
proof that we don't do unaligned xmm ops 16 years ago Andreas Gal :gal 1.59 KB, patch		Details \| Diff \| Splinter Review

Andreas Gal :gal

Reporter

Description

•

16 years ago

We are more than 70ms slower than our previous top SS score. We should try to identify the patches that slowed us down.

Andreas Gal :gal

Reporter

Updated

•

16 years ago

Depends on: 470375

David Mandelin [:dmandelin]

Assignee

Comment 1

•

16 years ago

I'll start in on this, just running the tests to find those patches. But anyone should feel free to steal the bug back from me.

Assignee: general → dmandelin

Nochum Sossonko [:Natch]

Comment 2

•

16 years ago

On v8 v.2 http://v8.googlecode.com/svn/data/benchmarks/v2/run.html I get ~150 pts. on trunk and ~200 pts in 3.0.5, don't know if that means anything to tm, do you guys just follow the ss scores?

David Mandelin [:dmandelin]

Assignee

Comment 3

•

16 years ago

We care about the v8 benchmarks too, but AFAIK we've been using mostly SunSpider to drive our internal TM optimization efforts. In testing my data collection script, I discovered that 23249:7b79b55a6ea6 (Fri Jan 02 12:37:55 2009 -0800, Merge m-c to tracemonkey) seems to have caused a 90 ms perf regression on access-nbody. I was surprised to see that because it's from after this bug report was filed, so presumably there's also some older version.

David Mandelin [:dmandelin]

Assignee

Comment 4

•

16 years ago

Attached file Shell script to run a complete SunSpider run from TM repo — Details

This is the shell script that runs a SunSpider test from a given revision in the TM repo. It is to be run in the js/src directory. You would have to update the way it runs autoconf213 and the path to SunSpider to use it yourself.

David Mandelin [:dmandelin]

Assignee

Comment 5

•

16 years ago

Attached file Python program to run tests over a range of revisions — Details

This Python program drives the shell script over a range of revisions. You need to have a 'logs' directory in order to run it. It will save the stdout and stderr for each rev in a log file. It prints out the SunSpider times as Python tuples (for later reading with eval and e.g. transforming to spreadsheet format). It takes about a minute to do a run with 10 SunSpider runs/test. Based on that, it can run every rev since Dec 1 in about 20 hours. So I'll just run it over the weekend. If you have any more precise guidance about when the regression might have occurred, let me know.

David Mandelin [:dmandelin]

Assignee

Comment 6

•

16 years ago

I skipped the script and instead bisected my way to the regressing changeset. It is: changeset: 23223:6bbf10f75a88 parent: 23204:4ac387253c8f user: Andreas Gal <gal@mozilla.com> date: Thu Jan 01 17:55:43 2009 -0800 summary: Store frame state information in the code cache and merely put a pointer to it onto the native call stack (470375, r=danderson). This changeset causes a 90 ms regression to SunSpider in access-nbody.js. Strangely, running access-nbody.js by itself causes no regression.

Andreas Gal :gal

Reporter

Comment 7

•

16 years ago

Strange. We have exactly the same affect with david's latest patch in fannkuch. David, any progress on that?

David Mandelin [:dmandelin]

Assignee

Comment 8

•

16 years ago

This appears to be due to a bug in bash. On my system, this command does not exhibit the regression (I have created copies of sunspider-standalone-driver.js with 'aaa' names): ~/bashsrc/bash-3.2/bash -c '/Users/dmandelin/sources/tracemonkey/js/src/opt/dist/bin/js -j -f tmp/sunspider-test-prefix.js -f resources/aaaaaaaaaaaaaaaa.js' This exhibits a 20 ms regression on access-nbody: ~/bashsrc/bash-3.2/bash -c '/Users/dmandelin/sources/tracemonkey/js/src/opt/dist/bin/js -j -f tmp/sunspider-test-prefix.js -f resources/aaaaaaaaaaaaaaaaaaaaaaaa.js' And this exhibits the full regression: ~/bashsrc/bash-3.2/bash -c '/Users/dmandelin/sources/tracemonkey/js/src/opt/dist/bin/js -j -f tmp/sunspider-test-prefix.js -f resources/aaaaaaaaaaaaaaaaaaaaaaaaaaaa.js' But I can make the regression go away by adding the -i flag. I have proved that the pertinent effect of -i is to read and execute ~/.bashrc. It turns out that the regression doesn't go away for an empty .bashrc, but does go away if it contains this: export z=~ But this won't do it: export z=a

Andreas Gal :gal

Reporter

Comment 9

•

16 years ago

I know its only January 2 but I would like to nominate this for weirdest bug of the year.

David Mandelin [:dmandelin]

Assignee

Comment 10

•

16 years ago

Latest freaky update. I have now proved that the value of $esp on entry to main determines the presence and amount of perf hit: - I added code at the start of main to print esp and ebp. Then I ran identical runs except for the length of the last command-line argument. I saw that a perf change between two runs was always accompanied by a change in esp and ebp. - I can make a regression happen or not simply by calling 'alloca(32)' or not at the beginning of main. AFAIK the only effect is to increment esp.

Brendan Eich [:brendan]

Comment 11

•

16 years ago

Freaky Deaky! So something perf-critical is sensitive to stack page faulting, maybe? Can you try values from 1 to 32 to see when the slowdown happens? /be

Andreas Gal :gal

Reporter

Comment 12

•

16 years ago

We use alloca for the native stack. Maybe we should align? Or use a fixed per-thread zone?

David Mandelin [:dmandelin]

Assignee

Comment 13

•

16 years ago

Things that are the same in normal vs. slow runs: number of traces compiled native code size of each trace number of traces executed jsops executed in interpreter jsops executed natively number of calls to js_BoxDouble time spent in js_BoxDouble number of pages zeroed time spent in kernel Things that are not the same: execution time of last trace in "loop body" group in nbody - appears to run for 2x, 4x, or 8x of baseline time number of x86 instructions executed number of call, branch, and indirect jump instructions executed - the increase factor is different for each instruction kind I tried to look for differences in page faults and system calls in dtrace, but the regression went away entirely when running under dtrace.

Andreas Gal :gal

Reporter

Comment 14

•

16 years ago

Attached patch patch making sure we use unsigned comparison for pointers — Details — Splinter Review

David, can you try this patch? Does it make a difference?

David Mandelin [:dmandelin]

Assignee

Comment 15

•

16 years ago

Latest ridiculous update: - I've managed to pin all the performance hit blame on the call to a certain trace in js_CallTree. This trace is for the last loop in the advance function of access-nbody. This is a loop with 5 iterations. - I can make the perf hit appear simply by bumping the stack pointer. E.g., by adding asm("addl $64, %esp") right before the trace call. - In this particular test, when the perf hit appears, it causes the trace to run 10x as long. Without the perf hit, the trace usually runs in 1700 cycles, but sometimes adds 50,000 cycles or so (presumably some kind of interrupt). With the hit, the trace usually runs in 18,000 cycles, but sometimes adds 50,000 or 100,000 cycles. - The only builtins that I can see being called on the trace are js_BoxDouble and js_UnboxDouble. AFAICT, these functions are not producing the slowdown. But it's possible that the caller sequence for them is. My adhoc timers and Shark agree that the trace just runs longer. - I did a bunch of HPC stuff today and I didn't come up with much. It appears that the slow version executes (or whatever verb applies): - 100M more clock cycles (about 40 ms) - 10M more x86 instructions (expected cost: 10-20M cycles) - 40M more uops dispatched - 20M more uops retired - 1M more blocked loads (expected cost: 5-20M cycles) - 500k more blocked stores (expected cost: 3M cycles) - 100k more FPU ops - 1.5M more L2 loads (expected cost: 2M cycles) - 600 more L2 load misses (trivial cost) - 1.5M more branches (expected cost: 2-5M cycles) - 65k more branch misses (expected cost: 1.3M cycles) The stats above are bizarre on many levels. First, I still haven't been able to actually observe *what* instructions the slow version runs that the fast one doesn't, and I don't know how it could end up doing that. Second, the cost of the additional events listed above only adds up to 20-50% of the extra cycles seen. Assuming that any of these stats are real, it seems that the extra/different code has less parallelism. One guess just occurred to me. I know that the same number of traces are compiled and the same number of traces are called. But there are 2 traces for the bad loop. Maybe one is 10x as fast as the other. And maybe one of the guards goes the wrong way and sends it through the wrong trace on iterations 2-5 in the slow case. But in debug mode, both of those traces produce 904-914 lines of native spew, so it's hard to see how they could be that different. Could the trace be jumping over to some irrelevant trace, that can somehow run without crashing but without changing observed variables? Or maybe I'm just wrong about the traces being the same and they're really not? Andreas, I think I'll be asking you for hints on non-intrusive trace instrumentation and native code snapshotting tomorrow.

Andreas Gal :gal

Reporter

Comment 16

•

16 years ago

I think when we were looking at this yesterday we saw two traces being compiled and the difference was a boolean value vs an object. If it was double in on trace and integers in the other, I could see one trace taking longer. Can you look at the type map differences again? (debug mode shows that)

Andreas Gal :gal

Reporter

Comment 17

•

16 years ago

Attached patch proof that we don't do unaligned xmm ops — Details — Splinter Review

I verified with the attached patch that we don't do unaligned reads or writes of xmm registers.

Andreas Gal :gal

Reporter

Comment 18

•

16 years ago

We identified a potential source of this problem. We spill the 3 non-volatile registers at function entry before setting up ebp, which yields 5 words pushed (return address, original ebp, 3 registers) relative to an aligned sp. Hence, the new ebp is not 8-byte aligned. Why this only creates a problem for certain values for sp is unclear. I will file a blocking bug that fixes the ebp alignment and we will see whether this issue goes away.

Andreas Gal :gal

Reporter

Updated

•

16 years ago

Depends on: 472791

Andreas Gal :gal

Reporter

Comment 19

•

16 years ago

Question for moh: if ebp is unaligned, why would quad word loads/stores only be slow for certain esp values. For example esp = X is fast, but esp = X + 32 is slow (very slow, 5x slower for certain loads/stores). Any architecture hints? Is this possible/plausible?

Andreas Gal :gal

Reporter

Updated

•

16 years ago

No longer depends on: 472791

Andreas Gal :gal

Reporter

Updated

•

16 years ago

Depends on: 472791

Andreas Gal :gal

Reporter

Comment 20

•

16 years ago

dmandelin, the blocking bug has a patch that aligns ebp. Could you test it?

Ryan VanderMeulen [:RyanVM]

Comment 21

•

13 years ago

Obsolete with the removal of tracejit.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.