Closed Bug 471822 Opened 16 years ago Closed 13 years ago

TM: We have grown some performance hair

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: gal, Assigned: dmandelin)

References

Details

Attachments

(4 files)

We are more than 70ms slower than our previous top SS score. We should try to identify the patches that slowed us down.
Depends on: 470375
I'll start in on this, just running the tests to find those patches. But anyone should feel free to steal the bug back from me.
Assignee: general → dmandelin
On v8 v.2 http://v8.googlecode.com/svn/data/benchmarks/v2/run.html I get ~150 pts. on trunk and ~200 pts in 3.0.5, don't know if that means anything to tm, do you guys just follow the ss scores?
We care about the v8 benchmarks too, but AFAIK we've been using mostly SunSpider to drive our internal TM optimization efforts.

In testing my data collection script, I discovered that 23249:7b79b55a6ea6 (Fri Jan 02 12:37:55 2009 -0800, Merge m-c to tracemonkey) seems to have caused a 90 ms perf regression on access-nbody. I was surprised to see that because it's from after this bug report was filed, so presumably there's also some older version.
This is the shell script that runs a SunSpider test from a given revision in the TM repo. It is to be run in the js/src directory. You would have to update the way it runs autoconf213 and the path to SunSpider to use it yourself.
This Python program drives the shell script over a range of revisions. You need to have a 'logs' directory in order to run it. It will save the stdout and stderr for each rev in a log file. It prints out the SunSpider times as Python tuples (for later reading with eval and e.g. transforming to spreadsheet format). 

It takes about a minute to do a run with 10 SunSpider runs/test. Based on that, it can run every rev since Dec 1 in about 20 hours. So I'll just run it over the weekend. If you have any more precise guidance about when the regression might have occurred, let me know.
I skipped the script and instead bisected my way to the regressing changeset. It is:

changeset:   23223:6bbf10f75a88
parent:      23204:4ac387253c8f
user:        Andreas Gal <gal@mozilla.com>
date:        Thu Jan 01 17:55:43 2009 -0800
summary:     Store frame state information in the code cache and merely put a pointer to it onto the native call stack (470375, r=danderson).

This changeset causes a 90 ms regression to SunSpider in access-nbody.js. Strangely, running access-nbody.js by itself causes no regression.
Strange. We have exactly the same affect with david's latest patch in fannkuch. David, any progress on that?
This appears to be due to a bug in bash. On my system, this command does not exhibit the regression (I have created copies of sunspider-standalone-driver.js with 'aaa' names):

~/bashsrc/bash-3.2/bash -c '/Users/dmandelin/sources/tracemonkey/js/src/opt/dist/bin/js -j -f tmp/sunspider-test-prefix.js -f resources/aaaaaaaaaaaaaaaa.js'

This exhibits a 20 ms regression on access-nbody:

~/bashsrc/bash-3.2/bash -c '/Users/dmandelin/sources/tracemonkey/js/src/opt/dist/bin/js -j -f tmp/sunspider-test-prefix.js -f resources/aaaaaaaaaaaaaaaaaaaaaaaa.js'

And this exhibits the full regression:

~/bashsrc/bash-3.2/bash -c '/Users/dmandelin/sources/tracemonkey/js/src/opt/dist/bin/js -j -f tmp/sunspider-test-prefix.js -f resources/aaaaaaaaaaaaaaaaaaaaaaaaaaaa.js'

But I can make the regression go away by adding the -i flag. I have proved that the pertinent effect of -i is to read and execute ~/.bashrc. It turns out that the regression doesn't go away for an empty .bashrc, but does go away if it contains this:

  export z=~

But this won't do it:

  export z=a
I know its only January 2 but I would like to nominate this for weirdest bug of the year.
Latest freaky update. I have now proved that the value of $esp on entry to main determines the presence and amount of perf hit:

- I added code at the start of main to print esp and ebp. Then I ran identical runs except for the length of the last command-line argument. I saw that a perf change between two runs was always accompanied by a change in esp and ebp.

- I can make a regression happen or not simply by calling 'alloca(32)' or not at the beginning of main. AFAIK the only effect is to increment esp.
Freaky Deaky!

So something perf-critical is sensitive to stack page faulting, maybe? Can you try values from 1 to 32 to see when the slowdown happens?

/be
We use alloca for the native stack. Maybe we should align? Or use a fixed per-thread zone?
Things that are the same in normal vs. slow runs:

  number of traces compiled
  native code size of each trace
  number of traces executed
  jsops executed in interpreter
  jsops executed natively
  number of calls to js_BoxDouble
  time spent in js_BoxDouble
  number of pages zeroed
  time spent in kernel

Things that are not the same:

  execution time of last trace in "loop body" group in nbody
     - appears to run for 2x, 4x, or 8x of baseline time
  number of x86 instructions executed
  number of call, branch, and indirect jump instructions executed
     - the increase factor is different for each instruction kind

I tried to look for differences in page faults and system calls in dtrace, but the regression went away entirely when running under dtrace.
David, can you try this patch? Does it make a difference?
Latest ridiculous update:

- I've managed to pin all the performance hit blame on the call to a certain trace in js_CallTree. This trace is for the last loop in the advance function of access-nbody. This is a loop with 5 iterations.

- I can make the perf hit appear simply by bumping the stack pointer. E.g., by adding asm("addl $64, %esp") right before the trace call.

- In this particular test, when the perf hit appears, it causes the trace to run 10x as long. Without the perf hit, the trace usually runs in 1700 cycles, but sometimes adds 50,000 cycles or so (presumably some kind of interrupt). With the hit, the trace usually runs in 18,000 cycles, but sometimes adds 50,000 or 100,000 cycles. 

- The only builtins that I can see being called on the trace are js_BoxDouble and js_UnboxDouble. AFAICT, these functions are not producing the slowdown. But it's possible that the caller sequence for them is. My adhoc timers and Shark agree that the trace just runs longer.

- I did a bunch of HPC stuff today and I didn't come up with much. It appears that the slow version executes (or whatever verb applies):

    - 100M more clock cycles (about 40 ms)
    -  10M more x86 instructions (expected cost: 10-20M cycles)
    -  40M more uops dispatched
    -  20M more uops retired
    -   1M more blocked loads    (expected cost: 5-20M cycles)
    - 500k more blocked stores   (expected cost: 3M cycles)
    - 100k more FPU ops
    - 1.5M more L2 loads         (expected cost: 2M cycles)
    -  600 more L2 load misses   (trivial cost)
    - 1.5M more branches         (expected cost: 2-5M cycles)
    -  65k more branch misses    (expected cost: 1.3M cycles)

The stats above are bizarre on many levels. First, I still haven't been able to actually observe *what* instructions the slow version runs that the fast one doesn't, and I don't know how it could end up doing that. Second, the cost of the additional events listed above only adds up to 20-50% of the extra cycles seen. Assuming that any of these stats are real, it seems that the extra/different code has less parallelism. 

One guess just occurred to me. I know that the same number of traces are compiled and the same number of traces are called. But there are 2 traces for the bad loop. Maybe one is 10x as fast as the other. And maybe one of the guards goes the wrong way and sends it through the wrong trace on iterations 2-5 in the slow case. But in debug mode, both of those traces produce 904-914 lines of native spew, so it's hard to see how they could be that different. Could the trace be jumping over to some irrelevant trace, that can somehow run without crashing but without changing observed variables? Or maybe I'm just wrong about the traces being the same and they're really not?

Andreas, I think I'll be asking you for hints on non-intrusive trace instrumentation and native code snapshotting tomorrow.
I think when we were looking at this yesterday we saw two traces being compiled and the difference was a boolean value vs an object. If it was double in on trace and integers in the other, I could see one trace taking longer. Can you look at the type map differences again? (debug mode shows that)
I verified with the attached patch that we don't do unaligned reads or writes of xmm registers.
We identified a potential source of this problem. We spill the 3 non-volatile registers at function entry before setting up ebp, which yields 5 words pushed (return address, original ebp, 3 registers) relative to an aligned sp. Hence, the new ebp is not 8-byte aligned. Why this only creates a problem for certain values for sp is unclear. I will file a blocking bug that fixes the ebp alignment and we will see whether this issue goes away.
Depends on: 472791
Question for moh: if ebp is unaligned, why would quad word loads/stores only be slow for certain esp values. For example esp = X is fast, but esp = X + 32 is slow (very slow, 5x slower for certain loads/stores). Any architecture hints? Is this possible/plausible?
No longer depends on: 472791
Depends on: 472791
dmandelin, the blocking bug has a patch that aligns ebp. Could you test it?
Obsolete with the removal of tracejit.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: