Closed Bug 652535 Opened 13 years ago Closed 11 years ago

JM: JIT code performance counters

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: bhackett1024, Unassigned)

References

Details

Attachments

(1 file)

For tests where we spend a lot of time in JIT code, shark is not helpful in identifying where we are actually spending our time, and just reading the script is not a good approach as benchmarks increase in complexity.  It should be possible to insert performance counters into JIT code so that we can see which loops are hot both in terms of times hit and CPU clocks.  This wouldn't identify places where we spend the most time in the VM, but JIT code which does lots of VM stuff would still show up as expensive.

I experimented a little with the x86 RDTSC instruction, and it seems up to the task.  This instruction gets the elapsed clocks from the CPU, and takes about 11 ns pretty consistently to execute on my machine (not sure if it needs to talk to the OS kernel in there).  This bias can be accounted for when reporting the results, and can be mitigated by only timing every 1/N times we hit a loop head or script entry.
> but JIT code which does lots of VM stuff would still show up as expensive.

And shark will show the VM time fine.

Is the idea here to do the timing always, or just after startProfiling() has been called in a profiling-enabled build?  The former would be nice for people who're trying to profile JS in release builds...
I'm not sure if you've seen it, but Dave did some work on this in bug 642054.
One thing to be aware of about RDTSC is that it is not a serializing instruction; it doesn't wait for all prior instructions to execute before reading the counter. This may not be a big deal.
Depends on: 642054
(In reply to comment #1)
> > but JIT code which does lots of VM stuff would still show up as expensive.
> 
> And shark will show the VM time fine.
> 
> Is the idea here to do the timing always, or just after startProfiling() has
> been called in a profiling-enabled build?  The former would be nice for people
> who're trying to profile JS in release builds...

I think we should have a way to profile JS time in plain release builds (not sure what enabling profiling for a build already accomplishes), it's good information not just for engine developers but downstream extensions and web developers. (Though right now I'm mainly interested in the engine developer and benchmark tuner perspectives).

I'm not sure whether this or the approach in bug 642054 is better (did not know about that bug), will play around with this some more.  Using timers I worry about a few things --- repeatability, what to do about compartment switching in the browser and all that time spent in stubs called from JIT code.  What really matters though is how useful the results are and the overhead involved.
> not sure what enabling profiling for a build already accomplishes

Adds some functions on the global to start/stop shark and other profilers and forces compilation with -fno-omit-frame-pointer.
Attached patch WIPSplinter Review
Patch using RDTSC instruction to track time under the JIT code for each script.  Tracks the number of clocks when entering or exiting a script; entering a script adds the number elapsed since the last track to the parent's JIT, and exiting a script adds the number elapsed to the current JIT.  Doesn't do anything at loop heads, and is set up so that only about .6% of enters/exits are actually measured (a decrement/branch is taken at others), for a pretty tiny effect on perf. (This is only really usable on benchmarks that make a good number of calls).

Still, this gives fairly consistent and repeatable results that are by and large in sync with what we know on the v8 benchmarks (bug 642003).  One interesting thing is that on the v8-crypto benchmark these point to several functions outside am3 as making a big contribution (bnpFromString, bnpSquareTo and montReduce, each about 10% of total time).
Let me see if I understand the patch correctly:

You insert a test into all mjitted prologues and epilogues that decides whether to accumulate cycles in a per-jit cycle counter. The algorithm boils down to:

Step 1: use the first sample as a baseline cycle count
Step 2: sample the next 100 events, updating jit->totalClocks
Step 3: flip a coin
  heads: skip the next 0x8000 events, goto step 3
  tails: goto step 1

The result is 100 samples followed by a varying number of skipped samples, repeating forever. The expected number of skipped samples is 2*0x8000. So the percentage of events sampled is 100 / (100 + 2 * 0x8000), or 0.15%, but it's easily tunable. Pretty slick!

(The 0.6% in your comment appears to be based making 0x8000 * 2 = 16000 decimal.)

Assuming I'm understanding things correctly:

 - this should do ok with threads as long as they stick to a CPU, because mjitted code is thread-specific (?).

 - if a thread migrates to a different CPU, it'll glitch the accounting, though if the TSCs on the CPUs are sufficiently close or sufficiently different, it'll be ok

 - mjit only
(In reply to comment #7)
> Step 1: use the first sample as a baseline cycle count
> Step 2: sample the next 100 events, updating jit->totalClocks
> Step 3: flip a coin
>   heads: skip the next 0x8000 events, goto step 3
>   tails: goto step 1
> 
> The result is 100 samples followed by a varying number of skipped samples,
> repeating forever. The expected number of skipped samples is 2*0x8000. So
> the percentage of events sampled is 100 / (100 + 2 * 0x8000), or 0.15%, but
> it's easily tunable. Pretty slick!
> 
> (The 0.6% in your comment appears to be based making 0x8000 * 2 = 16000
> decimal.)

Oops, the .6% is wrong due to a bug in the code itself.  Meant to take the bitwise-and with 0x7fff rather than 0x8000, so that instead of flipping a coin we end up skipping between 1 and 32000 events, averaging out to 16000.  The more randomness in the sampling the better, as is the sampling in this patch is vulnerable to bias should there be cyclic patterns in the calls the script makes.
(In reply to comment #3)
> One thing to be aware of about RDTSC is that it is not a serializing
> instruction; it doesn't wait for all prior instructions to execute before
> reading the counter. This may not be a big deal.

When this isn't what you want, it sounds like the usual fix is to add in a cpuid instruction, which is serializing. (Though this seems odd to me, because I don't see any way to avoid trashing its several output registers. It's the least invasive unprivileged serializing instruction, though. I wonder why there isn't a serializing NOP?)
(In reply to comment #4)
> I'm not sure whether this or the approach in bug 642054 is better (did not
> know about that bug), will play around with this some more.  Using timers I
> worry about a few things --- repeatability, what to do about compartment
> switching in the browser and all that time spent in stubs called from JIT
> code.  What really matters though is how useful the results are and the
> overhead involved.

Another advantage of the approach in this bug is that you can usefully gather other performance metrics like cache misses, mispredictions, etc. A sampling-based approach can't really do that. (I'm calling your algorithm here "interval sampling", though I'm afraid to Google it to see what meanings it might already have.)

What do you mean about repeatability, though? Unless you remove the random skipping from your algorithm (simple enough), the results aren't repeatable there either.

For now, I'm trying to push forward with the bug 642054 approach, since it seems more generally applicable. And I can steal code for other platforms from v8.
JM was removed, we have pccounts now + integration with some profilers.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: