Closed Bug 637393 Opened 13 years ago Closed 13 years ago

Profile execution of all ops, broken down by run mode

Categories

(Core :: JavaScript Engine, enhancement)

x86_64
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sfink, Assigned: sfink)

References

Details

(Whiteboard: [fixed-in-tracemonkey])

Attachments

(4 files, 5 obsolete files)

Extension of bug 606629.

Add a facility to count up the number of times every instruction gets executed, broken down by "run mode" (currently: interpreting, tracejitting, methodjitting).

Sample output, slightly trimmed:

./js -j -m -p -e 'var y = 0; function f(x) { for(var i = 0; i < x; i++) { y += x; } }; f(1000);'
--- PC PROFILE -e:1 ---
counts    x off   line  op
---------   ----- ----  --
main:
1/0/0     x 00000:   1  zero
1/0/0     x 00001:   1  setlocal 0
0/0/0     x 00004:   1  pop
1/0/0     x 00005:   1  goto 29 (+24)
18/0/0    x 00008:   1  notrace 0
17/0/983  x 00011:   1  bindgname "y"
17/0/983  x 00014:   1  getglobal "y"
17/0/983  x 00017:   1  getarg 0
17/0/983  x 00020:   1  add
17/0/983  x 00021:   1  setgname "y"
0/0/983   x 00024:   1  pop
17/0/983  x 00025:   1  localinc 0
0/0/0     x 00028:   1  pop
18/0/983  x 00029:   1  getlocal 0
18/0/983  x 00032:   1  getarg 0
18/0/983  x 00035:   1  lt
0/0/0     x 00036:   1  ifne 8 (-28)
0/0/1     x 00039:   1  stop

It's a little mysterious, but that first column gives counts i/t/j where i=times this instruction was executed in interpreter, t=times while tracejitted, m=times while methodjitted. (I was looking at the effect of the -p option here -- here, it just causes one more iteration to happen in the interpreter. Not sure if that's expected?)

Ideas for potential uses:
 - first glance at performance problems

 - test coverage tool: the dump includes scripts and line numbers, so a coverage tool would be straightforward. (It'd probably be better to base it off of the internally collected info rather than parsing the disassembly, though.) In particular, you could do per-engine coverage metrics.

 - could do ANOVA/linear regression to determine relative expense of different ops in the different engines (ignoring a huge number of variables)
Summary: Profile execution of all ops, broken down my run mode → Profile execution of all ops, broken down by run mode
At some point, I thought I needed a JSOP_RUNMODE that gives the current run mode. I can't remember why, but the real patch is (trivially) based on this, so for now I'm sticking it in here.

Apply this on top of bug 606629.
Create a set of execution counters on JSScript that count up the number of times every JS opcode is executed. Actual counts are controlled by an option bit on each JSContext. For now, hardwire the option in in the JS shell.

Applies after the JSOP_RUNMODE patch (which I should fix; this patch does not use the new JSop.)
Assignee: general → sphink
Attachment #515687 - Attachment is obsolete: true
Attachment #515689 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #535459 - Flags: review?(dmandelin)
Oops. 'make check' tells me that I wasn't checking for the existence of profile data properly.
Attachment #535459 - Attachment is obsolete: true
Attachment #535459 - Flags: review?(dmandelin)
Attachment #535509 - Flags: review?(dmandelin)
Comment on attachment 535509 [details] [diff] [review]
Part 1: add profiling option to record run counts of each pc

Review of attachment 535509 [details] [diff] [review]:
-----------------------------------------------------------------

I have a few comments about public API and C++iness. The rest looks good. And I have one more question: do we want this to be debug builds only? The only reason to want this outside of a debug build that I can think of off hand is that someone might want to run this on a program that is painfully slow in debug mode. So maybe we want an --enable-counters build option that turns it on in non-debug builds as well.

::: js/src/jsapi.h
@@ +91,5 @@
> +/*
> + * Different "run modes" used for execution accounting (JSOPTION_PCPROFILE)
> + */
> +
> +#define JSRUNMODE_INTERP 0

Do these need to be in the public API? If so, OK, otherwise they should be an enum or some other typed thing.

::: js/src/jsscript.h
@@ +500,5 @@
>      uint32          *closedSlots; /* vector of closed slots; args first, then vars. */
>  
> +#ifdef DEBUG
> +    /* array of execution counters for every JSOp in the script, by runmode */
> +    int             *pcProfiles[JSRUNMODE_COUNT];

Let's make this a C++ class and encapsulate memory management and representation. Also, to me it seems preferable to allocate it as one array with accessors like 'get(RunMode, int)'.

::: js/src/methodjit/Compiler.cpp
@@ +892,5 @@
> +            RegisterID r1 = frame.allocReg();
> +            RegisterID r2 = frame.allocReg();
> +
> +            masm.move(ImmPtr(script->pcProfiles[JSRUNMODE_METHODJIT]), r1);
> +            Address pcCounter(r1, sizeof(int) * (PC - script->code));

I think you can bake in the address of the PC profile entry itself here. (That also applies to the tracer.)

One thing to think about is that we're going to want to do moving GC eventually. But I think that the script profile can and should be allocated in a nonmoving tenured generation, so it should be OK to assume that every counter element has a fixed address during the lifetime of the script.
Attachment #535509 - Flags: review?(dmandelin) → review-
Attachment #535461 - Flags: review?(dmandelin) → review+
(In reply to comment #7)
> Comment on attachment 535509 [details] [diff] [review] [review]
> Part 1: add profiling option to record run counts of each pc
> 
> Review of attachment 535509 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> I have a few comments about public API and C++iness. The rest looks good.
> And I have one more question: do we want this to be debug builds only? The
> only reason to want this outside of a debug build that I can think of off
> hand is that someone might want to run this on a program that is painfully
> slow in debug mode. So maybe we want an --enable-counters build option that
> turns it on in non-debug builds as well.

tl;dr: you're right. Why did I make it DEBUG-only in the first place?

What are the tradeoffs? As I see it, the only benefit of making it DEBUG-only is performance overhead. Specifically, it adds this code to every interpreter DO_OP():

  if (pcProfile && !regs.fp()->hasImacropc())
      ++pcProfile[regs.pc - script->code];

where pcProfile is NULL, so it's an easily predicted branch. That cost should be low, but the feature would be so rarely used that perhaps it's still not worth it.

It also slows down mjit and tjit compilation very slightly.

If JSOPTION_PCPROFILE is turned on, it has significant costs in time and memory, of course.

I can think of two benefits of making it independent of DEBUG. First, there's no differing behavior between builds and no need for another configure switch, so you don't have to document it or worry about people being confused about why it's unavailable. Similarly, there are no future concerns about different formats when saving out scripts and their metadata if you later decide you want to include this information. I'm not sure why you would, but somebody might come up with a use case for either XDR or jsdbg2's remote debug protocol.

Second, I am thinking of uses for this where people would want it in their "normal" browser instance. Specifically, I've been hearing of some game developers who are starting to use JS and are really interested in figuring out how to write JS code so that it will stay on the JS engine's hot paths. The data this feature provides is a good first cut at answering the basic question "is my code getting JITted?" It doesn't answer *why* your code might be falling off trace or whatever, but it answers what is happening.

Btw, the uses I envision for this feature are (1) the "am I getting JITted? Which JIT?" I mentioned about; and (2) code coverage ("what percentage of the lines are executed in this JS test case?")

Those are the relevant issues as I see them, and I'd lean towards making it independent of DEBUG. Not sure why I did that in the first place, other than having an extreme aversion to adding any overhead to the production browser.

> ::: js/src/jsapi.h
> @@ +91,5 @@
> > +/*
> > + * Different "run modes" used for execution accounting (JSOPTION_PCPROFILE)
> > + */
> > +
> > +#define JSRUNMODE_INTERP 0
> 
> Do these need to be in the public API? If so, OK, otherwise they should be
> an enum or some other typed thing.

Just considering the patches in this bug, no. But I'd actually like to expose this is a nicer way than disassembly dumps (eg for use in code coverage tool), and for that I'll want them in the public API.

Do you have a better descriptor than "runmode"? "engine" seems too broad.

I could envision breaking it down further into "slowpath mjitted jsop" vs "fastpath mjitted jsop", if such things had any reasonable meaning.

> ::: js/src/jsscript.h
> @@ +500,5 @@
> >      uint32          *closedSlots; /* vector of closed slots; args first, then vars. */
> >  
> > +#ifdef DEBUG
> > +    /* array of execution counters for every JSOp in the script, by runmode */
> > +    int             *pcProfiles[JSRUNMODE_COUNT];
> 
> Let's make this a C++ class and encapsulate memory management and
> representation. Also, to me it seems preferable to allocate it as one array
> with accessors like 'get(RunMode, int)'.

Fair enough. I started out with a single array, but I duplicated the accessor math everywhere, which was messy and dumb. I'll try encapsulating it in a class.

Oh, and as suggested by your naming of --enable-counters, I think I'll s/profile/counter/ everywhere. 'Counter' is a more accurate term for this, and it avoids conflicting with JSOPTION_PROFILING.

> 
> ::: js/src/methodjit/Compiler.cpp
> @@ +892,5 @@
> > +            RegisterID r1 = frame.allocReg();
> > +            RegisterID r2 = frame.allocReg();
> > +
> > +            masm.move(ImmPtr(script->pcProfiles[JSRUNMODE_METHODJIT]), r1);
> > +            Address pcCounter(r1, sizeof(int) * (PC - script->code));
> 
> I think you can bake in the address of the PC profile entry itself here.
> (That also applies to the tracer.)

I just wrote a long explanation of how I did it this way because there's no Address constructor that takes a single register. Then I tried it, and it compiled fine for both TM and JM. Doh!

Ok, done, thanks!

> One thing to think about is that we're going to want to do moving GC
> eventually. But I think that the script profile can and should be allocated
> in a nonmoving tenured generation, so it should be OK to assume that every
> counter element has a fixed address during the lifetime of the script.

That was in the back of my mind, but I figured I could worry about it later.
Attachment #535509 - Attachment is obsolete: true
Attachment #537301 - Flags: review?(dmandelin)
Attachment #535460 - Attachment is obsolete: true
Attachment #535460 - Flags: review?(dmandelin)
Attachment #537303 - Flags: review?(dmandelin)
(In reply to comment #8)
> > Do these need to be in the public API? If so, OK, otherwise they should be
> > an enum or some other typed thing.
> 
> Just considering the patches in this bug, no. But I'd actually like to
> expose this is a nicer way than disassembly dumps (eg for use in code
> coverage tool), and for that I'll want them in the public API.

OK.

> Do you have a better descriptor than "runmode"? "engine" seems too broad.

Not really. I tend to think of "execution engine" as the technical term, but it doesn't lend itself to nice names, and "runmode" seems at least as understandable.
Attachment #537301 - Flags: review?(dmandelin) → review+
Attachment #537303 - Flags: review?(dmandelin) → review+
When I pushed this to try, it complained because I added a new UnwantedForeground free. I don't have any strong justification for it, since it's really just another field on JSScript so its lifetime may as well be managed like everything else.

On a separate note, eval-cached scripts' counters will accumulate until they are finally destroyed. It would be easy to make counters reset if their owning script is reused (I have a patch for it), but on the other hand it seems kind of nice to aggregate repeated use of trivial, cached scripts. Back to that first hand, it may allow some observability into when GC happens if JS-level accessors are added, which is a little bit of a no-no though probably ok for something like this. Do you think one way or the other would be better?

Thanks!
Attachment #537741 - Flags: review?(dmandelin)
Attachment #537741 - Flags: review?(dmandelin) → review+
(In reply to comment #11)
> On a separate note, eval-cached scripts' counters will accumulate until they
> are finally destroyed. It would be easy to make counters reset if their
> owning script is reused (I have a patch for it), but on the other hand it
> seems kind of nice to aggregate repeated use of trivial, cached scripts.
> Back to that first hand, it may allow some observability into when GC
> happens if JS-level accessors are added, 

Where would the JS-level accessors be callable from? I would tend to think they could only be used from chrome code, debug APIs, or something like that, in which case merging the counters seems OK.

> which is a little bit of a no-no
> though probably ok for something like this. Do you think one way or the
> other would be better?
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 671602
Blocks: 676515
Did the branch on if (pcCounts ...) on every opcode dispatch not ding performance noticeably? See bug 679136.

/be
(In reply to Brendan Eich [:brendan] from comment #15)
> Did the branch on if (pcCounts ...) on every opcode dispatch not ding
> performance noticeably? See bug 679136.
> 
> /be

I've spent a bunch of time trying to answer that question. I have a patch that adds various configure options to compile out counters completely, compile them out of the interpreter but leave them on for both JITs, or leave them in everywhere. Here's my latest timing output, where it ran the v8 suite 27 times on each configuration. I threw out the debug non-jit runs this time because they take too long.

CONFIG                 RUNS     MEAN    MEDIAN
opt jit fullcounts      27	4499.3	4504.0
opt jit nointerp        27	4482.0	4490.0
opt jit none            27	4491.3	4489.0
opt jit dtrace          27	4489.1	4489.0
opt nojit fullcounts    27	673.3	674.0
opt nojit nointerp      27	719.8	720.0
opt nojit none          27	686.3	689.0
opt nojit dtrace        27	684.3	686.0
debug jit fullcounts    27	1209.7	1215.0
debug jit nointerp      27	1211.9	1215.0
debug jit none          27	1217.9	1225.0
debug jit dtrace        27	1220.6	1227.0
debug nojit fullcounts   0	0.0	0.0
debug nojit nointerp     0	0.0	0.0
debug nojit none         0	0.0	0.0
debug nojit dtrace       0	0.0	0.0

(ignore the 'dtrace' runs.)

'jit' means running with -j -m -p. 'nojit' means running with no flags (so interpreter only.)

So with the optimized jit runs, it oddly ran faster with the extra branch. That's what's been driving me crazy. The nojit runs are confusing too -- there, the 'fullcounts' runs are slowest, as expected, but the 'nointerp' runs (where pccounts were compiled in for the JIT but not the interpreter) were strangely fast.

In debug builds, the results are in exactly the order I would expect (though I would have expected a larger hit from the counters.)

I hadn't intended for this to be compiled in for optimized builds to begin with. When jimb noticed that it was, sometime last week, I made the patch to compile it out but I tried to figure out the timings to see whether I could leave it in the 'nointerp' state even in release builds. But I kept consistently seeing faster numbers from the optimized JITted case.

Plotting the individual runs shows that the ranges are heavily overlapped, so it's possible that I'm just getting unlucky by getting the final weird result every time I aggregate a batch of test runs. But I suppose it's vaguely possible that the effect is real, and the hit from 200-odd well-predicted branches is offset by mystical alignment adjustments on my test box.
Oh, and not that it matters, all those test runs are done in a separate cpu+memory cgroup. I was hoping that might be enough to make real time meaningful, but I guess not.
I've had a lot of problems like this myself. One thing that helps a lot on V8 is to disable ASLR. I get much more consistent results that way.

Also, I've found that I get more consistent results if run V8 for a fixed number of iterations rather than having it be time-based. However, it's important that the number of iterations should be about the same as what you would have run normally. Otherwise the GC behavior can be quite different.
You need to log in before you can comment on or make changes to this bug.