Open Bug 639085 Opened 9 years ago Updated 7 years ago

tool for simple coarse-granularity profiling

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

ASSIGNED

People

(Reporter: luke, Assigned: luke)

Details

Attachments

(4 files, 1 obsolete file)

Attached patch initial patchSplinter Review
This tool has the initial goal of getting some understanding of the potential speedup from DOM-bindings, but I also think it is useful for getting getting a quick and easy coarse answer to "why is this slow?" or "where is all the time going?"  The tool is just a hacked-up browser that spits out rdtsc times to a log whenever control flow enters/leaves various components of the JS engine and XPConnect.

Other than running fast, the reason for making a tool vs. just using automated profiling like normal is that it allows ad hoc customization of what component is blamed for what time.  For example: with the exception of the NS_InvokeByIndex/JS_CallFunctionValue at the heart of CallMethod, one would want to "charge" any JS engine time under CallMethod to CallMethod.

The tool is very much incomplete (only works on x86 linux) and immature (need to do more work verifying results) and primitive (it just spits out results for the entire lifetime of the browser main thread), but initial results (posted shortly) are interesting.

The code is in a user repo:
  http://hg.mozilla.org/users/lwagner_mozilla.com/measure
Its diff vs. TM is attached for reference.

A readme for building and interpreting the results:
  http://hg.mozilla.org/users/lwagner_mozilla.com/measure/file/53b547dfe953/js/timelog/README
Attached patch initial results (obsolete) — Splinter Review
Initial results are attached.  (See README for interpreting rows/columns.)  Here's a summary though (that is, only for the "groups", not individual components):

Dromaeo:
JS          39% ( 59169ms)
XPC         16% ( 24737ms)
GC           4% (  6440ms)
Other       40% ( 60847ms)

Fluid sim 
JS          56% ( 25240ms)
XPC          0% (   251ms)
GC           0% (   126ms)
Other       42% ( 18725ms)

http://mootools.net/slickspeed/
JS          25% (  3614ms)
XPC         16% (  2327ms)
GC           3% (   457ms)
Other       54% (  7564ms)

http://blog.frontendforce.com/2010/05/jquery-vs-mootools-nightly-benchmark/
JS          27% (  4475ms)
XPC         12% (  2095ms)
GC           4% (   728ms)
Other       55% (  9224ms)

http://andrewdupont.net/test/double-dollar/
JS          11% (   962ms)
XPC          4% (   334ms)
GC           1% (   143ms)
Other       82% (  6724ms)

Peacekeeper
JS          40% ( 70272ms)
XPC          4% (  7791ms)
GC           1% (  3314ms)
Other       53% ( 92839ms)

Facebook session
JS           5% (  3038ms)
XPC          2% (  1336ms)
GC           2% (  1442ms)
Other       89% ( 49872ms)

FF, mostly using chrome
JS           5% (  1454ms)
XPC          7% (  1864ms)
GC           2% (   739ms)
Other       84% ( 21705ms)

So (and again, I need to do a bit more work to verify the measurements) it seems like XPC generally consumes a small total percentage of browser time.  Its worse case is, not surprisingly, in DOM/JS benchmarks, but even then, it tops out at 16%.

Let me know if there are any more interesting benchmarks.  dvander suggested bug 638328 (which means I'll need to get this working on Windows since WebGL doesn't seem to initialize on my Linux machine).
You probably need to instrument the code generated by writaTraceableStub in qsgen.py too, though I wonder how much we're on trace nowadays... :(

The number for XPCWN_Ops on http://mootools.net/slickspeed/ is disturbing, though that may be all the DOM0 crap the libraries love.  Could we differentiate between the various stuff going through nsIXPCScriptable and the actual overhead of XPConnect on paths from JS into idl-defined C++ methods?

In general, 10-15% XPC+quickstub stuff on things that are touching the DOM a lot sounds about right.
(In reply to comment #3)
> You probably need to instrument the code generated by writaTraceableStub in
> qsgen.py too

I originally did just that.  However the average number of cycles under a traceable native (minus the time under the C++ target of course) was low enough that the rdtsc overhead (~30 cycles from my measurements x 4 (enter,pause,resume,leave)) was substantial.

Also, at an intuitive level, if we are on trace, life is generally good, so I was happy to lump that all together.

> though I wonder how much we're on trace nowadays... :(

You can see the answer in the detailed output in the attachment to comment 1: 9% of Dromaeo, 44% of V8, 7% SS, 49% fluid sim, 3% double dollar, 19% peacekeeper.

> The number for XPCWN_Ops on http://mootools.net/slickspeed/ is disturbing,
> though that may be all the DOM0 crap the libraries love.  Could we
> differentiate between the various stuff going through nsIXPCScriptable and the
> actual overhead of XPConnect on paths from JS into idl-defined C++ methods?

You bet.  Would that be simply time in XPC.*Helper.*Ops or is it something else?
Ah.  So the thing is, the quickstub cost is pretty similar both on and off trace.  So if we're measuring one but not the other we're under-estimating the cost of the quickstubs; how much depends on how much time we spend on trace.  We're underestimating it by no more than the time spent on trace in the data here.

30*4 is definitely comparable to the cost of an entire quickstub in many cases, for both the traceable and fastnative cases.

> Would that be simply time in XPC.*Helper.*Ops or is it something else?

For the classinfo stuff, I _think_ so.  

I guess you already differentiate that from the CallMethod stuff; that might be all we really need.  I admit I no longer remember that much about the non-quickstub stuff here.  ;)
(In reply to comment #5)
> Ah.  So the thing is, the quickstub cost is pretty similar both on and off
> trace.  So if we're measuring one but not the other we're under-estimating the
> cost of the quickstubs; how much depends on how much time we spend on trace. 
> We're underestimating it by no more than the time spent on trace in the data
> here.

Oh duh, you're right.  My mind was falsely dichtomizing trace+traceable-natives and not-trace+quickstubs.  I'll re-run with that and post with the next batch including dmandelin's benchmarks.

> 30*4 is definitely comparable to the cost of an entire quickstub in many cases,
> for both the traceable and fastnative cases.

I measured the average number of cycles in a quickstub and it varies (surprisingly much) between 300 and 1000.  So 120 cycles could really be noticeable.  Actually, assuming that roughly 1/2 of those cycles should be attributed to the "other" guy (i.e., code calling the quick stub and code called by the quick stubb), its more like 60.
 
> > Would that be simply time in XPC.*Helper.*Ops or is it something else?
> 
> For the classinfo stuff, I _think_ so.  

Cool, will split those from the rest.
> it varies (surprisingly much) between 300 and 1000.

Right; the exact number would depend on the number of arguments, on whether |this| and the arguments implement fast unwrapping paths, on whether there's a return value and the type of the return value.  A quickstub with slow this and argument conversions and several arguments would take a lot longer than a quickstub for nodeType.
(In reply to comment #7)
What I meant was that the *average* (of all quickstub calls in a given benchmark) varied between benchmarks.  But now I see (in attachment 517068 [details] [diff] [review]) that I understated the difference: Dromaeo had an average of 200 cycles; a Facebook session had an average of 1900, and just messing around with FF UI had an average of 2100!
Oh, that is _very_ interesting.  Ignoring the Fx UI for the moment, I wonder whether the difference is that Dromaeo calls "simpler" quickstubs or that Facebook does stuff we don't have optimized very well.

As long as you're in there, want to log data on _which_ quickstubs are being called how many times in the two cases (and probably file a separate bug with that data)?
I gave trace a lower priority than XPC code so XPC-from-trace now counts as such, but it mostly didn't change things (the biggest change was from Dromaeo spending 13% in QS to 14%).  I also added a category XPCWN_SH for the XPC WN ops that call scriptable helpers.  I ran the non-WebGL-requiring demos dmandelin suggested, there summaries are below (full results in the attachment, summaries below).  Next I need to get this working on Windows so I can run WebGL stuff (its also interesting to see if accelerated rending has a big effect on the "other" category).

Galactic Plunder (http://dougx.net/plunder/plunder.html)
JS           4% (  1098ms)
XPC          2% (   771ms)
GC           0% (   169ms)
Other       92% ( 24332ms)

Marble Run (http://marblerun.at/tracks/new)
JS          11% (  3689ms)
XPC          1% (   401ms)
GC           1% (   332ms)
Other       86% ( 28150ms)

http://alteredqualia.com/visualization/evolve
JS          80% ( 22592ms)
XPC          0% (   178ms)
GC           0% (    88ms)
Other       18% (  5150ms)

http://beadsjs.chrismorgan.info/bugzilla-clock.html
JS          72% ( 26218ms)
XPC          0% (   333ms)
GC           0% (   284ms)
Other       25% (  9398ms)
Attached file GMail session
Here is an interesting session we just measured (when considering how much time GMail actually spent in JS).

XPC time shows up higher than the other non-benchmark webapps at 10% of which half is quickstubs.  Bytecode compile time is a full half second.  Most interesting, though, is that for the one 839ms interval (that is, time between pumping the event queue), quickstub accounted for 84% of the time!  The average number for cycles per quickstub here is 27K cycles.  Note that the measurement includes all time in a quickstub minus the time in the target DOM call, so perhaps we are hitting some crazy object->native conversion path.
Hmm.  Or native->object, triggering a GC?  Or do the measurements exclude that sort of thing?
GC has the highest priority (http://bit.ly/gwT2lr) so that wouldn't count against XPC/Quickstubs.
Hrm.  In that case I can't think of a way this can possibly happen.  27k is just an insane number of cycles...

Is this at all reproducible?  Could we have quickstubs log the name of the quickstub any time the cycle count for the quickstub is > 10,000 or something?

For that matter, logging when it's > 1000 might be nice too.  Give us an idea of which stubs are slow.
Attached file Windows resuts
I got the measurement working on Windows last week, but I just got around to actually running it today.  As predicted, some of the graphics tests spend proportionately more time in JS and less in FF-other, but no sea change in the results.  For the other tests, there was actually a surprising parity in the relative percentages between the very different machines.

Some overall results concerning time spent in XPConnect:
- The DOM/JS benchmarks have the highest XPC times, at around 14-17%.
- Other tests that use the DOM, but not as a benchmark, seem to top out at 6% (gmail).
- The "FF chrome session" test (which was me just clicking around FF chrome, no addons installed) spends 4%/3% of the total time in XPCWrappedNative::CallMethod/XPCWrappedJSClass::CallMethod, respectively, which are our two mega-slow XPC paths.  Nothing else spent much time in these.
Attachment #517068 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.