Closed Bug 642054 Opened 13 years ago Closed 11 years ago

Implement a profiler for JavaScript

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: dmandelin, Assigned: sfink)

References

Details

Attachments

(5 files, 11 obsolete files)

2.65 KB, text/plain
Details
622 bytes, text/plain
Details
2.20 KB, text/plain
Details
14.39 KB, text/plain
Details
1.73 KB, text/plain
Details
We should have a profiler that charges time against scripts/code-lines/JSOps.
Attachment #519575 - Attachment is obsolete: true
The hottest "code" is:

    3786    0.740 unknown jitcode/stub called via jitcode 0

This is probably either (a) IC jitcode or (b) C++ functions called from jitcode. For (a), we can just add IC jitcode chunks to the map, although this may require a more scalable data structure (currently I scan a vector). (b) is harder, but it seems reasonable to try to walk the C stack back until we reach jitcode. For long-running C++ functions (e.g., js_GC), we will probably want to record that we are in that function and put it in the profile explicitly.
Attachment #519729 - Attachment is obsolete: true
Attachment #519831 - Attachment is obsolete: true
Attachment #519833 - Attachment is obsolete: true
I'm starting to get some possibly useful output. Here is 'earley', 100 iters, with coarse-grained classification on an opt build. Lots of time in lambda/callobj/arguments.
Attached file 'earley' by script
Output by script. Most of our time is in the top 7, which are sc_list, a couple of local lambdas (loop2, loop3), sc_Pair, another pair of local lambdas (deriv_trees sc_loop1_98), and sc_append.
Same as the previous, but broken down by activity and line number within script.
Attached file Postprocessing script
This postprocesses profiler output into summaries by activity and script.
The v4 patch seems to be missing jsprofiler.*

What is the plan for this going forward?  Is there a plan to make it usable with the browser and convey profiling info without needing postprocessing?  Not sure whether to look more into this timer-based approach or to use performance counters as in bug 652535.
Blocks: 652535
Attached patch with jsprofiler.cpp (obsolete) — Splinter Review
(In reply to comment #10)
> The v4 patch seems to be missing jsprofiler.*
> 
> What is the plan for this going forward?  Is there a plan to make it usable
> with the browser and convey profiling info without needing postprocessing?  

I would like to see that happen.

> Not sure whether to look more into this timer-based approach or to use 
> performance counters as in bug 652535.

Both are potentially interesting and seem worthwhile. Performance counters seems to have the advantage of being more customizable, while sampling has the advantage of (with a bunch more work) being able to charge time to C++ slow calls by name.
Attached patch SIGPROF-based JS profiler (obsolete) — Splinter Review
I took a look at this today. I'm attaching an updated version that works on x86_64 (and attempts to work on ARM, though I haven't tried compiling there) and also works around some of the threading issues.

I restricted the SIGPROF signal to just the main thread as much as possible. It should be enough for now. Otherwise, I get way too many collisions between multiple threads running the signal handler at the same time.

However, this still leaves some pretty big signal-related issues. As written, the signal handler has to do unsafe operations -- specifically, allocating memory is very bad. I get a deadlock within a second or so because SIGPROF fires within malloc and so it deadlocks with its own allocations.

To get this to work, it seems like we'd need to queue up the information to be gathered and send it off to another thread or something to be processed. But that'll also require locking anything needed in memory (so eg script->filename and JIT code don't go away). I'm not sure how doable that is. I don't know whether that'll be enough to allow collection on all threads, though.

Hmm... thinking more, perhaps the thing to do is to send everything over to another thread for processing: every JIT code creation event would report the generated address range, along with the script filename and whatever else was needed. It adds overhead even for things that never get a sample hit, but the signal handler would only need to write() (or msgsnd() or tack onto a lockless finite queue or whatever). It would report the IP, the profiler's current activity value, and perhaps a backtrace of IPs.
Attachment #519957 - Attachment is obsolete: true
Attachment #528240 - Attachment is obsolete: true
(In reply to comment #13)
> However, this still leaves some pretty big signal-related issues. As
> written, the signal handler has to do unsafe operations -- specifically,
> allocating memory is very bad. I get a deadlock within a second or so
> because SIGPROF fires within malloc and so it deadlocks with its own
> allocations.

You might want to look at the approach in bug 652535, which doesn't use signals and profiles a randomized sample of call entries/exits at a pretty small runtime cost.  Doesn't have the fidelity of this approach (just reports time per script), but signal handlers might be even more problematic going forward with IonMonkey's plan to put the JS stack on the C stack.

Is it possible to avoid signals and just have another thread periodically poll the main thread to see what it's doing?  Vaguely recall that's how V8's profiler works, but I never looked at it in detail.
(In reply to comment #14)
> You might want to look at the approach in bug 652535, which doesn't use
> signals and profiles a randomized sample of call entries/exits at a pretty
> small runtime cost.  Doesn't have the fidelity of this approach (just
> reports time per script), but signal handlers might be even more problematic
> going forward with IonMonkey's plan to put the JS stack on the C stack.

Cool, I've read that bug but hadn't noticed your sampling trick.

> Is it possible to avoid signals and just have another thread periodically
> poll the main thread to see what it's doing?  Vaguely recall that's how V8's
> profiler works, but I never looked at it in detail.

I was trying to think of a way to do that. On Linux, I could use ptrace, which is a little gross. I guess I should look at V8.

systemtap would let me grab an IP and stack at every context switch.
I think it would be really cool if we could make use of the new Linux perf infrastructure. It's incredibly powerful, and it's undergoing a ton of development right now.

You can hook into it in various ways. At the lowest level, you can make a few syscalls and it will start filling a circular buffer with IP/stack trace info. But you can also get lots more info about cache misses, IPC, etc. It would be awesome if we could connect this back to the JS source.

Here's some info:
  https://perf.wiki.kernel.org/index.php/Main_Page
  http://www.linux-kongress.org/2010/slides/lk2010-perf-acme.pdf
Unfortunately, the best docs are in the kernel tree itself, under tools/perf. However, on most distributions you can try this stuff out with a command like |perf stats ls|. At most, you'll have to install a package that's specific to the kernel you're running.
(In reply to comment #14)
> Is it possible to avoid signals and just have another thread periodically
> poll the main thread to see what it's doing?  Vaguely recall that's how V8's
> profiler works, but I never looked at it in detail.

I've just skimmed through v8's profiler. On Linux, they do pretty much what I said in the last paragraph of comment 13 (though it's just a thread, so no write() etc silliness needed). They use SIGPROF, but the signal handler just adds some fixed-size info onto the head of a circular queue. A separate profiler thread reads the records off and processes them. Interestingly, they don't use an interval timer to trigger the SIGPROFs. Instead, the profiler thread sends them. A code comment says they get much better accuracy from

  while (1) {
    send signal;
    usleep();
  }

than from an interval timer. Interesting.

On Windows, the scheduling is the same (a while loop with a sleep for the profiler thread), but the profiler can use GetThreadContext(profiled_thread) to reach into the other thread's state.

OSX is similar to Windows, but the profiler thread suspends the profiled thread first, then uses thread_get_state() to peek into it.

v8 has two types of profiling, CPU profiling and JS profiling. (I think the latter might be called runtime profiling in the source?) CPU profiling grabs the ip, sp, fp, and I think the stack and some global "VM state" context marker.

JS profiling notes whether or not JS code is running at that sample instant, and computes a windowed average that gives "what percent of time is currently spent running JS code?" They use it to control how much optimization to do: <20% JS means never optimize, <75% means only optimize really hot stuff (they triple the threshold), and >=75% means "normal": check which JS functions exceed the optimization threshold, and queue them up for optimization.

Funny optimization hack: if only JS profiling is enabled, it'll notice if no JS has run for a while, and suspend the profiler thread until JS code is reentered.

v8 has to track executable code creation/destruction (as do we), and handle code movement triggered by their moving GC. If something in the sampling window moves, they update the pointer within the window.
Here's a WIP update that resolves some of the concurrency issues and might conceivably work across multiple threads, though I haven't tried it yet.

It's big and ugly, Linux-only, and has numerous remaining concurrency problems. It doesn't handle context shutdown or even script discard, I don't think.

But it gives output similar to dmandelin's original output, and now survives my test cases that hung with the older version.
Assignee: general → sphink
Attachment #530765 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Did a bunch of cleanup. Starting to resemble something real. The biggest concurrency issues are now resolved, though several remain. (The asynchronous signal handler is a surprisingly large pain.)

This patch depends for correctness on the one I'm about to post to make it possible to pass in a NULL context to JS_PCToLineNumber, although it'll run fine without it.
Attachment #535235 - Attachment is obsolete: true
Even though JS_PCToLineNumber just scans through the src notes attached to a JSScript, it requires a JSContext* and if you're really unlucky, will temporarily grab the debugger lock. (Specifically, if you give it the initial PC of a function definition -- JSOP_DEFFUN -- and you've set a trap on that PC, it needs to take the debugger lock so it can access the trap table.)

That makes it useless for the sampling profiler, because you might get a SIGPROF while the debugger lock is held, and you'd deadlock. (And NSPR doesn't provide a trylock.)
That last one tends to seg fault pretty quickly when it accesses a partially-built stack frame.

As a (perhaps) temporary workaround, I've updated to mark the stack as off-limits while the interpreter is pushing a frame, and while in JIT code or unlabeled code called from JIT code. (The latter should be possible to eliminate.) The tracejit is not handled specially; I haven't really thought through what it'll need.

This survives earley-boyer, at least. Next I want to validate that the output is sane, before I attempt to resolve the remaining concurrency issues.
Attachment #535811 - Attachment is obsolete: true
I'm still missing marking some places where the stack is off-limits, and I haven't implemented the script lifetime management, but I wanted to stare at the data for a while as a sanity check and to see whether it's collecting any useful info yet. For a visual display of what it's producing, please check out http://10.250.2.131/~sfink/v8/ui.html (from within the moz network.)

For earley-boyer, the main test case I've been staring at for this bug, it very visibly continues to point the finger at STUB_CREATETHIS (stubs::CreateThis).

Note that the above URL gives data for js -m run-<test>.js. So its not representative of what happens in cases where we would actually trace a significant portion. There's nothing stopping me from running it with -m -t or whatever; it's just that I only have hooks for providing rich data when running under the mjit.
Updated a bit and moved to a public URL: http://people.mozilla.org/~sfink/v8/ui.html

This has the bulk of the "unknown" bits categorized, though it'd be really nice to get symbol information to fill in the rest.
Blocks: 675102
I don't want to deal with juggling the patches here, but I'd like to keep a more or less working copy available, so I'm going to obsolete the patches and use http://hg.mozilla.org/users/sfink_mozilla.com/patches-jsprofiler/ as a public patch queue instead.
Attachment #535816 - Attachment is obsolete: true
Attachment #536532 - Attachment is obsolete: true
SPS covers this now.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: