Open Bug 1109326 Opened 10 years ago Updated 2 years ago

Marking baseline JIT stubs is slow

Categories

(Core :: JavaScript: GC, defect)

x86_64
Linux
defect

Tracking

()

People

(Reporter: sfink, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Attached file roots.log
I was looking at the GC timings while running Octane, and I noticed a surprising amount of time in some of the major GCs. I traced it into the exact rooting marking, into marking JSScripts, and finally to marking baseline jitcode. On my machine, I was seeing 1.3ms spent in root marking in markRuntime, the majority of which is spent in BaselineScript::trace.

For one run, the set of 11 JSScripts took 1446 usec (1.446ms). 750 usec of that was in BaselineScript::trace. That was from a total of 33465 stubs on 23858 IC entries.

That's a lot of stuff to mark. I don't know how common things like this are, but I'd like to check with djvj to see if he has any easy ideas for speeding this up that just haven't come to pass because it wasn't needed or something. Like, I dunno, keeping a separate table for stable ICs somewhere than can be linearly scanned to find all the pointers to all the pointers. (I have an uneasy feeling in my gut that we'll end up with a JSObject*** from this somehow...)
The attachment is the output of a set of very messy patches to gather more timing information. Probably the most interesting bit is the stuff leading up to

  Marked 47 exact roots in 1.514000ms (major)

Info for that GC starts after the preceding

  Marked 47 exact roots in 0.074000ms (minor)

ni? for his quick take on the situation.
Flags: needinfo?(kvijayan)
Good find.  I suppose a large part of the overhead comes from the fact that we have to scan all ICs, even though a good chunk of the stub memory doesn't contain gcthing pointers.

A table listing all trace-able (i.e. non-leaf) stubs would be one approach, as you suggest.  So an array hanging off of the BaselineScript, which new traceable stubs append themselves to.

Alternatively, we could just hang another singly linked list of traceable stubs off of each BaselineScript.  Every traceable stub would get an extra ICStub *nextTraceable field, and would add itself to the per-script list on creation.

Do you know the distrubution of stubs (all stubs, as well as only traceable stubs) across all baseline scripts?
Flags: needinfo?(kvijayan) → needinfo?(sphink)
Attached file single-script.txt
Finally got around to dumping this out. I didn't set it up for very good correspondence to my earlier report, but I'm attaching a dump of stub counts for a single JSScript. (It seems to show 4 baseline scripts -- is there more than one baseline script possible per JSScript, or is this just from the recursive marking, where it encountered another baseline script during the marking or something?)

In brief, it shows lots and lots of ICs with either 1 or 2 stubs, very few with zero, and none with more than that. (Though other baseline scripts *do* show more. 6, in particular, seems to be a somewhat frequent number.)
Flags: needinfo?(sphink)
Or better, here's a dump for a full Octane run, with "-- collect --" markers between different collections.

Sorry, I should process this down into a histogram or something. But I'm not 100% sure what's useful to you here.
(In reply to Steve Fink [:sfink, :s:] from comment #4)
> Created attachment 8539563 [details]
> Stub counts for all GCs during an Octane run
> 
> Or better, here's a dump for a full Octane run, with "-- collect --" markers
> between different collections.
> 
> Sorry, I should process this down into a histogram or something. But I'm not
> 100% sure what's useful to you here.

Specifically to figure out if it would be helpful to have a separate list of traceable stubs only (excluding the ones which contain no gcthing pointers).. what would be most useful is the ratio
overall at runtime of traceable stubs vs stubs-with-no-gcthing-pointers.

If that ratio is low (e.g. 20%), then keeping a separate list of traceable stubs should bring this time down a lot.  If that ratio is high (e.g. we find that 80% of stubs at runtime need to be scanned for heapptrs), then we need to look at other approaches.
We're also doing this for minor GCs? That seems unnecessary, AFAIK all objects should be in the store buffer...
(In reply to Jan de Mooij [:jandem] from comment #6)
> We're also doing this for minor GCs? That seems unnecessary, AFAIK all
> objects should be in the store buffer...

No, we are not doing this for minor GCs. For those, JSScripts get marked with js::Nursery::MinorGCCallback, which doesn't recurse through children.

If I look at what actually gets marked, I see that dynamically, 68% of all stubs in Octane are untraceable. Which is sadly in the middle.
That's still pretty good, no?  If we segregate the traceable and non-traceable stubs, this indicates we should see ~70% improvement in trace time for stubs.  Also, if we do the segregation in memory instead of using a separate linked list, we should get better memory locality for the stub tracing as well.

Currently, the tracer simply walks the IC chains one by one.  But if you consider how stubs get added, this is going to step around a bit - all the fallback stubs are in their own memory space.  Then, in each iteration, a new stub will be added to the chain.  Considering for example, three ICs, we end up with this sort of configuration:

   IC1 -> OptStub[1] -> FallbackStub[1]
   IC2 -> OptStub[1] -> FallbackStub[2]
   IC3 -> OptStub[1] -> FallbackStub[3]

We'd expect all the fallback stubs to be contiguous in memory region A, and the opt stubs to be contiguous in memory region B.  The scan will end up hitting B, A, B, A, B, A successively.  Throw in pointer sparsity caused by non-traceable stubs, and the cache behaviour is even more affected.

Segregating the stubs will introduce more memory non-locality to the actual runtime execution of baseline (since traceable and non-traceable stubs won't be allocated from the same memory region anymore), but that should be OK since baseline is really poor for memory locality anyway - the vast majority of the time is spent jumping from globally shared stubcode to other globally shared stubcode, and little prediction ability on the part of the processor to figure out where the next jump is.  So the impact of doing this should be minimal on actual runtime, and should show a 70% improvement in tracing stubcode.
(In reply to Kannan Vijayan [:djvj] from comment #8)
> That's still pretty good, no?  If we segregate the traceable and
> non-traceable stubs, this indicates we should see ~70% improvement in trace
> time for stubs.

Sure, I don't want to argue you out of it! I just don't entirely trust the representativeness (is that even a word?) of this benchmark result. The majority of Octane spends essentially no time doing this marking. It's only a couple of benchmarks near the end that would see the big win, and I have no idea what that means for actual Web code. It could be quite significant, though, and even if not, this whacks down one of our worst-case moles (as in, the set of things that in some cases will bloat out the initial slice of an incremental GC, which is currently a big problem.)

> Also, if we do the segregation in memory instead of using a
> separate linked list, we should get better memory locality for the stub
> tracing as well.

Yeah, seems good for prefetching.

> Currently, the tracer simply walks the IC chains one by one.  But if you
> consider how stubs get added, this is going to step around a bit - all the
> fallback stubs are in their own memory space.  Then, in each iteration, a
> new stub will be added to the chain.  Considering for example, three ICs, we
> end up with this sort of configuration:
> 
>    IC1 -> OptStub[1] -> FallbackStub[1]
>    IC2 -> OptStub[1] -> FallbackStub[2]
>    IC3 -> OptStub[1] -> FallbackStub[3]
> 
> We'd expect all the fallback stubs to be contiguous in memory region A, and
> the opt stubs to be contiguous in memory region B.  The scan will end up
> hitting B, A, B, A, B, A successively.

I'm just quibbling, but bouncing between two separate areas doesn't seem to me like it'd be that awful. Even if they share the same lower-order bits, it shouldn't matter as long as you have at least 2-way associativity.

> Throw in pointer sparsity caused by
> non-traceable stubs, and the cache behaviour is even more affected.

I'll buy that. Sparse cache utilization is gonna hurt.

> Segregating the stubs will introduce more memory non-locality to the actual
> runtime execution of baseline (since traceable and non-traceable stubs won't
> be allocated from the same memory region anymore), but that should be OK
> since baseline is really poor for memory locality anyway - the vast majority
> of the time is spent jumping from globally shared stubcode to other globally
> shared stubcode, and little prediction ability on the part of the processor
> to figure out where the next jump is.  So the impact of doing this should be
> minimal on actual runtime, and should show a 70% improvement in tracing
> stubcode.

Irrelevant to this bug, but it makes me wonder what you'd see if you experimented with cloning some of the stubcode, just to provide some well-predicted hot paths through them. (It would burn memory, so probably wouldn't be worth it most of the time, I suppose.)
(In reply to Steve Fink [:sfink, :s:] from comment #9)
> The
> majority of Octane spends essentially no time doing this marking. It's only
> a couple of benchmarks near the end that would see the big win,

Just curious, which tests? How much time do we spend in GC (or root marking) and how much of this is Baseline stubs?

> and I have
> no idea what that means for actual Web code. It could be quite significant,
> though,

Rumor has it most GCs in the browser don't have JS code on the stack. I don't know if that's still true, but in that case there shouldn't be many script roots and we should be discarding JIT code quite aggressively..

> and even if not, this whacks down one of our worst-case moles (as
> in, the set of things that in some cases will bloat out the initial slice of
> an incremental GC, which is currently a big problem.)

Isn't the root (no pun intended), or at least a big part of it, of the problem PushMarkStack trying to be smart by marking scripts directly? If it just pushed them on the mark stack, it'd automatically push marking JIT scripts to other slices right? Can we do that, maybe only for largish scripts?
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: