Closed Bug 958492 Opened 6 years ago Closed 5 years ago

Splay Latency browser numbers are much lower than in shell

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: h4writer, Assigned: bhackett1024)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 1 obsolete file)

In most cases the numbers got from a shell build are "more or less" the same as the numbers we get with browser builds. Now this isn't the case with Splay Latency. We get 2x higher numbers in the shell than in the browser. We should figure out what causes this and get the browser numbers the same as the shell numbers
Normal GCs in the shell are not incremental because that would be non-deterministic. I don't think we want to change this, since it would really mess up a lot of things. We might be able to allow you to opt into incremental GC in the shell. That would be a good amount of work though; right now all the scheduling code lives in nsGlobalWindow.cpp because it needs to coordinate with the cycle collector.

Rather than hacking around this, I think we should just work harder on getting awfy to run in the browser.
I thought incremental GC should be better? Here the shell is 2x better than the browser. Why would we be slower due to incremental GC's? Isn't incremental GC something that should make us faster?
Oh, I misunderstood. That is weird.
My current guess is that GC takes more time than GC in shell.

In the browser I see pauses of:
32.39872399999999
2.156260000000003
79.69760099999996
91.00974599999995
91.595686
101.00587399999995
2.0042000000000826
117.44391900000028
136.0692449999999

=> 660 milliseconds

In the shell:
33
38
40
42
44
41
40
42
42
41

=> 440 milliseconds

Now the reason of the big splay latency score difference is not because the shell is 200ms faster. BUT every pause^2 is taken and averaged. So that means the score is based on "1600ms" in the shell, but "66249ms" in the browser!

In the shell the slow pauses are indeed caused by "Collect". I haven't tested it for the browser, but I would be surprised if it isn't.

Setting "javascript.options.mem.gc_incremental" to false, didn't improve score in the browser.
> BUT every pause^2 is taken and averaged. So that
> means the score is based on "1600ms" in the shell, but "66249ms" in the
> browser!

I hate benchmarks that use weird scoring systems, and Google is a repeat offender. What's wrong with just measuring time? Too easy to understand?
They want to penalize us for long pauses.

I think the next step is to get a GC log of the browser using memchaser or something. That will tell if the GCs are incremental or not, and where the time is being spent. A pause of 136ms is fairly high. We try to keep pauses at 40ms or below, although lots of tabs can make things works.
> They want to penalize us for long pauses.

Sure. My point is that squaring pauses isn't particularly meaningful. Anyway, we're stuck with it.
Attached file Memchaser log
Btw all testing was done on newest nightly with as good as a clean profile. No addons, except for memchaser.
Bill, the memchaser log Hannes attached has:

"reason":"ALLOC_TRIGGER" and "nonincremental_reason":"allocation trigger"

Does this mean we don't use incremental GC for ALLOC_TRIGGER GCs? Is this something we can easily fix?
Flags: needinfo?(wmccloskey)
Right, we're not using incremental GC. I just looked at the code, and the problem with this benchmark is that it doesn't return to the event loop very often. It uses the usual Octane model of running the benchmark loop for one second, and then maybe running it again if needed. There's no point in doing incremental GCs if the benchmark is blocking the main thread for one second; the browser will jank even with incremental GC.

This benchmark really ought to use setTimeout or setInterval to run the benchmark in lots of short segments (50ms or less), returning to the event loop in between. I think Firefox would perform better in that case. We typically run JS_MaybeGC when returning to the event loop and that should trigger an incremental GC. There still might be problems, but at least we'd have a chance.

I'm not sure what to do about this problem. We could file a bug against the benchmark. The alternative is to change our heuristics to run incremental GCs in the middle of JS execution. The latter would be a lot of work and it wouldn't actually benefit any users.

Jan, do you think it would be feasible to get them to change or eliminate this benchmark? If we do ask them to change it, we should probably propose an alternate version. We would have to change the Octane base.js to allow some benchmarks to run in shorter segments I guess.
Flags: needinfo?(wmccloskey)
I played with this a little bit today. At first I tried to modify the benchmark, but then I realized that it's kind of hard to do. We can make splay return to the event loop more often, but then we have to decide whether the time spent in the event loop should count against us. If it doesn't count, then it's very easy to hide all the GCs there. That kind of goes against the point of the benchmark. If we do time the event loop, then we'll end up with very irregular scores due to all the random things that can happen during event loop processing.

So I decided to try running incremental GC in the middle of the benchmark. It's hard to decide when to do it though. Eventually I decided that calls to window.performance.now could be a good time. If we do it there, we're likely to end up doing one GC slice per measurement interval, which is ideal.

With these changes, our splay score goes down from ~7000 to ~6500. Our splay-latency score goes from ~7000 to ~12000. Over four runs, my total Octane score went from an average of 11260 to an average of 11772.

Boris, I'm asking for feedback on whether you think this would be a terrible idea. Adding some code into performance.now that occasionally makes it take thousands of times longer is scary. However, GCs can already happen at almost any time. Putting them right next to the timing code at least makes them a little more regular in some sense.

Also, I tried to isolate things so that we only do the GCs when we're also doing a lot of allocation. One remaining issue is the cost of JS_MaybeGC. If need be, I think I can find a way to inline it. I'm not sure whether it's worth it though.

Finally, I added a shell option to run with incremental GC. When you enable this option, the shell still gets somewhat better numbers than the browser for splay-latency. I'm not sure if there's much we can do about that, since the browser has to do more work.
Attachment #8368996 - Flags: feedback?(bzbarsky)
So there are two things to consider here, as you point out.

First, the forced cost on every call of invoking JS_MaybeGC at all.  I just tested, and I see numbers like so, without this patch, on Mac, on my hardware:

* "performance" getter on global: 40-45ns (will drop to ~12-13 once we have Window on
  WebIDL bindings, I hope).
* now() call: 40ns.

With this patch applied, the now() call takes about 110ns.  So the time for a typical "performance.now()" call goes from 80ns to 150ns with this patch for me (and from 55ns to 125ns once we get Window converted).  That's slightly worrisome, but presumably we can reduce that cost by adding some sort of inline fast path for the common no-op case.

If the page is doing "window.performance.now()" instead of "performance.now()", then right now it's taking more like 200ns.  With this patch it'll take 270ns, modulo whatever optimizations we add.  We do want to make that "window.performance" get faster too...

Date.now() numbers are similar: it takes 40ns without this patch, 110ns with.

"new Date" and subtracting another Date object takes about 260ns both with and without this patch.  Presumably it's not calling date_now?

Second, the cost of sometimes doing a GC on this call.  This is hard to evaluate without having some idea of how often we'd end up GCing and how long each GC would end up taking.  Any idea on those two numbers?
Comment on attachment 8368996 [details] [diff] [review]
more-incremental-gc

Seems plausible in general if we make the no-op case faster and have some ideas on the other numbers....
Attachment #8368996 - Flags: feedback?(bzbarsky) → feedback+
I want to get back to this, but I don't have time just now.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #11)
> Right, we're not using incremental GC. I just looked at the code, and the
> problem with this benchmark is that it doesn't return to the event loop very
> often. It uses the usual Octane model of running the benchmark loop for one
> second, and then maybe running it again if needed. There's no point in doing
> incremental GCs if the benchmark is blocking the main thread for one second;
> the browser will jank even with incremental GC.

I've talked with somebody of the v8 team about this. The view on it is that latency and predictability is important in a VM to provide "soft real-time" guarantees. Which I agree on. He also agree that latency tests could be based on requestAnimiationFrame/setInterval. But they want to stress the system, i.e. when doing a computationally heavy workload which results in just short idle times. Idle times so low it is almost 0. So they view the current measuring system as a way to approach the zero idle time. And argue that if a VM is good on such a workload, like octane-splay latency, It will also be good when using requestAnimationFrame/setInterval.

So we kinda stuck with this :(.
Meaning we need some sort of way to do iGC, during execution (without the current pauses).
So it might make sense to check it during "Date.now()". But that feels like too specific. It doesn't make sense for me that code that has "Date.now()" only get iGC, but if we don't measure time, we don't get iGC. I think we need a more global way of running iGC. I was wondering if it was viable to have a check (like we have for "script running too long"), on loop headers or something. I think that might be a more constructive way to deal with this, than special casing "Date.now()"
Maybe we can convince them to change the benchmark once engines implement their own event loops and we can do zero-delay event turns? It might be a while until that's the case, but it will happen, e.g. for promises.
Depends on: 1029648
Realistically I'm not going to get to this.
Flags: needinfo?(wmccloskey)
(In reply to Hannes Verschore [:h4writer] from comment #16)
> (In reply to Bill McCloskey (:billm) from comment #11)
> > Right, we're not using incremental GC. I just looked at the code, and the
> > problem with this benchmark is that it doesn't return to the event loop very
> > often. It uses the usual Octane model of running the benchmark loop for one
> > second, and then maybe running it again if needed. There's no point in doing
> > incremental GCs if the benchmark is blocking the main thread for one second;
> > the browser will jank even with incremental GC.
> 
> I've talked with somebody of the v8 team about this. The view on it is that
> latency and predictability is important in a VM to provide "soft real-time"
> guarantees. Which I agree on. He also agree that latency tests could be
> based on requestAnimiationFrame/setInterval. But they want to stress the
> system, i.e. when doing a computationally heavy workload which results in
> just short idle times. Idle times so low it is almost 0. So they view the
> current measuring system as a way to approach the zero idle time. And argue
> that if a VM is good on such a workload, like octane-splay latency, It will
> also be good when using requestAnimationFrame/setInterval.

I don't follow this at all.

Latency and predictability are good qualities. So is throughput. In the limit, if you never return to the event loop during a benchmark run, any amount of iGC is *always* worse than just doing a nonincremental GC when you need it. If you do return to the event loop a few times, there is no answer that is guaranteed to be correct. It depends on what happens between turns, how long the gap is between turns, and how full your heap is.

If you're going to be idle for a second between turns, and one turn doesn't exhaust memory, then the optimum is to do no GCs during the turn, return to the event loop and process any input, then do a full nonincremental GC. If you are going to go way over memory (and most of it is garbage), then you should do as many full nonincremental GCs as you need and the final GC doesn't matter that much. iGC during the turn is a nice intermediate point, where you'll pay some throughput in order to speed up an eventual memory usage-triggered finishing GC, but it's not the "right" answer. And if you don't have much of any idle time between turns, then it really depends on the application for whether it's better to maximize throughput or minimize jitter.

Arguing that being good on this sort of workload means you'll be good on rAF is backwards. Yes, low latency on octane-splay does imply low latency with rAF fairly well. But the converse is not true, and there is always a latency vs throughput tradeoff with these things, so to some extent low latency on octane-splay also implies low throughput on octane-splay. Do we really want a benchmark rewarding low throughput, just because a workload exists for which that is the right choice?

If we want browsers to be good at rAF, I think we should have a benchmark that uses rAF. If we want a proxy benchmark that runs in the shell, I would suggest that rather than hacking something based on Date.now or performance.now or whatever, we make up new things like benchmark.startFrame and benchmark.endFrame and tell JS engine makers that it's totally legit to optimize behavior by assuming those are called at the beginning and end of a frame. If the benchmark code needs to have something awful like

  if (!this.hasOwnProperty("benchmark") || !benchmark.hasOwnProperty("startFrame"))
    benchmark = { startFrame: function () => {} };
  }

to be able to run on all engines, then so be it. If we go with something like this, then I'd be a lot more comfortable with some hacky pattern-match-splay-latency type of solution to "backport" the startFrame notification to the current benchmark.

As till said, ES7(?) task loops or something would be a better place for this, but I'm thinking shorter-term.

> 
> So we kinda stuck with this :(.
> Meaning we need some sort of way to do iGC, during execution (without the
> current pauses).
> So it might make sense to check it during "Date.now()". But that feels like
> too specific. It doesn't make sense for me that code that has "Date.now()"
> only get iGC, but if we don't measure time, we don't get iGC. I think we
> need a more global way of running iGC. I was wondering if it was viable to
> have a check (like we have for "script running too long"), on loop headers
> or something. I think that might be a more constructive way to deal with
> this, than special casing "Date.now()"

Isn't a long-running script precisely where you normally want to maximize throughput?
One thing we could perhaps do is demonstrate that the total time v8 spends under the test function decreases if they don't do iGCs in this case, either. Modifying the benchmark to that effect should be trivial, but I don't know about v8. It might be enough to just increase the max slice duration to, oh, infinity.

If their score on this "benchmark" improves, then clearly for the web it's better not to do iGCs in this situation, as sfink argues. (And I don't see how it could not improve, at least slightly.)
> If we want browsers to be good at rAF, I think we should have a benchmark that uses rAF.

Yes, yes, a million times this. (And feel free to substitute any other value for "rAF".)
I like Steve's idea of adding a proxy for rAF. We couldn't just do it in the shell, though, since most people run Octane in the browser. (As mentioned above, we probably don't want to write the benchmark using actual rAF because the results would be really inconsistent.) However, if we want to add this special proxy-rAF to the browser, we'd have to add it as a property on the window or something. I'm guessing there would be a ton of resistance to that.

I'm coming around more and more to the view that we should just make our GC work like V8's.
(In reply to Bill McCloskey (:billm) from comment #22)
> I like Steve's idea of adding a proxy for rAF. We couldn't just do it in the
> shell, though, since most people run Octane in the browser. (As mentioned
> above, we probably don't want to write the benchmark using actual rAF
> because the results would be really inconsistent.) However, if we want to
> add this special proxy-rAF to the browser, we'd have to add it as a property
> on the window or something. I'm guessing there would be a ton of resistance
> to that.

Why wouldn't Promise chains be a good proxy for rAF performance? Doing each iteration of the splay benchmark in a promise and measuring the average delay between runs (so excluding the benchmark's workload itself) should give a pretty good idea of how the browser deals with the GC. Something like the following should hopefully get the idea across:


var iterations = 0;
var workloadDuration;
var startTime;
function resolver(resolve, reject) {
  var time = performance.now();
  // Dummy-workload, obviously.
  while (performance.now() < time + 100) {}
  workloadDuration = performance.now() - time;
  resolve();
}

function then() {
  var currentTime = performance.now();
  var delay = (currentTime - startTime - workloadDuration).toFixed(2);
  console.log('iteration ' + (++iterations) + ' completed in ' + delay + 'ms');
  if (iterations < 10) {
    startTime = performance.now();
    p = new Promise(resolver);
    p.then(then);
  }
}
startTime = performance.now();
var p = new Promise(resolver);
p.then(then);


Now obviously this only works where promises are available, so we'd need to implement ES6 Job Queues[1]. In the meantime, having a browser benchmark that only works in the browser, not the shell, isn't entirely absurd.


[1] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-jobs-and-job-queues
> Why wouldn't Promise chains be a good proxy for rAF performance?

I didn't see anything in the spec about it, but I assume we'd run the event loop in between each promise resolution. If that's true, then promises would have the same problem as rAF. Besides GC, we would be measuring everything that can run in the event loop: session store collection, page thumbnail collection, etc. Each run would give very different numbers because of this. It's actually not a bad thing to benchmark, since we want all of those things to be fast. But I think the Octane developers would object to it. Google people also created the bouncing balls benchmark, which was very similar to splay latency except that it did run the event loop. It had the problem of giving very inconsistent results, and I'm guessing that's why Octane included splay latency instead.
> but I assume we'd run the event loop in between each promise resolution.

With our current implementation, yes.

Per spec, no: Promises take priority over normal events.

Whether this is actually something we're willing to implement is unclear to me, since it would allow promises to block user interaction.
(In reply to Bill McCloskey (:billm) from comment #24)
> > Why wouldn't Promise chains be a good proxy for rAF performance?
> 
> I didn't see anything in the spec about it, but I assume we'd run the event
> loop in between each promise resolution.

As Boris said, it's not supposed to as things stand now. (But read on.)

> If that's true, then promises would
> have the same problem as rAF. Besides GC, we would be measuring everything
> that can run in the event loop: session store collection, page thumbnail
> collection, etc. Each run would give very different numbers because of this.

In a clean profile with just the one window/tab open, would that really introduce that much variability?

> It's actually not a bad thing to benchmark, since we want all of those
> things to be fast. 

This. Incidentally, we're not doing well at all on this benchmark (as posted in comment 23) by that measure: my main browsing profile has pauses between 3ms and 100ms, whereas in a new profile I get 3ms to ~6ms.

> But I think the Octane developers would object to it.

Because they want to isolate GC pause time testing, or because they're doing really well at the current, less user-relevant benchmark?

> Google people also created the bouncing balls benchmark, which was very
> similar to splay latency except that it did run the event loop. It had the
> problem of giving very inconsistent results, and I'm guessing that's why
> Octane included splay latency instead.

I guess to some extent that's fair, yes.


(Also, ignore the part about subtracting the workload duration. That obviously only works if the implementation doesn't GC at all during the workload, so doesn't make much sense here.)
We discussed this extensively at today's GC meeting and came to the conclusion that Bill is correct. Explaining why is a bit hard, but stick with me.

In a general purpose programming language, it is really, really hard to tell what effect latency will have because of all the ways it is possible to capture or update user-visible state. Things like writes to the front buffer, network i/o, file i/o, etc can all have dramatic effects on a program's behavior in the presence of latency, uniform or otherwise. There are enough API's that modify external state that it's quite hard to even draw a line in a program that tells you what /might/ end up being important (modulo Haskell's IO monads, of course).

Naturally, in JavaScript, things are a bit different. With JS, the state a user's script updates is simply not visible until the script is done, so only the time it takes to return to the event loop (throughput) matters. Of course, this a lie: nothing is ever that simple. A script can observe latency by calling Date.now(), as it does for this benchmark. Or it could be communicating with a web worker where the uneven task delivery could cause pipeline stalls. Or it could be running chrome code where a long gc could time out a socket. Or block the user from changing desktops for several seconds because X is waiting on us. Or. Or. Or.

Flip the problem on it's head. The only reason we've been thinking of throughput as the default is because we live in a run-to-completion world. This is wrong. Latency has to be the default: the effect of failing at latency can be catastrophic, whereas the effect of being slightly slower is (generally) being slightly slower. 

I'm taking this, although I have no idea what the right approach is yet.
Assignee: nobody → terrence
Status: NEW → ASSIGNED
Your comment left me initially confused, though it makes sense to me after thinking about it for a bit. So to be clear: although it won't directly make the active tab more responsive to the user and might slow run-to-completion, moving from a small amount of full GCs to a multitude of incremental GCs will make the script perform more predictably, which may help its interaction with other parts of the engine?

Aside from the examples you mentioned, could a script reasonably check its own runtime to decide whether to stop and yield to other things - e.g. run for 30ms, then queue itself up to continue running a bit later? In the case of a script going out of its way to keep things responsive like that, it would be pretty annoying to have infrequent but (relatively) long GCs cause jank for the user.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #28)
> Aside from the examples you mentioned, could a script reasonably check its
> own runtime to decide whether to stop and yield to other things - e.g. run
> for 30ms, then queue itself up to continue running a bit later?

Within the engine? No, that's forbidden by run-to-completion semantics. We're not talking about yielding to other things, we're talking only about yielding to incremental GCs.

If you want your script to do something like that, you would need to do it manually. Promises might be useful.

> moving from a small amount of full GCs to a multitude of incremental GCs
> will make the script perform more predictably, which may help its

I don't think that's quite the right model. An incremental GC slice is not the same thing as a fraction of a full GC. It's more like a full GC is broken down into a series of incremental mark slices followed by a sweep slice. Or even closer, it's an initial root marking slice, a series of incremental mark slices, and a final sweep slice. You don't free any memory until the sweep slice, and you free about the same amount of memory as a full GC would.

If you let the numbers represent some arbitrary unit of time, say nanofortnights, you could compare them as:

full GCs:                    7           7          7 DONE
incremental slices:  2 1 1 1 3    2 1 1 1 3    2 1 1 1 3 DONE

Time goes to the right.

This is assuming that you're only doing a full GC when you need it to free up memory. Note that the sum of each group of incremental GCs is slightly higher than the cost of a full GC, and thus the incremental approach takes 3 nanofortnights longer to finish. But if you pick pairs of points in time (between GC slices) and measure the elapsed time between them, the variance is much lower in the incremental GC case.

If you're returning the event loop at interior points, the lower latency incremental approach is clearly superior: imagine you got unlucky and returned to the event loop just before the first full GC. Then you resume processing briefly. In the full GC case, you then spend processing time + 7 nanofortnights. In the iGC case, you spend processing time + 3 nanofortnights, which is much less even though peak memory usage is the same (the full GC and final iGC slice free the same amount of memory, after the same amount of processing.)

If you don't return to the event loop, then the full GC approach looks better, because you finish faster. But the argument in comment 27 is that (1) there are many ways that the uneven spacing could matter, other than returning to the event loop; and (2) the consequences of such jitter are more important than the loss in throughput. I'm a little uncertain about (1), since they sound a little theoretical (real, but uncommon). But (2) is convincing to me -- time and time again, we find that avoiding pathological cases is more important than making the good (fast) cases better. So it makes sense to push hard on latency, avoiding huge long stalls at inopportune times, and only optimize throughput in cases where it doesn't hurt latency (much). Latency-first design, I guess. (They're not diametrically opposed -- many improvements will help with both.)

This benchmark still sticks in my craw, because it's a case where it's relatively straightforward to automatically detect that we could improve throughput at no cost to latency. It really does reward incorrect scheduling behavior, unless you reinterpret Date.now to represent a visible side effect. But given that coming up with an improved benchmark would be a PITA (at least one that is representative when run in the shell), it makes sense to postpone that for later and tune for latency first.
Attached patch patch (obsolete) — Splinter Review
This patch improves my splay-latency score a lot without a big effect on other octane tests and without relying on performance.now or anything.  I've only tested on an x86 shell so far (which I modified in the patch so it can do IGCs.)  Score before:

Richards: 29788
DeltaBlue: 47255
Crypto: 28048
RayTrace: 96791
EarleyBoyer: 34444
RegExp: 3169
Splay: 23372
SplayLatency: 14856
NavierStokes: 29473
PdfJS: 17711
Mandreel: 29909
MandreelLatency: 29674
Gameboy: 62778
CodeLoad: 17405
Box2D: 47030
zlib: 58820
Typescript: 33492
----
Score (version 9): 28888

And after:

Richards: 29982
DeltaBlue: 46653
Crypto: 27875
RayTrace: 92943
EarleyBoyer: 35681
RegExp: 3228
Splay: 21885
SplayLatency: 24957
NavierStokes: 29502
PdfJS: 15728
Mandreel: 30001
MandreelLatency: 37767
Gameboy: 62841
CodeLoad: 17341
Box2D: 47429
zlib: 58640
Typescript: 27037
----
Score (version 9): 29521

So, splay-latency improves by 2/3 and splay decreases by 7% or so, and other octane tests don't seem to move (I get a big improvement in mandreel-latency when running the tests together, but this doesn't repro running mandreel individually.)

Currently, when we trigger GCs during splay it is due to GC thing allocation in the zone.  This causes an immediate non-incremental GC of the zone.  The patch changes this so that when we are 90% of the way to the zone allocation trigger, we start an incremental GC and execute a slice after every MB of allocation (splay allocates 300-500K of data between points when it measures pause times.)  If we hit the allocation threshold before the IGC finishes, we finish the GC non-incrementally, so our memory usage with this patch should not be higher than it is currently, unless the IGC collects less than the non-incremental GC.
Assignee: terrence → bhackett1024
Comment on attachment 8506466 [details] [diff] [review]
patch

I tested this in an x64 browser and get a bigger improvement than in the x86 shell --- maybe 4% overall, though it's noisy.  I also tested the old realtime raytracer demo that Bill pointed me towards, and get similar FPS with and without the patch, as expected.
Attachment #8506466 - Attachment description: WIP → patch
Attachment #8506466 - Flags: review?(wmccloskey)
Comment on attachment 8506466 [details] [diff] [review]
patch

Review of attachment 8506466 [details] [diff] [review]:
-----------------------------------------------------------------

I forgot one of my main objections to this approach: it will make the shell really non-deterministic. I know we already have a lot of non-determinism in the shell, but it's usually pretty self-contained (parsing, Ion compiling). But the GC touches a lot more stuff. I'm just afraid that a change like this will make it much more difficult to get reproducible shell test cases.

However, I think there's a solution. We already have a mechanism to do work-based GC scheduling (as opposed to time-based). It's only used for debugging right now, but I think it would be good for this. We'll need a few things:
1. An extra parameter for GCSchedulingTunables that determines how much work we do per slice, if we're using work-based scheduling.
2. Code to select work-based scheduling for ALLOC_TRIGGER GCs.

You'll also need to look for an amount of work that makes splay-latency work well. Hopefully this won't be too hard. If it is, I guess we could reconsider the time-based approach.

::: js/src/gc/Heap.h
@@ +1014,5 @@
> +    /*
> +     * Amount of data to allocate before a new incremental slice can be
> +     * triggered.
> +     */
> +    size_t delayBytes_;

This doesn't really seem to belong here. I think the idea of this class is to track stats that we keep for both the runtime and the zone. In this case we're only concerned with the zone, so I think it would make more sense for this to be a field of the Zone.

::: js/src/jsgc.cpp
@@ +5897,5 @@
>      int64_t budget;
>      if (millis)
>          budget = SliceBudget::TimeBudget(millis);
> +    else if (reason == JS::gcreason::ALLOC_TRIGGER)
> +        budget = sliceBudget;

This is where we'd done something like |budget = SliceBudget::WorkBudget(workBudget)|, where workBudget is the GC tunable.

::: js/src/shell/js.cpp
@@ +6132,5 @@
>      JS_SetGCParameter(rt, JSGC_MODE, JSGC_MODE_INCREMENTAL);
>      JS_SetGCParameterForThread(cx, JSGC_MAX_CODE_CACHE_BYTES, 16 * 1024 * 1024);
> +    JS_SetGCParameter(rt, JSGC_DYNAMIC_HEAP_GROWTH, 1);
> +    JS_SetGCParameter(rt, JSGC_DYNAMIC_MARK_SLICE, 1);
> +    JS_SetGCParameter(rt, JSGC_SLICE_TIME_BUDGET, 10);

And we'd change this to JSGC_SLICE_WORK_BUDGET = whatever, leaving the time budget at 0.
Attachment #8506466 - Flags: review?(wmccloskey)
Attached patch patchSplinter Review
We already have a mechanism for avoiding non-determinism in the shell, and I think it would be better to use that than to work around this issue by changing the incremental mechanism used.  A work based strategy seems like it would be pretty hard to tune well, since the amount of time taken to do a certain amount of work varies widely between machines.  This patch doesn't set the GC settings that allow incremental GC in --enable-more-deterministic builds, which are what the fuzzers usually use.
Attachment #8506466 - Attachment is obsolete: true
Attachment #8506902 - Flags: review?(wmccloskey)
(In reply to Brian Hackett (:bhackett) from comment #33)
> This patch doesn't
> set the GC settings that allow incremental GC in --enable-more-deterministic
> builds, which are what the fuzzers usually use.

The fuzzers only use that flag for differential testing, not for the "normal" fuzzing.
(In reply to Jan de Mooij [:jandem] from comment #34)
> (In reply to Brian Hackett (:bhackett) from comment #33)
> > This patch doesn't
> > set the GC settings that allow incremental GC in --enable-more-deterministic
> > builds, which are what the fuzzers usually use.
> 
> The fuzzers only use that flag for differential testing, not for the
> "normal" fuzzing.

The configure flags that Gary posts seem to all use --enable-more-deterministic.  I don't know what configure flags decoder uses.  I don't see any real disadvantages to using --enable-more-deterministic all or most of the time though.
I admit it would be nice if the slices lasted a consistent amount of time across machines. Let's see what Gary and Christian think. Guys, when do you test with the --enable-more-deterministic flag? Would you be opposed to always turning it on?
Flags: needinfo?(gary)
Flags: needinfo?(choller)
(In reply to Bill McCloskey (:billm) from comment #36)
> Guys, when do you test with the --enable-more-deterministic flag? Would you be opposed to always turning it on?

No, I'm not opposed as long as we don't break the web.

(In reply to Jan de Mooij [:jandem] from comment #34)
> (In reply to Brian Hackett (:bhackett) from comment #33)
> > This patch doesn't
> > set the GC settings that allow incremental GC in --enable-more-deterministic
> > builds, which are what the fuzzers usually use.
> 
> The fuzzers only use that flag for differential testing, not for the
> "normal" fuzzing.

Not necessarily true, there are some combinations that we use that flag for "normal" fuzzing too, just so we don't miss something. Sometimes, these combinations are randomly chosen.
Flags: needinfo?(gary)
(In reply to Bill McCloskey (:billm) from comment #36)
> I admit it would be nice if the slices lasted a consistent amount of time
> across machines. Let's see what Gary and Christian think. Guys, when do you
> test with the --enable-more-deterministic flag? Would you be opposed to
> always turning it on?

I don't think it's really a requirement that all fuzzing be done with --enable-more-deterministic, it's just that we'll be able to easily isolate the effect of the incremental GCs with this option and can use that to more easily debug an affected program.  This is similar to what we do with off thread Ion compilation, which has an effect on all programs too but can be turned off to isolate its effect with --no-threads.
I don't fuzz with --enable-more-deterministic because it differs from the code that we ship in production. Furthermore, I'm not sure if we have builds for that, and I'd like to keep fuzzing working with our builds.

If there is something about incremental GC that would make stuff more deterministic, then I'd be happy about that, but can we make that just a runtime option for the shell? That's also a lot easier to integrate for testing.
Flags: needinfo?(choller)
Comment on attachment 8506902 [details] [diff] [review]
patch

Review of attachment 8506902 [details] [diff] [review]:
-----------------------------------------------------------------

OK, let's do this, but with a command line switch as well as the configure flag.
Attachment #8506902 - Flags: review?(wmccloskey) → review+
(In reply to Brian Hackett (:bhackett) from comment #30)
> So, splay-latency improves by 2/3 and splay decreases by 7% or so, and other
> octane tests don't seem to move (I get a big improvement in mandreel-latency
> when running the tests together, but this doesn't repro running mandreel
> individually.)

I'm probably missing something or these numbers vary wildly, but to me it looks like some other tests are substantially affected: pdf.js by 11%, and TypeScript by 19%. RayTrace is down by close to 4%, but that's within the normal range of variation, so probably not relevant.
(In reply to Till Schneidereit [:till] from comment #41)
> (In reply to Brian Hackett (:bhackett) from comment #30)
> > So, splay-latency improves by 2/3 and splay decreases by 7% or so, and other
> > octane tests don't seem to move (I get a big improvement in mandreel-latency
> > when running the tests together, but this doesn't repro running mandreel
> > individually.)
> 
> I'm probably missing something or these numbers vary wildly, but to me it
> looks like some other tests are substantially affected: pdf.js by 11%, and
> TypeScript by 19%. RayTrace is down by close to 4%, but that's within the
> normal range of variation, so probably not relevant.

Numbers on octane tests can vary widely between runs.  After doing several runs and comparing the results I didn't notice big swings in other tests, but we'll need to wait for awfy for more certainty.
https://hg.mozilla.org/mozilla-central/rev/35025fd9e99b
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Depends on: 1088298
You need to log in before you can comment on or make changes to this bug.