958492 - Splay Latency browser numbers are much lower than in shell

Reporter

Description

•

12 years ago

In most cases the numbers got from a shell build are "more or less" the same as the numbers we get with browser builds. Now this isn't the case with Splay Latency. We get 2x higher numbers in the shell than in the browser. We should figure out what causes this and get the browser numbers the same as the shell numbers

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 1

•

12 years ago

Normal GCs in the shell are not incremental because that would be non-deterministic. I don't think we want to change this, since it would really mess up a lot of things. We might be able to allow you to opt into incremental GC in the shell. That would be a good amount of work though; right now all the scheduling code lives in nsGlobalWindow.cpp because it needs to coordinate with the cycle collector. Rather than hacking around this, I think we should just work harder on getting awfy to run in the browser.

Hannes Verschore [:h4writer]

Reporter

Comment 2

•

12 years ago

I thought incremental GC should be better? Here the shell is 2x better than the browser. Why would we be slower due to incremental GC's? Isn't incremental GC something that should make us faster?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 3

•

12 years ago

Oh, I misunderstood. That is weird.

Hannes Verschore [:h4writer]

Reporter

Comment 4

•

12 years ago

My current guess is that GC takes more time than GC in shell. In the browser I see pauses of: 32.39872399999999 2.156260000000003 79.69760099999996 91.00974599999995 91.595686 101.00587399999995 2.0042000000000826 117.44391900000028 136.0692449999999 => 660 milliseconds In the shell: 33 38 40 42 44 41 40 42 42 41 => 440 milliseconds Now the reason of the big splay latency score difference is not because the shell is 200ms faster. BUT every pause^2 is taken and averaged. So that means the score is based on "1600ms" in the shell, but "66249ms" in the browser! In the shell the slow pauses are indeed caused by "Collect". I haven't tested it for the browser, but I would be surprised if it isn't. Setting "javascript.options.mem.gc_incremental" to false, didn't improve score in the browser.

Nicholas Nethercote [inactive]

Comment 5

•

12 years ago

> BUT every pause^2 is taken and averaged. So that > means the score is based on "1600ms" in the shell, but "66249ms" in the > browser! I hate benchmarks that use weird scoring systems, and Google is a repeat offender. What's wrong with just measuring time? Too easy to understand?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 6

•

12 years ago

They want to penalize us for long pauses. I think the next step is to get a GC log of the browser using memchaser or something. That will tell if the GCs are incremental or not, and where the time is being spent. A pause of 136ms is fairly high. We try to keep pauses at 40ms or below, although lots of tabs can make things works.

Nicholas Nethercote [inactive]

Comment 7

•

12 years ago

> They want to penalize us for long pauses. Sure. My point is that squaring pauses isn't particularly meaningful. Anyway, we're stuck with it.

Hannes Verschore [:h4writer]

Reporter

Comment 8

•

12 years ago

Attached file Memchaser log — Details

Hannes Verschore [:h4writer]

Reporter

Comment 9

•

12 years ago

Btw all testing was done on newest nightly with as good as a clean profile. No addons, except for memchaser.

Jan de Mooij [:jandem]

Comment 10

•

12 years ago

Bill, the memchaser log Hannes attached has: "reason":"ALLOC_TRIGGER" and "nonincremental_reason":"allocation trigger" Does this mean we don't use incremental GC for ALLOC_TRIGGER GCs? Is this something we can easily fix?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 11

•

12 years ago

Right, we're not using incremental GC. I just looked at the code, and the problem with this benchmark is that it doesn't return to the event loop very often. It uses the usual Octane model of running the benchmark loop for one second, and then maybe running it again if needed. There's no point in doing incremental GCs if the benchmark is blocking the main thread for one second; the browser will jank even with incremental GC. This benchmark really ought to use setTimeout or setInterval to run the benchmark in lots of short segments (50ms or less), returning to the event loop in between. I think Firefox would perform better in that case. We typically run JS_MaybeGC when returning to the event loop and that should trigger an incremental GC. There still might be problems, but at least we'd have a chance. I'm not sure what to do about this problem. We could file a bug against the benchmark. The alternative is to change our heuristics to run incremental GCs in the middle of JS execution. The latter would be a lot of work and it wouldn't actually benefit any users. Jan, do you think it would be feasible to get them to change or eliminate this benchmark? If we do ask them to change it, we should probably propose an alternate version. We would have to change the Octane base.js to allow some benchmarks to run in shorter segments I guess.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 12

•

12 years ago

Attached patch more-incremental-gc — Details — Splinter Review

I played with this a little bit today. At first I tried to modify the benchmark, but then I realized that it's kind of hard to do. We can make splay return to the event loop more often, but then we have to decide whether the time spent in the event loop should count against us. If it doesn't count, then it's very easy to hide all the GCs there. That kind of goes against the point of the benchmark. If we do time the event loop, then we'll end up with very irregular scores due to all the random things that can happen during event loop processing. So I decided to try running incremental GC in the middle of the benchmark. It's hard to decide when to do it though. Eventually I decided that calls to window.performance.now could be a good time. If we do it there, we're likely to end up doing one GC slice per measurement interval, which is ideal. With these changes, our splay score goes down from ~7000 to ~6500. Our splay-latency score goes from ~7000 to ~12000. Over four runs, my total Octane score went from an average of 11260 to an average of 11772. Boris, I'm asking for feedback on whether you think this would be a terrible idea. Adding some code into performance.now that occasionally makes it take thousands of times longer is scary. However, GCs can already happen at almost any time. Putting them right next to the timing code at least makes them a little more regular in some sense. Also, I tried to isolate things so that we only do the GCs when we're also doing a lot of allocation. One remaining issue is the cost of JS_MaybeGC. If need be, I think I can find a way to inline it. I'm not sure whether it's worth it though. Finally, I added a shell option to run with incremental GC. When you enable this option, the shell still gets somewhat better numbers than the browser for splay-latency. I'm not sure if there's much we can do about that, since the browser has to do more work.

Attachment #8368996 - Flags: feedback?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 13

•

12 years ago

So there are two things to consider here, as you point out. First, the forced cost on every call of invoking JS_MaybeGC at all. I just tested, and I see numbers like so, without this patch, on Mac, on my hardware: * "performance" getter on global: 40-45ns (will drop to ~12-13 once we have Window on WebIDL bindings, I hope). * now() call: 40ns. With this patch applied, the now() call takes about 110ns. So the time for a typical "performance.now()" call goes from 80ns to 150ns with this patch for me (and from 55ns to 125ns once we get Window converted). That's slightly worrisome, but presumably we can reduce that cost by adding some sort of inline fast path for the common no-op case. If the page is doing "window.performance.now()" instead of "performance.now()", then right now it's taking more like 200ns. With this patch it'll take 270ns, modulo whatever optimizations we add. We do want to make that "window.performance" get faster too... Date.now() numbers are similar: it takes 40ns without this patch, 110ns with. "new Date" and subtracting another Date object takes about 260ns both with and without this patch. Presumably it's not calling date_now? Second, the cost of sometimes doing a GC on this call. This is hard to evaluate without having some idea of how often we'd end up GCing and how long each GC would end up taking. Any idea on those two numbers?

Boris Zbarsky [:bzbarsky]

Comment 14

•

12 years ago

Comment on attachment 8368996 [details] [diff] [review] more-incremental-gc Seems plausible in general if we make the no-op case faster and have some ideas on the other numbers....

Attachment #8368996 - Flags: feedback?(bzbarsky) → feedback+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 15

•

12 years ago

I want to get back to this, but I don't have time just now.

Flags: needinfo?(wmccloskey)

Hannes Verschore [:h4writer]

Reporter

Comment 16

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #11) > Right, we're not using incremental GC. I just looked at the code, and the > problem with this benchmark is that it doesn't return to the event loop very > often. It uses the usual Octane model of running the benchmark loop for one > second, and then maybe running it again if needed. There's no point in doing > incremental GCs if the benchmark is blocking the main thread for one second; > the browser will jank even with incremental GC. I've talked with somebody of the v8 team about this. The view on it is that latency and predictability is important in a VM to provide "soft real-time" guarantees. Which I agree on. He also agree that latency tests could be based on requestAnimiationFrame/setInterval. But they want to stress the system, i.e. when doing a computationally heavy workload which results in just short idle times. Idle times so low it is almost 0. So they view the current measuring system as a way to approach the zero idle time. And argue that if a VM is good on such a workload, like octane-splay latency, It will also be good when using requestAnimationFrame/setInterval. So we kinda stuck with this :(. Meaning we need some sort of way to do iGC, during execution (without the current pauses). So it might make sense to check it during "Date.now()". But that feels like too specific. It doesn't make sense for me that code that has "Date.now()" only get iGC, but if we don't measure time, we don't get iGC. I think we need a more global way of running iGC. I was wondering if it was viable to have a check (like we have for "script running too long"), on loop headers or something. I think that might be a more constructive way to deal with this, than special casing "Date.now()"

Till Schneidereit [:till]

Comment 17

•

11 years ago

Maybe we can convince them to change the benchmark once engines implement their own event loops and we can do zero-delay event turns? It might be a while until that's the case, but it will happen, e.g. for promises.

Terrence Cole [:terrence]

Updated

•

11 years ago

Blocks: GC.performance

Depends on: 1029648

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 18

•

11 years ago

Realistically I'm not going to get to this.

Flags: needinfo?(wmccloskey)

Steve Fink [:sfink] [:s:]

Comment 19

•

11 years ago

(In reply to Hannes Verschore [:h4writer] from comment #16) > (In reply to Bill McCloskey (:billm) from comment #11) > > Right, we're not using incremental GC. I just looked at the code, and the > > problem with this benchmark is that it doesn't return to the event loop very > > often. It uses the usual Octane model of running the benchmark loop for one > > second, and then maybe running it again if needed. There's no point in doing > > incremental GCs if the benchmark is blocking the main thread for one second; > > the browser will jank even with incremental GC. > > I've talked with somebody of the v8 team about this. The view on it is that > latency and predictability is important in a VM to provide "soft real-time" > guarantees. Which I agree on. He also agree that latency tests could be > based on requestAnimiationFrame/setInterval. But they want to stress the > system, i.e. when doing a computationally heavy workload which results in > just short idle times. Idle times so low it is almost 0. So they view the > current measuring system as a way to approach the zero idle time. And argue > that if a VM is good on such a workload, like octane-splay latency, It will > also be good when using requestAnimationFrame/setInterval. I don't follow this at all. Latency and predictability are good qualities. So is throughput. In the limit, if you never return to the event loop during a benchmark run, any amount of iGC is *always* worse than just doing a nonincremental GC when you need it. If you do return to the event loop a few times, there is no answer that is guaranteed to be correct. It depends on what happens between turns, how long the gap is between turns, and how full your heap is. If you're going to be idle for a second between turns, and one turn doesn't exhaust memory, then the optimum is to do no GCs during the turn, return to the event loop and process any input, then do a full nonincremental GC. If you are going to go way over memory (and most of it is garbage), then you should do as many full nonincremental GCs as you need and the final GC doesn't matter that much. iGC during the turn is a nice intermediate point, where you'll pay some throughput in order to speed up an eventual memory usage-triggered finishing GC, but it's not the "right" answer. And if you don't have much of any idle time between turns, then it really depends on the application for whether it's better to maximize throughput or minimize jitter. Arguing that being good on this sort of workload means you'll be good on rAF is backwards. Yes, low latency on octane-splay does imply low latency with rAF fairly well. But the converse is not true, and there is always a latency vs throughput tradeoff with these things, so to some extent low latency on octane-splay also implies low throughput on octane-splay. Do we really want a benchmark rewarding low throughput, just because a workload exists for which that is the right choice? If we want browsers to be good at rAF, I think we should have a benchmark that uses rAF. If we want a proxy benchmark that runs in the shell, I would suggest that rather than hacking something based on Date.now or performance.now or whatever, we make up new things like benchmark.startFrame and benchmark.endFrame and tell JS engine makers that it's totally legit to optimize behavior by assuming those are called at the beginning and end of a frame. If the benchmark code needs to have something awful like if (!this.hasOwnProperty("benchmark") || !benchmark.hasOwnProperty("startFrame")) benchmark = { startFrame: function () => {} }; } to be able to run on all engines, then so be it. If we go with something like this, then I'd be a lot more comfortable with some hacky pattern-match-splay-latency type of solution to "backport" the startFrame notification to the current benchmark. As till said, ES7(?) task loops or something would be a better place for this, but I'm thinking shorter-term. > > So we kinda stuck with this :(. > Meaning we need some sort of way to do iGC, during execution (without the > current pauses). > So it might make sense to check it during "Date.now()". But that feels like > too specific. It doesn't make sense for me that code that has "Date.now()" > only get iGC, but if we don't measure time, we don't get iGC. I think we > need a more global way of running iGC. I was wondering if it was viable to > have a check (like we have for "script running too long"), on loop headers > or something. I think that might be a more constructive way to deal with > this, than special casing "Date.now()" Isn't a long-running script precisely where you normally want to maximize throughput?

Till Schneidereit [:till]

Comment 20

•

11 years ago

One thing we could perhaps do is demonstrate that the total time v8 spends under the test function decreases if they don't do iGCs in this case, either. Modifying the benchmark to that effect should be trivial, but I don't know about v8. It might be enough to just increase the max slice duration to, oh, infinity. If their score on this "benchmark" improves, then clearly for the web it's better not to do iGCs in this situation, as sfink argues. (And I don't see how it could not improve, at least slightly.)

Nicholas Nethercote [inactive]

Comment 21

•

11 years ago

> If we want browsers to be good at rAF, I think we should have a benchmark that uses rAF. Yes, yes, a million times this. (And feel free to substitute any other value for "rAF".)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 22

•

11 years ago

I like Steve's idea of adding a proxy for rAF. We couldn't just do it in the shell, though, since most people run Octane in the browser. (As mentioned above, we probably don't want to write the benchmark using actual rAF because the results would be really inconsistent.) However, if we want to add this special proxy-rAF to the browser, we'd have to add it as a property on the window or something. I'm guessing there would be a ton of resistance to that. I'm coming around more and more to the view that we should just make our GC work like V8's.

Till Schneidereit [:till]

Comment 23

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #22) > I like Steve's idea of adding a proxy for rAF. We couldn't just do it in the > shell, though, since most people run Octane in the browser. (As mentioned > above, we probably don't want to write the benchmark using actual rAF > because the results would be really inconsistent.) However, if we want to > add this special proxy-rAF to the browser, we'd have to add it as a property > on the window or something. I'm guessing there would be a ton of resistance > to that. Why wouldn't Promise chains be a good proxy for rAF performance? Doing each iteration of the splay benchmark in a promise and measuring the average delay between runs (so excluding the benchmark's workload itself) should give a pretty good idea of how the browser deals with the GC. Something like the following should hopefully get the idea across: var iterations = 0; var workloadDuration; var startTime; function resolver(resolve, reject) { var time = performance.now(); // Dummy-workload, obviously. while (performance.now() < time + 100) {} workloadDuration = performance.now() - time; resolve(); } function then() { var currentTime = performance.now(); var delay = (currentTime - startTime - workloadDuration).toFixed(2); console.log('iteration ' + (++iterations) + ' completed in ' + delay + 'ms'); if (iterations < 10) { startTime = performance.now(); p = new Promise(resolver); p.then(then); } } startTime = performance.now(); var p = new Promise(resolver); p.then(then); Now obviously this only works where promises are available, so we'd need to implement ES6 Job Queues[1]. In the meantime, having a browser benchmark that only works in the browser, not the shell, isn't entirely absurd. [1] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-jobs-and-job-queues

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 24

•

11 years ago

> Why wouldn't Promise chains be a good proxy for rAF performance? I didn't see anything in the spec about it, but I assume we'd run the event loop in between each promise resolution. If that's true, then promises would have the same problem as rAF. Besides GC, we would be measuring everything that can run in the event loop: session store collection, page thumbnail collection, etc. Each run would give very different numbers because of this. It's actually not a bad thing to benchmark, since we want all of those things to be fast. But I think the Octane developers would object to it. Google people also created the bouncing balls benchmark, which was very similar to splay latency except that it did run the event loop. It had the problem of giving very inconsistent results, and I'm guessing that's why Octane included splay latency instead.

Boris Zbarsky [:bzbarsky]

Comment 25

•

11 years ago

> but I assume we'd run the event loop in between each promise resolution. With our current implementation, yes. Per spec, no: Promises take priority over normal events. Whether this is actually something we're willing to implement is unclear to me, since it would allow promises to block user interaction.

Till Schneidereit [:till]

Comment 26

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #24) > > Why wouldn't Promise chains be a good proxy for rAF performance? > > I didn't see anything in the spec about it, but I assume we'd run the event > loop in between each promise resolution. As Boris said, it's not supposed to as things stand now. (But read on.) > If that's true, then promises would > have the same problem as rAF. Besides GC, we would be measuring everything > that can run in the event loop: session store collection, page thumbnail > collection, etc. Each run would give very different numbers because of this. In a clean profile with just the one window/tab open, would that really introduce that much variability? > It's actually not a bad thing to benchmark, since we want all of those > things to be fast. This. Incidentally, we're not doing well at all on this benchmark (as posted in comment 23) by that measure: my main browsing profile has pauses between 3ms and 100ms, whereas in a new profile I get 3ms to ~6ms. > But I think the Octane developers would object to it. Because they want to isolate GC pause time testing, or because they're doing really well at the current, less user-relevant benchmark? > Google people also created the bouncing balls benchmark, which was very > similar to splay latency except that it did run the event loop. It had the > problem of giving very inconsistent results, and I'm guessing that's why > Octane included splay latency instead. I guess to some extent that's fair, yes. (Also, ignore the part about subtracting the workload duration. That obviously only works if the implementation doesn't GC at all during the workload, so doesn't make much sense here.)

Terrence Cole [:terrence]

Comment 27

•

11 years ago

We discussed this extensively at today's GC meeting and came to the conclusion that Bill is correct. Explaining why is a bit hard, but stick with me. In a general purpose programming language, it is really, really hard to tell what effect latency will have because of all the ways it is possible to capture or update user-visible state. Things like writes to the front buffer, network i/o, file i/o, etc can all have dramatic effects on a program's behavior in the presence of latency, uniform or otherwise. There are enough API's that modify external state that it's quite hard to even draw a line in a program that tells you what /might/ end up being important (modulo Haskell's IO monads, of course). Naturally, in JavaScript, things are a bit different. With JS, the state a user's script updates is simply not visible until the script is done, so only the time it takes to return to the event loop (throughput) matters. Of course, this a lie: nothing is ever that simple. A script can observe latency by calling Date.now(), as it does for this benchmark. Or it could be communicating with a web worker where the uneven task delivery could cause pipeline stalls. Or it could be running chrome code where a long gc could time out a socket. Or block the user from changing desktops for several seconds because X is waiting on us. Or. Or. Or. Flip the problem on it's head. The only reason we've been thinking of throughput as the default is because we live in a run-to-completion world. This is wrong. Latency has to be the default: the effect of failing at latency can be catastrophic, whereas the effect of being slightly slower is (generally) being slightly slower. I'm taking this, although I have no idea what the right approach is yet.

Assignee: nobody → terrence

Status: NEW → ASSIGNED

Emanuel Hoogeveen [:ehoogeveen]

Comment 28

•

11 years ago

Your comment left me initially confused, though it makes sense to me after thinking about it for a bit. So to be clear: although it won't directly make the active tab more responsive to the user and might slow run-to-completion, moving from a small amount of full GCs to a multitude of incremental GCs will make the script perform more predictably, which may help its interaction with other parts of the engine? Aside from the examples you mentioned, could a script reasonably check its own runtime to decide whether to stop and yield to other things - e.g. run for 30ms, then queue itself up to continue running a bit later? In the case of a script going out of its way to keep things responsive like that, it would be pretty annoying to have infrequent but (relatively) long GCs cause jank for the user.

Steve Fink [:sfink] [:s:]

Comment 29

•

11 years ago

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #28) > Aside from the examples you mentioned, could a script reasonably check its > own runtime to decide whether to stop and yield to other things - e.g. run > for 30ms, then queue itself up to continue running a bit later? Within the engine? No, that's forbidden by run-to-completion semantics. We're not talking about yielding to other things, we're talking only about yielding to incremental GCs. If you want your script to do something like that, you would need to do it manually. Promises might be useful. > moving from a small amount of full GCs to a multitude of incremental GCs > will make the script perform more predictably, which may help its I don't think that's quite the right model. An incremental GC slice is not the same thing as a fraction of a full GC. It's more like a full GC is broken down into a series of incremental mark slices followed by a sweep slice. Or even closer, it's an initial root marking slice, a series of incremental mark slices, and a final sweep slice. You don't free any memory until the sweep slice, and you free about the same amount of memory as a full GC would. If you let the numbers represent some arbitrary unit of time, say nanofortnights, you could compare them as: full GCs: 7 7 7 DONE incremental slices: 2 1 1 1 3 2 1 1 1 3 2 1 1 1 3 DONE Time goes to the right. This is assuming that you're only doing a full GC when you need it to free up memory. Note that the sum of each group of incremental GCs is slightly higher than the cost of a full GC, and thus the incremental approach takes 3 nanofortnights longer to finish. But if you pick pairs of points in time (between GC slices) and measure the elapsed time between them, the variance is much lower in the incremental GC case. If you're returning the event loop at interior points, the lower latency incremental approach is clearly superior: imagine you got unlucky and returned to the event loop just before the first full GC. Then you resume processing briefly. In the full GC case, you then spend processing time + 7 nanofortnights. In the iGC case, you spend processing time + 3 nanofortnights, which is much less even though peak memory usage is the same (the full GC and final iGC slice free the same amount of memory, after the same amount of processing.) If you don't return to the event loop, then the full GC approach looks better, because you finish faster. But the argument in comment 27 is that (1) there are many ways that the uneven spacing could matter, other than returning to the event loop; and (2) the consequences of such jitter are more important than the loss in throughput. I'm a little uncertain about (1), since they sound a little theoretical (real, but uncommon). But (2) is convincing to me -- time and time again, we find that avoiding pathological cases is more important than making the good (fast) cases better. So it makes sense to push hard on latency, avoiding huge long stalls at inopportune times, and only optimize throughput in cases where it doesn't hurt latency (much). Latency-first design, I guess. (They're not diametrically opposed -- many improvements will help with both.) This benchmark still sticks in my craw, because it's a case where it's relatively straightforward to automatically detect that we could improve throughput at no cost to latency. It really does reward incorrect scheduling behavior, unless you reinterpret Date.now to represent a visible side effect. But given that coming up with an improved benchmark would be a PITA (at least one that is representative when run in the shell), it makes sense to postpone that for later and tune for latency first.

Brian Hackett [Laid off!]

Assignee

Comment 30

•

11 years ago

Attached patch patch (obsolete) — Details — Splinter Review

This patch improves my splay-latency score a lot without a big effect on other octane tests and without relying on performance.now or anything. I've only tested on an x86 shell so far (which I modified in the patch so it can do IGCs.) Score before: Richards: 29788 DeltaBlue: 47255 Crypto: 28048 RayTrace: 96791 EarleyBoyer: 34444 RegExp: 3169 Splay: 23372 SplayLatency: 14856 NavierStokes: 29473 PdfJS: 17711 Mandreel: 29909 MandreelLatency: 29674 Gameboy: 62778 CodeLoad: 17405 Box2D: 47030 zlib: 58820 Typescript: 33492 ---- Score (version 9): 28888 And after: Richards: 29982 DeltaBlue: 46653 Crypto: 27875 RayTrace: 92943 EarleyBoyer: 35681 RegExp: 3228 Splay: 21885 SplayLatency: 24957 NavierStokes: 29502 PdfJS: 15728 Mandreel: 30001 MandreelLatency: 37767 Gameboy: 62841 CodeLoad: 17341 Box2D: 47429 zlib: 58640 Typescript: 27037 ---- Score (version 9): 29521 So, splay-latency improves by 2/3 and splay decreases by 7% or so, and other octane tests don't seem to move (I get a big improvement in mandreel-latency when running the tests together, but this doesn't repro running mandreel individually.) Currently, when we trigger GCs during splay it is due to GC thing allocation in the zone. This causes an immediate non-incremental GC of the zone. The patch changes this so that when we are 90% of the way to the zone allocation trigger, we start an incremental GC and execute a slice after every MB of allocation (splay allocates 300-500K of data between points when it measures pause times.) If we hit the allocation threshold before the IGC finishes, we finish the GC non-incrementally, so our memory usage with this patch should not be higher than it is currently, unless the IGC collects less than the non-incremental GC.

Assignee: terrence → bhackett1024

Brian Hackett [Laid off!]

Assignee

Comment 31

•

11 years ago

Comment on attachment 8506466 [details] [diff] [review] patch I tested this in an x64 browser and get a bigger improvement than in the x86 shell --- maybe 4% overall, though it's noisy. I also tested the old realtime raytracer demo that Bill pointed me towards, and get similar FPS with and without the patch, as expected.

Attachment #8506466 - Attachment description: WIP → patch

Attachment #8506466 - Flags: review?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 32

•

11 years ago

Comment on attachment 8506466 [details] [diff] [review] patch Review of attachment 8506466 [details] [diff] [review]: ----------------------------------------------------------------- I forgot one of my main objections to this approach: it will make the shell really non-deterministic. I know we already have a lot of non-determinism in the shell, but it's usually pretty self-contained (parsing, Ion compiling). But the GC touches a lot more stuff. I'm just afraid that a change like this will make it much more difficult to get reproducible shell test cases. However, I think there's a solution. We already have a mechanism to do work-based GC scheduling (as opposed to time-based). It's only used for debugging right now, but I think it would be good for this. We'll need a few things: 1. An extra parameter for GCSchedulingTunables that determines how much work we do per slice, if we're using work-based scheduling. 2. Code to select work-based scheduling for ALLOC_TRIGGER GCs. You'll also need to look for an amount of work that makes splay-latency work well. Hopefully this won't be too hard. If it is, I guess we could reconsider the time-based approach. ::: js/src/gc/Heap.h @@ +1014,5 @@ > + /* > + * Amount of data to allocate before a new incremental slice can be > + * triggered. > + */ > + size_t delayBytes_; This doesn't really seem to belong here. I think the idea of this class is to track stats that we keep for both the runtime and the zone. In this case we're only concerned with the zone, so I think it would make more sense for this to be a field of the Zone. ::: js/src/jsgc.cpp @@ +5897,5 @@ > int64_t budget; > if (millis) > budget = SliceBudget::TimeBudget(millis); > + else if (reason == JS::gcreason::ALLOC_TRIGGER) > + budget = sliceBudget; This is where we'd done something like |budget = SliceBudget::WorkBudget(workBudget)|, where workBudget is the GC tunable. ::: js/src/shell/js.cpp @@ +6132,5 @@ > JS_SetGCParameter(rt, JSGC_MODE, JSGC_MODE_INCREMENTAL); > JS_SetGCParameterForThread(cx, JSGC_MAX_CODE_CACHE_BYTES, 16 * 1024 * 1024); > + JS_SetGCParameter(rt, JSGC_DYNAMIC_HEAP_GROWTH, 1); > + JS_SetGCParameter(rt, JSGC_DYNAMIC_MARK_SLICE, 1); > + JS_SetGCParameter(rt, JSGC_SLICE_TIME_BUDGET, 10); And we'd change this to JSGC_SLICE_WORK_BUDGET = whatever, leaving the time budget at 0.

Attachment #8506466 - Flags: review?(wmccloskey)

Brian Hackett [Laid off!]

Assignee

Comment 33

•

11 years ago

Attached patch patch — Details — Splinter Review

We already have a mechanism for avoiding non-determinism in the shell, and I think it would be better to use that than to work around this issue by changing the incremental mechanism used. A work based strategy seems like it would be pretty hard to tune well, since the amount of time taken to do a certain amount of work varies widely between machines. This patch doesn't set the GC settings that allow incremental GC in --enable-more-deterministic builds, which are what the fuzzers usually use.

Attachment #8506466 - Attachment is obsolete: true

Attachment #8506902 - Flags: review?(wmccloskey)

Jan de Mooij [:jandem]

Comment 34

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #33) > This patch doesn't > set the GC settings that allow incremental GC in --enable-more-deterministic > builds, which are what the fuzzers usually use. The fuzzers only use that flag for differential testing, not for the "normal" fuzzing.

Brian Hackett [Laid off!]

Assignee

Comment 35

•

11 years ago

(In reply to Jan de Mooij [:jandem] from comment #34) > (In reply to Brian Hackett (:bhackett) from comment #33) > > This patch doesn't > > set the GC settings that allow incremental GC in --enable-more-deterministic > > builds, which are what the fuzzers usually use. > > The fuzzers only use that flag for differential testing, not for the > "normal" fuzzing. The configure flags that Gary posts seem to all use --enable-more-deterministic. I don't know what configure flags decoder uses. I don't see any real disadvantages to using --enable-more-deterministic all or most of the time though.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 36

•

11 years ago

I admit it would be nice if the slices lasted a consistent amount of time across machines. Let's see what Gary and Christian think. Guys, when do you test with the --enable-more-deterministic flag? Would you be opposed to always turning it on?

Flags: needinfo?(gary)

Flags: needinfo?(choller)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 37

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #36) > Guys, when do you test with the --enable-more-deterministic flag? Would you be opposed to always turning it on? No, I'm not opposed as long as we don't break the web. (In reply to Jan de Mooij [:jandem] from comment #34) > (In reply to Brian Hackett (:bhackett) from comment #33) > > This patch doesn't > > set the GC settings that allow incremental GC in --enable-more-deterministic > > builds, which are what the fuzzers usually use. > > The fuzzers only use that flag for differential testing, not for the > "normal" fuzzing. Not necessarily true, there are some combinations that we use that flag for "normal" fuzzing too, just so we don't miss something. Sometimes, these combinations are randomly chosen.

Flags: needinfo?(gary)

Brian Hackett [Laid off!]

Assignee

Comment 38

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #36) > I admit it would be nice if the slices lasted a consistent amount of time > across machines. Let's see what Gary and Christian think. Guys, when do you > test with the --enable-more-deterministic flag? Would you be opposed to > always turning it on? I don't think it's really a requirement that all fuzzing be done with --enable-more-deterministic, it's just that we'll be able to easily isolate the effect of the incremental GCs with this option and can use that to more easily debug an affected program. This is similar to what we do with off thread Ion compilation, which has an effect on all programs too but can be turned off to isolate its effect with --no-threads.

Christian Holler (:decoder)

Comment 39

•

11 years ago

I don't fuzz with --enable-more-deterministic because it differs from the code that we ship in production. Furthermore, I'm not sure if we have builds for that, and I'd like to keep fuzzing working with our builds. If there is something about incremental GC that would make stuff more deterministic, then I'd be happy about that, but can we make that just a runtime option for the shell? That's also a lot easier to integrate for testing.

Flags: needinfo?(choller)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 40

•

11 years ago

Comment on attachment 8506902 [details] [diff] [review] patch Review of attachment 8506902 [details] [diff] [review]: ----------------------------------------------------------------- OK, let's do this, but with a command line switch as well as the configure flag.

Attachment #8506902 - Flags: review?(wmccloskey) → review+

Till Schneidereit [:till]

Comment 41

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #30) > So, splay-latency improves by 2/3 and splay decreases by 7% or so, and other > octane tests don't seem to move (I get a big improvement in mandreel-latency > when running the tests together, but this doesn't repro running mandreel > individually.) I'm probably missing something or these numbers vary wildly, but to me it looks like some other tests are substantially affected: pdf.js by 11%, and TypeScript by 19%. RayTrace is down by close to 4%, but that's within the normal range of variation, so probably not relevant.

Brian Hackett [Laid off!]

Assignee

Comment 42

•

11 years ago

(In reply to Till Schneidereit [:till] from comment #41) > (In reply to Brian Hackett (:bhackett) from comment #30) > > So, splay-latency improves by 2/3 and splay decreases by 7% or so, and other > > octane tests don't seem to move (I get a big improvement in mandreel-latency > > when running the tests together, but this doesn't repro running mandreel > > individually.) > > I'm probably missing something or these numbers vary wildly, but to me it > looks like some other tests are substantially affected: pdf.js by 11%, and > TypeScript by 19%. RayTrace is down by close to 4%, but that's within the > normal range of variation, so probably not relevant. Numbers on octane tests can vary widely between runs. After doing several runs and comparing the results I didn't notice big swings in other tests, but we'll need to wait for awfy for more certainty.

Brian Hackett [Laid off!]

Assignee

Comment 43

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/35025fd9e99b

Carsten Book [:Tomcat]

Comment 44

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/35025fd9e99b

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla36

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

11 years ago

Depends on: 1088298

Memchaser log 12 years ago Hannes Verschore [:h4writer] 19.33 KB, text/plain		Details
more-incremental-gc 12 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 8.56 KB, patch	bzbarsky : feedback+	Details \| Diff \| Splinter Review
patch 11 years ago Brian Hackett [Laid off!] 8.54 KB, patch		Details \| Diff \| Splinter Review
patch 11 years ago Brian Hackett [Laid off!] 8.65 KB, patch	billm : review+	Details \| Diff \| Splinter Review