Dropped frames in Canvas Animations; presumed caused by GC pauses

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
5 years ago
8 months ago

People

(Reporter: jwmerrill, Assigned: terrence)

Tracking

(Blocks 1 bug)

36 Branch
x86
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:36.0) Gecko/20100101 Firefox/36.0
Build ID: 20141212004002

Steps to reproduce:

I'm trying to make a 60 fps canvas backed animation in Elm, a compile-to-js functional language: http://elm-lang.org/

I made a simple program with bubbles that track the mouse: http://jsbin.com/cusahisozo

Running this program in Firefox (35 & 36.0a2) and Chrome (39), I observe that Firefox periodically drops frames, and Chrome generally doesn't.

Here is a sample from Firefox dev tools new Timeline: http://imgur.com/Ry7S4ab

In a typical frame, there are a few short bursts of js execution totaling < 4 ms. However, in the middle of that screenshot, there is a large 100 ms pause that I suspect must be garbage collection.

I also made an artificial canvas drawing and garbage collection benchmark to try to get rid of some of the complexity of Elm's compiled code and runtime:

http://jsbin.com/puzojazapi/1/edit?js,output

(click "run with js" in the top right if it doesn't run automatically).

This code runs a requestAnimationFrame loop that would ideally fire once per frame. Every time through the loop, it draws a circle with a position determined by the current time, and the lightness of the circle alternates between successive frames.

When things are running smoothly, the result looks like a stack of pennies, with evenly spaced bands of dark and light. When there are dropped frames, it looks like someone grabbed a few pennies out of the stack.

To stress the GC a bit, I've added some allocation busywork to the loop, controlled by 2 constants at the top of the source. Each time through the loop, I fill a new array with ENTRY_LENGTH random numbers, and then put a reference to it in a 2nd array to make sure it can't be immediately collected. After RETAIN_ENTRIES times through the draw loop, I return back to the beginning of the outer array, and start overwriting previous entries, allowing them to be collected by the GC. So ENTRY_LENGTH determines how much allocation happens on each frame, and the product of ENTRY_LENGTH*RETAIN_ENTRIES determines how much uncollectable memory we allocate before we start letting parts of it get collected.

When ENTRY_LENGTH is set to 0, no allocation occurs, and both Firefox and Chrome animate pretty smoothly. With RETAIN_ENTRIES=100, ENTRY_LENGTH=1e5, I see small pauses in Chrome, and noticeably longer pauses in Firefox. With RETAIN_ENTRIES=100, ENTRY_LENGTH=1e6, things get really interesting:

http://imgur.com/a/Tu1Ix

Firefox (top) runs at 60 fps in bursts, interspersed with long periods of dropped frames. Chrome (bottom) decreases the frame rate until it is able to draw consistently.

My overall conclusion is that Chrome seems to have significantly lower maximum GC pauses for this (artificial) workload.

System Details:

MacBook Pro (Retina, 15-inch, Late 2013)
Processor: 2.3 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3
Graphics: Integrated Intel Iris Pro 1536 MB + Discrete NVIDIA GeForce GT 750M


Actual results:

Dropped frames.


Expected results:

No dropped frames.
(In reply to jwmerrill from comment #0)
> User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:36.0)
> Gecko/20100101 Firefox/36.0
> Build ID: 20141212004002
> 
> Steps to reproduce:
> 
> I'm trying to make a 60 fps canvas backed animation in Elm, a compile-to-js
> functional language: http://elm-lang.org/
> 
> I made a simple program with bubbles that track the mouse:
> http://jsbin.com/cusahisozo

Just from that description, I suspect that bug 1085597 is involved: we currently can't nursery-allocate things that cross between C++ and JS, eg mouse events. That turns out to be the only significant source of tenured garbage in things like asm.js games (that use movement events), and it may very well be what you're running into here.

> I also made an artificial canvas drawing and garbage collection benchmark to
> try to get rid of some of the complexity of Elm's compiled code and runtime:
> 
> http://jsbin.com/puzojazapi/1/edit?js,output
> 
> (click "run with js" in the top right if it doesn't run automatically).
> 
> This code runs a requestAnimationFrame loop that would ideally fire once per
> frame. Every time through the loop, it draws a circle with a position
> determined by the current time, and the lightness of the circle alternates
> between successive frames.
> 
> When things are running smoothly, the result looks like a stack of pennies,
> with evenly spaced bands of dark and light. When there are dropped frames,
> it looks like someone grabbed a few pennies out of the stack.
> 
> To stress the GC a bit, I've added some allocation busywork to the loop,
> controlled by 2 constants at the top of the source. Each time through the
> loop, I fill a new array with ENTRY_LENGTH random numbers, and then put a
> reference to it in a 2nd array to make sure it can't be immediately
> collected. After RETAIN_ENTRIES times through the draw loop, I return back
> to the beginning of the outer array, and start overwriting previous entries,
> allowing them to be collected by the GC. So ENTRY_LENGTH determines how much
> allocation happens on each frame, and the product of
> ENTRY_LENGTH*RETAIN_ENTRIES determines how much uncollectable memory we
> allocate before we start letting parts of it get collected.

Lifespan is pretty critical to the GC. How well does this mimic your actual application? You mentioned that you have a per-frame data structure that you construct. If this is live at the end of a frame, it will get tenured when we do a per-frame GC. If you can prevent from anything like that staying alive after a frame, it might make a big difference.

Basically, if you can keep things in the nursery, then you can allocate quite a bit of garbage and it won't cost you much at all. Minor (nursery) GCs are usually very cheap. Most of the time, if you can stay within the nursery then GC is essentially free. The main reasons I'm aware of for falling out of the nursery are (1) keeping things alive across minor GCs (eg past the end of an animation frame), and (2) because of reference edges crossing between C++ and JS. We're working on reducing the cases where #2 happens, by allowing nursery allocation of simple structures that don't escape.

This is speculation, though -- I haven't looked closely at your benchmarks. (But thank you very much for filing this! We need some good examples to work from.)

> http://imgur.com/a/Tu1Ix

Yeah, that's a very nice depiction of behavior we don't want. :-)

> Firefox (top) runs at 60 fps in bursts, interspersed with long periods of
> dropped frames. Chrome (bottom) decreases the frame rate until it is able to
> draw consistently.

Interesting. I don't think we're doing anything cleverly adaptive like that.

> My overall conclusion is that Chrome seems to have significantly lower
> maximum GC pauses for this (artificial) workload.

I can't promise that this is definitely a representative enough workload for us to care about, but I know I personally don't have anything better to work from, so when I get some time I'd like to look into this.
Blocks: GC.60fps
Depends on: 1085597
(In reply to Steve Fink [:sfink, :s:] (PTO til Dec 1) from comment #1)
> (In reply to jwmerrill from comment #0)
> > User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:36.0)
> > Gecko/20100101 Firefox/36.0
> > Build ID: 20141212004002
> > 
> > Steps to reproduce:
> > 
> > I'm trying to make a 60 fps canvas backed animation in Elm, a compile-to-js
> > functional language: http://elm-lang.org/
> > 
> > I made a simple program with bubbles that track the mouse:
> > http://jsbin.com/cusahisozo
> 
> Just from that description, I suspect that bug 1085597 is involved: we
> currently can't nursery-allocate things that cross between C++ and JS, eg
> mouse events. That turns out to be the only significant source of tenured
> garbage in things like asm.js games (that use movement events), and it may
> very well be what you're running into here.

That's pretty interesting. Elm is in a very different part of design space from asm.js, though.

> > I also made an artificial canvas drawing and garbage collection benchmark to
> > try to get rid of some of the complexity of Elm's compiled code and runtime:
> > 
> > http://jsbin.com/puzojazapi/1/edit?js,output
> > 
> > (click "run with js" in the top right if it doesn't run automatically).
> > 
> > This code runs a requestAnimationFrame loop that would ideally fire once per
> > frame. Every time through the loop, it draws a circle with a position
> > determined by the current time, and the lightness of the circle alternates
> > between successive frames.
> > 
> > When things are running smoothly, the result looks like a stack of pennies,
> > with evenly spaced bands of dark and light. When there are dropped frames,
> > it looks like someone grabbed a few pennies out of the stack.
> > 
> > To stress the GC a bit, I've added some allocation busywork to the loop,
> > controlled by 2 constants at the top of the source. Each time through the
> > loop, I fill a new array with ENTRY_LENGTH random numbers, and then put a
> > reference to it in a 2nd array to make sure it can't be immediately
> > collected. After RETAIN_ENTRIES times through the draw loop, I return back
> > to the beginning of the outer array, and start overwriting previous entries,
> > allowing them to be collected by the GC. So ENTRY_LENGTH determines how much
> > allocation happens on each frame, and the product of
> > ENTRY_LENGTH*RETAIN_ENTRIES determines how much uncollectable memory we
> > allocate before we start letting parts of it get collected.
> 
> Lifespan is pretty critical to the GC. How well does this mimic your actual
> application? You mentioned that you have a per-frame data structure that you
> construct. If this is live at the end of a frame, it will get tenured when
> we do a per-frame GC. If you can prevent from anything like that staying
> alive after a frame, it might make a big difference.
> 
> Basically, if you can keep things in the nursery, then you can allocate
> quite a bit of garbage and it won't cost you much at all. Minor (nursery)
> GCs are usually very cheap. Most of the time, if you can stay within the
> nursery then GC is essentially free. The main reasons I'm aware of for
> falling out of the nursery are (1) keeping things alive across minor GCs (eg
> past the end of an animation frame), and (2) because of reference edges
> crossing between C++ and JS. We're working on reducing the cases where #2
> happens, by allowing nursery allocation of simple structures that don't
> escape.
> 
> This is speculation, though -- I haven't looked closely at your benchmarks.
> (But thank you very much for filing this! We need some good examples to work
> from.)

This is really interesting. I didn't know that there was a per-frame minor GC, but that does make sense.

As far as the fidelity of the artificial benchmark--I don't think it is a very faithful model at all. It was the simplest thing I could think of that involved drawing and GC that allows tuning how much garbage stays around for how long. Originally, I wasn't sure if the stutter in Firefox was due to worse canvas drawing performance, slower JS execution overall, or GC pauses. I made the artificial model to help me see what happens when you abuse the GC.

I do think that the first program I posted, the orange bubbles following the mouse, is pretty representative of problems we've seen show up in many other Elm programs. If that can be made to run smoothly, I think many other examples will also run more smoothly.

To understand real allocation patterns in Elm, it's important to know that the language places strong emphasis on immutability. It's not clear to me, personally, that this will be worth the tradeoffs for animated games, but I do think it's a very worthwhile experiment. And for simple examples, at least, it seems to be working out well in Chrome/v8 already.

In the bubble program, each bubble is modeled by an object that stores its size, position, velocity, and acceleration from the previous step. This collection of objects needs to be stored from one frame to the next so that the equations of motion can be integrated over the elapsed time. But the result of this integration is a new list of objects--it doesn't mutate the old list. At this point, the old list should eligible for GC. I suspect this is a common pattern in Elm: there will be many objects that live exactly for the duration from one frame to the next frame, and then they become garbage.

> > http://imgur.com/a/Tu1Ix
> 
> Yeah, that's a very nice depiction of behavior we don't want. :-)
> 
> > Firefox (top) runs at 60 fps in bursts, interspersed with long periods of
> > dropped frames. Chrome (bottom) decreases the frame rate until it is able to
> > draw consistently.
> 
> Interesting. I don't think we're doing anything cleverly adaptive like that.
> 
> > My overall conclusion is that Chrome seems to have significantly lower
> > maximum GC pauses for this (artificial) workload.
> 
> I can't promise that this is definitely a representative enough workload for
> us to care about, but I know I personally don't have anything better to work
> from, so when I get some time I'd like to look into this.
Apologies, it looks like I made those jsbins in a temporary way. These links should be permanent:

I made a simple program with bubbles that track the mouse: http://jsbin.com/piyeye/1/

I also made an artificial canvas drawing and garbage collection benchmark to try to get rid of some of the complexity of Elm's compiled code and runtime:

http://jsbin.com/hucoya/1/edit?js,output
Here is sample of console output with javascript.options.mem.log enabled in about:config from the demo file at http://jsbin.com/piyeye/1/

https://gist.github.com/jwmerrill/a40a0330e3094d473245

It looks like TOO_MUCH_MALLOC shows up frequently
I cannot speak for the entire GC team, but I, for one, <3 Elm and want to see it work as well (or better) in Firefox than it does in Chrome. We have a goal of getting Firefox to a reliable 60fps this year on gaming workloads, both asm.js and normal javascript. Given our limited resources, we obviously cannot promise that we'll succeed for Elm too, but we'll certainly add it to our benchmarks and do our best to see that typical Elm workloads rise with the tide.

(In reply to jwmerrill from comment #4)
> Here is sample of console output with javascript.options.mem.log enabled in
> about:config from the demo file at http://jsbin.com/piyeye/1/
> 
> https://gist.github.com/jwmerrill/a40a0330e3094d473245
> 
> It looks like TOO_MUCH_MALLOC shows up frequently

Our existing GC triggers are a hodge-podge of highly focused and frequently quirky solutions to specific GC/OOM issues that have cropped up in the past; sometimes a new workload turns up some really horrible behavior. Let's dig into the simple benchmark to see if we can figure out what's going off the rails.

Unlike in V8, Number is not heap allocated and is stored directly in the object's slot or element memeory. Also unlike V8, we store large slot data out-of-line in the malloc heap. Thus, the array insertion loop turns the runtime component into a test of realloc and rand. Moreover, it only creates one object per frame, although this object has a large elements vector. So, first question to answer is: why are we GCing so frequently if we're not creating any GC objects?

For efficiency, our heap-size GC triggers are edge triggers, not level triggers -- they count upwards at each malloc until we pass a threshold, then reset to 0 after the GC. (We're actively trying to change these to be precise, in part to better address workloads like this.) For the same not-so-great reasons, when we realloc a large elements vector, we don't un-count the original size, so each doubling effectively gets you 50% closer to the GC triggers than it should.

When the script has malloced (or realloced) 30MiB of data, we trigger an ALLOC_TRIGGER GC. This type of GC sets a flag to do a GC later, when it is safe to do so (this is so that sites that call malloc do not need to all be GC safe). ALLOC_TRIGGERed GC's are incremental, and the heap size we allow will scale if we start GCing frequently. However, if we have not actually gotten to a GC safe region (e.g. an object allocation) and done a GC before the user has crossed about:config's javascript.options.mem.high_water_mark (128MiB by default), then the script is considered to be hogging the heap and we instead schedule a TOO_MUCH_MALLOC GC. This type of GC runs non-incrementally because we want to reign in the heavy-allocator asap. Of course, this heuristic tends to be much less helpful on artificial workloads than on real workloads and is probably just too low these days to be helpful at all. Problem #1 identified: raise the high_water_mark on desktop. Ideally we'd make this scale according to actual memory utilization, but that's actually quite hard to do without causing excessive OOMs.

So, once we've upped high_water_mark a bit, we start getting ALLOC_TRIGGER and MAYBEGC GC's and everything is good, right? Nope! Sadly, we're still getting perceptible lag. Looking at the MOZ_GCTIMER output, we can see that the GC is choosing 20ms slice times. Why? After a bit of digging it turns out that even after raising the GC's malloc triggers, we're still triggering GC just more than javascript.options.mem.gc_high_frequency_time_limit_ms (once per second by default on desktop). This puts the script into "high frequency GC" mode. This heuristic is trying to capture the case where the script is generating so much garbage that we're just GCing constantly and we aren't running the mutator as much as we should. To address this, we double the slice length in the hopes that lowering the GC's incrementalization overhead will allow the script to get more work done. Problem #2 identified: make the GC observe MMU instead of frequency when adjusting the slice length. This should be fixable, but will need to be done with care.

So once we've upped gc_high_frequency_time_limit_ms a bit to get around this, we're good, right? Again no. All of our GC's still have a max-pause of 20ms. It looks like this is happening in an interruptable phase, so this is either a bug in our incrementalization or a bug in how we compute high-frequency mode. Let's work around this for now by setting gc_dynamic_mark_slice to false. Problem #3 or symptom of something else: more investigation needed.

So now that dynamic mark slices are disabled, we're good right? Well... no. We're now splitting the work into more slices, as expected and most slices are coming in at gc_incremental_slice_ms. Yay. However, we still have one slice in every GC that is taking too long: >20ms. What gives? Well, after more digging, I found a bug in processMarkStackTop: we only check our slice budget at the top of scan_obj. When we scan the large array of Number, we never hit an object, so we never interrupt the scan_value_array loop. Since this array is enormous, it just takes ~20ms to scan and so we blow out our budget. Whoops! Bug #2: check the slice budget at loop edges.

So once we're checking our slice budget again, we're good right? We're hitting gc_incremental_slice_ms reliably and everything does looks smoother, but the framerate is still noticeably lower when we're in the GC. The default value of 10ms was chosen for a 30fps refresh rate and now that we're running at 60fps, that's only leaving 6ms for drawing. Let's try lowering it and see what blows up first!

8ms: reliably hitting our mark, no not noticeable change in the drawing. Maybe it's just not enough to matter? Down to 6ms: still reliably hitting our mark, but the drawing looks notably worse now. We're now splitting into 6+ slices, and our max pause is dead-on 6ms, but I /still/ see 2 large gaps in the drawing. What gives? It appears that we're scheduling our GC slices correctly at the refresh driver for our first 3-4 frames, but after that, we've allocated enough after the last GC that suddenly we're getting inter-frame ALLOC_TRIGGER slices. Multiple of these are landing within a single frame, so even though we're maintaining low latency, we're missing draw. Problem #4: our alloc trigger GC's should observe that we are in an animation and delay accordingly.

So with that fixed, we're good right? Well, not really. 6ms slices are looking much, much smoother, but still not as smooth as chrome. Lets push it down to 4ms. Uh, oh, we're now collecting slowly enough that we don't finish a GC before we hit the ALLOC_TRIGGER. Since we've disabled the incremental ALLOC_TRIGGER frame for interSliceGC, The non-incremental ALLOC_TRIGGER kicks in and just finishes off the GC. This takes 20ms and takes us back to square one again. Problem #5: our GC scheduling heuristics have been pushed out of alignment, the steams have crossed, chaos ensues. I'm going to have to give this some thought, but for now let's focus on getting to 5ms.

====

As to the other workload, it looks like if we overcome the same problems as above, we get to a point where we're spending 30ms in grey marking: e.g. marking browser objects. In the past we haven't incrementalized this because we haven't had barriers in all of gecko. Now that GGC has made everything smart pointers, it should now be possible to incrementalize this part of the GC too.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to Terrence Cole [:terrence] from comment #5)
> we're good right?

This is a fascinating analysis. Thanks for taking the time to explain it.
As per my exploration, here are the non-configuration changes I needed to improve our latency on this benchmark. I was worried that the new check in processMarkStackTop would regress throughput, but I'm not seeing any significant hit on octane.
Assignee: nobody → terrence
Status: NEW → ASSIGNED
Attachment #8549829 - Flags: review?(jcoppeard)
Comment on attachment 8549829 [details] [diff] [review]
elm_ubench_latency-v0.diff

Review of attachment 8549829 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8549829 - Flags: review?(jcoppeard) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/cafc08288982

We still need to figure out what heuristics to tweak and how, so marking leave-open.
Keywords: leave-open
Depends on: 1123260
And backed out for regressing splay.
https://hg.mozilla.org/integration/mozilla-inbound/rev/8e736f848d7c

The extra budget check appears to be too much overhead when done naively. I should be able to reproduce locally and investigate some different approaches.
No longer depends on: 1123260
Depends on: 1123260
No longer depends on: 1123260
Looks like this was not the regressor, relanding:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e9e78f9c8c40
This is almost certainly the GC heuristic change, as that can delay GC activity. Let's land the marking change separately and think about our triggers in a more comprehensive manner elsewhere.

https://hg.mozilla.org/integration/mozilla-inbound/rev/8977d94c9d9a
The bubbles example:

http://output.jsbin.com/piyeye/1/

is running great in Firefox Developer Edition 40.0a2 with Electrolysis enabled. Almost no dropped frames. Really nice improvements!
\o/

Thanks, Jason! The tracing rewrite was an absurd amount of work, but it looks like it's paying massive dividends.
The leave-open keyword is there and there is no activity for 6 months.
:jonco, maybe it's time to close this bug?
Flags: needinfo?(jcoppeard)
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Flags: needinfo?(jcoppeard)
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.