58.31 KB, text/plain
SPS profile: http://people.mozilla.com/~bgirard/cleopatra/#report=cc6842a7ecf8fef4605f31575d1c954c1a9c38d6 Something has changed such that when running Escape Goat, a GC is triggered by onTooMuchMalloc every handful of frames. Interestingly enough, the SPS profile claims that all these GCs are being triggered by GraphicsDevice$UpdateViewport, despite the fact that I cannot identify any garbage allocations in that function, let alone enough allocations to cause it to be the bottleneck in an SPS profile (51.7% worth). However, I have a guess: This function happens to call setTransform, translate and scale on a Canvas 2D context. I suspect something changed in the Canvas backend that results in lots of malloc() calls, thus triggering GCs. I tried fiddling with the $UpdateViewport function to see if I could make it faster based on early profile data, and this actually *moved* the GCs out of the function and into another one, perhaps because I had changed inlining characteristics or something. Either way, I'm pretty sure this function itself isn't creating garbage. Aurora 21.0a2 (2013-03-12) is busted. Release (!) 19.0.2 is busted. Nightly not tested because 848803 isn't fixed yet. I don't actually remember seeing this when I was running the Aurora build of 19.x, so it's possible something else changed... maybe my video drivers? I'll try narrowing down a regression range with mozregression, but it'll be a pain since I can only observe this through SPS profiles.
It turns out there's actually a way to observe this bug without SPS. Because many of the GCs occur during frames, they get picked up by the frame timing logic built into the replay tool. You can view it by clicking Show Log after the replay ends, it'll be right above '// begin JSON-formatted data //'. If the GC isn't triggering super frequently, the data will look like this (Chrome Canary): Framerate: Mean 00206.9fps, Median 00250.0fps, 2nd percentile 01000.0fps, 98th percentile 00125.0fps, Standard Deviation 00.87ms Low standard deviation, mean and median pretty close to each other. (The standard deviation is in ms, the others are in fps; if you convert the fps values back to ms you'll see the absolute difference is quite small). In a build of FF with the GC triggering frequently, it looks like this: Framerate: Mean 00097.1fps, Median 00230.3fps, 2nd percentile 00601.4fps, 98th percentile 00008.4fps, Standard Deviation 22.83ms The incredibly large standard deviation indicates that it's recording lots of long frames (because the GC interrupted them), but when the GC isn't interrupting, performance is *awesome*. Got mozregression running right now based on this.
A bunch of the mozregression builds either crashed or had really severe secondary performance problems (that could have been hiding the GC issue), but this is what mozregression claims to be the regression range: Last good nightly: 2012-09-11 First bad nightly: 2012-09-14 Pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=96287ad60bef&tochange=e5af3d785252
I dug through the code for the canvas methods that function calls, and it doesn't seem like they can malloc. So, I tried removing those calls entirely and it simply moved the GCs to the next function in the call path that called into Canvas. I think the reason I'm seeing GCs in this function is simply that the Canvas API calls are a JS -> native transition, so maybe that's the earliest a GC can happen when it's pending due to malloc? It looks like the ionmonkey merge is in the regression window, so I think this probably should have been a JS Engine bug after all. Gonna update the bug description a bit.
Assignee: nobody → general
Summary: Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs possibly triggered by Canvas → Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs introduced after Ion merge
Version: 21 Branch → 19 Branch
After figuring out that the Firefox symbol server only has symbols for *release* builds, not debug builds, I did some digging with breakpoints to figure out who was triggering onMuchMalloc, since this bug still didn't make sense to me (and the regression window was a mess). I figured out the cause of these GCs, though I don't know how it regressed during that window: Some non-rendering code in the game was requesting the dimensions of the screen, and the accessor it used had a bug that meant that it would do the equivalent of this: canvas.width = canvas.width Normally, this is an adequate way to clear a canvas, and doesn't have a severe performance impact - you'll even see people recommend it as a way to clear a canvas: http://stackoverflow.com/questions/2142535/how-to-clear-the-canvas-for-redrawing In this case, it was totally accidental, but it was enough to trigger these GCs through this convoluted route: setting canvas.width (to anything, even the current size), triggers CanvasRenderingContext2D::Reset, which frees CanvasRenderingContext2D::mTarget. The next time you call any CanvasRenderingContext2D method, CanvasRenderingTarget2D::EnsureTarget sees that mTarget was freed and allocates a new rendering target, calling JS_updateMallocCounter(width * height * 4) in the process. This is likely to bring the malloc counter down to zero, triggering onTooMuchMalloc, and setting off a GC. Confusingly, despite the fact that this seems to suggest a clear route to the GC (directly through canvas functions, no less) they never appeared on the stack in SPS profiles. I don't know if this is a SPS bug, an artifact of inlining/optimization, or what. I was able to verify my findings by tracking down the function that was doing 'canvas.width = canvas.width' and changing it to instead do 'if (canvas.width !== width) canvas.width = width' so that in the common case (where it doesn't need to change anything), the canvas is not resized. This eliminated all the spurious GCs and performance is awesome with the change in place. SO, based on this, I see two core problems on the Firefox side: I don't think canvas allocations of any kind should update the JS malloc counter. Having every byte of a canvas surface count against the JS heap and trigger constant JS GCs is super useless because unless there are leaked JS objects holding canvases alive, the GC will not free any memory. Furthermore, in this particular use case - a Reset due to a size change - the GC CANNOT free any canvas memory, so there's no point in triggering it. Setting a canvas's width or height to the current value of that property should not free and reallocate the canvas. It seems extremely wasteful and probably comes with an unnecessary performance penalty; I don't know that the spec describes as the desired behavior, but I expect you could achieve it by simply clearing the render target and resetting any states that are supposed to be reset (like the transform stack). I'm going to refrain from recategorizing this to Canvas since I'm still not really sure if a Canvas changed introduced this regression or if the bad Canvas behavior was already there and JS changes made it cause these GCs. I've updated the URL associated with the bug to point to an old (still broken) build of the game so I can push a version that is fixed.
If the problem is caused by a JS_updateMallocCounter call, then it's a canvas problem.
Assignee: general → nobody
You need to log in before you can comment on or make changes to this bug.