Open
Bug 850993
Opened 11 years ago
Updated 2 years ago
Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs introduced after Ion merge
Categories
(Core :: Graphics: Canvas2D, defect)
Tracking
()
NEW
People
(Reporter: kael, Unassigned)
References
()
Details
Attachments
(1 file)
58.31 KB,
text/plain
|
Details |
SPS profile: http://people.mozilla.com/~bgirard/cleopatra/#report=cc6842a7ecf8fef4605f31575d1c954c1a9c38d6 Something has changed such that when running Escape Goat, a GC is triggered by onTooMuchMalloc every handful of frames. Interestingly enough, the SPS profile claims that all these GCs are being triggered by GraphicsDevice$UpdateViewport, despite the fact that I cannot identify any garbage allocations in that function, let alone enough allocations to cause it to be the bottleneck in an SPS profile (51.7% worth). However, I have a guess: This function happens to call setTransform, translate and scale on a Canvas 2D context. I suspect something changed in the Canvas backend that results in lots of malloc() calls, thus triggering GCs. I tried fiddling with the $UpdateViewport function to see if I could make it faster based on early profile data, and this actually *moved* the GCs out of the function and into another one, perhaps because I had changed inlining characteristics or something. Either way, I'm pretty sure this function itself isn't creating garbage. Aurora 21.0a2 (2013-03-12) is busted. Release (!) 19.0.2 is busted. Nightly not tested because 848803 isn't fixed yet. I don't actually remember seeing this when I was running the Aurora build of 19.x, so it's possible something else changed... maybe my video drivers? I'll try narrowing down a regression range with mozregression, but it'll be a pain since I can only observe this through SPS profiles.
Reporter | ||
Comment 1•11 years ago
|
||
The GC pauses don't go away if I disable hardware acceleration, so I don't think it could have been my video drivers. Though, amusingly, if I profile it with hw accel off, something like 20-25% of time is spent doing solid color fills to clear the canvas. Anyway, here's my about:support just in case. Application Basics Name Firefox Version 21.0a2 User Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130313 Firefox/21.0 Build Configuration about:buildconfig Extensions Name Version Enabled ID Adblock Plus 2.2.3 true {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d} Cookies Manager+ 1.5.1.1 true {bb6bc1bb-f824-4702-90cd-35e2fb24f25d} Download Statusbar 0.9.10 true {D4DD63FA-01E4-46a7-B6B1-EDAB7D6AD389} NoScript 2.6.5.8 true {73a6fe31-595d-460b-a920-fcc0f8843232} Rikaichan 2.07 true {0AA9101C-D3C1-4129-A9B7-D778C6A17F82} Rikaichan Japanese-English Dictionary File 2.01.120729 true rikaichan-jpen@polarcloud.com Suspend background tabs 1.0.1 true suspendbackgroundtabs@adblockplus.org about:telemetry 0.14 false ping.telemetry@mozilla.com Firebug 1.11.2 false firebug@software.joehewitt.com Firefox OS Simulator 2.0 false r2d2b2g@mozilla.org Force CORS 1.4 false forcecors@ocact.us Gecko Profiler 1.11.13 false jid0-edalmuivkozlouyij0lpdx548bc@jetpack JIT Inspector 0.4.0.0 false bhackett@mozilla.com Rapportive 1.4.0 false rapportive@rapportive.com Test Pilot 1.2.2 false testpilot@labs.mozilla.com Important Modified Preferences Name Value accessibility.typeaheadfind.flashBar 0 browser.cache.disk.capacity 358400 browser.cache.disk.smart_size.first_run false browser.cache.disk.smart_size.use_old_max false browser.cache.disk.smart_size_cached_value 358400 browser.places.smartBookmarksVersion 4 browser.startup.homepage_override.buildID 20130313042013 browser.startup.homepage_override.mstone 21.0a2 browser.tabs.warnOnClose false dom.max_chrome_script_run_time 40 dom.max_script_run_time 0 dom.mozApps.used true dom.w3c_touch_events.expose false extensions.lastAppVersion 21.0a2 font.internaluseonly.changed false gfx.direct3d.prefer_10_1 true network.cookie.prefsMigrated true places.database.lastMaintenance 1363231447 places.history.expiration.transient_current_max_pages 104858 plugin.disable_full_page_plugin_for_types application/pdf privacy.cpd.cookies false privacy.cpd.downloads false privacy.cpd.formdata false privacy.cpd.history false privacy.cpd.sessions false privacy.sanitize.migrateFx3Prefs true privacy.sanitize.timeSpan 0 security.warn_viewing_mixed false webgl.verbose true Graphics Adapter Description NVIDIA GeForce GTX 670 Adapter Drivers nvd3dumx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um Adapter RAM 2047 Device ID 0x1189 Direct2D Enabled true DirectWrite Enabled true (6.2.9200.16492) Driver Date 2-9-2013 Driver Version 9.18.13.1407 GPU #2 Active false GPU Accelerated Windows 1/1 Direct3D 10 Vendor ID 0x10de WebGL Renderer Google Inc. -- ANGLE (NVIDIA GeForce GTX 670) AzureCanvasBackend direct2d AzureContentBackend direct2d AzureFallbackCanvasBackend cairo JavaScript Incremental GC true Accessibility Activated false Prevent Accessibility 0 Library Versions Expected minimum version Version in use NSPR 4.9.6 Beta 4.9.6 Beta NSS 3.14.3.0 Basic ECC Beta 3.14.3.0 Basic ECC Beta NSSSMIME 3.14.3.0 Basic ECC Beta 3.14.3.0 Basic ECC Beta NSSSSL 3.14.3.0 Basic ECC Beta 3.14.3.0 Basic ECC Beta NSSUTIL 3.14.3.0 Beta 3.14.3.0 Beta
Reporter | ||
Comment 2•11 years ago
|
||
It turns out there's actually a way to observe this bug without SPS. Because many of the GCs occur during frames, they get picked up by the frame timing logic built into the replay tool. You can view it by clicking Show Log after the replay ends, it'll be right above '// begin JSON-formatted data //'. If the GC isn't triggering super frequently, the data will look like this (Chrome Canary): Framerate: Mean 00206.9fps, Median 00250.0fps, 2nd percentile 01000.0fps, 98th percentile 00125.0fps, Standard Deviation 00.87ms Low standard deviation, mean and median pretty close to each other. (The standard deviation is in ms, the others are in fps; if you convert the fps values back to ms you'll see the absolute difference is quite small). In a build of FF with the GC triggering frequently, it looks like this: Framerate: Mean 00097.1fps, Median 00230.3fps, 2nd percentile 00601.4fps, 98th percentile 00008.4fps, Standard Deviation 22.83ms The incredibly large standard deviation indicates that it's recording lots of long frames (because the GC interrupted them), but when the GC isn't interrupting, performance is *awesome*. Got mozregression running right now based on this.
Reporter | ||
Comment 3•11 years ago
|
||
A bunch of the mozregression builds either crashed or had really severe secondary performance problems (that could have been hiding the GC issue), but this is what mozregression claims to be the regression range: Last good nightly: 2012-09-11 First bad nightly: 2012-09-14 Pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=96287ad60bef&tochange=e5af3d785252
Reporter | ||
Comment 4•11 years ago
|
||
I dug through the code for the canvas methods that function calls, and it doesn't seem like they can malloc. So, I tried removing those calls entirely and it simply moved the GCs to the next function in the call path that called into Canvas. I think the reason I'm seeing GCs in this function is simply that the Canvas API calls are a JS -> native transition, so maybe that's the earliest a GC can happen when it's pending due to malloc? It looks like the ionmonkey merge is in the regression window, so I think this probably should have been a JS Engine bug after all. Gonna update the bug description a bit.
Assignee: nobody → general
Component: Canvas: 2D → JavaScript Engine
Summary: Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs possibly triggered by Canvas → Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs introduced after Ion merge
Version: 21 Branch → 19 Branch
Reporter | ||
Comment 5•11 years ago
|
||
After figuring out that the Firefox symbol server only has symbols for *release* builds, not debug builds, I did some digging with breakpoints to figure out who was triggering onMuchMalloc, since this bug still didn't make sense to me (and the regression window was a mess). I figured out the cause of these GCs, though I don't know how it regressed during that window: Some non-rendering code in the game was requesting the dimensions of the screen, and the accessor it used had a bug that meant that it would do the equivalent of this: canvas.width = canvas.width Normally, this is an adequate way to clear a canvas, and doesn't have a severe performance impact - you'll even see people recommend it as a way to clear a canvas: http://stackoverflow.com/questions/2142535/how-to-clear-the-canvas-for-redrawing In this case, it was totally accidental, but it was enough to trigger these GCs through this convoluted route: setting canvas.width (to anything, even the current size), triggers CanvasRenderingContext2D::Reset, which frees CanvasRenderingContext2D::mTarget. The next time you call any CanvasRenderingContext2D method, CanvasRenderingTarget2D::EnsureTarget sees that mTarget was freed and allocates a new rendering target, calling JS_updateMallocCounter(width * height * 4) in the process. This is likely to bring the malloc counter down to zero, triggering onTooMuchMalloc, and setting off a GC. Confusingly, despite the fact that this seems to suggest a clear route to the GC (directly through canvas functions, no less) they never appeared on the stack in SPS profiles. I don't know if this is a SPS bug, an artifact of inlining/optimization, or what. I was able to verify my findings by tracking down the function that was doing 'canvas.width = canvas.width' and changing it to instead do 'if (canvas.width !== width) canvas.width = width' so that in the common case (where it doesn't need to change anything), the canvas is not resized. This eliminated all the spurious GCs and performance is awesome with the change in place. SO, based on this, I see two core problems on the Firefox side: I don't think canvas allocations of any kind should update the JS malloc counter. Having every byte of a canvas surface count against the JS heap and trigger constant JS GCs is super useless because unless there are leaked JS objects holding canvases alive, the GC will not free any memory. Furthermore, in this particular use case - a Reset due to a size change - the GC CANNOT free any canvas memory, so there's no point in triggering it. Setting a canvas's width or height to the current value of that property should not free and reallocate the canvas. It seems extremely wasteful and probably comes with an unnecessary performance penalty; I don't know that the spec describes as the desired behavior, but I expect you could achieve it by simply clearing the render target and resetting any states that are supposed to be reset (like the transform stack). I'm going to refrain from recategorizing this to Canvas since I'm still not really sure if a Canvas changed introduced this regression or if the bad Canvas behavior was already there and JS changes made it cause these GCs. I've updated the URL associated with the bug to point to an old (still broken) build of the game so I can push a version that is fixed.
If the problem is caused by a JS_updateMallocCounter call, then it's a canvas problem.
Assignee: general → nobody
Component: JavaScript Engine → Canvas: 2D
Comment 7•10 years ago
|
||
I've been able to reproduce very similar behavior using the following steps, using the stable version 31 and nightly 34.0a1 (2014-08-20) Start Firefox Install firebug (https://getfirebug.com/), restart as needed. Go to a JIRA https://jira.atlassian.com/browse/WBS-181 Refresh the page (F5), it should reload fairly fast < 1s. Open up the firebug window (F12) and close it Refresh the JIRA page again, now it will take 10-15 seconds to render. I've reproduce this with an internal JIRA server as well, and I suspect large javascript files have something to do with it. The only way to get out of the long render times seems to be to restart Firefox. This also seems to happen without firebug, after some indeterminate activity, but it is easily triggered using firebug. I've enabled GC collection with javascript.options.mem.log=true and see many of these entries: GC(T+32.1) Reason: TOO_MUCH_MALLOC, Total Time: 192.3ms, Zones Collected: 21, Total Zones: 21, Total Compartments: 324, Minor GCs: 4, MMU (20ms): 0%, MMU (50ms): 0%, SCC Sweep Total: 19.5ms, SCC Sweep Max Pause: 19.5ms, Nonincremental Reason: requested, Allocated: 50MB, +Chunks: 0, -Chunks: 0 Totals: Wait Background Thread: 28.9ms, Purge: 0.6ms, Mark: 72.4ms, Mark Roots: 8.5ms, Sweep: 86.7ms, Mark During Sweeping: 49.2ms, Mark Weak: 6.6ms, Mark Gray: 37.0ms, Mark Gray and Weak: 5.6ms, Finalize Start Callback: 0.1ms, Sweep Atoms: 5.1ms, Sweep Compartments: 20.7ms, Sweep Discard Code: 3.0ms, Sweep Tables: 10.9ms, Sweep Cross Compartment Wrappers: 3.3ms, Sweep Base Shapes: 2.6ms, Sweep Initial Shapes: 1.8ms, Sweep Type Objects: 0.7ms, Sweep Breakpoints: 0.8ms, Discard Analysis: 6.7ms, Discard TI: 2.9ms, Sweep Types: 3.8ms, Sweep Object: 1.8ms, Sweep String: 1.8ms, Sweep Script: 1.1ms, Sweep Shape: 2.6ms, Sweep JIT code: 0.5ms, Finalize End Callback: 1.6ms, Deallocate: 0.9ms, End Callback: 0.8ms I'll attach the full logs from the console. If you believe this is a completely separate issue, I'll open a new bug.
Comment 8•10 years ago
|
||
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•