Open Bug 850993 Opened 11 years ago Updated 2 years ago

Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs introduced after Ion merge

Categories

(Core :: Graphics: Canvas2D, defect)

19 Branch
x86_64
Windows 7
defect

Tracking

()

People

(Reporter: kael, Unassigned)

References

()

Details

Attachments

(1 file)

SPS profile:
http://people.mozilla.com/~bgirard/cleopatra/#report=cc6842a7ecf8fef4605f31575d1c954c1a9c38d6

Something has changed such that when running Escape Goat, a GC is triggered by onTooMuchMalloc every handful of frames. Interestingly enough, the SPS profile claims that all these GCs are being triggered by GraphicsDevice$UpdateViewport, despite the fact that I cannot identify any garbage allocations in that function, let alone enough allocations to cause it to be the bottleneck in an SPS profile (51.7% worth).

However, I have a guess: This function happens to call setTransform, translate and scale on a Canvas 2D context. I suspect something changed in the Canvas backend that results in lots of malloc() calls, thus triggering GCs.

I tried fiddling with the $UpdateViewport function to see if I could make it faster based on early profile data, and this actually *moved* the GCs out of the function and into another one, perhaps because I had changed inlining characteristics or something. Either way, I'm pretty sure this function itself isn't creating garbage.

Aurora 21.0a2 (2013-03-12) is busted.
Release (!) 19.0.2 is busted.
Nightly not tested because 848803 isn't fixed yet.

I don't actually remember seeing this when I was running the Aurora build of 19.x, so it's possible something else changed... maybe my video drivers? I'll try narrowing down a regression range with mozregression, but it'll be a pain since I can only observe this through SPS profiles.
The GC pauses don't go away if I disable hardware acceleration, so I don't think it could have been my video drivers. Though, amusingly, if I profile it with hw accel off, something like 20-25% of time is spent doing solid color fills to clear the canvas. Anyway, here's my about:support just in case.



  Application Basics

        Name
        Firefox

        Version
        21.0a2

        User Agent
        Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130313 Firefox/21.0

        Build Configuration

          about:buildconfig

  Extensions

        Name

        Version

        Enabled

        ID

        Adblock Plus
        2.2.3
        true
        {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d}

        Cookies Manager+
        1.5.1.1
        true
        {bb6bc1bb-f824-4702-90cd-35e2fb24f25d}

        Download Statusbar
        0.9.10
        true
        {D4DD63FA-01E4-46a7-B6B1-EDAB7D6AD389}

        NoScript
        2.6.5.8
        true
        {73a6fe31-595d-460b-a920-fcc0f8843232}

        Rikaichan
        2.07
        true
        {0AA9101C-D3C1-4129-A9B7-D778C6A17F82}

        Rikaichan Japanese-English Dictionary File
        2.01.120729
        true
        rikaichan-jpen@polarcloud.com

        Suspend background tabs
        1.0.1
        true
        suspendbackgroundtabs@adblockplus.org

        about:telemetry
        0.14
        false
        ping.telemetry@mozilla.com

        Firebug
        1.11.2
        false
        firebug@software.joehewitt.com

        Firefox OS Simulator
        2.0
        false
        r2d2b2g@mozilla.org

        Force CORS
        1.4
        false
        forcecors@ocact.us

        Gecko Profiler
        1.11.13
        false
        jid0-edalmuivkozlouyij0lpdx548bc@jetpack

        JIT Inspector
        0.4.0.0
        false
        bhackett@mozilla.com

        Rapportive
        1.4.0
        false
        rapportive@rapportive.com

        Test Pilot
        1.2.2
        false
        testpilot@labs.mozilla.com

  Important Modified Preferences

      Name

      Value

        accessibility.typeaheadfind.flashBar
        0

        browser.cache.disk.capacity
        358400

        browser.cache.disk.smart_size.first_run
        false

        browser.cache.disk.smart_size.use_old_max
        false

        browser.cache.disk.smart_size_cached_value
        358400

        browser.places.smartBookmarksVersion
        4

        browser.startup.homepage_override.buildID
        20130313042013

        browser.startup.homepage_override.mstone
        21.0a2

        browser.tabs.warnOnClose
        false

        dom.max_chrome_script_run_time
        40

        dom.max_script_run_time
        0

        dom.mozApps.used
        true

        dom.w3c_touch_events.expose
        false

        extensions.lastAppVersion
        21.0a2

        font.internaluseonly.changed
        false

        gfx.direct3d.prefer_10_1
        true

        network.cookie.prefsMigrated
        true

        places.database.lastMaintenance
        1363231447

        places.history.expiration.transient_current_max_pages
        104858

        plugin.disable_full_page_plugin_for_types
        application/pdf

        privacy.cpd.cookies
        false

        privacy.cpd.downloads
        false

        privacy.cpd.formdata
        false

        privacy.cpd.history
        false

        privacy.cpd.sessions
        false

        privacy.sanitize.migrateFx3Prefs
        true

        privacy.sanitize.timeSpan
        0

        security.warn_viewing_mixed
        false

        webgl.verbose
        true

  Graphics

        Adapter Description
        NVIDIA GeForce GTX 670

        Adapter Drivers
        nvd3dumx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um

        Adapter RAM
        2047

        Device ID
        0x1189

        Direct2D Enabled
        true

        DirectWrite Enabled
        true (6.2.9200.16492)

        Driver Date
        2-9-2013

        Driver Version
        9.18.13.1407

        GPU #2 Active
        false

        GPU Accelerated Windows
        1/1 Direct3D 10

        Vendor ID
        0x10de

        WebGL Renderer
        Google Inc. -- ANGLE (NVIDIA GeForce GTX 670)

        AzureCanvasBackend
        direct2d

        AzureContentBackend
        direct2d

        AzureFallbackCanvasBackend
        cairo

  JavaScript

        Incremental GC
        true

  Accessibility

        Activated
        false

        Prevent Accessibility
        0

  Library Versions

        Expected minimum version

        Version in use

        NSPR
        4.9.6 Beta
        4.9.6 Beta

        NSS
        3.14.3.0 Basic ECC Beta
        3.14.3.0 Basic ECC Beta

        NSSSMIME
        3.14.3.0 Basic ECC Beta
        3.14.3.0 Basic ECC Beta

        NSSSSL
        3.14.3.0 Basic ECC Beta
        3.14.3.0 Basic ECC Beta

        NSSUTIL
        3.14.3.0 Beta
        3.14.3.0 Beta
It turns out there's actually a way to observe this bug without SPS. Because many of the GCs occur during frames, they get picked up by the frame timing logic built into the replay tool. You can view it by clicking Show Log after the replay ends, it'll be right above '// begin JSON-formatted data //'.

If the GC isn't triggering super frequently, the data will look like this (Chrome Canary):
Framerate: Mean 00206.9fps, Median 00250.0fps, 2nd percentile 01000.0fps, 98th percentile 00125.0fps, Standard Deviation 00.87ms

Low standard deviation, mean and median pretty close to each other. (The standard deviation is in ms, the others are in fps; if you convert the fps values back to ms you'll see the absolute difference is quite small).

In a build of FF with the GC triggering frequently, it looks like this:
Framerate: Mean 00097.1fps, Median 00230.3fps, 2nd percentile 00601.4fps, 98th percentile 00008.4fps, Standard Deviation 22.83ms

The incredibly large standard deviation indicates that it's recording lots of long frames (because the GC interrupted them), but when the GC isn't interrupting, performance is *awesome*.

Got mozregression running right now based on this.
A bunch of the mozregression builds either crashed or had really severe secondary performance problems (that could have been hiding the GC issue), but this is what mozregression claims to be the regression range:

Last good nightly: 2012-09-11
First bad nightly: 2012-09-14

Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=96287ad60bef&tochange=e5af3d785252
I dug through the code for the canvas methods that function calls, and it doesn't seem like they can malloc. So, I tried removing those calls entirely and it simply moved the GCs to the next function in the call path that called into Canvas. I think the reason I'm seeing GCs in this function is simply that the Canvas API calls are a JS -> native transition, so maybe that's the earliest a GC can happen when it's pending due to malloc?

It looks like the ionmonkey merge is in the regression window, so I think this probably should have been a JS Engine bug after all. Gonna update the bug description a bit.
Assignee: nobody → general
Component: Canvas: 2D → JavaScript Engine
Summary: Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs possibly triggered by Canvas → Extremely frequent, inexplicable onTooMuchMalloc/TOO_MUCH_MALLOC GCs introduced after Ion merge
Version: 21 Branch → 19 Branch
After figuring out that the Firefox symbol server only has symbols for *release* builds, not debug builds, I did some digging with breakpoints to figure out who was triggering onMuchMalloc, since this bug still didn't make sense to me (and the regression window was a mess). I figured out the cause of these GCs, though I don't know how it regressed during that window:

Some non-rendering code in the game was requesting the dimensions of the screen, and the accessor it used had a bug that meant that it would do the equivalent of this:

canvas.width = canvas.width

Normally, this is an adequate way to clear a canvas, and doesn't have a severe performance impact - you'll even see people recommend it as a way to clear a canvas:
http://stackoverflow.com/questions/2142535/how-to-clear-the-canvas-for-redrawing
In this case, it was totally accidental, but it was enough to trigger these GCs through this convoluted route:

setting canvas.width (to anything, even the current size), triggers CanvasRenderingContext2D::Reset, which frees CanvasRenderingContext2D::mTarget.
The next time you call any CanvasRenderingContext2D method, CanvasRenderingTarget2D::EnsureTarget sees that mTarget was freed and allocates a new rendering target, calling JS_updateMallocCounter(width * height * 4) in the process. This is likely to bring the malloc counter down to zero, triggering onTooMuchMalloc, and setting off a GC.

Confusingly, despite the fact that this seems to suggest a clear route to the GC (directly through canvas functions, no less) they never appeared on the stack in SPS profiles. I don't know if this is a SPS bug, an artifact of inlining/optimization, or what.

I was able to verify my findings by tracking down the function that was doing 'canvas.width = canvas.width' and changing it to instead do 'if (canvas.width !== width) canvas.width = width' so that in the common case (where it doesn't need to change anything), the canvas is not resized. This eliminated all the spurious GCs and performance is awesome with the change in place.

SO, based on this, I see two core problems on the Firefox side:

I don't think canvas allocations of any kind should update the JS malloc counter. Having every byte of a canvas surface count against the JS heap and trigger constant JS GCs is super useless because unless there are leaked JS objects holding canvases alive, the GC will not free any memory. Furthermore, in this particular use case - a Reset due to a size change - the GC CANNOT free any canvas memory, so there's no point in triggering it.

Setting a canvas's width or height to the current value of that property should not free and reallocate the canvas. It seems extremely wasteful and probably comes with an unnecessary performance penalty; I don't know that the spec describes as the desired behavior, but I expect you could achieve it by simply clearing the render target and resetting any states that are supposed to be reset (like the transform stack).

I'm going to refrain from recategorizing this to Canvas since I'm still not really sure if a Canvas changed introduced this regression or if the bad Canvas behavior was already there and JS changes made it cause these GCs.

I've updated the URL associated with the bug to point to an old (still broken) build of the game so I can push a version that is fixed.
If the problem is caused by a JS_updateMallocCounter call, then it's a canvas problem.
Assignee: general → nobody
Component: JavaScript Engine → Canvas: 2D
I've been able to reproduce very similar behavior using the following steps, using the stable version 31 and nightly 34.0a1 (2014-08-20)

Start Firefox
Install firebug (https://getfirebug.com/), restart as needed.
Go to a JIRA https://jira.atlassian.com/browse/WBS-181
Refresh the page (F5), it should reload fairly fast < 1s.
Open up the firebug window (F12) and close it
Refresh the JIRA page again, now it will take 10-15 seconds to render.

I've reproduce this with an internal JIRA server as well, and I suspect large javascript files have something to do with it.  The only way to get out of the long render times seems to be to restart Firefox.  This also seems to happen without firebug, after some indeterminate activity, but it is easily triggered using firebug.

I've enabled GC collection with javascript.options.mem.log=true and see many of these entries:

GC(T+32.1) Reason: TOO_MUCH_MALLOC, Total Time: 192.3ms, Zones Collected: 21, Total Zones: 21, Total Compartments: 324, Minor GCs: 4, MMU (20ms): 0%, MMU (50ms): 0%, SCC Sweep Total: 19.5ms, SCC Sweep Max Pause: 19.5ms, Nonincremental Reason: requested, Allocated: 50MB, +Chunks: 0, -Chunks: 0
    Totals: Wait Background Thread: 28.9ms, Purge: 0.6ms, Mark: 72.4ms, Mark Roots: 8.5ms, Sweep: 86.7ms, Mark During Sweeping: 49.2ms, Mark Weak: 6.6ms, Mark Gray: 37.0ms, Mark Gray and Weak: 5.6ms, Finalize Start Callback: 0.1ms, Sweep Atoms: 5.1ms, Sweep Compartments: 20.7ms, Sweep Discard Code: 3.0ms, Sweep Tables: 10.9ms, Sweep Cross Compartment Wrappers: 3.3ms, Sweep Base Shapes: 2.6ms, Sweep Initial Shapes: 1.8ms, Sweep Type Objects: 0.7ms, Sweep Breakpoints: 0.8ms, Discard Analysis: 6.7ms, Discard TI: 2.9ms, Sweep Types: 3.8ms, Sweep Object: 1.8ms, Sweep String: 1.8ms, Sweep Script: 1.1ms, Sweep Shape: 2.6ms, Sweep JIT code: 0.5ms, Finalize End Callback: 1.6ms, Deallocate: 0.9ms, End Callback: 0.8ms

I'll attach the full logs from the console.  If you believe this is a completely separate issue, I'll open a new bug.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: