Closed Bug 1815140 Opened 1 year ago Closed 1 year ago

grid devtools overlay causes extreme content-process jank (for`stringifyGridFragments` and resulting garbage collection), for testcase with many columns

Categories

(DevTools :: Inspector: Layout, defect, P3)

Desktop
Windows
defect

Tracking

(firefox111 wontfix, firefox112 verified, firefox113 verified)

VERIFIED FIXED
112 Branch
Tracking Status
firefox111 --- wontfix
firefox112 --- verified
firefox113 --- verified

People

(Reporter: hyperfekt, Assigned: bradwerth)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

Attached file grid_device-reset.html

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0

Steps to reproduce:

  1. Open the attached file.
  2. Open DevTools.
  3. Toggle the Grid Overlay.

Actual results:

Firefox becomes unresponsive, then all its windows turn black.

Expected results:

A grid overlay is displayed or the display of the grid overlay fails gracefully.

The Bugbug bot thinks this bug should belong to the 'Core::Layout: Grid' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Layout: Grid
Product: Firefox → Core

I can definitely reproduce issues with your testcase -- thanks for the bug report! For me, Firefox as-a-whole doesn't become unresponsive or blacked out, and I didn't get to a point with CanvasTranslator detected a device reset (maybe I just needed to wait longer). But the content process for the tab in question does start consuming quite a bit of CPU and does become periodically unresponsive for ~seconds, and that in and of itself is quite bad. I'm narrowing the bug title to focus on that; I assume the other symptoms that you described were downstream from that.

ANALYSIS:
tl;dr: It looks like most of that is associated with generating strings for all of the line names, and then garbage-collecting those strings; and doing that same work (or incremental portions of that work?) in a repeated requestAnimationFrame callback loop.

Here's a profile of me activating the grid overlay at around t=3s, and then leaving Firefox untouched from that point on for ~30 seconds (during which time the content process is pegged with periods of unresponsiveness):
https://share.firefox.dev/3losS8B

Here's a snippet of that profile, zoomed to a particularly-bad 3.1-second period of unresponsiveness, for a requestAnimationFrame callback:
https://share.firefox.dev/3x8Kt79

In that snippet, 95% of our time is spent doing JS garbage collection! If I ignore the garbage collection samples, it looks like nearly all our time is spent in stringifyGridFragments, and that sounds like something that could believably be generating tons of garbage (for this 10000-column grid testcase), depending on what we do with the generated strings...

I'm CC'ing three folks who worked on this grid-inspector code in the past (Gabe, Micah, bradwerth). I'm not sure if any of them have been working on devtools-related stuff recently, but they might be able to offer insights to other devtools folks who have cycles to take a look here.

Status: UNCONFIRMED → NEW
Component: Layout: Grid → Inspector: Layout
Ever confirmed: true
Product: Core → DevTools
Version: Firefox 109 → Trunk
Summary: blackout with 'CanvasTranslator detected a device reset' due to large number of grid tracks → grid devtools overlay causes extreme content-process jank (for`stringifyGridFragments` and resulting garbage collection), for testcase with many columns

It does seem pretty trivial to add an early exit when hasMoved is true but I don't yet know if that would activate in in this case. I'll put a patch together that does this, and try it out. In the more general case, the stringify-comparison is pretty hardcore. It would be good to find another way to do it, or to segment the stringification to add more opportunities for early exits. I'll attempt to do that also.

Assignee: nobody → bwerth

Segmenting the stringify-comparison is insufficent, since the common case is that there is no change, which would require stringifying all segments anyway. So the real solution might be to make the calculation of hasMoved into an async calculation that updates state when complete.

Truly the problem may just be that hasMoved is called continuously. I'll try to fix that.

Thanks, Brad!

I don't want to try to re-think the update loop of AutoRefreshHighlighter, so instead I propose we change hasGridFragments to take an optional bool parameter aReflow = true. This will only do the recompute if true, and could then be used in CssGridHighlighter._hasMoved to determine if the already-confirmed grid element has dumped its grid info data structure -- which is an indicator that it has been reflowed. I guess then the challenge is that any call to isGrid would force it to recompute that structure, so we'd have fragility in the call order. I'll find the right way to do this and make the change in the definition and implementation of Element.

The timestamp will be updated whenever the grid is reflowed, which is
equivalent to the old check of comparing the contents of the grid
structure. This is much much faster.

Depends on D169724

I think there is more to do here. Specifically, devtools should have a better strategy for drawing too-complex a highlighter. The proposed patches will prevent the memory bloat of the original report, but it won't do anything about the long delay in seeing the updated highlighter. I'll file a follow-up Bug.

Blocks: 1816542
Attachment #9317479 - Attachment description: Bug 1815140 Part 1: Make Grid info structure timestamp its creation. → Bug 1815140 Part 1: Make Grid dom object an idempotent property of its frame.
Attachment #9317480 - Attachment description: Bug 1815140 Part 2: Make grid highlighter hasMoved check the timestamp of the Grid structure. → Bug 1815140 Part 2: Make grid highlighter hasMoved directly compare the grid fragment objects.
Severity: -- → S3
Priority: -- → P3
Pushed by bwerth@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7f50487ea10c
Part 1: Make Grid dom object an idempotent property of its frame. r=emilio
https://hg.mozilla.org/integration/autoland/rev/dd7288d15da0
Part 2: Make grid highlighter hasMoved directly compare the grid fragment objects. r=devtools-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/0380de95a301
Part 3: Add a test of grid fragment equality across reflows. r=jdescottes

I think I understand what is going on. The failure is that the Grid object no longer has valid member variables, specifically it no longer has its GridDimension objects. This happens as DevTools reloads, the cycle collector destructs the GridDimension objects but not the Grid object itself. It must be that adding the Grid object as a property of the frame is enough to keep it from being destructed but not enough to keep it from being marked as destroyable by the cycle collector, thus freeing its members for destruction.

Possible solutions:

  1. There may be some NS_IMPL_CYCLE_COLLECTING-style macro that could inform the correct behavior here. I don't understand the cycle collector well enough to identify this.
  2. Manually addref'ing the Grid object when it is set as a property could work, as long as we ensured that a frame released the object when the frame was destructed or any other time the property was cleared. I don't know how to manipulate the refcount directly, safely, but it's probably possible.

Emilio, do you have any advice on how to fix this?

Flags: needinfo?(bwerth) → needinfo?(emilio)

What's happening is that the new getter might return an unlinked object, which is not good. The easiest fix is to not only clear the property on the grid dtor but also on the Grid cycle collection unlinking.

Flags: needinfo?(emilio)
Pushed by bwerth@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2e8e5b4fcf3b
Part 1: Make Grid dom object an idempotent property of its frame. r=emilio
https://hg.mozilla.org/integration/autoland/rev/813ab8067abb
Part 2: Make grid highlighter hasMoved directly compare the grid fragment objects. r=devtools-reviewers,jdescottes
https://hg.mozilla.org/integration/autoland/rev/c64b326f3b9a
Part 3: Add a test of grid fragment equality across reflows. r=jdescottes
Flags: qe-verify+

Steps to reproduce:

  1. Load the attached file (https://bug1815140.bmoattachments.org/attachment.cgi?id=9316029)
  2. Open Dev Tools / Inspector
  3. Click the "grid" element inside the body.
  4. Open Task Manager / Processes and observe the Firefox CPU and Memory usage.
    Actual: The CPU rises fast and the Memorry rises slowly to above 2000 MB; the browser usage might show signs of slowness or even freezes.
    Expected: The resources are not spiking aggressively.

With these steps, I have reproduced the issue in Firefox Release v111.0.1 and verified the fix in Beta v112.0 (RC) and Nightly v113.0a1.
This appears to be windows specific as it did not reproduce on Ubuntu or MacOS.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
OS: Unspecified → Windows
Hardware: Unspecified → Desktop
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: