Open Bug 1870765 Opened 10 months ago Updated 6 months ago

cloning and inserting canvas nodes causes shutdown hang

Categories

(Core :: Cycle Collector, defect, P3)

defect

Tracking

()

Tracking Status
firefox123 --- affected

People

(Reporter: tsmith, Unassigned, NeedInfo)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: hang, testcase, Whiteboard: [fuzzblocker][bugmon:bisected,confirmed])

Attachments

(1 file)

Attached file testcase.html

Found while fuzzing m-c 20231028-0be08aa0812f (--enable-debug --enable-fuzzing)

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html

This shutdown hang is seems to be fairly easily hit by fuzzers.

While reducing the test case I noticed that the test case can easily hang the browser while running. I think the hang scenario is actually happening much more often then the shutdown hang and likely have the same root cause. Marking as a fuzzblocker because of how easily a fuzzer can "accidentally" generate a test case that just hangs the browser (and consumes tons of memory).

Flags: in-testsuite?

To trigger the browser hang just increase the iterations number in the test case.

Testcase crashes using the initial build (mozilla-central 20231028092407-0be08aa0812f) but not with tip (mozilla-central 20231219050609-ad0cefd58a2a.)

Unable to bisect testcase (End build crashes!):

Start: 0be08aa0812f81d5eb9f2235165d8478ebaf825b (20231028092407)
End: ad0cefd58a2a10eb9a8b1c4c8cb04b734311d175 (20231219050609)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)

Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Keywords: bugmon
Whiteboard: [fuzzblocker] → [fuzzblocker][bugmon:bisected,confirmed]
Severity: -- → S2

Hello Jan, could you please take this? Thanks!

Flags: needinfo?(jjaschke)

Hm... here are some findings so far: I could reproduce a hang once by simply opening the attached page and then navigating back. Here's a profile, which shows Cycle Collection taking around 10 seconds, mostly doing some lookups in a hashtable.
However, when trying to reproduce this hang, it didn't work -- the page loaded quickly and closed quickly, no visible hang. Same in Private Mode. I'm not sure if there might be some caching involved and how that could interfere with Cycle Collection?

:smaug, before digging deeper, I'd be happy to hear your opinion. Are there known problems with Cycle Collection for the Canvas Element? Or is it just a huge amount of data that's being created and needs to be traversed, since .cloneNode(true) performs a deep copy of a subtree (twice per loop iteration), which therefore is growing dramatically?

FWIW, the attached page makes Chrome crash.

Flags: needinfo?(jjaschke) → needinfo?(smaug)

I'm not aware of <canvas> having any special issues with memory management.

How many elements do we end up creating there? The loop does create more and more elements each time.

Flags: needinfo?(smaug)

If the math doesn't fail on me the number of <canvas> can be calculated as 2^23 = 8 388 608 (for 23 iterations). Plus the additional c.cloneNode(true) in the loop which result isn't being used.

So it is quite a few tightly coupled nodes. Does that justify a 10 second hang in Cycle Collection?

Flags: needinfo?(smaug)

well, it certainly is not realistic case. No website has that many canvas elements around, and the testcase is trying to create garbage very fast.
We could improve scheduling of CC so that it happens more often if purple buffer gets very large, but not sure it is worth given that the testcase is totally artificial.
And btw, there isn't a 10s hang, but two a bit shorter hangs :)

Flags: needinfo?(smaug)

When I tried to reproduce this with other elements like <p> or <li>, there was no visible jank in CC. I assume that <canvas> simply is a bigger object that produces a larger CC graph?

I'm tempted to close this as this scenario is pretty unrealistic.

Tyson, Olli, do you agree with that?

Flags: needinfo?(twsmith)
Flags: needinfo?(smaug)

From the fuzzing perspective this will just continue to be triggered and hog resources, it'd be nice to have a fuzzing workaround if possible.

Flags: needinfo?(twsmith)
Summary: cloning and inserting nodes causes shutdown hang → cloning and inserting canvas nodes causes shutdown hang

Hmm, canvas isn't particularly large element. And the size of the element itself shouldn't matter for CC. What could matter is the outgoing edges
https://searchfox.org/mozilla-central/rev/961a9e56a0b5fa96ceef22c61c5e75fb6ba53395/dom/html/HTMLCanvasElement.cpp#515-520
but that seems rather simple too.

It is a bit confusing that HTMLCanvasElement has Destroy method when also nsIContent has. I think that still works fine, but worth to verify.

But if <canvas> behaves differently to other elements, something is going wrong.

Flags: needinfo?(smaug)

Might be worth to try out with a way smaller testcase and check what the CC logs look like when canvas is used vs other elements.
CC logs are rather human readable, I mean the ones one can create for example from about:memory.

Severity: S2 → --
Component: DOM: Core & HTML → Graphics: Canvas2D

S3 because good to fix to help out fuzzers, but unlikely to affect web content.

Severity: -- → S3
Priority: -- → P3

And I can't reproduce.

(In reply to Brad Werth [:bradwerth] from comment #13)

And I can't reproduce.

Hmm interesting I can repro with m-c 20240116-02d1a62ad90a on Ubuntu 22.04.

Does this work?

pip install fuzzfetch grizzly-framework --upgrade
python -m fuzzfetch -d --fuzzing -n firefox
python -m grizzly.replay.bugzilla ./firefox/firefox 1870765

If that doesn't work ni? me and I can try to get a Pernosco session.

Flags: needinfo?(bwerth)

Thank you for the grizzly instructions. On macOS, I can get those instructions to work (slightly modified for the macOS app bundle) and the app memory usage gets to about 6 GB and it takes about 50 seconds to exit the process. We'll discuss in triage.

Blocks: gfx-triage
Flags: needinfo?(bwerth)

When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.

I can try to repro.

Flags: needinfo?(ahale)

(In reply to Brad Werth [:bradwerth] from comment #16)

When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.

I noticed something similar. I also could repro once, but when I tried again the hang was gone. So, ideally start a profiler session before trying to repro.

This bug prevents fuzzing from making progress; however, it has low severity. It is important for fuzz blocker bugs to be addressed in a timely manner (see here why?).
:lsalzman, could you consider increasing the severity?

For more information, please visit BugBot documentation.

Flags: needinfo?(lsalzman)
Flags: needinfo?(lsalzman)

Tyson, can you post a pernosco session?

Flags: needinfo?(twsmith)

I've tried the steps in comment #14 on Ubuntu 23.04 on real hardware and it does not repro there (No results detected).

Tried the same steps on Ubuntu 22.04 in VMware and that says:

Result: [@ isLiveHash] (d1c7ad36:7679ca7c)
Results successfully reproduced

I am not sure what to do with this successful repro - what are the next steps? I have no idea what I am doing with grizzly :)

Flags: needinfo?(ahale)

A Pernosco session is available here: https://pernos.co/debug/oHdWkp3u_Aed2xb5oStBgw/index.html

ahale: Use -h to get the required args for the installed version of Grizzly.

Flags: needinfo?(twsmith)
Flags: needinfo?(ahale)

@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?

I assume that we do believe that this is a benign complexity issue?
The only way around this is limiting the number of cc-ables that can be put in the cc-graph. Any other changes would just increase the acceptable value range of the exponent of the combinatorics here.
If we can limit the fuzzer's testcase from producing more than e.g. 1M nodes, that would prevent this class of issues.

Flags: needinfo?(twsmith)
See Also: → 1883399

(In reply to Kelsey Gilbert [:jgilbert] from comment #23)

@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?

Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?

Flags: needinfo?(twsmith) → needinfo?(jkratzer)

(In reply to Jason Kratzer [:jkratzer] from comment #24)

(In reply to Kelsey Gilbert [:jgilbert] from comment #23)

@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?

Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?

Yeah exactly.
The only real alternative I see otherwise is to do the same in Gecko, but I think that would be harder and riskier.
The fuzzer has discovered the cursed power of exponential growth, and we need some way to pull it back from approximating infinity. :)

The profile from comment 4 shows that this is really just in cc code. Either jkratzer or tsmith tried to repro the testcase with a div instead of a canvas, and it still repro'd, so this is likely a CC issue, though I believe that it's just a large-graphs-are-slow problem that, even if we solve it for 2^23, is unsolveable for 2^33. :)

Component: Graphics: Canvas2D → Cycle Collector
Flags: needinfo?(ahale)

Interesting. Usually hangs this long involve JS, but I suppose you can spam out a bunch of regular old C++ objects, too. It sounds like there are around 8 million nodes? There could potentially be exponential CC traverse behavior with Canvas nodes, but if it happens with divs instead then that's not what is going on here.

The initial test case talks about shutdown hangs, and comment 4 talks about navigating back, so I assume this is a problem with page teardown where our CC optimizations can't do anything.

This explains why a shutdown hang won't reproduce in a non-debug, non-fuzzing build, like Nightly: we don't run the CC when shutting down a process, we just shut it down. This means that bug 1850021 would solve the problem here, by making content process shutdown behave like Nightly and not a debug build. Note that this won't help the scenario where you are doing a same-origin navigation from the page, because the process isn't being killed.

Depends on: 1850021
No longer blocks: gfx-triage
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: