1870765 - cloning and inserting canvas nodes causes shutdown hang

Tyson Smith [:tsmith]

Reporter

Description

•

2 years ago

Attached file testcase.html — Details

Found while fuzzing m-c 20231028-0be08aa0812f (--enable-debug --enable-fuzzing)

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html

This shutdown hang is seems to be fairly easily hit by fuzzers.

While reducing the test case I noticed that the test case can easily hang the browser while running. I think the hang scenario is actually happening much more often then the shutdown hang and likely have the same root cause. Marking as a fuzzblocker because of how easily a fuzzer can "accidentally" generate a test case that just hangs the browser (and consumes tons of memory).

Flags: in-testsuite?

Tyson Smith [:tsmith]

Reporter

Comment 1

•

2 years ago

To trigger the browser hang just increase the iterations number in the test case.

Bugmon [:jkratzer for issues]

Comment 2

•

2 years ago

Testcase crashes using the initial build (mozilla-central 20231028092407-0be08aa0812f) but not with tip (mozilla-central 20231219050609-ad0cefd58a2a.)

Unable to bisect testcase (End build crashes!):

Start: 0be08aa0812f81d5eb9f2235165d8478ebaf825b (20231028092407)
End: ad0cefd58a2a10eb9a8b1c4c8cb04b734311d175 (20231219050609)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)

Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Keywords: bugmon

Whiteboard: [fuzzblocker] → [fuzzblocker][bugmon:bisected,confirmed]

Andreas Farre [:farre]|PTO -> Aug 7th

Updated

•

2 years ago

Severity: -- → S2

Hsin-Yi Tsai (she/her)[:hsinyi]

Comment 3

•

2 years ago

Hello Jan, could you please take this? Thanks!

Flags: needinfo?(jjaschke)

Jan Jaeschke [:jjaschke]

Comment 4

•

2 years ago

Hm... here are some findings so far: I could reproduce a hang once by simply opening the attached page and then navigating back. Here's a profile, which shows Cycle Collection taking around 10 seconds, mostly doing some lookups in a hashtable.
However, when trying to reproduce this hang, it didn't work -- the page loaded quickly and closed quickly, no visible hang. Same in Private Mode. I'm not sure if there might be some caching involved and how that could interfere with Cycle Collection?

:smaug, before digging deeper, I'd be happy to hear your opinion. Are there known problems with Cycle Collection for the Canvas Element? Or is it just a huge amount of data that's being created and needs to be traversed, since .cloneNode(true) performs a deep copy of a subtree (twice per loop iteration), which therefore is growing dramatically?

FWIW, the attached page makes Chrome crash.

Flags: needinfo?(jjaschke) → needinfo?(smaug)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 5

•

2 years ago

I'm not aware of <canvas> having any special issues with memory management.

How many elements do we end up creating there? The loop does create more and more elements each time.

Flags: needinfo?(smaug)

Jan Jaeschke [:jjaschke]

Comment 6

•

2 years ago

If the math doesn't fail on me the number of <canvas> can be calculated as 2^23 = 8 388 608 (for 23 iterations). Plus the additional c.cloneNode(true) in the loop which result isn't being used.

So it is quite a few tightly coupled nodes. Does that justify a 10 second hang in Cycle Collection?

Flags: needinfo?(smaug)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 7

•

2 years ago

well, it certainly is not realistic case. No website has that many canvas elements around, and the testcase is trying to create garbage very fast.
We could improve scheduling of CC so that it happens more often if purple buffer gets very large, but not sure it is worth given that the testcase is totally artificial.
And btw, there isn't a 10s hang, but two a bit shorter hangs :)

Flags: needinfo?(smaug)

Jan Jaeschke [:jjaschke]

Comment 8

•

2 years ago

When I tried to reproduce this with other elements like <p> or <li>, there was no visible jank in CC. I assume that <canvas> simply is a bigger object that produces a larger CC graph?

I'm tempted to close this as this scenario is pretty unrealistic.

Tyson, Olli, do you agree with that?

Flags: needinfo?(twsmith)

Flags: needinfo?(smaug)

Tyson Smith [:tsmith]

Reporter

Comment 9

•

2 years ago

From the fuzzing perspective this will just continue to be triggered and hog resources, it'd be nice to have a fuzzing workaround if possible.

Flags: needinfo?(twsmith)

Hsin-Yi Tsai (she/her)[:hsinyi]

Updated

•

2 years ago

Summary: cloning and inserting nodes causes shutdown hang → cloning and inserting canvas nodes causes shutdown hang

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 10

•

2 years ago

Hmm, canvas isn't particularly large element. And the size of the element itself shouldn't matter for CC. What could matter is the outgoing edges
https://searchfox.org/mozilla-central/rev/961a9e56a0b5fa96ceef22c61c5e75fb6ba53395/dom/html/HTMLCanvasElement.cpp#515-520
but that seems rather simple too.

It is a bit confusing that HTMLCanvasElement has Destroy method when also nsIContent has. I think that still works fine, but worth to verify.

But if <canvas> behaves differently to other elements, something is going wrong.

Flags: needinfo?(smaug)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 11

•

2 years ago

•

Edited

Might be worth to try out with a way smaller testcase and check what the CC logs look like when canvas is used vs other elements.
CC logs are rather human readable, I mean the ones one can create for example from about:memory.

Hsin-Yi Tsai (she/her)[:hsinyi]

Updated

•

2 years ago

Severity: S2 → --

Component: DOM: Core & HTML → Graphics: Canvas2D

Brad Werth [:bradwerth]

Comment 12

•

2 years ago

•

Edited

S3 because good to fix to help out fuzzers, but unlikely to affect web content.

Severity: -- → S3

Priority: -- → P3

Brad Werth [:bradwerth]

Comment 13

•

2 years ago

And I can't reproduce.

Tyson Smith [:tsmith]

Reporter

Comment 14

•

2 years ago

(In reply to Brad Werth [:bradwerth] from comment #13)

And I can't reproduce.

Hmm interesting I can repro with m-c 20240116-02d1a62ad90a on Ubuntu 22.04.

Does this work?

pip install fuzzfetch grizzly-framework --upgrade
python -m fuzzfetch -d --fuzzing -n firefox
python -m grizzly.replay.bugzilla ./firefox/firefox 1870765

If that doesn't work ni? me and I can try to get a Pernosco session.

Flags: needinfo?(bwerth)

Brad Werth [:bradwerth]

Comment 15

•

2 years ago

Thank you for the grizzly instructions. On macOS, I can get those instructions to work (slightly modified for the macOS app bundle) and the app memory usage gets to about 6 GB and it takes about 50 seconds to exit the process. We'll discuss in triage.

Blocks: gfx-triage

Flags: needinfo?(bwerth)

Brad Werth [:bradwerth]

Comment 16

•

2 years ago

When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.

Ashley Hale [:ahale]

Comment 17

•

2 years ago

I can try to repro.

Flags: needinfo?(ahale)

Jan Jaeschke [:jjaschke]

Comment 18

•

2 years ago

(In reply to Brad Werth [:bradwerth] from comment #16)

When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.

I noticed something similar. I also could repro once, but when I tried again the hang was gone. So, ideally start a profiler session before trying to repro.

BugBot [:suhaib / :marco/ :calixte]

Comment 19

•

2 years ago

This bug prevents fuzzing from making progress; however, it has low severity. It is important for fuzz blocker bugs to be addressed in a timely manner (see here why?).
:lsalzman, could you consider increasing the severity?

For more information, please visit BugBot documentation.

Flags: needinfo?(lsalzman)

Lee Salzman [:lsalzman]

Updated

•

2 years ago

Flags: needinfo?(lsalzman)

Jeff Muizelaar [:jrmuizel]

Comment 20

•

2 years ago

Tyson, can you post a pernosco session?

Flags: needinfo?(twsmith)

Ashley Hale [:ahale]

Comment 21

•

2 years ago

I've tried the steps in comment #14 on Ubuntu 23.04 on real hardware and it does not repro there (No results detected).

Tried the same steps on Ubuntu 22.04 in VMware and that says:

Result: [@ isLiveHash] (d1c7ad36:7679ca7c)
Results successfully reproduced

I am not sure what to do with this successful repro - what are the next steps? I have no idea what I am doing with grizzly :)

Flags: needinfo?(ahale)

Tyson Smith [:tsmith]

Reporter

Comment 22

•

2 years ago

A Pernosco session is available here: https://pernos.co/debug/oHdWkp3u_Aed2xb5oStBgw/index.html

ahale: Use -h to get the required args for the installed version of Grizzly.

Flags: needinfo?(twsmith)

Ashley Hale [:ahale]

Updated

•

2 years ago

Flags: needinfo?(ahale)

Kelsey Gilbert [:jgilbert]

Comment 23

•

2 years ago

@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?

I assume that we do believe that this is a benign complexity issue?
The only way around this is limiting the number of cc-ables that can be put in the cc-graph. Any other changes would just increase the acceptable value range of the exponent of the combinatorics here.
If we can limit the fuzzer's testcase from producing more than e.g. 1M nodes, that would prevent this class of issues.

Flags: needinfo?(twsmith)

Mayank Bansal

Updated

•

2 years ago

Comment 24

•

2 years ago

(In reply to Kelsey Gilbert [:jgilbert] from comment #23)

@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?

Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?

Jason Kratzer [:jkratzer]

Updated

•

2 years ago

Flags: needinfo?(twsmith) → needinfo?(jkratzer)

Kelsey Gilbert [:jgilbert]

Comment 25

•

2 years ago

(In reply to Jason Kratzer [:jkratzer] from comment #24)

(In reply to Kelsey Gilbert [:jgilbert] from comment #23)

@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?

Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?

Yeah exactly.
The only real alternative I see otherwise is to do the same in Gecko, but I think that would be harder and riskier.
The fuzzer has discovered the cursed power of exponential growth, and we need some way to pull it back from approximating infinity. :)

Kelsey Gilbert [:jgilbert]

Comment 26

•

2 years ago

The profile from comment 4 shows that this is really just in cc code. Either jkratzer or tsmith tried to repro the testcase with a div instead of a canvas, and it still repro'd, so this is likely a CC issue, though I believe that it's just a large-graphs-are-slow problem that, even if we solve it for 2^23, is unsolveable for 2^33. :)

Component: Graphics: Canvas2D → Cycle Collector

Flags: needinfo?(ahale)

Andrew McCreight [:mccr8]

Comment 27

•

2 years ago

Interesting. Usually hangs this long involve JS, but I suppose you can spam out a bunch of regular old C++ objects, too. It sounds like there are around 8 million nodes? There could potentially be exponential CC traverse behavior with Canvas nodes, but if it happens with divs instead then that's not what is going on here.

The initial test case talks about shutdown hangs, and comment 4 talks about navigating back, so I assume this is a problem with page teardown where our CC optimizations can't do anything.

This explains why a shutdown hang won't reproduce in a non-debug, non-fuzzing build, like Nightly: we don't run the CC when shutting down a process, we just shut it down. This means that bug 1850021 would solve the problem here, by making content process shutdown behave like Nightly and not a debug build. Note that this won't help the scenario where you are doing a same-origin navigation from the page, because the process isn't being killed.

Depends on: 1850021

Kelsey Gilbert [:jgilbert]

Updated

•

2 years ago

No longer blocks: gfx-triage

Tyson Smith [:tsmith]

Reporter

Comment 28

•

1 year ago

This has been detected by live site testing.

Blocks: site-scout

URL: http://claude.ai/

status-firefox136: --- → affected

Frederik Braun [:freddy] (OOO - back August 10th)

Comment 29

•

1 year ago

I got a release warning because this bug is marked as affecting a release and pending info from one of our engineers. However, the info that was originally requested from Jason was about optimizing our fuzzers so we hit this case less and aren't as frequently blocked by it.

That answer is not going to help anyone resolve the issue though. It is now marked as affecting release because the demo page at claude.ai is hitting the same code path. We should figure out if this is worth optimizing without any fuzzer-specific fixes and that answer should likely come from folks working on Canvas or the Cycle Collector :)

Jason Kratzer [:jkratzer]

Updated

•

1 year ago

Flags: needinfo?(jkratzer)

Tyson Smith [:tsmith]

Reporter

Comment 30

•

1 year ago

Removing fuzzblocker from the whiteboard since the fuzzers are no longer reporting this at a high rate, it is reported by live site testing daily.

Keywords: pernosco

Whiteboard: [fuzzblocker][bugmon:bisected,confirmed] → [bugmon:bisected,confirmed]