cloning and inserting canvas nodes causes shutdown hang
Categories
(Core :: Cycle Collector, defect, P3)
Tracking
()
People
(Reporter: tsmith, Unassigned)
References
(Depends on 1 open bug, Blocks 2 open bugs, )
Details
(Keywords: hang, pernosco, testcase, Whiteboard: [bugmon:bisected,confirmed])
Attachments
(1 file)
|
372 bytes,
text/html
|
Details |
Found while fuzzing m-c 20231028-0be08aa0812f (--enable-debug --enable-fuzzing)
To reproduce via Grizzly Replay:
$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html
This shutdown hang is seems to be fairly easily hit by fuzzers.
While reducing the test case I noticed that the test case can easily hang the browser while running. I think the hang scenario is actually happening much more often then the shutdown hang and likely have the same root cause. Marking as a fuzzblocker because of how easily a fuzzer can "accidentally" generate a test case that just hangs the browser (and consumes tons of memory).
| Reporter | ||
Comment 1•2 years ago
|
||
To trigger the browser hang just increase the iterations number in the test case.
Comment 2•2 years ago
|
||
Testcase crashes using the initial build (mozilla-central 20231028092407-0be08aa0812f) but not with tip (mozilla-central 20231219050609-ad0cefd58a2a.)
Unable to bisect testcase (End build crashes!):
Start: 0be08aa0812f81d5eb9f2235165d8478ebaf825b (20231028092407)
End: ad0cefd58a2a10eb9a8b1c4c8cb04b734311d175 (20231219050609)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Updated•2 years ago
|
Comment 4•2 years ago
|
||
Hm... here are some findings so far: I could reproduce a hang once by simply opening the attached page and then navigating back. Here's a profile, which shows Cycle Collection taking around 10 seconds, mostly doing some lookups in a hashtable.
However, when trying to reproduce this hang, it didn't work -- the page loaded quickly and closed quickly, no visible hang. Same in Private Mode. I'm not sure if there might be some caching involved and how that could interfere with Cycle Collection?
:smaug, before digging deeper, I'd be happy to hear your opinion. Are there known problems with Cycle Collection for the Canvas Element? Or is it just a huge amount of data that's being created and needs to be traversed, since .cloneNode(true) performs a deep copy of a subtree (twice per loop iteration), which therefore is growing dramatically?
FWIW, the attached page makes Chrome crash.
Comment 5•2 years ago
|
||
I'm not aware of <canvas> having any special issues with memory management.
How many elements do we end up creating there? The loop does create more and more elements each time.
Comment 6•2 years ago
|
||
If the math doesn't fail on me the number of <canvas> can be calculated as 2^23 = 8 388 608 (for 23 iterations). Plus the additional c.cloneNode(true) in the loop which result isn't being used.
So it is quite a few tightly coupled nodes. Does that justify a 10 second hang in Cycle Collection?
Comment 7•2 years ago
|
||
well, it certainly is not realistic case. No website has that many canvas elements around, and the testcase is trying to create garbage very fast.
We could improve scheduling of CC so that it happens more often if purple buffer gets very large, but not sure it is worth given that the testcase is totally artificial.
And btw, there isn't a 10s hang, but two a bit shorter hangs :)
Comment 8•2 years ago
|
||
When I tried to reproduce this with other elements like <p> or <li>, there was no visible jank in CC. I assume that <canvas> simply is a bigger object that produces a larger CC graph?
I'm tempted to close this as this scenario is pretty unrealistic.
Tyson, Olli, do you agree with that?
| Reporter | ||
Comment 9•2 years ago
|
||
From the fuzzing perspective this will just continue to be triggered and hog resources, it'd be nice to have a fuzzing workaround if possible.
Updated•2 years ago
|
Comment 10•2 years ago
|
||
Hmm, canvas isn't particularly large element. And the size of the element itself shouldn't matter for CC. What could matter is the outgoing edges
https://searchfox.org/mozilla-central/rev/961a9e56a0b5fa96ceef22c61c5e75fb6ba53395/dom/html/HTMLCanvasElement.cpp#515-520
but that seems rather simple too.
It is a bit confusing that HTMLCanvasElement has Destroy method when also nsIContent has. I think that still works fine, but worth to verify.
But if <canvas> behaves differently to other elements, something is going wrong.
Comment 11•2 years ago
•
|
||
Might be worth to try out with a way smaller testcase and check what the CC logs look like when canvas is used vs other elements.
CC logs are rather human readable, I mean the ones one can create for example from about:memory.
Updated•2 years ago
|
Comment 12•2 years ago
•
|
||
S3 because good to fix to help out fuzzers, but unlikely to affect web content.
Comment 13•2 years ago
|
||
And I can't reproduce.
| Reporter | ||
Comment 14•2 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #13)
And I can't reproduce.
Hmm interesting I can repro with m-c 20240116-02d1a62ad90a on Ubuntu 22.04.
Does this work?
pip install fuzzfetch grizzly-framework --upgrade
python -m fuzzfetch -d --fuzzing -n firefox
python -m grizzly.replay.bugzilla ./firefox/firefox 1870765
If that doesn't work ni? me and I can try to get a Pernosco session.
Comment 15•2 years ago
|
||
Thank you for the grizzly instructions. On macOS, I can get those instructions to work (slightly modified for the macOS app bundle) and the app memory usage gets to about 6 GB and it takes about 50 seconds to exit the process. We'll discuss in triage.
Comment 16•2 years ago
|
||
When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.
Comment 18•2 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #16)
When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.
I noticed something similar. I also could repro once, but when I tried again the hang was gone. So, ideally start a profiler session before trying to repro.
Comment 19•2 years ago
|
||
This bug prevents fuzzing from making progress; however, it has low severity. It is important for fuzz blocker bugs to be addressed in a timely manner (see here why?).
:lsalzman, could you consider increasing the severity?
For more information, please visit BugBot documentation.
Updated•2 years ago
|
Comment 21•2 years ago
|
||
I've tried the steps in comment #14 on Ubuntu 23.04 on real hardware and it does not repro there (No results detected).
Tried the same steps on Ubuntu 22.04 in VMware and that says:
Result: [@ isLiveHash] (d1c7ad36:7679ca7c)
Results successfully reproduced
I am not sure what to do with this successful repro - what are the next steps? I have no idea what I am doing with grizzly :)
| Reporter | ||
Comment 22•2 years ago
|
||
A Pernosco session is available here: https://pernos.co/debug/oHdWkp3u_Aed2xb5oStBgw/index.html
ahale: Use -h to get the required args for the installed version of Grizzly.
Updated•2 years ago
|
Comment 23•2 years ago
|
||
@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?
I assume that we do believe that this is a benign complexity issue?
The only way around this is limiting the number of cc-ables that can be put in the cc-graph. Any other changes would just increase the acceptable value range of the exponent of the combinatorics here.
If we can limit the fuzzer's testcase from producing more than e.g. 1M nodes, that would prevent this class of issues.
Comment 24•2 years ago
|
||
(In reply to Kelsey Gilbert [:jgilbert] from comment #23)
@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?
Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?
Updated•2 years ago
|
Comment 25•2 years ago
|
||
(In reply to Jason Kratzer [:jkratzer] from comment #24)
(In reply to Kelsey Gilbert [:jgilbert] from comment #23)
@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?
Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?
Yeah exactly.
The only real alternative I see otherwise is to do the same in Gecko, but I think that would be harder and riskier.
The fuzzer has discovered the cursed power of exponential growth, and we need some way to pull it back from approximating infinity. :)
Comment 26•1 year ago
|
||
The profile from comment 4 shows that this is really just in cc code. Either jkratzer or tsmith tried to repro the testcase with a div instead of a canvas, and it still repro'd, so this is likely a CC issue, though I believe that it's just a large-graphs-are-slow problem that, even if we solve it for 2^23, is unsolveable for 2^33. :)
Comment 27•1 year ago
|
||
Interesting. Usually hangs this long involve JS, but I suppose you can spam out a bunch of regular old C++ objects, too. It sounds like there are around 8 million nodes? There could potentially be exponential CC traverse behavior with Canvas nodes, but if it happens with divs instead then that's not what is going on here.
The initial test case talks about shutdown hangs, and comment 4 talks about navigating back, so I assume this is a problem with page teardown where our CC optimizations can't do anything.
This explains why a shutdown hang won't reproduce in a non-debug, non-fuzzing build, like Nightly: we don't run the CC when shutting down a process, we just shut it down. This means that bug 1850021 would solve the problem here, by making content process shutdown behave like Nightly and not a debug build. Note that this won't help the scenario where you are doing a same-origin navigation from the page, because the process isn't being killed.
Updated•1 year ago
|
| Reporter | ||
Comment 28•1 year ago
|
||
This has been detected by live site testing.
Comment 29•1 year ago
|
||
I got a release warning because this bug is marked as affecting a release and pending info from one of our engineers. However, the info that was originally requested from Jason was about optimizing our fuzzers so we hit this case less and aren't as frequently blocked by it.
That answer is not going to help anyone resolve the issue though. It is now marked as affecting release because the demo page at claude.ai is hitting the same code path. We should figure out if this is worth optimizing without any fuzzer-specific fixes and that answer should likely come from folks working on Canvas or the Cycle Collector :)
Updated•1 year ago
|
| Reporter | ||
Comment 30•1 year ago
|
||
Removing fuzzblocker from the whiteboard since the fuzzers are no longer reporting this at a high rate, it is reported by live site testing daily.
Description
•