cloning and inserting canvas nodes causes shutdown hang
Categories
(Core :: Cycle Collector, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox123 | --- | affected |
People
(Reporter: tsmith, Unassigned, NeedInfo)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: hang, testcase, Whiteboard: [fuzzblocker][bugmon:bisected,confirmed])
Attachments
(1 file)
372 bytes,
text/html
|
Details |
Found while fuzzing m-c 20231028-0be08aa0812f (--enable-debug --enable-fuzzing)
To reproduce via Grizzly Replay:
$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay ./firefox/firefox testcase.html
This shutdown hang is seems to be fairly easily hit by fuzzers.
While reducing the test case I noticed that the test case can easily hang the browser while running. I think the hang scenario is actually happening much more often then the shutdown hang and likely have the same root cause. Marking as a fuzzblocker because of how easily a fuzzer can "accidentally" generate a test case that just hangs the browser (and consumes tons of memory).
Reporter | ||
Comment 1•10 months ago
|
||
To trigger the browser hang just increase the iterations number in the test case.
Comment 2•10 months ago
|
||
Testcase crashes using the initial build (mozilla-central 20231028092407-0be08aa0812f) but not with tip (mozilla-central 20231219050609-ad0cefd58a2a.)
Unable to bisect testcase (End build crashes!):
Start: 0be08aa0812f81d5eb9f2235165d8478ebaf825b (20231028092407)
End: ad0cefd58a2a10eb9a8b1c4c8cb04b734311d175 (20231219050609)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Updated•10 months ago
|
Comment 4•9 months ago
|
||
Hm... here are some findings so far: I could reproduce a hang once by simply opening the attached page and then navigating back. Here's a profile, which shows Cycle Collection taking around 10 seconds, mostly doing some lookups in a hashtable.
However, when trying to reproduce this hang, it didn't work -- the page loaded quickly and closed quickly, no visible hang. Same in Private Mode. I'm not sure if there might be some caching involved and how that could interfere with Cycle Collection?
:smaug, before digging deeper, I'd be happy to hear your opinion. Are there known problems with Cycle Collection for the Canvas Element? Or is it just a huge amount of data that's being created and needs to be traversed, since .cloneNode(true)
performs a deep copy of a subtree (twice per loop iteration), which therefore is growing dramatically?
FWIW, the attached page makes Chrome crash.
Comment 5•9 months ago
|
||
I'm not aware of <canvas> having any special issues with memory management.
How many elements do we end up creating there? The loop does create more and more elements each time.
Comment 6•9 months ago
|
||
If the math doesn't fail on me the number of <canvas>
can be calculated as 2^23 = 8 388 608
(for 23 iterations). Plus the additional c.cloneNode(true)
in the loop which result isn't being used.
So it is quite a few tightly coupled nodes. Does that justify a 10 second hang in Cycle Collection?
Comment 7•9 months ago
|
||
well, it certainly is not realistic case. No website has that many canvas elements around, and the testcase is trying to create garbage very fast.
We could improve scheduling of CC so that it happens more often if purple buffer gets very large, but not sure it is worth given that the testcase is totally artificial.
And btw, there isn't a 10s hang, but two a bit shorter hangs :)
Comment 8•9 months ago
|
||
When I tried to reproduce this with other elements like <p>
or <li>
, there was no visible jank in CC. I assume that <canvas>
simply is a bigger object that produces a larger CC graph?
I'm tempted to close this as this scenario is pretty unrealistic.
Tyson, Olli, do you agree with that?
Reporter | ||
Comment 9•9 months ago
|
||
From the fuzzing perspective this will just continue to be triggered and hog resources, it'd be nice to have a fuzzing workaround if possible.
Updated•9 months ago
|
Comment 10•9 months ago
|
||
Hmm, canvas isn't particularly large element. And the size of the element itself shouldn't matter for CC. What could matter is the outgoing edges
https://searchfox.org/mozilla-central/rev/961a9e56a0b5fa96ceef22c61c5e75fb6ba53395/dom/html/HTMLCanvasElement.cpp#515-520
but that seems rather simple too.
It is a bit confusing that HTMLCanvasElement has Destroy method when also nsIContent has. I think that still works fine, but worth to verify.
But if <canvas> behaves differently to other elements, something is going wrong.
Comment 11•9 months ago
•
|
||
Might be worth to try out with a way smaller testcase and check what the CC logs look like when canvas is used vs other elements.
CC logs are rather human readable, I mean the ones one can create for example from about:memory.
Updated•9 months ago
|
Comment 12•9 months ago
•
|
||
S3 because good to fix to help out fuzzers, but unlikely to affect web content.
Comment 13•9 months ago
|
||
And I can't reproduce.
Reporter | ||
Comment 14•9 months ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #13)
And I can't reproduce.
Hmm interesting I can repro with m-c 20240116-02d1a62ad90a on Ubuntu 22.04.
Does this work?
pip install fuzzfetch grizzly-framework --upgrade
python -m fuzzfetch -d --fuzzing -n firefox
python -m grizzly.replay.bugzilla ./firefox/firefox 1870765
If that doesn't work ni?
me and I can try to get a Pernosco session.
Comment 15•9 months ago
|
||
Thank you for the grizzly instructions. On macOS, I can get those instructions to work (slightly modified for the macOS app bundle) and the app memory usage gets to about 6 GB and it takes about 50 seconds to exit the process. We'll discuss in triage.
Comment 16•9 months ago
|
||
When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.
Comment 18•9 months ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #16)
When I open the testcase in today's Nightly, there's no lasting problem. The tab opens, memory climbs to 5 or so GB, then memory drops and the browser remains usable.
I noticed something similar. I also could repro once, but when I tried again the hang was gone. So, ideally start a profiler session before trying to repro.
Comment 19•9 months ago
|
||
This bug prevents fuzzing from making progress; however, it has low severity. It is important for fuzz blocker bugs to be addressed in a timely manner (see here why?).
:lsalzman, could you consider increasing the severity?
For more information, please visit BugBot documentation.
Updated•9 months ago
|
Comment 21•8 months ago
|
||
I've tried the steps in comment #14 on Ubuntu 23.04 on real hardware and it does not repro there (No results detected).
Tried the same steps on Ubuntu 22.04 in VMware and that says:
Result: [@ isLiveHash] (d1c7ad36:7679ca7c)
Results successfully reproduced
I am not sure what to do with this successful repro - what are the next steps? I have no idea what I am doing with grizzly :)
Reporter | ||
Comment 22•8 months ago
|
||
A Pernosco session is available here: https://pernos.co/debug/oHdWkp3u_Aed2xb5oStBgw/index.html
ahale: Use -h
to get the required args for the installed version of Grizzly.
Updated•8 months ago
|
Comment 23•7 months ago
|
||
@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?
I assume that we do believe that this is a benign complexity issue?
The only way around this is limiting the number of cc-ables that can be put in the cc-graph. Any other changes would just increase the acceptable value range of the exponent of the combinatorics here.
If we can limit the fuzzer's testcase from producing more than e.g. 1M nodes, that would prevent this class of issues.
Comment 24•7 months ago
|
||
(In reply to Kelsey Gilbert [:jgilbert] from comment #23)
@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?
Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?
Updated•7 months ago
|
Comment 25•7 months ago
|
||
(In reply to Jason Kratzer [:jkratzer] from comment #24)
(In reply to Kelsey Gilbert [:jgilbert] from comment #23)
@tsmith: Can we give the fuzzer a mandatory preamble that patches prototypes to prevent this?
Kelsey, are you suggesting to patch the prototype of cloneNode as a means of exiting early when complexity hits a particular level?
Yeah exactly.
The only real alternative I see otherwise is to do the same in Gecko, but I think that would be harder and riskier.
The fuzzer has discovered the cursed power of exponential growth, and we need some way to pull it back from approximating infinity. :)
Comment 26•6 months ago
|
||
The profile from comment 4 shows that this is really just in cc code. Either jkratzer or tsmith tried to repro the testcase with a div instead of a canvas, and it still repro'd, so this is likely a CC issue, though I believe that it's just a large-graphs-are-slow problem that, even if we solve it for 2^23, is unsolveable for 2^33. :)
Comment 27•6 months ago
|
||
Interesting. Usually hangs this long involve JS, but I suppose you can spam out a bunch of regular old C++ objects, too. It sounds like there are around 8 million nodes? There could potentially be exponential CC traverse behavior with Canvas nodes, but if it happens with divs instead then that's not what is going on here.
The initial test case talks about shutdown hangs, and comment 4 talks about navigating back, so I assume this is a problem with page teardown where our CC optimizations can't do anything.
This explains why a shutdown hang won't reproduce in a non-debug, non-fuzzing build, like Nightly: we don't run the CC when shutting down a process, we just shut it down. This means that bug 1850021 would solve the problem here, by making content process shutdown behave like Nightly and not a debug build. Note that this won't help the scenario where you are doing a same-origin navigation from the page, because the process isn't being killed.
Updated•6 months ago
|
Description
•