Repeatedly loading a page leaks due to discard notification during incremental cycle collection
Categories
(Core :: DOM: Core & HTML, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox80 | --- | verified |
People
(Reporter: jonco, Assigned: edgar)
References
(Blocks 2 open bugs)
Details
Attachments
(6 files, 2 obsolete files)
Full details are here: https://gist.github.com/flotwig/202b179adbe57a3903aeb71334f6b77a
AIUI this just repeatedly loads https://dashboard.cypress.io.
Reporter | ||
Comment 1•5 years ago
|
||
I expect this is the same issue as bug 1576829.
Comment 2•5 years ago
|
||
Here is a verbose memory measurement from a Firefox process using almost all of the system's RAM. Most of the memory usage seems to be in Web Content (pid 3496).
Jon, does the attached memory report help to verify your assumption?
Reporter | ||
Comment 4•5 years ago
|
||
TBH it looks like the problem is that page allocates a ton of memory. There are only three instances of dashboard.cypress.io windows so it's not like we're not cleaning them up. The biggest one has 1.5GB allocated.
Any chance we can trim down where exactly in this page the high memory allocation is triggered? Which tools could be used for that?
This might be related to the crash we see on bug 1604680.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 7•5 years ago
|
||
I haven't been able to run the test script itself but I made a page that repeatedly reloads https://dashboard.cypress.io in an iframe. It does seem that memory use can build up but I didn't see runaway growth. It may be that my laptop is powerful enough to keep up with the allocations though. I did see many zones for this site hanging around after I'd closed the page.
Reporter | ||
Comment 8•5 years ago
|
||
I tried disabling the part of weakmap marking that marks keys based on the color of their delegate and which seems to be causing the problem in bug 1610193. This did not fix the problem, so it seems this is an unrelated issue with a similar effect.
While testing I see memory build up. Both GC and CC run periodically. The GC usually collects ~10 zones out of a total that continually increases. However sometimes all zones are collected, but few end up being removed.
Comment 9•5 years ago
|
||
Just to add more context to this bug: Cypress 4 is out now https://cypress.io/blog/2020/02/06/introducing-firefox-and-edge-support-in-cypress-4-0/ with Firefox support in beta.
This bug is the biggest blocker from getting Cypress' Firefox-support out of beta.
Zach, today I have seen that the gecko profiler has a new option to track and visualize memory consumption including allocation and garbage collection. Could you run a recent Firefox Nightly build and create a profile while running your tests? Here the steps:
- Download Firefox from https://www.mozilla.org/en-US/firefox/channel/desktop/#nightly
- Go to "Tools -> Web Developer" and enable "Enable Profiler Toolbar Icon" (it will enable the new profiler ui which is required)
- Configure the profiler to record memory (javascript, and native) and maybe even some other data (having more is always better)
- Start the profiler
- Start running your script
- Wait some seconds (how long does it actually take for each iteration?) and make sure the profiler buffer is large enough to hold all the data
- Select "Capture the Recording"
- Publish the profile and share it with us
Thanks
Comment 11•5 years ago
|
||
Henrik, https://gist.github.com/flotwig/202b179adbe57a3903aeb71334f6b77a contains some profiles with native allocations that Zach recorded a while ago.
Harald, from the profiling session last week I got that a lot of the memory recording features are new. So I wonder if these already recorded profiles from a while ago already contain that information or not.
Comment 13•5 years ago
|
||
We celebrated Allocations as new, but it has been available for 2+ months. I can confirm that Zach used the new Native Allocation recording – check out the recordings.
(btw, just use Native Allocations, not JS Allocations which will just report on a small subset)
Comment 14•5 years ago
|
||
Examples from the profiles in the gist:
With forced GC:
Forced GC mostly collects memory from Layout code paths: https://perfht.ml/388xpQN
Reflow also causes a lot of allocated memory (25%): https://perfht.ml/31xIicE
Without forced GC:
Layout code generates 100+ Mb of memory: https://perfht.ml/31BZKN9
Ok, thanks! Beside the memory problem, I can see that for the non forced GC profile we spend a huge amout of time in nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp)
due to constant reflows, which look to be related there.
Emilio, would you mind having a look at the profiles? Thanks.
Zach, is there a way to run the integration tests on our own, or does that only work internally? I assume that would help developers a lot to replicate the situation, and to prove a fix or different fixes.
Maybe the time as spent in reflow is expected given that the script constantly reloads the page...
Also it might be helpful to have a profile which includes a run with more than just 4s of test execution. So that multiple iterations of the page load and possible GCs are included. Looking at https://perfht.ml/37h0ZCy it shows about 12MB of data that is retained after a GC. It's all coming from https://beacon-v2.helpscout.net/static/js/main.2.1.228c6719.js
. I wonder if/how this accumulates over time.
Comment 18•5 years ago
|
||
Yeah, reflow time is sorta expected.
The "Forced GC collects memory from layout code-paths" is not expected though... What does the "Forced GC" really do? May this be evicting the BFCache, which could be somehow keeping around all the layout data structures for the page?
(In reply to Emilio Cobos Álvarez (:emilio) from comment #18)
The "Forced GC collects memory from layout code-paths" is not expected though... What does the "Forced GC" really do? May this be evicting the BFCache, which could be somehow keeping around all the layout data structures for the page?
Andrew, could you help us here please?
Comment 20•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #15)
Zach, is there a way to run the integration tests on our own, or does that only work internally? I assume that would help developers a lot to replicate the situation, and to prove a fix or different fixes.
Yeah, by creating a new Cypress project and adding the test code from the OP, you should be able to reproduce. We've been able to reproduce on macOS, Fedora, and Ubuntu so far. Rough instructions, assuming you have a fairly recent Node and npm:
- Open a new folder.
npm init
npm install --save cypress
- create
cypress/integration/spec.js
, add the code from https://gist.github.com/flotwig/202b179adbe57a3903aeb71334f6b77a) - run
$(npm bin)/cypress open
and begin runningspec.js
in Firefox. You should be able to see memory usage steadily climb. If you set a low memory limit on the process by using a VM, you will see it crash from OOM eventually.
Comment 21•5 years ago
|
||
I forgot to mention - the instructions I just provided will run Firefox without forced GC+CC (we disable it when users run tests locally, so it doesn't constantly lock up for GC while they're trying to interact with the browser).
To run Firefox with forced GC+CC between every test, pass the environment variable CYPRESS_firefoxGcInterval=1
(case-sensitive) while launching Cypress. You should see it lock up between tests while Firefox does it's thang, and the GC indicator in the bottom-left corner will show you when GC is running + a measurement of total time spent forcing GC.
Comment 22•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #19)
Andrew, could you help us here please?
I'm not really sure. Maybe it makes us do a full GC? That would catch some weird cases with weak maps and multiple compartments, like sfink has been looking at.
Reporter | ||
Comment 23•5 years ago
|
||
I spent a few days last week investigating this.
I couldn't get cypress to run against a local build of firefox so I made a page that loads https://dashboard.cypress.io/ into a single tab 100 times. Running this I see memory fluctuating but generally increasing fro 200MB to over 1GB (at which point I kill the process).
I'm testing on MacOS and I'm running with --disable-e10s to rule out any effects from multiple processes (and make logging easier to understand). I'm looking at memory with the system Activity Monitor.
What I've I've noticed is that the number of GC zones starts at ~10 and continues to increase. In other words zones are not dying (or only
sometimes dying) so they build up.
This is not a problem with triggering GCs. We are running plenty (we are also running CCs too).
I investigated whether this is because we were not selecting the right zones for collection, but have ruled that out. From a GC point of view the zones are not dead; they have live roots. For most of the zones these are gray root coming from outside the JS engine.
I investigated what CC roots are causing these JS GC things to be marked, and mostly these are: Promise, CallbackObject, JSStackFrame
and JSGCThingParticipant (i.e. outgoing edges from the GC heap). I don't really know to make of this.
I investigated whether we weren't running the snow white killer enough when the browser was busy or were not running the deferred deletion
enough, but that doesn't appear to be the problem.
I did find one thing that made the problem go away: running with CC logging turned on. I used the command:
MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_THREAD=main MOZ_CC_LOG_DIRECTORY=/Users/jon/cclogs/ mach -o run --disable-e10s
Now memory use stays around ~270MB and rarely breaks 300MB.
I spent a bunch of time trying to understand why this made a difference but wasn't able to work it out. I checked that these options don't set
mWantAllTraces which would definitely change things (e.g. by forcing full GCs).
My conclusion is that this is not a GC issue but a system wide or CC issue.
I've run out of ideas for how to attack this. Can one of the CC folks take a look?
Reporter | ||
Comment 24•5 years ago
|
||
Test file I'm using to reproduce this.
Comment 25•5 years ago
|
||
That's peculiar. CC logging shouldn't affect CC behavior. Maybe the logging slows it down enough that the CC isn't running as incrementally or something.
Comment 26•5 years ago
|
||
I'll try to take a look at this at some point, but I haven't had any time yet.
Comment 27•5 years ago
|
||
Hi everyone, thank you for your time spent investigating this issue. This is the last major blocker before the Cypress team can take Firefox support out of "beta", so please let me know if there's anything else we can provide from our side to expedite investigations here.
Comment 28•5 years ago
|
||
Hi, sorry I've been ignoring your bug so long. I'll take a look starting today or tomorrow.
Comment 29•5 years ago
|
||
Ok, I started looking at this for real now. I haven't figured anything out yet, but I'll note that if you do "minimize memory usage" in about:memory, it also frees everything immediately. Between this and Jon's observation that the leak goes away with logging, I'm guessing there's some bad interaction with incremental CC. I'll try to confirm my guess that logging somehow keeps us from running the CC as incrementally due to the slow down. Then I'll have to figure out how to log while still running things incrementally. Maybe I can tweak our heuristics so that it doesn't give up as quickly on being incremental.
The basic issue which I've seen before is that after a window closes, you can end up with some bit of code running periodically and Addref/Releasing something in the closed window. With a sync CC, that's no big deal because the extra reference is gone before the CC runs. However, with incremental CC, if that happens in between slices, then the CC will note that an object has been AddRef/Released, so out of caution it will not collect it. The idea is that we can probably free it in the next CC, but if it keeps happening then we'll never free it. I've seen this happen before with various media code, which had some helper code that didn't shut down correctly.
Comment 30•5 years ago
|
||
When I'm running this in a debug build, the console is being spammed with warnings like this:
WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /Users/andrewmccreight/mc/dom/events/DOMEventTargetHelper.cpp, line 321
which is from DOMEventTargetHelper::CheckCurrentGlobalCorrectness(). So maybe we're trying to run a DETH for a window that has gone away. Maybe I can just crash if we fail that warning when we're in the middle of an incremental CC...
Comment 31•5 years ago
|
||
I poked at recording what objects get touched during a CC, but there was too much noise.
I did manage to reproduce the issue with CC logging enabled. I didn't try it without modification, but I did try it with the GC logging part of it commented out, in the hope that would avoid whatever Jonco was seeing, and I could reproduce.
For at least some of the leaking windows, the thing holding them alive is an image element, with the CC description "FragmentOrElement (xhtml) img https://dashboard.cypress.io/login". As I guessed, it is an incremental CC root for some reason. I haven't been able to figure out why yet.
Comment 32•5 years ago
|
||
I did refcount logging on image elements, and for at least one of them it looks like image notification is what is AddRefing the element:
<HTMLImageElement> 0x14f312580 19 AddRef 3 [thread 0x101958080]
#01: imgRequestProxy::Notify(int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (/Users/andrewmccreight/mc/obj-dbg.noindex/dist/include/nsCOMPtr.h:513)
#02: void mozilla::image::ImageObserverNotifier<mozilla::image::ObserverTable const*>::operator()<void mozilla::image::SyncNotifyInternal<mozilla::image::ObserverTable const*>(mozilla::image::ObserverTable const* const&, bool, unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&)::{lambda(mozilla::image::IProgressObserver*)#2}>(void mozilla::image::SyncNotifyInternal<mozilla::image::ObserverTable const*>(mozilla::image::ObserverTable const* const&, bool, unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&)::{lambda(mozilla::image::IProgressObserver*)#2}) (/Users/andrewmccreight/mc/image/ProgressTracker.cpp:283)
#03: void mozilla::image::SyncNotifyInternal<mozilla::image::ObserverTable const*>(mozilla::image::ObserverTable const* const&, bool, unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&) (/Users/andrewmccreight/mc/image/ProgressTracker.cpp:330)
#04: mozilla::image::ProgressTracker::SyncNotifyProgress(unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&) (/Users/andrewmccreight/mc/image/CopyOnWrite.h:46)
#05: mozilla::image::RasterImage::NotifyProgress(unsigned int, mozilla::gfx::IntRectTyped<mozilla::UnorientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) (/Users/andrewmccreight/mc/image/RasterImage.cpp:1684)
#06: mozilla::image::RasterImage::NotifyDecodeComplete(mozilla::image::DecoderFinalStatus const&, mozilla::image::ImageMetadata const&, mozilla::image::DecoderTelemetry const&, unsigned int, mozilla::gfx::IntRectTyped<mozilla::UnorientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) (/Users/andrewmccreight/mc/image/RasterImage.cpp:1721)
#07: mozilla::detail::RunnableFunction<mozilla::image::IDecodingTask::NotifyDecodeComplete(mozilla::NotNull<mozilla::image::RasterImage*>, mozilla::NotNull<mozilla::image::Decoder*>)::$_6>::Run() (/Users/andrewmccreight/mc/image/IDecodingTask.cpp:123)
I'm not sure if this is an image element in a window that has closed or not.
Updated•5 years ago
|
Comment 33•4 years ago
|
||
I think what is happening is that because we're spam reloading the page we keep discarding the image on the web page, which causes us to call into imgRequestProxy::Notify(), which notifies all of the copies of the page that are in memory. The element is getting notified even when its owner doc is not active, which seems weird. I'm not sure when image elements are supposed to stop being notified, or exactly what the notification is doing.
Anyways, the basic problem is that the page is completely dead, but the notification addrefs/releases the element in the middle of an ICC, so we never end up freeing it. I'm not sure if this is more of a DOM issue or an image lib issue. Timothy, do you have any thoughts on what could be done here to stop us from leaking? Would it be okay to stop getting these notifications when the document is inactive?
Updated•4 years ago
|
Comment 34•4 years ago
|
||
This is an interesting one!
Comment 35•4 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #33)
Would it be okay to stop getting these notifications when the document is inactive?
For this concept of "inactive" is it possible for the document to go back to being active? We do currently rely on all notifications getting through. If the document is certainly doomed then that doesn't matter.
Comment 36•4 years ago
|
||
One thing I've been thinking about is not discarding images as aggressively when their document goes away because this is not the first thing I've come across lately where we uselessly discard and redecode because a document goes away. Something like waiting 10 seconds after the document dies before discarding the images if they are still unused at that time. That would have the effect of not sending out these notifications.
Comment 37•4 years ago
|
||
You might be able to check Document::mIsGoingAway. If that is the case, the document shouldn't come back from bfcache anymore.
But one needs to remember that some other window might still have a pointer to the relevant img element and adopt the element to its own document.
(one could detect that by overriding https://searchfox.org/mozilla-central/rev/446160560bf32ebf4cb7c4e25d7386ee22667255/dom/base/nsINode.h#706
Which HTMLImageElement for example already overrides https://searchfox.org/mozilla-central/source/dom/html/HTMLImageElement.cpp#654
)
Comment 38•4 years ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #35)
For this concept of "inactive" is it possible for the document to go back to being active? We do currently rely on all notifications getting through. If the document is certainly doomed then that doesn't matter.
It looks like mIsGoingAway never goes back to true once it is false.
I tried adding a call to DestroyImageLoadingContent() in HTMLImageElement::DestroyContent(), which gets called at the same time mIsGoingAway is set to true. It looks like this stops the list of image elements being notified from growing unboundedly, but it doesn't seem to fix the leak.
This also doesn't handle the adoption case Olli mentioned. Can a node really get adopted out of a destroyed document?
Comment 39•4 years ago
|
||
On further testing, it seems like maybe my change did fix the leak. Maybe I was getting something weird like about:memory causing a leak by running during an ICC.
Comment 40•4 years ago
|
||
Here's a quick patch to delay image discard. This should hopefully avoid discarding and redecoding the image.
Comment 41•4 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #38)
This also doesn't handle the adoption case Olli mentioned. Can a node really get adopted out of a destroyed document?
Yes. Make parent window to point to an element in an iframe, then unload iframe, and adopt the element to window.document.
Something like http://mozilla.pettay.fi/moztests/imgadopt.html
Comment 42•4 years ago
|
||
With this patch we still end up discarding and redecoding every 3-5 document loads because the discard event happens when the image isn't "locked" (ie visible in a document) (so the document is in between loads).
Memory use still goes up higher than we might want.
Comment 43•4 years ago
|
||
We could probably get away with just replaying all of the image status as notifications again if an image gets "revived", it'll be alittle inefficient, but probably okay. There are a couple notifications type that aren't stored in the status so won't get replayed (discard and unlocked draw), but discard is only used by tests, and unlocked draw is probably fine given that these nodes are "dead". We'd just need a place where to call the replay image status bits.
Comment 44•4 years ago
|
||
Thanks for taking a look. Is this something you could work on, Timothy? I'd prefer to not have to try and learn how all of this image notification stuff works if possible. (Sorry I ignored this bug for the last week.)
Comment 45•4 years ago
|
||
I'm going to move this to imagelib, though it is kind of at the border with DOM.
Comment 46•4 years ago
|
||
Well I mean if the fix is in image notifications sure, but we don't even have an idea for how to fix this yet.
Comment 47•4 years ago
|
||
:tnikkel, could other peers help unblock this?
Comment 48•4 years ago
|
||
Feel free to need info any one you might think could help.
Comment 49•4 years ago
|
||
ni? Edgar, who offered to take a look (next week). Thank you!
Assignee | ||
Comment 50•4 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #38)
I tried adding a call to DestroyImageLoadingContent() in HTMLImageElement::DestroyContent(), which gets called at the same time mIsGoingAway is set to true.
This also doesn't handle the adoption case Olli mentioned.
I think it makes sense that we clear the imageRequest in HTMLImageElement::DestroyContent() which would fix the leak. And we don't need to do any special handling for the adoption case given that HTMLImageElement would reload the image when adopting to another document (this behavior is defined in spec). And the same leak would happen on other elements which might load image, like HTMLInputElement, HTMLEmbedElement.. etc, I think we need to similar things for them, too. But I need to check the adoption case for them first.
Assignee | ||
Comment 51•4 years ago
|
||
(In reply to Edgar Chen [:edgar] from comment #50)
And the same leak would happen on other elements which might load image, like HTMLInputElement, HTMLEmbedElement.. etc, I think we need to similar things for them, too. But I need to check the adoption case for them first.
I check all nsImageLoadingContent elements, most of them would reload the image after adoption when the imageRequest is cleared, except HTMLObjectElement and HTMLEmbedElement which inherits from nsObjectLoadingContent that only reloads after attribute mutation.
Assignee | ||
Comment 52•4 years ago
|
||
Assignee | ||
Comment 53•4 years ago
|
||
It would be called in the destructor of nsObjectLoadingContent.
Assignee | ||
Comment 54•4 years ago
|
||
Assignee | ||
Comment 55•4 years ago
|
||
Assignee | ||
Comment 56•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 57•4 years ago
|
||
Assignee | ||
Comment 58•4 years ago
|
||
(In reply to Edgar Chen [:edgar] from comment #57)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6683f0cca3f34699062036ebda2abfafb2cc5d23
[task 2020-07-10T14:57:06.288Z] 14:57:06 INFO - TEST-UNEXPECTED-TIMEOUT | /content-security-policy/inheritance/iframe-all-local-schemes-inherit-self.sub.html | <iframe>'s about:blank inherits policy. - Test timed out
[task 2020-07-10T14:57:06.289Z] 14:57:06 INFO - TEST-UNEXPECTED-TIMEOUT | /content-security-policy/inheritance/iframe-all-local-schemes-inherit-self.sub.html | expected OK
Get an unexpected timeout on /content-security-policy/inheritance/iframe-all-local-schemes-inherit-self.sub.html
, I think it is hit the bug 543435, the test create an iframe and append an image on it, but the image is appended to the initial about:blank which is replaced by new about:blank later soon, and the onload even is also suppressed when the initial about:blank is going to be destroyed.
Updated•4 years ago
|
Assignee | ||
Comment 59•4 years ago
|
||
Depends on D82729
Updated•4 years ago
|
Assignee | ||
Comment 60•4 years ago
|
||
Comment 61•4 years ago
|
||
Comment 62•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/58c8aa9d3a93
https://hg.mozilla.org/mozilla-central/rev/cbe8f7a3adae
https://hg.mozilla.org/mozilla-central/rev/2a381254701d
Hi Zach, with this bug fixed we would be interested if your problem is gone now, or if it still exists. Would you mind testing again with the latest Firefox Nightly? Thanks!
Comment 64•4 years ago
|
||
== Change summary for alert #26607 (as of Mon, 27 Jul 2020 06:29:40 GMT) ==
Improvements:
10% Images windows10-64-shippable-qr opt 9,001,649.68 -> 8,131,449.62
9% Images windows10-64-shippable-qr opt 8,984,308.08 -> 8,142,803.93
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=26607
Updated•4 years ago
|
Comment 65•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #63)
Hi Zach, with this bug fixed we would be interested if your problem is gone now, or if it still exists. Would you mind testing again with the latest Firefox Nightly? Thanks!
Seems to be working great with Firefox Nightly! A huge "thank you" from the Cypress team to everyone involved with fixing this bug. :)
I see Target Milestone: mozilla80, does that mean it's safe to assume that this fix will be in Firefox in any release channel with version 80 or above?
(In reply to Zach Bloomquist from comment #65)
I see Target Milestone: mozilla80, does that mean it's safe to assume that this fix will be in Firefox in any release channel with version 80 or above?
This is correct. And great to hear that it is fixed for you.
Comment 67•4 years ago
|
||
Thank you, Zach!
Closing as verified per comment 65.
Comment 68•4 years ago
|
||
I happened to be looking at our memory usage benchmark just now, specifically at the memory usage for image data. It looks like the memory usage for images went from extremely noisy to extremely non-noisy in a change set that included this bug. I don't know for sure if that is the cause, but it seems plausible. Pretty cool.
Comment 69•4 years ago
|
||
That is great!
Assignee | ||
Comment 70•4 years ago
|
||
\o/
Comment 71•4 years ago
|
||
Wow. That makes a difference, indeed!
Description
•