Closed Bug 1608501 Opened 5 years ago Closed 4 years ago

Repeatedly loading a page leaks due to discard notification during incremental cycle collection

Categories

(Core :: DOM: Core & HTML, defect, P2)

defect

Tracking

()

VERIFIED FIXED
mozilla80
Tracking Status
firefox80 --- verified

People

(Reporter: jonco, Assigned: edgar)

References

(Blocks 2 open bugs)

Details

Attachments

(6 files, 2 obsolete files)

I expect this is the same issue as bug 1576829.

Attached file memory-measurement.txt

Here is a verbose memory measurement from a Firefox process using almost all of the system's RAM. Most of the memory usage seems to be in Web Content (pid 3496).

Jon, does the attached memory report help to verify your assumption?

Flags: needinfo?(jcoppeard)

TBH it looks like the problem is that page allocates a ton of memory. There are only three instances of dashboard.cypress.io windows so it's not like we're not cleaning them up. The biggest one has 1.5GB allocated.

Flags: needinfo?(jcoppeard)

Any chance we can trim down where exactly in this page the high memory allocation is triggered? Which tools could be used for that?

This might be related to the crash we see on bug 1604680.

Blocks: 1604723
See Also: → 1604680
Flags: needinfo?(jcoppeard)
Flags: needinfo?(jcoppeard)
Priority: -- → P2

I haven't been able to run the test script itself but I made a page that repeatedly reloads https://dashboard.cypress.io in an iframe. It does seem that memory use can build up but I didn't see runaway growth. It may be that my laptop is powerful enough to keep up with the allocations though. I did see many zones for this site hanging around after I'd closed the page.

I tried disabling the part of weakmap marking that marks keys based on the color of their delegate and which seems to be causing the problem in bug 1610193. This did not fix the problem, so it seems this is an unrelated issue with a similar effect.

While testing I see memory build up. Both GC and CC run periodically. The GC usually collects ~10 zones out of a total that continually increases. However sometimes all zones are collected, but few end up being removed.

Just to add more context to this bug: Cypress 4 is out now https://cypress.io/blog/2020/02/06/introducing-firefox-and-edge-support-in-cypress-4-0/ with Firefox support in beta.

This bug is the biggest blocker from getting Cypress' Firefox-support out of beta.

Zach, today I have seen that the gecko profiler has a new option to track and visualize memory consumption including allocation and garbage collection. Could you run a recent Firefox Nightly build and create a profile while running your tests? Here the steps:

  1. Download Firefox from https://www.mozilla.org/en-US/firefox/channel/desktop/#nightly
  2. Go to "Tools -> Web Developer" and enable "Enable Profiler Toolbar Icon" (it will enable the new profiler ui which is required)
  3. Configure the profiler to record memory (javascript, and native) and maybe even some other data (having more is always better)
  4. Start the profiler
  5. Start running your script
  6. Wait some seconds (how long does it actually take for each iteration?) and make sure the profiler buffer is large enough to hold all the data
  7. Select "Capture the Recording"
  8. Publish the profile and share it with us

Thanks

Flags: needinfo?(z)

Henrik, https://gist.github.com/flotwig/202b179adbe57a3903aeb71334f6b77a contains some profiles with native allocations that Zach recorded a while ago.

Flags: needinfo?(z) → needinfo?(hskupin)

Harald, from the profiling session last week I got that a lot of the memory recording features are new. So I wonder if these already recorded profiles from a while ago already contain that information or not.

We celebrated Allocations as new, but it has been available for 2+ months. I can confirm that Zach used the new Native Allocation recording – check out the recordings.

(btw, just use Native Allocations, not JS Allocations which will just report on a small subset)

Examples from the profiles in the gist:

With forced GC:
Forced GC mostly collects memory from Layout code paths: https://perfht.ml/388xpQN
Reflow also causes a lot of allocated memory (25%): https://perfht.ml/31xIicE

Without forced GC:
Layout code generates 100+ Mb of memory: https://perfht.ml/31BZKN9

Ok, thanks! Beside the memory problem, I can see that for the non forced GC profile we spend a huge amout of time in nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) due to constant reflows, which look to be related there.

Emilio, would you mind having a look at the profiles? Thanks.

Zach, is there a way to run the integration tests on our own, or does that only work internally? I assume that would help developers a lot to replicate the situation, and to prove a fix or different fixes.

Flags: needinfo?(z)
Flags: needinfo?(emilio)

Maybe the time as spent in reflow is expected given that the script constantly reloads the page...

Also it might be helpful to have a profile which includes a run with more than just 4s of test execution. So that multiple iterations of the page load and possible GCs are included. Looking at https://perfht.ml/37h0ZCy it shows about 12MB of data that is retained after a GC. It's all coming from https://beacon-v2.helpscout.net/static/js/main.2.1.228c6719.js. I wonder if/how this accumulates over time.

Yeah, reflow time is sorta expected.

The "Forced GC collects memory from layout code-paths" is not expected though... What does the "Forced GC" really do? May this be evicting the BFCache, which could be somehow keeping around all the layout data structures for the page?

Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #18)

The "Forced GC collects memory from layout code-paths" is not expected though... What does the "Forced GC" really do? May this be evicting the BFCache, which could be somehow keeping around all the layout data structures for the page?

Andrew, could you help us here please?

Flags: needinfo?(continuation)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #15)

Zach, is there a way to run the integration tests on our own, or does that only work internally? I assume that would help developers a lot to replicate the situation, and to prove a fix or different fixes.

Yeah, by creating a new Cypress project and adding the test code from the OP, you should be able to reproduce. We've been able to reproduce on macOS, Fedora, and Ubuntu so far. Rough instructions, assuming you have a fairly recent Node and npm:

  1. Open a new folder.
  2. npm init
  3. npm install --save cypress
  4. create cypress/integration/spec.js, add the code from https://gist.github.com/flotwig/202b179adbe57a3903aeb71334f6b77a)
  5. run $(npm bin)/cypress open and begin running spec.js in Firefox. You should be able to see memory usage steadily climb. If you set a low memory limit on the process by using a VM, you will see it crash from OOM eventually.
Flags: needinfo?(z)

I forgot to mention - the instructions I just provided will run Firefox without forced GC+CC (we disable it when users run tests locally, so it doesn't constantly lock up for GC while they're trying to interact with the browser).

To run Firefox with forced GC+CC between every test, pass the environment variable CYPRESS_firefoxGcInterval=1 (case-sensitive) while launching Cypress. You should see it lock up between tests while Firefox does it's thang, and the GC indicator in the bottom-left corner will show you when GC is running + a measurement of total time spent forcing GC.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #19)

Andrew, could you help us here please?

I'm not really sure. Maybe it makes us do a full GC? That would catch some weird cases with weak maps and multiple compartments, like sfink has been looking at.

Flags: needinfo?(continuation)

I spent a few days last week investigating this.

I couldn't get cypress to run against a local build of firefox so I made a page that loads https://dashboard.cypress.io/ into a single tab 100 times. Running this I see memory fluctuating but generally increasing fro 200MB to over 1GB (at which point I kill the process).

I'm testing on MacOS and I'm running with --disable-e10s to rule out any effects from multiple processes (and make logging easier to understand). I'm looking at memory with the system Activity Monitor.

What I've I've noticed is that the number of GC zones starts at ~10 and continues to increase. In other words zones are not dying (or only
sometimes dying) so they build up.

This is not a problem with triggering GCs. We are running plenty (we are also running CCs too).

I investigated whether this is because we were not selecting the right zones for collection, but have ruled that out. From a GC point of view the zones are not dead; they have live roots. For most of the zones these are gray root coming from outside the JS engine.

I investigated what CC roots are causing these JS GC things to be marked, and mostly these are: Promise, CallbackObject, JSStackFrame
and JSGCThingParticipant (i.e. outgoing edges from the GC heap). I don't really know to make of this.

I investigated whether we weren't running the snow white killer enough when the browser was busy or were not running the deferred deletion
enough, but that doesn't appear to be the problem.

I did find one thing that made the problem go away: running with CC logging turned on. I used the command:

MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_THREAD=main MOZ_CC_LOG_DIRECTORY=/Users/jon/cclogs/ mach -o run --disable-e10s

Now memory use stays around ~270MB and rarely breaks 300MB.

I spent a bunch of time trying to understand why this made a difference but wasn't able to work it out. I checked that these options don't set
mWantAllTraces which would definitely change things (e.g. by forcing full GCs).

My conclusion is that this is not a GC issue but a system wide or CC issue.

I've run out of ideas for how to attack this. Can one of the CC folks take a look?

Flags: needinfo?(continuation)
Flags: needinfo?(bugs)
Attached file index.html

Test file I'm using to reproduce this.

That's peculiar. CC logging shouldn't affect CC behavior. Maybe the logging slows it down enough that the CC isn't running as incrementally or something.

See Also: → 1612705

I'll try to take a look at this at some point, but I haven't had any time yet.

Hi everyone, thank you for your time spent investigating this issue. This is the last major blocker before the Cypress team can take Firefox support out of "beta", so please let me know if there's anything else we can provide from our side to expedite investigations here.

Hi, sorry I've been ignoring your bug so long. I'll take a look starting today or tomorrow.

Ok, I started looking at this for real now. I haven't figured anything out yet, but I'll note that if you do "minimize memory usage" in about:memory, it also frees everything immediately. Between this and Jon's observation that the leak goes away with logging, I'm guessing there's some bad interaction with incremental CC. I'll try to confirm my guess that logging somehow keeps us from running the CC as incrementally due to the slow down. Then I'll have to figure out how to log while still running things incrementally. Maybe I can tweak our heuristics so that it doesn't give up as quickly on being incremental.

The basic issue which I've seen before is that after a window closes, you can end up with some bit of code running periodically and Addref/Releasing something in the closed window. With a sync CC, that's no big deal because the extra reference is gone before the CC runs. However, with incremental CC, if that happens in between slices, then the CC will note that an object has been AddRef/Released, so out of caution it will not collect it. The idea is that we can probably free it in the next CC, but if it keeps happening then we'll never free it. I've seen this happen before with various media code, which had some helper code that didn't shut down correctly.

Assignee: nobody → continuation
Component: JavaScript: GC → XPCOM
Flags: needinfo?(continuation)
Flags: needinfo?(bugs)

When I'm running this in a debug build, the console is being spammed with warnings like this:
WARNING: NS_ENSURE_TRUE(!mHasOrHasHadOwnerWindow || mOwnerWindow) failed: file /Users/andrewmccreight/mc/dom/events/DOMEventTargetHelper.cpp, line 321
which is from DOMEventTargetHelper::CheckCurrentGlobalCorrectness(). So maybe we're trying to run a DETH for a window that has gone away. Maybe I can just crash if we fail that warning when we're in the middle of an incremental CC...

I poked at recording what objects get touched during a CC, but there was too much noise.

I did manage to reproduce the issue with CC logging enabled. I didn't try it without modification, but I did try it with the GC logging part of it commented out, in the hope that would avoid whatever Jonco was seeing, and I could reproduce.

For at least some of the leaking windows, the thing holding them alive is an image element, with the CC description "FragmentOrElement (xhtml) img https://dashboard.cypress.io/login". As I guessed, it is an incremental CC root for some reason. I haven't been able to figure out why yet.

I did refcount logging on image elements, and for at least one of them it looks like image notification is what is AddRefing the element:

<HTMLImageElement> 0x14f312580 19 AddRef 3 [thread 0x101958080]
#01: imgRequestProxy::Notify(int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const*) (/Users/andrewmccreight/mc/obj-dbg.noindex/dist/include/nsCOMPtr.h:513)
#02: void mozilla::image::ImageObserverNotifier<mozilla::image::ObserverTable const*>::operator()<void mozilla::image::SyncNotifyInternal<mozilla::image::ObserverTable const*>(mozilla::image::ObserverTable const* const&, bool, unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&)::{lambda(mozilla::image::IProgressObserver*)#2}>(void mozilla::image::SyncNotifyInternal<mozilla::image::ObserverTable const*>(mozilla::image::ObserverTable const* const&, bool, unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&)::{lambda(mozilla::image::IProgressObserver*)#2}) (/Users/andrewmccreight/mc/image/ProgressTracker.cpp:283)
#03: void mozilla::image::SyncNotifyInternal<mozilla::image::ObserverTable const*>(mozilla::image::ObserverTable const* const&, bool, unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&) (/Users/andrewmccreight/mc/image/ProgressTracker.cpp:330)
#04: mozilla::image::ProgressTracker::SyncNotifyProgress(unsigned int, mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const&) (/Users/andrewmccreight/mc/image/CopyOnWrite.h:46)
#05: mozilla::image::RasterImage::NotifyProgress(unsigned int, mozilla::gfx::IntRectTyped<mozilla::UnorientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) (/Users/andrewmccreight/mc/image/RasterImage.cpp:1684)
#06: mozilla::image::RasterImage::NotifyDecodeComplete(mozilla::image::DecoderFinalStatus const&, mozilla::image::ImageMetadata const&, mozilla::image::DecoderTelemetry const&, unsigned int, mozilla::gfx::IntRectTyped<mozilla::UnorientedPixel> const&, mozilla::Maybe<unsigned int> const&, mozilla::image::DecoderFlags, mozilla::image::SurfaceFlags) (/Users/andrewmccreight/mc/image/RasterImage.cpp:1721)
#07: mozilla::detail::RunnableFunction<mozilla::image::IDecodingTask::NotifyDecodeComplete(mozilla::NotNull<mozilla::image::RasterImage*>, mozilla::NotNull<mozilla::image::Decoder*>)::$_6>::Run() (/Users/andrewmccreight/mc/image/IDecodingTask.cpp:123)

I'm not sure if this is an image element in a window that has closed or not.

Component: XPCOM → DOM: Core & HTML

I think what is happening is that because we're spam reloading the page we keep discarding the image on the web page, which causes us to call into imgRequestProxy::Notify(), which notifies all of the copies of the page that are in memory. The element is getting notified even when its owner doc is not active, which seems weird. I'm not sure when image elements are supposed to stop being notified, or exactly what the notification is doing.

Anyways, the basic problem is that the page is completely dead, but the notification addrefs/releases the element in the middle of an ICC, so we never end up freeing it. I'm not sure if this is more of a DOM issue or an image lib issue. Timothy, do you have any thoughts on what could be done here to stop us from leaking? Would it be okay to stop getting these notifications when the document is inactive?

Flags: needinfo?(tnikkel)
Summary: Test script causes Firefox to OOM repeatedly loading a page → Repeatedly loading a page leaks due to discard notification during incremental cycle collection

This is an interesting one!

(In reply to Andrew McCreight [:mccr8] from comment #33)

Would it be okay to stop getting these notifications when the document is inactive?

For this concept of "inactive" is it possible for the document to go back to being active? We do currently rely on all notifications getting through. If the document is certainly doomed then that doesn't matter.

Flags: needinfo?(continuation)

One thing I've been thinking about is not discarding images as aggressively when their document goes away because this is not the first thing I've come across lately where we uselessly discard and redecode because a document goes away. Something like waiting 10 seconds after the document dies before discarding the images if they are still unused at that time. That would have the effect of not sending out these notifications.

You might be able to check Document::mIsGoingAway. If that is the case, the document shouldn't come back from bfcache anymore.

But one needs to remember that some other window might still have a pointer to the relevant img element and adopt the element to its own document.
(one could detect that by overriding https://searchfox.org/mozilla-central/rev/446160560bf32ebf4cb7c4e25d7386ee22667255/dom/base/nsINode.h#706
Which HTMLImageElement for example already overrides https://searchfox.org/mozilla-central/source/dom/html/HTMLImageElement.cpp#654
)

(In reply to Timothy Nikkel (:tnikkel) from comment #35)

For this concept of "inactive" is it possible for the document to go back to being active? We do currently rely on all notifications getting through. If the document is certainly doomed then that doesn't matter.

It looks like mIsGoingAway never goes back to true once it is false.

I tried adding a call to DestroyImageLoadingContent() in HTMLImageElement::DestroyContent(), which gets called at the same time mIsGoingAway is set to true. It looks like this stops the list of image elements being notified from growing unboundedly, but it doesn't seem to fix the leak.

This also doesn't handle the adoption case Olli mentioned. Can a node really get adopted out of a destroyed document?

Flags: needinfo?(continuation)

On further testing, it seems like maybe my change did fix the leak. Maybe I was getting something weird like about:memory causing a leak by running during an ICC.

Attached file delay image discard (obsolete) —

Here's a quick patch to delay image discard. This should hopefully avoid discarding and redecoding the image.

(In reply to Andrew McCreight [:mccr8] from comment #38)

This also doesn't handle the adoption case Olli mentioned. Can a node really get adopted out of a destroyed document?

Yes. Make parent window to point to an element in an iframe, then unload iframe, and adopt the element to window.document.
Something like http://mozilla.pettay.fi/moztests/imgadopt.html

Attached patch imgdiscdelaySplinter Review

With this patch we still end up discarding and redecoding every 3-5 document loads because the discard event happens when the image isn't "locked" (ie visible in a document) (so the document is in between loads).

Memory use still goes up higher than we might want.

Attachment #9148293 - Attachment is obsolete: true

We could probably get away with just replaying all of the image status as notifications again if an image gets "revived", it'll be alittle inefficient, but probably okay. There are a couple notifications type that aren't stored in the status so won't get replayed (discard and unlocked draw), but discard is only used by tests, and unlocked draw is probably fine given that these nodes are "dead". We'd just need a place where to call the replay image status bits.

Thanks for taking a look. Is this something you could work on, Timothy? I'd prefer to not have to try and learn how all of this image notification stuff works if possible. (Sorry I ignored this bug for the last week.)

I'm going to move this to imagelib, though it is kind of at the border with DOM.

Component: DOM: Core & HTML → ImageLib

Well I mean if the fix is in image notifications sure, but we don't even have an idea for how to fix this yet.

:tnikkel, could other peers help unblock this?

Feel free to need info any one you might think could help.

ni? Edgar, who offered to take a look (next week). Thank you!

Flags: needinfo?(echen)

(In reply to Andrew McCreight [:mccr8] from comment #38)

I tried adding a call to DestroyImageLoadingContent() in HTMLImageElement::DestroyContent(), which gets called at the same time mIsGoingAway is set to true.
This also doesn't handle the adoption case Olli mentioned.

I think it makes sense that we clear the imageRequest in HTMLImageElement::DestroyContent() which would fix the leak. And we don't need to do any special handling for the adoption case given that HTMLImageElement would reload the image when adopting to another document (this behavior is defined in spec). And the same leak would happen on other elements which might load image, like HTMLInputElement, HTMLEmbedElement.. etc, I think we need to similar things for them, too. But I need to check the adoption case for them first.

Assignee: continuation → echen
Component: ImageLib → DOM: Core & HTML
Flags: needinfo?(tnikkel)
Flags: needinfo?(echen)

(In reply to Edgar Chen [:edgar] from comment #50)

And the same leak would happen on other elements which might load image, like HTMLInputElement, HTMLEmbedElement.. etc, I think we need to similar things for them, too. But I need to check the adoption case for them first.

I check all nsImageLoadingContent elements, most of them would reload the image after adoption when the imageRequest is cleared, except HTMLObjectElement and HTMLEmbedElement which inherits from nsObjectLoadingContent that only reloads after attribute mutation.

Attachment #9162433 - Attachment description: Bug 1608501 - Part 3: Clean up image request in DestroyContent; → Bug 1608501 - Part 2: Clean up image request in DestroyContent;
Attachment #9162183 - Attachment is obsolete: true

(In reply to Edgar Chen [:edgar] from comment #57)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6683f0cca3f34699062036ebda2abfafb2cc5d23

[task 2020-07-10T14:57:06.288Z] 14:57:06 INFO - TEST-UNEXPECTED-TIMEOUT | /content-security-policy/inheritance/iframe-all-local-schemes-inherit-self.sub.html | <iframe>'s about:blank inherits policy. - Test timed out
[task 2020-07-10T14:57:06.289Z] 14:57:06 INFO - TEST-UNEXPECTED-TIMEOUT | /content-security-policy/inheritance/iframe-all-local-schemes-inherit-self.sub.html | expected OK

Get an unexpected timeout on /content-security-policy/inheritance/iframe-all-local-schemes-inherit-self.sub.html, I think it is hit the bug 543435, the test create an iframe and append an image on it, but the image is appended to the initial about:blank which is replaced by new about:blank later soon, and the onload even is also suppressed when the initial about:blank is going to be destroyed.

Attachment #9162182 - Attachment description: Bug 1608501 - Part 1: Rename nsImageLoadingContent::DestroyImageLoadingContent to nsImageLoadingContent::DestroyContent; → Bug 1608501 - Part 1: Rename nsImageLoadingContent::DestroyImageLoadingContent to ::Destroy;
Attachment #9162433 - Attachment description: Bug 1608501 - Part 2: Clean up image request in DestroyContent; → Bug 1608501 - Part 3: Clean up image request in DestroyContent;
Pushed by echen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/58c8aa9d3a93 Part 1: Rename nsImageLoadingContent::DestroyImageLoadingContent to ::Destroy; r=smaug https://hg.mozilla.org/integration/autoland/rev/cbe8f7a3adae Part 2: Rename nsObjectLoadingContent::DestroyContent to ::Destroy; r=smaug https://hg.mozilla.org/integration/autoland/rev/2a381254701d Part 3: Clean up image request in DestroyContent; r=smaug

Hi Zach, with this bug fixed we would be interested if your problem is gone now, or if it still exists. Would you mind testing again with the latest Firefox Nightly? Thanks!

Flags: needinfo?(z)

== Change summary for alert #26607 (as of Mon, 27 Jul 2020 06:29:40 GMT) ==

Improvements:

10% Images windows10-64-shippable-qr opt 9,001,649.68 -> 8,131,449.62
9% Images windows10-64-shippable-qr opt 8,984,308.08 -> 8,142,803.93

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=26607

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #63)

Hi Zach, with this bug fixed we would be interested if your problem is gone now, or if it still exists. Would you mind testing again with the latest Firefox Nightly? Thanks!

Seems to be working great with Firefox Nightly! A huge "thank you" from the Cypress team to everyone involved with fixing this bug. :)

I see Target Milestone: mozilla80, does that mean it's safe to assume that this fix will be in Firefox in any release channel with version 80 or above?

Flags: needinfo?(z)

(In reply to Zach Bloomquist from comment #65)

I see Target Milestone: mozilla80, does that mean it's safe to assume that this fix will be in Firefox in any release channel with version 80 or above?

This is correct. And great to hear that it is fixed for you.

Thank you, Zach!
Closing as verified per comment 65.

Status: RESOLVED → VERIFIED
Flags: qe-verify+

I happened to be looking at our memory usage benchmark just now, specifically at the memory usage for image data. It looks like the memory usage for images went from extremely noisy to extremely non-noisy in a change set that included this bug. I don't know for sure if that is the cause, but it seems plausible. Pretty cool.

https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=mozilla-central,1961123,1,4&series=mozilla-central,2242853,1,4&series=mozilla-central,2242903,1,4&timerange=7776000

That is great!

Wow. That makes a difference, indeed!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: