Open Bug 1608501 Opened 3 months ago Updated 2 days ago

Test script causes Firefox to OOM repeatedly loading a page

Categories

(Core :: JavaScript: GC, defect, P2)

defect

Tracking

()

People

(Reporter: jonco, Unassigned, NeedInfo)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files)

I expect this is the same issue as bug 1576829.

Attached file memory-measurement.txt

Here is a verbose memory measurement from a Firefox process using almost all of the system's RAM. Most of the memory usage seems to be in Web Content (pid 3496).

Jon, does the attached memory report help to verify your assumption?

Flags: needinfo?(jcoppeard)

TBH it looks like the problem is that page allocates a ton of memory. There are only three instances of dashboard.cypress.io windows so it's not like we're not cleaning them up. The biggest one has 1.5GB allocated.

Flags: needinfo?(jcoppeard)

Any chance we can trim down where exactly in this page the high memory allocation is triggered? Which tools could be used for that?

This might be related to the crash we see on bug 1604680.

Blocks: 1604723
See Also: → 1604680
Flags: needinfo?(jcoppeard)
Flags: needinfo?(jcoppeard)
Priority: -- → P2

I haven't been able to run the test script itself but I made a page that repeatedly reloads https://dashboard.cypress.io in an iframe. It does seem that memory use can build up but I didn't see runaway growth. It may be that my laptop is powerful enough to keep up with the allocations though. I did see many zones for this site hanging around after I'd closed the page.

I tried disabling the part of weakmap marking that marks keys based on the color of their delegate and which seems to be causing the problem in bug 1610193. This did not fix the problem, so it seems this is an unrelated issue with a similar effect.

While testing I see memory build up. Both GC and CC run periodically. The GC usually collects ~10 zones out of a total that continually increases. However sometimes all zones are collected, but few end up being removed.

Just to add more context to this bug: Cypress 4 is out now https://cypress.io/blog/2020/02/06/introducing-firefox-and-edge-support-in-cypress-4-0/ with Firefox support in beta.

This bug is the biggest blocker from getting Cypress' Firefox-support out of beta.

Zach, today I have seen that the gecko profiler has a new option to track and visualize memory consumption including allocation and garbage collection. Could you run a recent Firefox Nightly build and create a profile while running your tests? Here the steps:

  1. Download Firefox from https://www.mozilla.org/en-US/firefox/channel/desktop/#nightly
  2. Go to "Tools -> Web Developer" and enable "Enable Profiler Toolbar Icon" (it will enable the new profiler ui which is required)
  3. Configure the profiler to record memory (javascript, and native) and maybe even some other data (having more is always better)
  4. Start the profiler
  5. Start running your script
  6. Wait some seconds (how long does it actually take for each iteration?) and make sure the profiler buffer is large enough to hold all the data
  7. Select "Capture the Recording"
  8. Publish the profile and share it with us

Thanks

Flags: needinfo?(z)

Henrik, https://gist.github.com/flotwig/202b179adbe57a3903aeb71334f6b77a contains some profiles with native allocations that Zach recorded a while ago.

Flags: needinfo?(z) → needinfo?(hskupin)

Harald, from the profiling session last week I got that a lot of the memory recording features are new. So I wonder if these already recorded profiles from a while ago already contain that information or not.

Flags: needinfo?(hskupin)

We celebrated Allocations as new, but it has been available for 2+ months. I can confirm that Zach used the new Native Allocation recording – check out the recordings.

(btw, just use Native Allocations, not JS Allocations which will just report on a small subset)

Examples from the profiles in the gist:

With forced GC:
Forced GC mostly collects memory from Layout code paths: https://perfht.ml/388xpQN
Reflow also causes a lot of allocated memory (25%): https://perfht.ml/31xIicE

Without forced GC:
Layout code generates 100+ Mb of memory: https://perfht.ml/31BZKN9

Ok, thanks! Beside the memory problem, I can see that for the non forced GC profile we spend a huge amout of time in nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp) due to constant reflows, which look to be related there.

Emilio, would you mind having a look at the profiles? Thanks.

Zach, is there a way to run the integration tests on our own, or does that only work internally? I assume that would help developers a lot to replicate the situation, and to prove a fix or different fixes.

Flags: needinfo?(z)
Flags: needinfo?(emilio)

Maybe the time as spent in reflow is expected given that the script constantly reloads the page...

Also it might be helpful to have a profile which includes a run with more than just 4s of test execution. So that multiple iterations of the page load and possible GCs are included. Looking at https://perfht.ml/37h0ZCy it shows about 12MB of data that is retained after a GC. It's all coming from https://beacon-v2.helpscout.net/static/js/main.2.1.228c6719.js. I wonder if/how this accumulates over time.

Yeah, reflow time is sorta expected.

The "Forced GC collects memory from layout code-paths" is not expected though... What does the "Forced GC" really do? May this be evicting the BFCache, which could be somehow keeping around all the layout data structures for the page?

Flags: needinfo?(emilio)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #18)

The "Forced GC collects memory from layout code-paths" is not expected though... What does the "Forced GC" really do? May this be evicting the BFCache, which could be somehow keeping around all the layout data structures for the page?

Andrew, could you help us here please?

Flags: needinfo?(continuation)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #15)

Zach, is there a way to run the integration tests on our own, or does that only work internally? I assume that would help developers a lot to replicate the situation, and to prove a fix or different fixes.

Yeah, by creating a new Cypress project and adding the test code from the OP, you should be able to reproduce. We've been able to reproduce on macOS, Fedora, and Ubuntu so far. Rough instructions, assuming you have a fairly recent Node and npm:

  1. Open a new folder.
  2. npm init
  3. npm install --save cypress
  4. create cypress/integration/spec.js, add the code from https://gist.github.com/flotwig/202b179adbe57a3903aeb71334f6b77a)
  5. run $(npm bin)/cypress open and begin running spec.js in Firefox. You should be able to see memory usage steadily climb. If you set a low memory limit on the process by using a VM, you will see it crash from OOM eventually.
Flags: needinfo?(z)

I forgot to mention - the instructions I just provided will run Firefox without forced GC+CC (we disable it when users run tests locally, so it doesn't constantly lock up for GC while they're trying to interact with the browser).

To run Firefox with forced GC+CC between every test, pass the environment variable CYPRESS_firefoxGcInterval=1 (case-sensitive) while launching Cypress. You should see it lock up between tests while Firefox does it's thang, and the GC indicator in the bottom-left corner will show you when GC is running + a measurement of total time spent forcing GC.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #19)

Andrew, could you help us here please?

I'm not really sure. Maybe it makes us do a full GC? That would catch some weird cases with weak maps and multiple compartments, like sfink has been looking at.

Flags: needinfo?(continuation)

I spent a few days last week investigating this.

I couldn't get cypress to run against a local build of firefox so I made a page that loads https://dashboard.cypress.io/ into a single tab 100 times. Running this I see memory fluctuating but generally increasing fro 200MB to over 1GB (at which point I kill the process).

I'm testing on MacOS and I'm running with --disable-e10s to rule out any effects from multiple processes (and make logging easier to understand). I'm looking at memory with the system Activity Monitor.

What I've I've noticed is that the number of GC zones starts at ~10 and continues to increase. In other words zones are not dying (or only
sometimes dying) so they build up.

This is not a problem with triggering GCs. We are running plenty (we are also running CCs too).

I investigated whether this is because we were not selecting the right zones for collection, but have ruled that out. From a GC point of view the zones are not dead; they have live roots. For most of the zones these are gray root coming from outside the JS engine.

I investigated what CC roots are causing these JS GC things to be marked, and mostly these are: Promise, CallbackObject, JSStackFrame
and JSGCThingParticipant (i.e. outgoing edges from the GC heap). I don't really know to make of this.

I investigated whether we weren't running the snow white killer enough when the browser was busy or were not running the deferred deletion
enough, but that doesn't appear to be the problem.

I did find one thing that made the problem go away: running with CC logging turned on. I used the command:

MOZ_CC_LOG_ALL=1 MOZ_CC_LOG_THREAD=main MOZ_CC_LOG_DIRECTORY=/Users/jon/cclogs/ mach -o run --disable-e10s

Now memory use stays around ~270MB and rarely breaks 300MB.

I spent a bunch of time trying to understand why this made a difference but wasn't able to work it out. I checked that these options don't set
mWantAllTraces which would definitely change things (e.g. by forcing full GCs).

My conclusion is that this is not a GC issue but a system wide or CC issue.

I've run out of ideas for how to attack this. Can one of the CC folks take a look?

Flags: needinfo?(continuation)
Flags: needinfo?(bugs)
Attached file index.html

Test file I'm using to reproduce this.

That's peculiar. CC logging shouldn't affect CC behavior. Maybe the logging slows it down enough that the CC isn't running as incrementally or something.

See Also: → 1612705

I'll try to take a look at this at some point, but I haven't had any time yet.

Hi everyone, thank you for your time spent investigating this issue. This is the last major blocker before the Cypress team can take Firefox support out of "beta", so please let me know if there's anything else we can provide from our side to expedite investigations here.

You need to log in before you can comment on or make changes to this bug.