Open Bug 1531088 Opened 5 years ago Updated 2 years ago

Memory allocated by a content web app (5GB+) is not released soon enough after its reloads, eventually causing OOM

Categories

(Core :: JavaScript: GC, defect, P3)

defect

Tracking

()

Tracking Status
firefox67 --- affected

People

(Reporter: mayhemer, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Nightly x64 (official builds), win10 with 32g of RAM.

I have a web app that processes ~1GB large logs and converts them to an in memory representation. The consumption may go to gigabytes, which is expected.

The problem: on later reload of the app, which triggers another round of parsing of the same set of log files immediately (from domcontentloaded), causes more and more memory to be consumed. Doing such a reload two or three times leads to: 1. huge slow down of the parsing process (apparently GC interruptions), 2. the parsing script ends up with OUT OF MEMORY exception eventually, 3. we don't release the memory even after some longer time, w/o the activity (didn't try to close the tab or hard reload to not restart the parsing automatically - sometimes it's not even possible because of the huge load).

We go up to 10GB of private bytes allocated for the tab handling content process. I have to kill that content process to be able to work again.

STR:

  1. go to: https://mayhemer.github.io/backtrack-analyzer/
  2. download, unpack and load (as baseline) all logs at: https://janbambas.cz/dwnlds/backtrack-logs-1551291722.tgz
  3. wait for the message to go from "Loading..." to "Select objective" (will take some time)
  4. then reload the page, the "loading..." message will appear again
  5. repeat until the time to finish loading is either unbearable (minutes) or you get "out of memory" exception

ER:
we do GC more quickly to not let the content process pile up JS objects and eventually make JS run out of memory

This is quite an edge case, but I think worth reporting.

I don't know how much this is actionable, but could also be a recent leak regression because I work with these large log files and the app quite for some time, and this reloads+out-of-memory problem appeared just recently. Or it may just be that the logs have grown on my side recently :)

Yeah, this sort of thing has been an issue for awhile. The main example I'm aware of is asm.js applications, which allocate a large block of memory. There was a fix for those where if you try to allocate the big block and it fails, to immediately do a GC. It is harder to fix for something like this were presumably the memory is scattered around in a lot of different allocations. I'm not sure what a good fix is.

This is the relevant part of a diff of two memory reports, one before loading any files and the other after loading the files. We can see that most of the memory is indeed in the JS heap, and quite a lot is strings.

├──2,949.03 MB (94.77%) -- top(https://mayhemer.github.io/backtrack-analyzer/, id=NNN)
│ │ ├──1,627.79 MB (52.31%) -- active/window(https://mayhemer.github.io/backtrack-analyzer/)
│ │ │ ├──1,610.80 MB (51.77%) -- js-realm(https://mayhemer.github.io/backtrack-analyzer/)
│ │ │ │ ├──1,610.73 MB (51.76%) -- classes
│ │ │ │ │ ├──1,022.50 MB (32.86%) -- class(Object)/objects
│ │ │ │ │ │ ├────675.63 MB (21.71%) ── gc-heap
│ │ │ │ │ │ └────346.87 MB (11.15%) ── malloc-heap/slots
│ │ │ │ │ ├────588.15 MB (18.90%) -- class(Array)/objects
│ │ │ │ │ │ ├──520.94 MB (16.74%) ── gc-heap [+]
│ │ │ │ │ │ └───67.21 MB (02.16%) -- malloc-heap
│ │ │ │ │ │ ├──67.21 MB (02.16%) ── elements/normal
│ │ │ │ │ │ └───0.00 MB (00.00%) ── slots [+]
│ │ │ │ │ └──────0.08 MB (00.00%) ++ (4 tiny)
│ │ │ │ └──────0.07 MB (00.00%) ++ (4 tiny)
│ │ │ └─────16.99 MB (00.55%) ++ (3 tiny)
│ │ └──1,321.24 MB (42.46%) -- js-zone(0xNNN)
│ │ ├──1,288.19 MB (41.40%) -- strings
│ │ │ ├──1,255.41 MB (40.34%) ++ (2905 tiny)
│ │ │ └─────32.79 MB (01.05%) ++ string(length=118, copies=122791, "RunScript [https://www.redditstatic.com/desktop2x/vendorsChatClientGiftsGovernance~Reddit.31686eb897e3abe26481.js]")
│ │ └─────33.05 MB (01.06%) ++ (10 tiny)

(In reply to Paul Bone [:pbone] from comment #2)

│ │ │ └─────32.79 MB (01.05%) ++ string(length=118, copies=122791, "RunScript [https://www.redditstatic.com/desktop2x/vendorsChatClientGiftsGovernance~Reddit.31686eb897e3abe26481.js]")

Curious, I don't think I had a reddit tab open when I captured this. Maybe a service worker..

I recorded a profile of 3 reloads of the data:

http://bit.ly/2UeaFrO

Between the 1st and 2nd the GC starts, it finishes after the 2nd load begins. But because the GC uses snapshot-at-the-beginning it can't free the memory that I assume was released when the 2nd load started.

Compare this to the gap between the 2nd and 3rd where the GC starts after the 3rd load begins, that GC is able to free a fair amount of memory (maybe I got lucky).

Regardless of this the 2nd GC will have been able to free what was released after the 1st GC started. So this shouldn't build up memory and OOM (so a more aggressive test is needed). Bug 1519298 will help us understand this better.

I also noticed that the GC collected only a single zone, at least in my profiling, that could limit what can be freed but didn't seem to for the 2nd GC.

Depends on: 1519298
Priority: -- → P3

I wonder if we should have more aggressive GC/CC scheduling for this reloading case. People love to spam reload, but it causes real problems for our usual heuristics, which assume that a page is going to live for more than a few seconds.

(In reply to Andrew McCreight [:mccr8] from comment #5)

I wonder if we should have more aggressive GC/CC scheduling for this reloading case. People love to spam reload, but it causes real problems for our usual heuristics, which assume that a page is going to live for more than a few seconds.

I think that would be great. We're seeing this also on the devtools tests. reftests work around it with WindowUtils.runNextCollectorTimer but it would be nice if we didn't need that kind of thing.

One small nit: when I load large logs into the web app, work with them for a while, then load a different set of logs that are by magnitude smaller (few MB), the content process later starts to hang periodically. I can see a full CPU core load and movement in memory (seems to go up and later down - but I have to look more closely to confirm!) Note that all the expected initial allocations of the new web app reload (and new set of logs load) is already done around at least 30+ seconds ago.

(In reply to Honza Bambas (:mayhemer) from comment #7)

One small nit: when I load large logs into the web app, work with them for a while, then load a different set of logs that are by magnitude smaller (few MB), the content process later starts to hang periodically. I can see a full CPU core load and movement in memory (seems to go up and later down - but I have to look more closely to confirm!) Note that all the expected initial allocations of the new web app reload (and new set of logs load) is already done around at least 30+ seconds ago.

er... not sure I have reloaded the page before the new (smaller) set of logs load. I will observe more closely next time.

Hi Honza, I'd also be great to get a profile of that particular behaviour - sounds odd to me.

(In reply to Paul Bone [:pbone] from comment #9)

Hi Honza, I'd also be great to get a profile of that particular behaviour - sounds odd to me.

ni->me to not forget.

Flags: needinfo?(honzab.moz)
Blocks: 1533449
No longer blocks: GCScheduling

stalled on my side.

Flags: needinfo?(honzab.moz)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.