Open Bug 1475518 Opened 6 years ago Updated 2 years ago

Commit-space usage investigation

Categories

(Core :: General, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: gsvelto, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [MemShrink:P2][overhead:noted])

While investigating Windows OOM crashes I discovered that the vast majority of OOMs in 64-bit Windows builds is caused by commit-space exhaustion [1]. Windows separates used memory in three areas: reserved virtual address space, committed memory and used physical memory. While the first and last are self-explanatory commit-space is not: it's memory that a process has requested to be available at all times, so even if it's never used (i.e. not backed by physical memory) it can still cause an OOM when there's not enough of it (usually it's size is physical memory + swap space).

Inspecting OOM and about:memory reports I noticed a pattern I still cannot explain: content processes usually have a significant amount of committed memory that is never used. 10's of MiB per process is normal and I've seen extreme cases were committed memory was ten times the amount of used memory.

This committed memory doesn't seem to be accounted for within Gecko. The only hint I have as to where it's coming from is that it's quite small when a content process is created but grows rather fast when scrolling pages. My guess is that it's memory that the GPU driver (or some other part of the pipeline?) is committing for its internal purposes.

If we want to dramatically grow the number of content processes I believe we should identify where this memory is being used and why.

[1] https://sql.telemetry.mozilla.org/queries/52834#146985
Assuming this is mostly a GPU driver issue, we're probably going to have to solve it with WebGL remoting and a separate GPU processes. We already want both of those for other reasons, and a GPU process will help enormously with other memory issues, like the separate glyph caches we currently have in all content processes.
Whiteboard: [MemShrink] → [MemShrink][overhead:noted]
Another hint that this is due to the GPU driver is that the compositor/GPU process shows the largest amount of unexplained commit-space. One possible explanation for this is that the driver (or something else?) is committing memory that might be needed for swapping out textures stored in GPU RAM in case it would be necessary. In the past GPU drivers kept an entire copy of every texture in system RAM but this is not the case anymore. Still it could be that they keep sufficient space around for "swapping out" textures if they run out of GPU RAM.

Finally, I've seen similar behavior on nVidia, Intel and AMD GPUs. AMD drivers fortunately report how much memory they have committed and this amount is quite close to the measured overhead. This information is gathered via an undocumented Windows call [1] which is why not all vendors seem to implement it.

[1] https://searchfox.org/mozilla-central/rev/b0275bc977ad7fda615ef34b822bba938f2b16fd/gfx/thebes/gfxWindowsPlatform.cpp#192
Whiteboard: [MemShrink][overhead:noted] → [MemShrink:P2][overhead:noted]

After discussing this with my team today I had another quick look at this and here's a few data-points and pointers on how to conduct this investigation.

First of all the measurements: in about:memory what you're looking for is a (significant) discrepancy between the memory we've explicitly allocated (and accounted for) and the memory that Windows considers committed. The former is the explicit entry under Explicit Allocations and the latter is address-space > commit > private. Here's an example from my main process:

Explicit Allocations

462.85 MB (100.0%) ++ explicit

Other Measurements

134,217,727.94 MB (100.0%) -- address-space
└────────1,478.58 MB (00.00%) -- commit
         ├────692.89 MB (00.00%) ++ mapped
         ├────513.42 MB (00.00%) -- private
         │    ├──506.80 MB (00.00%) ── readwrite(segments=2564)
         │    ├────2.82 MB (00.00%) ── execute-read(segments=23)
         │    ├────2.40 MB (00.00%) ── readwrite+stack(segments=107)
         │    ├────1.37 MB (00.00%) ── readwrite+guard(segments=107)
         │    ├────0.02 MB (00.00%) ── noaccess(segments=5)
         │    └────0.02 MB (00.00%) ── readonly(segments=3)
         └────272.27 MB (00.00%) ++ image

This is not so bad, there's only around 50MiB of private committed memory that's not accounted for. Note that under the private entry you will find multiple ones. readwrite is usually regular annotations, execute-read are probably going to be buffers for JIT'd code, readwrite+stack is obviously the stack, readwrite+guard are guard pages.

Here's another example, this time it's the gpu process:

Explicit Allocations

121.29 MB (100.0%) ++ explicit

Other Measurements

134,217,727.94 MB (100.0%) -- address-space
└────────1,080.50 MB (00.00%) -- commit
         ├────582.56 MB (00.00%) -- private
         │    ├──299.79 MB (00.00%) ── readwrite+writecombine(segments=215)
         │    ├──280.58 MB (00.00%) ── readwrite(segments=439)
         │    ├────1.34 MB (00.00%) ── readwrite+stack(segments=38)
         │    ├────0.74 MB (00.00%) ── readwrite+guard(segments=38)
         │    ├────0.07 MB (00.00%) ── execute-read(segments=2)
         │    ├────0.03 MB (00.00%) ── readonly(segments=6)
         │    └────0.02 MB (00.00%) ── noaccess(segments=6)
         ├────255.39 MB (00.00%) ++ mapped
         └────242.55 MB (00.00%) ++ image

Different story here, we've explicitly allocated ~120MiB of memory but there's over 580MiB that are committed! The readwrite entry is over twice the size of our explicitly allocated memory so there's something else allocating memory - my guess is that's the graphics drivers or DirectX runtime. Then there's the readwrite+writecombine entry which is the most suspicious of all. This is uncacheable memory with write-combining enabled which is the hallmark of a buffer that must have been allocated by the graphics driver for use with the GPU. As you can see this is very large.

Last but not least this is a content process:
web (pid 14288)

Explicit Allocations

493.66 MB (100.0%) ++ explicit

Other Measurements

134,217,727.94 MB (100.0%) -- address-space
└────────1,466.72 MB (00.00%) -- commit
         ├────643.57 MB (00.00%) ++ mapped
         ├────590.52 MB (00.00%) -- private
         │    ├──524.88 MB (00.00%) ── readwrite(segments=769)
         │    ├───59.04 MB (00.00%) ── readwrite+writecombine(segments=24)
         │    ├────3.63 MB (00.00%) ── execute-read(segments=12)
         │    ├────2.08 MB (00.00%) ── readwrite+stack(segments=65)
         │    ├────0.86 MB (00.00%) ── readwrite+guard(segments=65)
         │    ├────0.03 MB (00.00%) ── readonly(segments=6)
         │    └────0.01 MB (00.00%) ── noaccess(segments=2)
         └────232.64 MB (00.00%) ++ image

This is a mixed-bag, the readwrite chunk is only a bit larger than our explicit allocations but we've got a fairly hefty readwrite+writecombine chunk. I have no idea what that could be for: textures to back canvas elements maybe? Or buffers for video decoding?

In order to figure out where that memory is going the best way would be to hook up a debugger to one of the affected processes and use it to get stack traces out of VirtualAlloc() and VirtualAllocEx() calls. In particular we're interested in calls that have the MEM_COMMIT flag set in the flAllocationType parameter for those are the ones that are actually committing the memory they're requesting. Additionally one could look for calls that have the PAGE_WRITECOMBINE flag set in the flProtect parameter. See this page for more info about that flag.

As discussed on Matrix it might also be worth checking some of the file mappings we have under the mapped and image entries (though the latter should be largely made up of xul.dll). In that case we'd have to get stacks for MapViewOfFile() and friends.

I updated the query in comment 0 [1]. It still shows the low-commit-space-situation is the highest. Looking at per-process view [2], the main process shows the biggest number.

[1] https://sql.telemetry.mozilla.org/queries/79144#196656
[2] https://sql.telemetry.mozilla.org/queries/79145#196658

I should also point out that Windows does not overcommit (ie, does not have an OOM killer). Reserved memory may not be committed, but it may not necessarily be excluded from commit space, either.

(In reply to Toshihito Kikuchi [:toshi] from comment #6)

I updated the query in comment 0 [1]. It still shows the low-commit-space-situation is the highest. Looking at per-process view [2], the main process shows the biggest number.

[1] https://sql.telemetry.mozilla.org/queries/79144#196656
[2] https://sql.telemetry.mozilla.org/queries/79145#196658

Thanks Toshihito!

(In reply to Aaron Klotz [:aklotz] from comment #7)

I should also point out that Windows does not overcommit (ie, does not have an OOM killer). Reserved memory may not be committed, but it may not necessarily be excluded from commit space, either.

Yes, what got me into this was the realization that the majority of our users experiencing OOM crashes on Windows had plenty of physical memory available at the time of the crash.

(In reply to Gabriele Svelto [:gsvelto] from comment #0)

[1] https://sql.telemetry.mozilla.org/queries/52834#146985

Query #79144 appears to have been replaced with a different query(?), so here's an up-to-date version:

https://sql.telemetry.mozilla.org/queries/86429#214072

What are the units, number of crashes? MB free?

Flags: needinfo?(rkraesig)

Number of crashes in the nightly channel

^ Yes, that.

(In reply to Gabriele Svelto [:gsvelto] from comment #3)

First of all the measurements: in about:memory what you're looking for is a (significant) discrepancy between the memory we've explicitly allocated (and accounted for) and the memory that Windows considers committed. The former is the explicit entry under Explicit Allocations and the latter is address-space > commit > private.

It appears that (nowadays?) one should also subtract, from the explicit entry, any decoded-nonheap entries found thereunder. (I infer from context that these are image-data stored in temporary-file-backed shared memory, which wouldn't count towards the commit charge.)

Flags: needinfo?(rkraesig)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.