Open Bug 680358 Opened 13 years ago Updated 2 years ago

Fuse frame queue allocation to avoid wasted address space and per-frame malloc calls

Categories

(Core :: Audio/Video: Playback, defect)

defect

Tracking

()

REOPENED

People

(Reporter: kinetik, Unassigned)

References

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink:P2][see comment 15 and 16])

Running DMD on a local test video revealed larger than expected internal fragmentation on our video frame allocations:

==5965== Unreported: 7,708,672 (cumulative: 7,708,672) bytes in 11 heap block(s) in record 1 of 11426:
==5965==  Requested bytes unreported: 1,536,000 / 16,896,000
==5965==  Slop      bytes unreported: 6,172,672 / 6,172,672
==5965==    at 0x4A06065: malloc (vg_replace_malloc.c:261)
==5965==    by 0x5678F26: moz_xmalloc (mozalloc.cpp:110)
==5965==    by 0x7A49C75: mozilla::layers::BasicPlanarYCbCrImage::SetData(mozilla::layers::PlanarYCbCrImage::Data const&) (mozalloc.h:241)
==5965==    by 0x73BCEEB: VideoData::Create(nsVideoInfo&, mozilla::layers::ImageContainer*, long, long, long, VideoData::YCbCrBuffer const&, int, long, nsIntRect) (nsBuiltinDecoderReader.cpp:
186)

The test video is 800x480, so each frame allocation is 1536000 bytes.  The decoded frame queue holds 10 frames.  The allocator appears to be allocating 2097152 bytes per frame, resulting in 25% slop.

(Note that one entire frame is classified as unreported because it has been allocated but is not currently present in the decode queue used to calculate the allocated memory size.)

Looking at jemalloc's implementation, these allocations will be taking the huge_malloc path, which rounds allocations up to the next chunksize (~1MB).
Nice find.  I'm surprised but glad that someone other than me is running DMD :)

The jemalloc size requests you have to watch out for are in the ranges 512--8192 bytes and 1--2MB, because in those ranges all the sizes are powers-of-two so you can end up with up to 50% slop if you're unlucky.
Whiteboard: [MemShrink] → [MemShrink:P2]
Taking bug to do investigation per memshrink meeting
Assignee: nobody → rjesup
Here's how this works in jemalloc:

In huge_malloc, let csize be size rounded up to the nearest megabyte, and let psize be size rounded up to the nearest page.

Allocate csize bytes.  Then, if MALLOC_DECOMMIT is defined, do

  if (csize - psize > 0)
    pages_decommit((void *)((uintptr_t)ret + psize), csize - psize);

So on Windows, where DECOMMIT is defined, we're not wasting anything.  On Linux, I think the assumption is that the mmapped memory beyond psize is never touched, so the OS never gives it a physical page.
IOW, I don't think the round-up-to-nearest-mb behavior is actually causing us to use more memory.
I think I agree with Justin - it will take memory space in the map, but won't take actual physical memory.  Taking space in the map still matters, since 32-bit processes will start to run out of usable VM space (and start having major perf problems) around 1.7-2.1GB VSS.  And in Win32 (though the logic is obtuse) MALLOC_DECOMMIT is set.

It may be useful to use MALLOC_DECOMMIT on Linux if it frees up VM space, but it's not a huge issue.  It would apply to all >1MB allocs (large images mostly).  It's easy to test (jlebar?).

There may be a use in grouping allocs of the video buffers when they're *smaller* than 1MB to reduce total slop, and above 1MB it may save address space if we don't do MALLOC_DECOMMIT.  I'll look into that.
(In reply to Randell Jesup [:jesup] from comment #5)

> It may be useful to use MALLOC_DECOMMIT on Linux if it frees up VM space,
> but it's not a huge issue.  It would apply to all >1MB allocs (large images
> mostly).  It's easy to test (jlebar?).

This owuld be worthwhile anyway. I tried turning it on on Mac and it reduced memory footprint by 43%(!), but totally killed performance of the browser.
I was under the impression that the DECOMMIT wins on mac were due to madvise bugs -- if madvise FREE/DONT_NEED doesn't actually release memory, then of course decommit would help.

I did an informal experiment with DECOMMIT in bug 636220 comment 16 and 17 -- compare anonymous maps' RSS between comment 16 and 17 and they're close to the same.
(In reply to Justin Lebar [:jlebar] from comment #7)
> I was under the impression that the DECOMMIT wins on mac were due to madvise
> bugs -- if madvise FREE/DONT_NEED doesn't actually release memory, then of
> course decommit would help.

No, this is two separate issues. The madvise issue is that madvise doesnt work on Mac 10.5. The DECOMMIT thing is that enabling MALLOC_DECOMMIT on mac (that is, doing a decommit using mmap instead of using madvise) causes a huge slowdown.
But did DECOMMIT give a massive footprint win on 10.6, where madvise works properly, or only on 10.5?  My thesis is that non-decommit plus a working madvise should use approximately as much memory as decommit.
DECOMMIT gave a much larger win than madvise, on 10.6 (where madvise works).
So there are two things to do here:

(1) Account for the slop in the media/decoded-{audio,video} memory reporters.  This can be done by using moz_malloc_usable_size, and if it returns 0 (which it can on platforms that don't support it), then fall back to the old calculation which doesn't account for slop.  You could fold in the slop to the existing reporters, or report it separately, depending on which you think is better.

(2) Determine if the slop can be reduced.  That's the investigation that Randall agreed to do.

We can do both in this bug or file a separate bug for (1).  Any preferences?
I think it makes sense to have a separate bug, and to track the slop separately since in this case it's "only" wasted address space.
Filed bug 682195.
Randell, any progress on (2) from comment 11?
Whiteboard: [MemShrink:P2] → [MemShrink:P2][see comment 11]
Yes, sorry.

The problem is that the frames are allocated very separately - it's not creating a queue of 10 frames to (re)use, it's allocating and pushing frames into a queue until there are 10, and deallocating when they're consumed.

In order to fix this, we would need to create a recycling allocator (aka buffer pool) for use for the video frames for a stream and pre-allocate 10 (AMPLE_VIDEO_FRAMES) buffers, in which case we could remove all the slop (except <1MB at the end) and also avoid trips through the allocator on every frame (not a huge win, but a win especially if due to other things interfering (fragmenting/re-using) with the video's allocations it might have to obtain new memory from the OS at times).

As an optimization, the buffer pool could either be informed when it's not in operation (and drop the memory), or whenever all frames are consumed drop the memory (even if this churns a little it's better than current, and it may never churn).

With the current algorithm, it's largely impossible to reduce the slop.  The change would not need to be very intrusive; we would need a bufferpool object for each queue and the create-a-frame code would allocate through there.  Minor side thing to check or deal with - if the video changes resolution in mid-stream, the buffers may have to be dropped and re-allocated (and hold both temporarily).  (Yes, h.264 and probably others can change resolutions in mid-stream.)

Work is contained and not particularly complex though definitely non-trivial.  Win is potentially significant on video depending on the resolution; small win perhaps on performance.

Since the investigation is done, dropping the bug back in the pool for the video people to fight over (or really almost any person comfortable with allocators/etc could work on it).
Assignee: rjesup → nobody
So in the best case this would save us just under 10MB per video, i.e. just under 1MB per decoded frame.  And on Windows and Linux at least, this 10MB is wasted address space but probably has no other cost.

I guess this wouldn't hurt to fix but I think there are plenty of other MemShrink bugs with better effort:benefit ratios.
Whiteboard: [MemShrink:P2][see comment 11] → [MemShrink:P2][see comment 15 and 16]
> So in the best case this would save us just under 10MB per video, i.e. just under 1MB per 
> decoded frame.

Actually, it's no more than 4KB per decoded frame.  Although we round the virtual size up to the next MB, the committed size is rounded up to the next page.

As such, this is not a particularly important bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
We do actually care about how much address space is wasted, so this is still worth fixing at some point.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Summary: Investigate internal fragmentation in large media allocations → Fuse frame queue allocation to avoid wasted address space and per-frame malloc calls
Also, batched allocation and avoiding churn may help reduce VM fragmentation, so perhaps bump up on the radar?
dmajor, does this bug match up with your previous measurements of HTML <video> decoding causing additional memory usage or fragmentation?
Flags: needinfo?(dmajor)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #20)
> dmajor, does this bug match up with your previous measurements of HTML
> <video> decoding causing additional memory usage or fragmentation?

Yes. Here's the data from bug 930797 comment 54, scoped to VideoData::Create:

Reserve Type, Desired Size (B), Stack, Count, Committed Actual (B), Outstanding Committed (B)
AIFO, 2097152, xul.dll!mozilla::VideoData::Create, 176, 369098752, 243662848, 369098752, 369098752
AIFI, , , 66, 162529280, 0, 162529280, 0
, 2097152, xul.dll!mozilla::VideoData::Create, 43, 90177536, 0, 90177536, 0
, 3145728, xul.dll!mozilla::VideoData::Create, 23, 72351744, 0, 72351744, 0

How to read this: At the point of the OOM crash, there were 176 outstanding 2MB allocations from VideoData::Create. That was 369MB originally reserved, of which 243MB is still committed. Drilling down to individual allocations, nearly all were 1348448 committed out of 2097152 reserved -- so about 33% slop.

Over the course of the trace, there were an additional 43 2MB allocations and 23 3MB allocations that were allocated and released before the crash (no data on slop since the blocks are freed).
Flags: needinfo?(dmajor)
A correction and clarification:

(In reply to David Major [:dmajor] from comment #21)
> That was 369MB originally reserved, of which 243MB is still committed. 
That wording may be misleading. More like: 396MB *still* reserved, 243MB still committed. So we have lots of ~700KB blocks of VA reserved-but-uncommitted.

> nearly all were 1348448 committed out of 2097152 reserved 
Typo, the committed size should be 1384448 (in case that number has significance to you)
> More like: 396MB *still* reserved, 243MB
Sigh. 369MB *still* reserved. Catn tpye todya.
Component: Audio/Video → Audio/Video: Playback
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.