680358 - Fuse frame queue allocation to avoid wasted address space and per-frame malloc calls

Reporter

Description

•

13 years ago

Running DMD on a local test video revealed larger than expected internal fragmentation on our video frame allocations:

==5965== Unreported: 7,708,672 (cumulative: 7,708,672) bytes in 11 heap block(s) in record 1 of 11426:
==5965==  Requested bytes unreported: 1,536,000 / 16,896,000
==5965==  Slop      bytes unreported: 6,172,672 / 6,172,672
==5965==    at 0x4A06065: malloc (vg_replace_malloc.c:261)
==5965==    by 0x5678F26: moz_xmalloc (mozalloc.cpp:110)
==5965==    by 0x7A49C75: mozilla::layers::BasicPlanarYCbCrImage::SetData(mozilla::layers::PlanarYCbCrImage::Data const&) (mozalloc.h:241)
==5965==    by 0x73BCEEB: VideoData::Create(nsVideoInfo&, mozilla::layers::ImageContainer*, long, long, long, VideoData::YCbCrBuffer const&, int, long, nsIntRect) (nsBuiltinDecoderReader.cpp:
186)

The test video is 800x480, so each frame allocation is 1536000 bytes.  The decoded frame queue holds 10 frames.  The allocator appears to be allocating 2097152 bytes per frame, resulting in 25% slop.

(Note that one entire frame is classified as unreported because it has been allocated but is not currently present in the decode queue used to calculate the allocated memory size.)

Looking at jemalloc's implementation, these allocations will be taking the huge_malloc path, which rounds allocations up to the next chunksize (~1MB).

Nicholas Nethercote [inactive]

Comment 1

•

13 years ago

Nice find.  I'm surprised but glad that someone other than me is running DMD :)

The jemalloc size requests you have to watch out for are in the ranges 512--8192 bytes and 1--2MB, because in those ranges all the sizes are powers-of-two so you can end up with up to 50% slop if you're unlucky.

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Keywords: footprint

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Whiteboard: [MemShrink] → [MemShrink:P2]

Randell Jesup [:jesup] (needinfo me)

Comment 2

•

13 years ago

Taking bug to do investigation per memshrink meeting

Assignee: nobody → rjesup

Justin Lebar (not reading bugmail)

Comment 3

•

13 years ago

Here's how this works in jemalloc:

In huge_malloc, let csize be size rounded up to the nearest megabyte, and let psize be size rounded up to the nearest page.

Allocate csize bytes.  Then, if MALLOC_DECOMMIT is defined, do

  if (csize - psize > 0)
    pages_decommit((void *)((uintptr_t)ret + psize), csize - psize);

So on Windows, where DECOMMIT is defined, we're not wasting anything.  On Linux, I think the assumption is that the mmapped memory beyond psize is never touched, so the OS never gives it a physical page.

Justin Lebar (not reading bugmail)

Comment 4

•

13 years ago

IOW, I don't think the round-up-to-nearest-mb behavior is actually causing us to use more memory.

Randell Jesup [:jesup] (needinfo me)

Comment 5

•

13 years ago

I think I agree with Justin - it will take memory space in the map, but won't take actual physical memory.  Taking space in the map still matters, since 32-bit processes will start to run out of usable VM space (and start having major perf problems) around 1.7-2.1GB VSS.  And in Win32 (though the logic is obtuse) MALLOC_DECOMMIT is set.

It may be useful to use MALLOC_DECOMMIT on Linux if it frees up VM space, but it's not a huge issue.  It would apply to all >1MB allocs (large images mostly).  It's easy to test (jlebar?).

There may be a use in grouping allocs of the video buffers when they're *smaller* than 1MB to reduce total slop, and above 1MB it may save address space if we don't do MALLOC_DECOMMIT.  I'll look into that.

Paul Biggar

Comment 6

•

13 years ago

(In reply to Randell Jesup [:jesup] from comment #5)

> It may be useful to use MALLOC_DECOMMIT on Linux if it frees up VM space,
> but it's not a huge issue.  It would apply to all >1MB allocs (large images
> mostly).  It's easy to test (jlebar?).

This owuld be worthwhile anyway. I tried turning it on on Mac and it reduced memory footprint by 43%(!), but totally killed performance of the browser.

Justin Lebar (not reading bugmail)

Comment 7

•

13 years ago

I was under the impression that the DECOMMIT wins on mac were due to madvise bugs -- if madvise FREE/DONT_NEED doesn't actually release memory, then of course decommit would help.

I did an informal experiment with DECOMMIT in bug 636220 comment 16 and 17 -- compare anonymous maps' RSS between comment 16 and 17 and they're close to the same.

Paul Biggar

Comment 8

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #7)
> I was under the impression that the DECOMMIT wins on mac were due to madvise
> bugs -- if madvise FREE/DONT_NEED doesn't actually release memory, then of
> course decommit would help.

No, this is two separate issues. The madvise issue is that madvise doesnt work on Mac 10.5. The DECOMMIT thing is that enabling MALLOC_DECOMMIT on mac (that is, doing a decommit using mmap instead of using madvise) causes a huge slowdown.

Justin Lebar (not reading bugmail)

Comment 9

•

13 years ago

But did DECOMMIT give a massive footprint win on 10.6, where madvise works properly, or only on 10.5?  My thesis is that non-decommit plus a working madvise should use approximately as much memory as decommit.

Paul Biggar

Comment 10

•

13 years ago

DECOMMIT gave a much larger win than madvise, on 10.6 (where madvise works).

Nicholas Nethercote [inactive]

Comment 11

•

13 years ago

So there are two things to do here:

(1) Account for the slop in the media/decoded-{audio,video} memory reporters.  This can be done by using moz_malloc_usable_size, and if it returns 0 (which it can on platforms that don't support it), then fall back to the old calculation which doesn't account for slop.  You could fold in the slop to the existing reporters, or report it separately, depending on which you think is better.

(2) Determine if the slop can be reduced.  That's the investigation that Randall agreed to do.

We can do both in this bug or file a separate bug for (1).  Any preferences?

Matthew Gregan [:kinetik]

Reporter

Comment 12

•

13 years ago

I think it makes sense to have a separate bug, and to track the slop separately since in this case it's "only" wasted address space.

Matthew Gregan [:kinetik]

Reporter

Comment 13

•

13 years ago

Filed bug 682195.

Nicholas Nethercote [inactive]

Comment 14

•

13 years ago

Randell, any progress on (2) from comment 11?

Whiteboard: [MemShrink:P2] → [MemShrink:P2][see comment 11]

Randell Jesup [:jesup] (needinfo me)

Comment 15

•

13 years ago

Yes, sorry.

The problem is that the frames are allocated very separately - it's not creating a queue of 10 frames to (re)use, it's allocating and pushing frames into a queue until there are 10, and deallocating when they're consumed.

In order to fix this, we would need to create a recycling allocator (aka buffer pool) for use for the video frames for a stream and pre-allocate 10 (AMPLE_VIDEO_FRAMES) buffers, in which case we could remove all the slop (except <1MB at the end) and also avoid trips through the allocator on every frame (not a huge win, but a win especially if due to other things interfering (fragmenting/re-using) with the video's allocations it might have to obtain new memory from the OS at times).

As an optimization, the buffer pool could either be informed when it's not in operation (and drop the memory), or whenever all frames are consumed drop the memory (even if this churns a little it's better than current, and it may never churn).

With the current algorithm, it's largely impossible to reduce the slop.  The change would not need to be very intrusive; we would need a bufferpool object for each queue and the create-a-frame code would allocate through there.  Minor side thing to check or deal with - if the video changes resolution in mid-stream, the buffers may have to be dropped and re-allocated (and hold both temporarily).  (Yes, h.264 and probably others can change resolutions in mid-stream.)

Work is contained and not particularly complex though definitely non-trivial.  Win is potentially significant on video depending on the resolution; small win perhaps on performance.

Since the investigation is done, dropping the bug back in the pool for the video people to fight over (or really almost any person comfortable with allocators/etc could work on it).

Assignee: rjesup → nobody

Nicholas Nethercote [inactive]

Comment 16

•

13 years ago

So in the best case this would save us just under 10MB per video, i.e. just under 1MB per decoded frame.  And on Windows and Linux at least, this 10MB is wasted address space but probably has no other cost.

I guess this wouldn't hurt to fix but I think there are plenty of other MemShrink bugs with better effort:benefit ratios.

Whiteboard: [MemShrink:P2][see comment 11] → [MemShrink:P2][see comment 15 and 16]

Justin Lebar (not reading bugmail)

Comment 17

•

12 years ago

> So in the best case this would save us just under 10MB per video, i.e. just under 1MB per 
> decoded frame.

Actually, it's no more than 4KB per decoded frame.  Although we round the virtual size up to the next MB, the committed size is rounded up to the next page.

As such, this is not a particularly important bug.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WONTFIX

Matthew Gregan [:kinetik]

Reporter

Comment 18

•

12 years ago

We do actually care about how much address space is wasted, so this is still worth fixing at some point.

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Matthew Gregan [:kinetik]

Reporter

Updated

•

12 years ago

Summary: Investigate internal fragmentation in large media allocations → Fuse frame queue allocation to avoid wasted address space and per-frame malloc calls

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

10 years ago

Also, batched allocation and avoiding churn may help reduce VM fragmentation, so perhaps bump up on the radar?

Benjamin Smedberg

Comment 20

•

10 years ago

dmajor, does this bug match up with your previous measurements of HTML <video> decoding causing additional memory usage or fragmentation?

Flags: needinfo?(dmajor)

(Away)

Comment 21

•

10 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #20)
> dmajor, does this bug match up with your previous measurements of HTML
> <video> decoding causing additional memory usage or fragmentation?

Yes. Here's the data from bug 930797 comment 54, scoped to VideoData::Create:

Reserve Type, Desired Size (B), Stack, Count, Committed Actual (B), Outstanding Committed (B)
AIFO, 2097152, xul.dll!mozilla::VideoData::Create, 176, 369098752, 243662848, 369098752, 369098752
AIFI, , , 66, 162529280, 0, 162529280, 0
, 2097152, xul.dll!mozilla::VideoData::Create, 43, 90177536, 0, 90177536, 0
, 3145728, xul.dll!mozilla::VideoData::Create, 23, 72351744, 0, 72351744, 0

How to read this: At the point of the OOM crash, there were 176 outstanding 2MB allocations from VideoData::Create. That was 369MB originally reserved, of which 243MB is still committed. Drilling down to individual allocations, nearly all were 1348448 committed out of 2097152 reserved -- so about 33% slop.

Over the course of the trace, there were an additional 43 2MB allocations and 23 3MB allocations that were allocated and released before the crash (no data on slop since the blocks are freed).

Flags: needinfo?(dmajor)

(Away)

Comment 22

•

10 years ago

A correction and clarification:

(In reply to David Major [:dmajor] from comment #21)
> That was 369MB originally reserved, of which 243MB is still committed. 
That wording may be misleading. More like: 396MB *still* reserved, 243MB still committed. So we have lots of ~700KB blocks of VA reserved-but-uncommitted.

> nearly all were 1348448 committed out of 2097152 reserved 
Typo, the committed size should be 1384448 (in case that number has significance to you)

(Away)

Comment 23

•

10 years ago

> More like: 396MB *still* reserved, 243MB
Sigh. 369MB *still* reserved. Catn tpye todya.

Maire Reavy [:mreavy]

Updated

•

9 years ago

Component: Audio/Video → Audio/Video: Playback

BMO Automation

Updated

•

2 years ago

Severity: normal → S3