Closed Bug 1697476 Opened 3 years ago Closed 2 years ago

NS_ERROR_OUT_OF_MEMORY after 1hour+ of playback due to accumulation in TrackBuffersManager

Categories

(Core :: Audio/Video, defect, P1)

Firefox 86
Desktop
Unspecified
defect

Tracking

()

VERIFIED FIXED
99 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox97 --- wontfix
firefox98 --- wontfix
firefox99 --- verified

People

(Reporter: jordan, Assigned: bryce)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(3 files)

Attached file memory-report.json.gz

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 OPR/74.0.3911.203

Steps to reproduce:

Visit either :

on Windows and playback the video during 1h~1h30.

Actual results:

After 1h~1h30 of playback. The video stops/freeze.
The following error can be see in the console :
NS_ERROR_OUT_OF_MEMORY

I've attached a memory report retrieved with about:memory tool

Expected results:

The playback of the video should continue smoothly withtout the NS_ERROR_OUT_OF_MEMORY error.

I have reproduced this issue by opening the first link and letting the video play between 40 minutes and 1 hour on Nightly v88.0a1 and Release v86.0.

It is still unknown whether it also reproduces on Mac and Ubuntu and also whether it is a regression or not, but considering the long investigation time, I will perform the remaining investigations only if necessary.

@mccr8 : Can you deduce something that might help address this issue further from the memory report attached? Thanks!
Please M me and let me know if more testing is needed.

Status: UNCONFIRMED → NEW
Component: Untriaged → Audio/Video: Playback
Ever confirmed: true
Flags: needinfo?(continuation)
Product: Firefox → Core
Hardware: Unspecified → Desktop

Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.

Most of the stuff in the memory report looks completely normal.

The big exception is this entry in the GPU process:
4,460,498.35 MB ── gfx-d2d-vram-source-surface

This value is incremented in SourceSurfaceD2D1::DrawTargetWillChange(), so presumably we're leaking these bitmaps. Hopefully a video or graphics person might understand how that could happen.

Flags: needinfo?(continuation)
Summary: NS_ERROR_OUT_OF_MEMORY after 1hour+ of playback with DRM → gfx-d2d-vram-source-surface leaks on video with DRM

Thanks for the report and the confirmation that it can be reproduced. Given that the leak appears to be a distinct type of object, hopefully people will be able to investigate it without running the video for quite as long.

Moving to gfx component for the leak seems happen in D2D code.

Component: Audio/Video: Playback → Graphics

Sotaro, are you able to reproduce this?

Severity: -- → S2
Flags: needinfo?(sotaro.ikeda.g)

jordan, can you attach about:support to this bug? Thank you.

Flags: needinfo?(sotaro.ikeda.g) → needinfo?(jordan)
Flags: needinfo?(sotaro.ikeda.g)
Attached file about_support.txt
Flags: needinfo?(jordan)

SourceSurfaceD2D1 usage in GPU process happens only with remoted canvas 2d. And it seems that SourceSurfaceD2D1::DrawTargetWillChange() is not called since Bug 1696967 fix. By reverting fix on latest m-c, I saw the DrawTargetWillChange() call in GPU process with the STR.

It is weird that mVRAMUsageSS is incremented only at one place and the value is not decremented.

mVRAMUsageSS seems to be broken on current m-c.

jordan, can you confirm if the problem is addressed with latest nightly?

Flags: needinfo?(jordan)
Depends on: 1699352
Flags: needinfo?(sotaro.ikeda.g)
Depends on: 1696967

(In reply to Sotaro Ikeda [:sotaro] from comment #11)

jordan, can you confirm if the problem is addressed with latest nightly?

I saw the problem even with latest m-c.

Flags: needinfo?(jordan)

By comment 10, D2D memory leak did not happen. Then it might not be graphics bug.

There was a case that console had the following error.

Error Code: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005)
Details: ipc::IPCResult __cdecl mozilla::gmp::ChromiumCDMParent::RecvIncreaseShmemPoolSize(void): Failled to ensure CDM has enough shmems.

With latest m-c, I got the following error in console.


Media resource could not be decoded. 6.0.0-EAB-2
08:00:55.060 [clpp.video.events] Received Video Element Event: 'error' cl.core.min.js:225:456
Player Error: undefined
error { target: video.md-image-xxx.clpp-fill, isTrusted: true, srcElement: video.md-image-xxx.clpp-fill, currentTarget: video.md-image-xxx.clpp-fill
, eventPhase: 2, bubbles: false, cancelable: false, returnValue: true, defaultPrevented: false, composed: false, … }
AppPlayer.vue:195:18
This error message will be blank when privacy.resistFingerprinting = true. If it is really necessary, please add it to the whitelist in MediaError::GetMessage: NS_ERROR_OUT_OF_MEMORY (0x8007000e) cl.mse.min.js:104:481
08:00:55.136 [clpp.mse.Player] Player state change: PLAYING -> ERROR cl.core.min.js:225:456
Player Error: FATAL Clpp-Error [Category 3 - Code 3016]: NS_ERROR_OUT_OF_MEMORY (0x8007000e)
Object { message: "FATAL Clpp-Error [Category 3 - Code 3016]: NS_ERROR_OUT_OF_MEMORY (0x8007000e)", stack: "2582/</</</s.H@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:20056\nNn/<@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:232499\n", severity: 2, code: 3016, data: {…}, a: null }
AppPlayer.vue:195:18
08:00:55.328 [clpp.mse.Streaming] (video:1) failed fetch and append: code=3014 cl.core.min.js:226:31
Player Error: FATAL Clpp-Error [Category 3 - Code 3014]: {"mediaError":3}
Object { message: "FATAL Clpp-Error [Category 3 - Code 3014]: {"mediaError":3}", stack: "2582/</</</s.H@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:20056\no/s.t.hf@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:257722\n", severity: 2, code: 3014, data: {…}, a: null }
AppPlayer.vue:195:18
08:00:55.631 [clpp.mse.Streaming] (audio:7) failed fetch and append: code=3015 cl.core.min.js:226:31
Player Error: FATAL Clpp-Error [Category 3 - Code 3015]: {"mediaError":{}}
Object { message: "FATAL Clpp-Error [Category 3 - Code 3015]: {"mediaError":{}}", stack: "2582/</</</s.H@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:20056\npe@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:197137\noe@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:195989\nhn/<@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:225235\n", severity: 2, code: 3015, data: {…}, a: null }
AppPlayer.vue:195:18
Media resource could not be decoded, error: Error Code: NS_ERROR_OUT_OF_MEMORY (0x8007000e) 6.0.0-EAB-2
08:01:23.552 [clpp.mse.Player] Player is in ERROR state and can not switch to BUFFERING cl.core.min.js:225:456
08:01:23.611 [clpp.video.events] Received Video Element Event: 'ratechange'

gfx-d2d-vram-source-surface leaks did not happen. Report gfx-d2d-vram-source-surface was wrong. Then it seems better to be looked by media team.

Component: Graphics → Audio/Video

There is also Canvas2D memory leak(Bug 1697344). It might be related.

Depends on: 1697344

Hello,

Do you have any news about this issue?

Best,
Philippe

Flags: needinfo?(jmuizelaar)

Bryce, can you reproduce this? Do you have an idea what might be leaking?

Flags: needinfo?(jmuizelaar) → needinfo?(bvandyk)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)

Bryce, can you reproduce this? Do you have an idea what might be leaking?

At this stage I just get CORS errors for both the original test cases.

Reporter, has the configuration changed on the server hosting the media? If so, could it be reverted so we can test this further?

Flags: needinfo?(jordan)

Philippe, are you able to help with the questions in comment 21?

Flags: needinfo?(philippe)

Clearing NI as this is hard to address without a repro.

The info that we have suggests we are running out of shmems to send the CDM. I'm not sure why this would happen, but if I had to speculate it could be as a result of the codec taking up more memory than we anticipate and/or the existing shmems becoming fragmented (not clear how).

Please NI me if we have a repro.

Flags: needinfo?(bvandyk)

Thinking about this, I can probably MITM myself to overcome the CORS issues with tests. So I'll hold NI, but don't currently have a lot of cycles.

Flags: needinfo?(bvandyk)

The Codepen link has been updated with a working source that reproduced the NS_ERROR_OUT_OF_MEMORY issue.
https://codepen.io/JordanKinow/pen/eYdaZaz

Flags: needinfo?(jordan)

(In reply to jordan from comment #25)

The Codepen link has been updated with a working source that reproduced the NS_ERROR_OUT_OF_MEMORY issue.
https://codepen.io/JordanKinow/pen/eYdaZaz

Thanks, much appreciated.

Flags: needinfo?(bvandyk)

I've been debugging this and bug 1732492. I think they're likely the same underlying issue, but am holding off duping until we have a better idea what exactly is going on.

I'm about to be on an extended pto, so dumping my investigation so far for other engineers to follow up.

  • The repro case in comment 25 is a good one. It can be reproduced with a local build, whilst the Amazon case requires and official build (which means you can't easily debug local changes).
  • This seems to be a product of high resolution. The cases we have are ~1920*1080. I've played lower res videos (~720p) for 5+ hours without failure. It's possible that the lower res videos would eventually fault, but I don't have any evidence of that. It would be interesting to know if even higher resolution triggered the fault faster.
  • We appear to be throwing an OOM in the TrackBuffersManager when we fail to append.
    • Catching this in a debugger I see that line is failing when the task involved has the following mBuffer = {mRawPtr=0x0000028c1416d570 size = 2279083 } RefPtr<mozilla::MediaByteBuffer> and when mInputBuffer has
+		mBuffer	{mRawPtr=0x0000028c18cfd0b0 size = 2145931084, capacity = 2146435064 }	RefPtr<mozilla::MediaByteBuffer>
		mStart	2145931008	unsigned __int64
		mLength	76	unsigned __int64

Based on the above, it seems like the issue is in our media source code, and is not specific to EME/DRM. However, I remain hesitant to rule out an EME/DRM connection, given both the repro cases involve it. That said, I think the media source lead is the best one to follow here.

I've gathered the following sets of logs while reproing the issue here. They are too large to attach to the bug, so am providing links that should work of other Mozillian's to GDrive.

These logs are me reproing with MOZ_LOG=sync,timestamp,GMP:5,EME:5,MediaDecoder:5,PlatformDecoderModule:5,MediaDemuxer:5,MediaFormatReader:5,MP4Metadata:5

and these with MOZ_LOG=sync,timestamp,GMP:5,EME:5,MediaDecoder:5,PlatformDecoderModule:5,MediaDemuxer:5,MediaFormatReader:5,MP4Metadata:5,MediaSource:5

Flags: needinfo?(jmuizelaar)

I'm continuing to look into this. Further notes

  • This can reproduce on the example provided with a video only stream, and does not reproduce with an audio only stream. Not surprising, as video is going to take up more space than audio, but useful to know we can narrow the scope.
    • The video only case may trigger OOMs in different parts of the code than I note in comment 27, so I need to chase this more.
  • I've crafted some very large videos (4K, 2 hour long loop of big buck bunny) and have been trying to reproduce using Shaka player, and cannot. This is true for both encrypted and un-encrypted media.
    • It's possible there's something different about this media in the test case above compared to the media I have.
    • It's possible that there's a difference in the players being used (e.g. shaka vs bitmovin).

I'm currently investigating those last points to see if I can further narrow down what's going on.

Further investigation suggests this problem is triggered by the media involved. I.e. of the 2 issues I was following up on from comment 28, the problem is

It's possible there's something different about this media in the test case above compared to the media I have.

I was not able to reproduce the issue using my own media (encrypted or not) in the bitmovin player. So if there is a JS component involved, it's only part of the problem. However, at this stage I think what's going on is specifically media related.

I'm still tracing the problem. The issue does appear to originate in our Media Source code, and looks to be caused by a logic error where we get stuck not clearing the input buffer as expected. I hesitate to speculate further at this stage, and will report back once I've debugged further.

I'm pretty confident I've found the media aspect of this issue. I think we can fix this with a small patch that's suitable for uplift, and deal with further work in follow ups. It would be nice to get testing in place to cover this, which I think we can do if we get better memory reporting in place. I'll raise a follow up for that too.

Detailed explanation of issue follows, mitigations follow that. TLDR, we will accumulate data forever in some of our Media Source machinery when fed certain mp4 files and I have a fix.


Bug 1558364 made changes to the how we handle data being input to our Media Source stack. Loosely it means that instead of doing removals from the start of an array, and incurring the cost of relocating the array each time, we instead use a span object to track out buffers.

I.e. instead of having and array and handling removal from the front of the array as follows

[chunk, chunk] -remove->  [empty, chunk] -resize-> [chunk]

we use a span and track removal as follows

     span                         span
  ----------                      ----
 v          v                    v   v
[chunk, chunk] -remove-> [chunk, chunk]

In the span case, we don't end up needing to resize the underlying buffer as we remove data. This means if we have a series of removals needed as we processes appended data, we don't incur a repeat cost from removing them.

However, the span approach needs to clear its underlying buffer at some point, if we keep appending it will grow indefinitely and we'll eventually run out of memory. There are a few places we can clear the underlying buffer, but code was added in bug 1558364 to specifically avoid this issue. This is the code.

However, that code is only triggered if mInputBuffer is Nothing or isEmpty. I.e. if our input span is empty. Our anticipated use case is that this span is empty when we start an append. The high level flow expected is Append data -> parse all the data -> start next append -> clear the buffer because all previous data is parsed. In this case, the buffer is cleared as we start new appends. This code also covers the case where we sometimes don't have an empty span, but where we eventually do. So long as we eventually parse all the data and leave the input empty, the underlying buffer will be cleared.

This code doesn't work is if we consistently have data left over from parses, which is what happens here. In that case the buffer grows forever.

This happens because every append in the stream on this bug has tailing styp and sidx boxes. I don't believe the placement of these boxes after the media data is forbidden by the spec (though appreciate checking of my reading), but it is unusual. Gecko's mp4 container parser for these cases would normally read over that data as part of parsing the media segment, but if it follows the media segment, then the parser leaves it there until it parses the next segment.

So instead of each segment being parsed like so when the boxes precede the media data

[[styp],[sidx],[moof],[mdat]]
 ^                         ^
 ---------------------------
            parse

the data is not fully parsed, like so

[[moof],[mdat],[styp],[sidx]]
  ^         ^
   ----------
    parse

then when we get another segment appended, the span state and area parsed look like this

                                span
                -----------------------------------------
                v                                       v
[[moof],[mdat],[styp],[sidx]],[[moof],[mdat],[styp],[sidx]]
                ^                         ^
                --------------------------
                            parse

So every time we parse we're left with some tailing data in the span. Because there's always a tail, we never clean the underlying buffer and eventually OOM.


I propose we mitigate this by clearing the underlying buffer if the span empty or not. If the span isn't empty we copy any remaining data into a new buffer. For the cases where we don't have any left over data we'll continue to take the fast path introduced in bug 1558364. For cases where we have data left over, we'll pay some cost from doing an allocation and copy, but we'll avoid this OOM, and I believe we'll typically have better performance than the pre bug 1558364 case (primarily because we can batch what would previously have been several removals from our buffer into one large move).

We can do further mitigations by having the mp4 container parser skip over this data so our span becomes empty and we take the fast path. I'd looked at doing so, and think it could make sense to follow up with this. However, I see this as a bigger patch and harder to uplift. I have further thoughts on this, but will save them for a follow up bug.

It would also be good to have better memory reporting from Media Source code. I spent a lot of time chasing the OOM location, and better reporting could have helped narrow the time down.

I'll get a patch up and follow up bugs raised.

Flags: needinfo?(philippe)
Flags: needinfo?(jmuizelaar)
Regressed by: 1558364

Set release status flags based on info from the regressing bug 1558364

This prevents us eventually overflowing the buffer when getting certain data
appended.

Assignee: nobody → bvandyk
Priority: -- → P1
Summary: gfx-d2d-vram-source-surface leaks on video with DRM → NS_ERROR_OUT_OF_MEMORY after 1hour+ of playback with DRM

FWIW, I don't think there's anything DRM specific about the bug except that our repros happen to be encrypted streams. I'll verify when I create a test case next week, but the boxes tripping us up are just normal mp4 ones. I.e. this could happen in non-encrypted streams in the same way.

Edit: confirming this is not DRM specific. The testcase in bug 1755822 is DRM free but repros.

Summary: NS_ERROR_OUT_OF_MEMORY after 1hour+ of playback with DRM → NS_ERROR_OUT_OF_MEMORY after 1hour+ of playback due to accumulation in TrackBuffersManager
Pushed by bvandyk@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a5c20fbe5d5e
Shrink MSE's mInputBuffer during append even if it's not empty. r=media-playback-reviewers,kinetik,alwu
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch

The patch landed in nightly and beta is affected.
:bryce, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(bvandyk)

(In reply to Release mgmt bot [:marco/ :calixte] from comment #36)

The patch landed in nightly and beta is affected.
:bryce, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Yep. Waiting on

  • Someone to test this fixes bug 1732492 as the account I have is geolocked and doesn't work in my region.
  • Bug 1755822 to land so we have some coverage.
Flags: needinfo?(bvandyk)

Hey Bryce, can we dupe bug 1732492 over to this one?

(In reply to Jim Mathies [:jimm] from comment #38)

Hey Bryce, can we dupe bug 1732492 over to this one?

Yes. Could you test that bug 1732492 is fixed using Amazon's test stream? I'd like to confirm that it's fixed before we dupe it.

Our last beta shipped, we build our release build on Monday.

Flags: qe-verify+

I reproduced this issue using the second link from the Description on Firefox 98.0(20220304153049) on Win10. Confirmed as fixed on Firefox 99.0b8(20220324185704), Nightly 100.0a1(20220327213550) on Win10 64-bits. If there's a need to verify this on other platforms, please let me know.
Later edit: Also verified on MacOS 12.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: