NS_ERROR_OUT_OF_MEMORY after 1hour+ of playback due to accumulation in TrackBuffersManager
Categories
(Core :: Audio/Video, defect, P1)
Tracking
()
People
(Reporter: jordan, Assigned: bryce)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: regression)
Attachments
(3 files)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 OPR/74.0.3911.203
Steps to reproduce:
Visit either :
on Windows and playback the video during 1h~1h30.
Actual results:
After 1h~1h30 of playback. The video stops/freeze.
The following error can be see in the console :
NS_ERROR_OUT_OF_MEMORY
I've attached a memory report retrieved with about:memory tool
Expected results:
The playback of the video should continue smoothly withtout the NS_ERROR_OUT_OF_MEMORY error.
Comment 1•3 years ago
|
||
I have reproduced this issue by opening the first link and letting the video play between 40 minutes and 1 hour on Nightly v88.0a1 and Release v86.0.
It is still unknown whether it also reproduces on Mac and Ubuntu and also whether it is a regression or not, but considering the long investigation time, I will perform the remaining investigations only if necessary.
@mccr8 : Can you deduce something that might help address this issue further from the memory report attached? Thanks!
Please M me and let me know if more testing is needed.
Comment 2•3 years ago
|
||
Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.
Comment 3•3 years ago
|
||
Most of the stuff in the memory report looks completely normal.
The big exception is this entry in the GPU process:
4,460,498.35 MB ── gfx-d2d-vram-source-surface
This value is incremented in SourceSurfaceD2D1::DrawTargetWillChange(), so presumably we're leaking these bitmaps. Hopefully a video or graphics person might understand how that could happen.
Updated•3 years ago
|
Comment 4•3 years ago
|
||
Thanks for the report and the confirmation that it can be reproduced. Given that the leak appears to be a distinct type of object, hopefully people will be able to investigate it without running the video for quite as long.
Comment 5•3 years ago
|
||
Moving to gfx component for the leak seems happen in D2D code.
Comment 6•3 years ago
|
||
Sotaro, are you able to reproduce this?
Comment 7•3 years ago
|
||
jordan, can you attach about:support to this bug? Thank you.
Updated•3 years ago
|
Comment 9•3 years ago
•
|
||
SourceSurfaceD2D1 usage in GPU process happens only with remoted canvas 2d. And it seems that SourceSurfaceD2D1::DrawTargetWillChange() is not called since Bug 1696967 fix. By reverting fix on latest m-c, I saw the DrawTargetWillChange() call in GPU process with the STR.
Comment 10•3 years ago
|
||
It is weird that mVRAMUsageSS is incremented only at one place and the value is not decremented.
mVRAMUsageSS seems to be broken on current m-c.
Comment 11•3 years ago
|
||
jordan, can you confirm if the problem is addressed with latest nightly?
Comment 12•3 years ago
|
||
Bug 1699352 was added for comment 10.
Comment 13•3 years ago
|
||
(In reply to Sotaro Ikeda [:sotaro] from comment #11)
jordan, can you confirm if the problem is addressed with latest nightly?
I saw the problem even with latest m-c.
Comment 14•3 years ago
|
||
By comment 10, D2D memory leak did not happen. Then it might not be graphics bug.
Comment 15•3 years ago
|
||
There was a case that console had the following error.
Error Code: NS_ERROR_DOM_MEDIA_FATAL_ERR (0x806e0005)
Details: ipc::IPCResult __cdecl mozilla::gmp::ChromiumCDMParent::RecvIncreaseShmemPoolSize(void): Failled to ensure CDM has enough shmems.
Comment 16•3 years ago
|
||
With latest m-c, I got the following error in console.
Media resource could not be decoded. 6.0.0-EAB-2
08:00:55.060 [clpp.video.events] Received Video Element Event: 'error' cl.core.min.js:225:456
Player Error: undefined
error { target: video.md-image-xxx.clpp-fill, isTrusted: true, srcElement: video.md-image-xxx.clpp-fill, currentTarget: video.md-image-xxx.clpp-fill
, eventPhase: 2, bubbles: false, cancelable: false, returnValue: true, defaultPrevented: false, composed: false, … }
AppPlayer.vue:195:18
This error message will be blank when privacy.resistFingerprinting = true. If it is really necessary, please add it to the whitelist in MediaError::GetMessage: NS_ERROR_OUT_OF_MEMORY (0x8007000e) cl.mse.min.js:104:481
08:00:55.136 [clpp.mse.Player] Player state change: PLAYING -> ERROR cl.core.min.js:225:456
Player Error: FATAL Clpp-Error [Category 3 - Code 3016]: NS_ERROR_OUT_OF_MEMORY (0x8007000e)
Object { message: "FATAL Clpp-Error [Category 3 - Code 3016]: NS_ERROR_OUT_OF_MEMORY (0x8007000e)", stack: "2582/</</</s.H@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:20056\nNn/<@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:232499\n", severity: 2, code: 3016, data: {…}, a: null }
AppPlayer.vue:195:18
08:00:55.328 [clpp.mse.Streaming] (video:1) failed fetch and append: code=3014 cl.core.min.js:226:31
Player Error: FATAL Clpp-Error [Category 3 - Code 3014]: {"mediaError":3}
Object { message: "FATAL Clpp-Error [Category 3 - Code 3014]: {"mediaError":3}", stack: "2582/</</</s.H@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:20056\no/s.t.hf@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:257722\n", severity: 2, code: 3014, data: {…}, a: null }
AppPlayer.vue:195:18
08:00:55.631 [clpp.mse.Streaming] (audio:7) failed fetch and append: code=3015 cl.core.min.js:226:31
Player Error: FATAL Clpp-Error [Category 3 - Code 3015]: {"mediaError":{}}
Object { message: "FATAL Clpp-Error [Category 3 - Code 3015]: {"mediaError":{}}", stack: "2582/</</</s.H@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:20056\npe@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:197137\noe@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:195989\nhn/<@https://demo.castlabs.com/v6/6.0.0-EAB-2/js/app.fa7631ca.js:1:225235\n", severity: 2, code: 3015, data: {…}, a: null }
AppPlayer.vue:195:18
Media resource could not be decoded, error: Error Code: NS_ERROR_OUT_OF_MEMORY (0x8007000e) 6.0.0-EAB-2
08:01:23.552 [clpp.mse.Player] Player is in ERROR state and can not switch to BUFFERING cl.core.min.js:225:456
08:01:23.611 [clpp.video.events] Received Video Element Event: 'ratechange'
Comment 17•3 years ago
•
|
||
gfx-d2d-vram-source-surface leaks did not happen. Report gfx-d2d-vram-source-surface was wrong. Then it seems better to be looked by media team.
Comment 18•3 years ago
|
||
There is also Canvas2D memory leak(Bug 1697344). It might be related.
Comment 19•3 years ago
|
||
Hello,
Do you have any news about this issue?
Best,
Philippe
Updated•3 years ago
|
Comment 20•3 years ago
|
||
Bryce, can you reproduce this? Do you have an idea what might be leaking?
Assignee | ||
Comment 21•3 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)
Bryce, can you reproduce this? Do you have an idea what might be leaking?
At this stage I just get CORS errors for both the original test cases.
Reporter, has the configuration changed on the server hosting the media? If so, could it be reverted so we can test this further?
Assignee | ||
Comment 22•3 years ago
|
||
Philippe, are you able to help with the questions in comment 21?
Assignee | ||
Comment 23•3 years ago
|
||
Clearing NI as this is hard to address without a repro.
The info that we have suggests we are running out of shmems to send the CDM. I'm not sure why this would happen, but if I had to speculate it could be as a result of the codec taking up more memory than we anticipate and/or the existing shmems becoming fragmented (not clear how).
Please NI me if we have a repro.
Assignee | ||
Comment 24•3 years ago
•
|
||
Thinking about this, I can probably MITM myself to overcome the CORS issues with tests. So I'll hold NI, but don't currently have a lot of cycles.
Reporter | ||
Comment 25•2 years ago
|
||
The Codepen link has been updated with a working source that reproduced the NS_ERROR_OUT_OF_MEMORY issue.
https://codepen.io/JordanKinow/pen/eYdaZaz
Assignee | ||
Comment 26•2 years ago
|
||
(In reply to jordan from comment #25)
The Codepen link has been updated with a working source that reproduced the NS_ERROR_OUT_OF_MEMORY issue.
https://codepen.io/JordanKinow/pen/eYdaZaz
Thanks, much appreciated.
Assignee | ||
Comment 27•2 years ago
•
|
||
I've been debugging this and bug 1732492. I think they're likely the same underlying issue, but am holding off duping until we have a better idea what exactly is going on.
I'm about to be on an extended pto, so dumping my investigation so far for other engineers to follow up.
- The repro case in comment 25 is a good one. It can be reproduced with a local build, whilst the Amazon case requires and official build (which means you can't easily debug local changes).
- This seems to be a product of high resolution. The cases we have are ~1920*1080. I've played lower res videos (~720p) for 5+ hours without failure. It's possible that the lower res videos would eventually fault, but I don't have any evidence of that. It would be interesting to know if even higher resolution triggered the fault faster.
- We appear to be throwing an OOM in the TrackBuffersManager when we fail to append.
- Catching this in a debugger I see that line is failing when the task involved has the following
mBuffer
={mRawPtr=0x0000028c1416d570 size = 2279083 } RefPtr<mozilla::MediaByteBuffer>
and whenmInputBuffer
has
- Catching this in a debugger I see that line is failing when the task involved has the following
+ mBuffer {mRawPtr=0x0000028c18cfd0b0 size = 2145931084, capacity = 2146435064 } RefPtr<mozilla::MediaByteBuffer>
mStart 2145931008 unsigned __int64
mLength 76 unsigned __int64
Based on the above, it seems like the issue is in our media source code, and is not specific to EME/DRM. However, I remain hesitant to rule out an EME/DRM connection, given both the repro cases involve it. That said, I think the media source lead is the best one to follow here.
I've gathered the following sets of logs while reproing the issue here. They are too large to attach to the bug, so am providing links that should work of other Mozillian's to GDrive.
These logs are me reproing with MOZ_LOG=sync,timestamp,GMP:5,EME:5,MediaDecoder:5,PlatformDecoderModule:5,MediaDemuxer:5,MediaFormatReader:5,MP4Metadata:5
and these with MOZ_LOG=sync,timestamp,GMP:5,EME:5,MediaDecoder:5,PlatformDecoderModule:5,MediaDemuxer:5,MediaFormatReader:5,MP4Metadata:5,MediaSource:5
Updated•2 years ago
|
Assignee | ||
Comment 28•2 years ago
|
||
I'm continuing to look into this. Further notes
- This can reproduce on the example provided with a video only stream, and does not reproduce with an audio only stream. Not surprising, as video is going to take up more space than audio, but useful to know we can narrow the scope.
- The video only case may trigger OOMs in different parts of the code than I note in comment 27, so I need to chase this more.
- I've crafted some very large videos (4K, 2 hour long loop of big buck bunny) and have been trying to reproduce using Shaka player, and cannot. This is true for both encrypted and un-encrypted media.
- It's possible there's something different about this media in the test case above compared to the media I have.
- It's possible that there's a difference in the players being used (e.g. shaka vs bitmovin).
I'm currently investigating those last points to see if I can further narrow down what's going on.
Assignee | ||
Comment 29•2 years ago
|
||
Further investigation suggests this problem is triggered by the media involved. I.e. of the 2 issues I was following up on from comment 28, the problem is
It's possible there's something different about this media in the test case above compared to the media I have.
I was not able to reproduce the issue using my own media (encrypted or not) in the bitmovin player. So if there is a JS component involved, it's only part of the problem. However, at this stage I think what's going on is specifically media related.
I'm still tracing the problem. The issue does appear to originate in our Media Source code, and looks to be caused by a logic error where we get stuck not clearing the input buffer as expected. I hesitate to speculate further at this stage, and will report back once I've debugged further.
Assignee | ||
Comment 30•2 years ago
•
|
||
I'm pretty confident I've found the media aspect of this issue. I think we can fix this with a small patch that's suitable for uplift, and deal with further work in follow ups. It would be nice to get testing in place to cover this, which I think we can do if we get better memory reporting in place. I'll raise a follow up for that too.
Detailed explanation of issue follows, mitigations follow that. TLDR, we will accumulate data forever in some of our Media Source machinery when fed certain mp4 files and I have a fix.
Bug 1558364 made changes to the how we handle data being input to our Media Source stack. Loosely it means that instead of doing removals from the start of an array, and incurring the cost of relocating the array each time, we instead use a span object to track out buffers.
I.e. instead of having and array and handling removal from the front of the array as follows
[chunk, chunk] -remove-> [empty, chunk] -resize-> [chunk]
we use a span and track removal as follows
span span
---------- ----
v v v v
[chunk, chunk] -remove-> [chunk, chunk]
In the span case, we don't end up needing to resize the underlying buffer as we remove data. This means if we have a series of removals needed as we processes appended data, we don't incur a repeat cost from removing them.
However, the span approach needs to clear its underlying buffer at some point, if we keep appending it will grow indefinitely and we'll eventually run out of memory. There are a few places we can clear the underlying buffer, but code was added in bug 1558364 to specifically avoid this issue. This is the code.
However, that code is only triggered if mInputBuffer
is Nothing
or isEmpty
. I.e. if our input span is empty. Our anticipated use case is that this span is empty when we start an append. The high level flow expected is Append data -> parse all the data -> start next append -> clear the buffer because all previous data is parsed.
In this case, the buffer is cleared as we start new appends. This code also covers the case where we sometimes don't have an empty span, but where we eventually do. So long as we eventually parse all the data and leave the input empty, the underlying buffer will be cleared.
This code doesn't work is if we consistently have data left over from parses, which is what happens here. In that case the buffer grows forever.
This happens because every append in the stream on this bug has tailing styp
and sidx
boxes. I don't believe the placement of these boxes after the media data is forbidden by the spec (though appreciate checking of my reading), but it is unusual. Gecko's mp4 container parser for these cases would normally read over that data as part of parsing the media segment, but if it follows the media segment, then the parser leaves it there until it parses the next segment.
So instead of each segment being parsed like so when the boxes precede the media data
[[styp],[sidx],[moof],[mdat]]
^ ^
---------------------------
parse
the data is not fully parsed, like so
[[moof],[mdat],[styp],[sidx]]
^ ^
----------
parse
then when we get another segment appended, the span state and area parsed look like this
span
-----------------------------------------
v v
[[moof],[mdat],[styp],[sidx]],[[moof],[mdat],[styp],[sidx]]
^ ^
--------------------------
parse
So every time we parse we're left with some tailing data in the span. Because there's always a tail, we never clean the underlying buffer and eventually OOM.
I propose we mitigate this by clearing the underlying buffer if the span empty or not. If the span isn't empty we copy any remaining data into a new buffer. For the cases where we don't have any left over data we'll continue to take the fast path introduced in bug 1558364. For cases where we have data left over, we'll pay some cost from doing an allocation and copy, but we'll avoid this OOM, and I believe we'll typically have better performance than the pre bug 1558364 case (primarily because we can batch what would previously have been several removals from our buffer into one large move).
We can do further mitigations by having the mp4 container parser skip over this data so our span becomes empty and we take the fast path. I'd looked at doing so, and think it could make sense to follow up with this. However, I see this as a bigger patch and harder to uplift. I have further thoughts on this, but will save them for a follow up bug.
It would also be good to have better memory reporting from Media Source code. I spent a lot of time chasing the OOM location, and better reporting could have helped narrow the time down.
I'll get a patch up and follow up bugs raised.
Comment 31•2 years ago
|
||
Set release status flags based on info from the regressing bug 1558364
Updated•2 years ago
|
Assignee | ||
Comment 32•2 years ago
|
||
This prevents us eventually overflowing the buffer when getting certain data
appended.
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 33•2 years ago
•
|
||
FWIW, I don't think there's anything DRM specific about the bug except that our repros happen to be encrypted streams. I'll verify when I create a test case next week, but the boxes tripping us up are just normal mp4 ones. I.e. this could happen in non-encrypted streams in the same way.
Edit: confirming this is not DRM specific. The testcase in bug 1755822 is DRM free but repros.
Comment 34•2 years ago
|
||
Pushed by bvandyk@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a5c20fbe5d5e Shrink MSE's mInputBuffer during append even if it's not empty. r=media-playback-reviewers,kinetik,alwu
Comment 35•2 years ago
|
||
bugherder |
Comment 36•2 years ago
|
||
The patch landed in nightly and beta is affected.
:bryce, is this bug important enough to require an uplift?
If not please set status_beta
to wontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 37•2 years ago
|
||
(In reply to Release mgmt bot [:marco/ :calixte] from comment #36)
The patch landed in nightly and beta is affected.
:bryce, is this bug important enough to require an uplift?
If not please setstatus_beta
towontfix
.For more information, please visit auto_nag documentation.
Yep. Waiting on
- Someone to test this fixes bug 1732492 as the account I have is geolocked and doesn't work in my region.
- Bug 1755822 to land so we have some coverage.
Comment 38•2 years ago
|
||
Hey Bryce, can we dupe bug 1732492 over to this one?
Assignee | ||
Comment 39•2 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #38)
Hey Bryce, can we dupe bug 1732492 over to this one?
Yes. Could you test that bug 1732492 is fixed using Amazon's test stream? I'd like to confirm that it's fixed before we dupe it.
Comment 40•2 years ago
|
||
Our last beta shipped, we build our release build on Monday.
Updated•2 years ago
|
Comment 41•2 years ago
•
|
||
I reproduced this issue using the second link from the Description on Firefox 98.0(20220304153049) on Win10. Confirmed as fixed on Firefox 99.0b8(20220324185704), Nightly 100.0a1(20220327213550) on Win10 64-bits. If there's a need to verify this on other platforms, please let me know.
Later edit: Also verified on MacOS 12.
Description
•