Closed Bug 1102642 Opened 9 years ago Closed 8 years ago

Large OOM in mozilla::MP4ContainerParser::ParseStartAndEndTimestamps

Categories

(Core :: Audio/Video, defect, P1)

x86_64
Windows 7
defect

Tracking

()

VERIFIED FIXED
mozilla38
Tracking Status
firefox36 + verified
firefox37 + verified
firefox38 + verified

People

(Reporter: rowbot, Assigned: jya)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 2 obsolete files)

While attempting to reproduce the crash in Bug 1100063, I ran into this crash.  I do not have reliable steps to reproduce.  All I can say is that I was doing an excessive amount of seeking in YouTube videos trying to reproduce the crash in Bug 1100063.

bp-b75a193d-c509-4dbb-93ed-e172e2141120
 Also reproduced this bug while seeking through YouTube videos, but no reliable STR 75e0d0a8-450a-400f-b6c6-caca82141221
Status: UNCONFIRMED → NEW
Ever confirmed: true
[Tracking Requested - why for this release]:

I looked at sample of crashes from this signature - see https://crash-stats.mozilla.com/report/list?signature=OOM%20%7C%20large%20%7C%20mozalloc_abort%28char%20const%2A%20const%29%20%7C%20mozalloc_handle_oom%28unsigned%20int%29%20%7C%20moz_xmalloc%20%7C%20nsTArray_base%3CnsTArrayInfallibleAllocator%2C%20nsTArray_CopyWithMemutils%3E%3A%3AEnsureCapacity%28unsigned%20int%2C%20unsigned%20int%29%20%7C%20nsTArray_Impl%3Cunsigned%20char%2C%20nsTArra... for more reports - and I found that they all go through mozilla::MP4ContainerParser::ParseStartAndEndTimestamps.

This is #6 with 2.3% of all DevEd 36 crashes in the last 3 days, I'm pretty sure this is MSE-related.
Summary: crash [OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xmalloc | nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity(unsigned int, unsigned int)] → Large OOM in mozilla::MP4ContainerParser::ParseStartAndEndTimestamps
ni on Anthony - have we landed stuff on beta yet that would account for this crash?
Flags: needinfo?(ajones)
(In reply to Marcia Knous [:marcia - use needinfo] from comment #3)
> ni on Anthony - have we landed stuff on beta yet that would account for this
> crash?

Sorry I meant Aurora/DevEd instead of Beta.
Ralph - anything missing from 36?
Flags: needinfo?(giles)
From the stack it looks like this area might be involved: http://hg.mozilla.org/releases/mozilla-aurora/annotate/17c6cdb0f99f/dom/media/mediasource/ContainerParser.cpp#l250

http://bit.ly/1FnZT8K shows the first small spike using Build ID 2014123000. From there there have been small spikes using Build IDs 2015010300, 2015010400 and 2015010500.
There's a fair number of MediaSource changes we haven't backported. I'll see if I can find something.
Assignee: nobody → giles
Flags: needinfo?(giles)
Flags: needinfo?(ajones)
Does this affect 37 too?

Anyway, tracking for 36 as it is an important crash.
Keywords: crash
I'm going to assume that this affects 36+ until we know better.
Ralph, this bug is now 2% of our crashes. Have you been able to find a potential fix?
Flags: needinfo?(giles)
I haven't been able to figure this one out. Everything is uplifted to beta now, so keep me posted on statistics and whether it's beta 2 or beta 3.
Flags: needinfo?(giles)
Jya, Matt, do you have any ideas here? Does ~2M seem unreasonably large for an AppendBuffers call? That doesn't seem like it should be enough to OOM. Perhaps the length is getting corrupted?

It looks like it's the mStream->AppendElements() in ParseStartAndEndTimestamps(), which is only called by SourceBuffer::AppendBuffer() via TrackBuffer::AppendBuffer(). If we could get a reproduction with NSPR_LOG_MODULES=MediaSource:5 we could confirm the theory.
Blocks: MSE
Priority: -- → P1
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jyavenard)
(In reply to Ralph Giles (:rillian) from comment #12)
> Jya, Matt, do you have any ideas here? Does ~2M seem unreasonably large for
> an AppendBuffers call? That doesn't seem like it should be enough to OOM.

On 32-bit builds, especially on a 32-bit OS, 2MiB can certainly OOM. Our allocator (jemalloc) tends to leave 1MiB-sized holes behind, so you can run out of 2MiB chunks well before all the address space is used up. Our chunk cache (bug 1073662) potentially makes this worse on Windows by holding onto 1MiB chunks of address space for later reuse (which can't be used for larger allocations).

Even 1MiB allocations aren't great, since jemalloc will fail if the first 1MiB-sized chunk isn't 1MiB-aligned and there are no more 2MiB chunks left (bug 1005844). It would probably still be an improvement though, especially since 1MiB allocations would benefit from the chunk cache (but note that the chunk cache is at least temporarily going away as we transition to jemalloc3, bug 762449).

I don't know this code at all, so feel free to disregard, but perhaps mozilla::SegmentedVector could be used to provide some encapsulation while avoiding large allocations?
(In reply to Ralph Giles (:rillian) from comment #12)
> Jya, Matt, do you have any ideas here? Does ~2M seem unreasonably large for
> an AppendBuffers call? That doesn't seem like it should be enough to OOM.
> Perhaps the length is getting corrupted?
> 
> It looks like it's the mStream->AppendElements() in
> ParseStartAndEndTimestamps(), which is only called by
> SourceBuffer::AppendBuffer() via TrackBuffer::AppendBuffer(). If we could
> get a reproduction with NSPR_LOG_MODULES=MediaSource:5 we could confirm the
> theory.

I've seen plenty of videos / dash using 3MB segment (about 5s worth of 1080p videos).

I've been testing MSE for month using this page: http://people.mozilla.org/~jyavenard/tests/mse_mp4/paper.html the media segment is 3MB.
Flags: needinfo?(jyavenard)
(In reply to Ralph Giles (:rillian) from comment #11)
> I haven't been able to figure this one out. Everything is uplifted to beta
> now, so keep me posted on statistics and whether it's beta 2 or beta 3.

According to crash stats the large spike in these crashes occurred using 2015011412. Since that spike I don't see very many crashes in this stack on beta: http://bit.ly/1B0OpjI. The most recent beta I see has  ID:20150122214638
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #13)

> On 32-bit builds, especially on a 32-bit OS, 2MiB can certainly OOM. Our
> allocator (jemalloc) tends to leave 1MiB-sized holes behind, so you can run
> out of 2MiB chunks well before all the address space is used up.

Ouch. So much for infallible allocations.

> I don't know this code at all, so feel free to disregard, but perhaps
> mozilla::SegmentedVector could be used to provide some encapsulation while
> avoiding large allocations?

Doesn't look like a drop in replacement. We need random access, which could certainly be added, but I worry about the performance. See the mp4_demuxer::BufferedStream wrapper at https://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/BufferStream.cpp#19

What about making mStream fallible here?
From what I have heard in different places, we should make anything fallible that uses allocations >256k - so 1M or 2M should definitely be fallible if possible. In the end, we should try to fail gracefully rather than crash on large videos, I guess. ;-)
(In reply to Marcia Knous [:marcia - use needinfo] from comment #15)

> According to crash stats the large spike in these crashes occurred using
> 2015011412. Since that spike I don't see very many crashes in this stack on
> beta: http://bit.ly/1B0OpjI. The most recent beta I see has 
> ID:20150122214638

I'd certainly like to believe the last two weeks of uplift fixed this, but Sylvestre said the opposite in #10, and that ID sounds 36b3? We'll know more after the breakpad bump from bug 1124892 lands.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #17)
> From what I have heard in different places, we should make anything fallible
> that uses allocations >256k

I supposed I'd heard that before too. It's just not reasonable to call ~MB allocations 'large' where video is concerned.
(In reply to Ralph Giles (:rillian) from comment #18)
> (In reply to Marcia Knous [:marcia - use needinfo] from comment #15)
> 
> > According to crash stats the large spike in these crashes occurred using
> > 2015011412. Since that spike I don't see very many crashes in this stack on
> > beta: http://bit.ly/1B0OpjI. The most recent beta I see has 
> > ID:20150122214638
> 
> I'd certainly like to believe the last two weeks of uplift fixed this, but
> Sylvestre said the opposite in #10, and that ID sounds 36b3? We'll know more
> after the breakpad bump from bug 1124892 lands.

http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/36.0b1-candidates/build2/win32_info.txt shows the build ID 2015011412 was beta 1, not beta 3. So this may no longer be an issue in the betas.
Oh, I see. You're saying you _haven't_ seen crashes with 20150122214638.

Quoth Bobby Holly on email, "I don't think we need to reproduce this per se - we just need to do fallible allocation for any buffers whose size is controlled by JS (this is a standard rule), and certainly any buffer in the hundreds of k or larger."
Like this?
Attachment #8554125 - Flags: review?(bobbyholley)
Comment on attachment 8554125 [details] [diff] [review]
Make mp4_demuxer::BufferReader fallible

I get instant crashes on youtube with this. Don't have time to debug tonight.
Attachment #8554125 - Flags: review?(bobbyholley)
Comment on attachment 8554125 [details] [diff] [review]
Make mp4_demuxer::BufferReader fallible

Review of attachment 8554125 [details] [diff] [review]:
-----------------------------------------------------------------

I think we need to change the call signature of ParseStartAndEndTimestamps to disambiguate between "init segment not there" and "error that we should throw for". The API is pretty ugly right now, but the simplest possible solution is probably to just pass a bool* that the callee can set in the error cases, and that the caller can use to propagate a DOMException.

This only covers high-level init segment parsing. The other thing we need to make fallible is ResourceQueue::AppendItem (and callers), which covers the regular SourceBuffer append case. That could probably go in a separate patch though.
Attachment #8554125 - Flags: review-
That will conflict with jya's patches to make us run the buffer append algorithm asynchronously.

The spec has explicit handling for insufficient memory during the prepare append algorithm (throw a QUOTA_EXCEEDED_ERR exception), but not for the same during the buffer append algorithm.

I think that means we need to allocate all memory we're going to use during the prepare append algorithm, so that we know we have enough to run the asynchronous part.

We could also just use the append error algorithm and put the MediaSource into an error state, this is the expected behaviour when running out of memory during a stream append.
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #25)
> That will conflict with jya's patches to make us run the buffer append
> algorithm asynchronously.
> 
> The spec has explicit handling for insufficient memory during the prepare
> append algorithm (throw a QUOTA_EXCEEDED_ERR exception), but not for the
> same during the buffer append algorithm.
> 
> I think that means we need to allocate all memory we're going to use during
> the prepare append algorithm, so that we know we have enough to run the
> asynchronous part.

I was thinking the exact same thing yesterday as I was going over the async append patch once again (bug 1118589).

With that patch the OOM would now happen right when the AppendDataRunnable is created which is not good.

I was thinking that the only way to prevent a OOM down the chain, would be to allocate the memory in PrepareAppend, and store that in something similar to the mp4_demuxer::ByteBuffer which is a refcounted array. And subsequent usage of the AppendBuffer and further would be to re-use/share that data including the internal nsQueue data used by the SourceBufferResource.

Even if we check for OOM when allocating in PrepareAppend, there's no guarantee there won't be more further down the track as right now, for each appendBuffer the data appended is temporarily duplicated at least three times later.
Attached patch Use shared data (obsolete) — Splinter Review
This is what I had in mind. Not fully functional yet as I haven't completed ParseStartAndEndTimestamps BufferStream conversion into MP4Stream, (the MediaSourceResource doesn't allow to remove partial data). All buffers are now allocated just once, and then shared; nothing is duplicated
Assignee: giles → jyavenard
Status: NEW → ASSIGNED
Adding a new signature. This crash is reproducible in my Windows 8 lab machine.
Use a ref counted, faillible TArray to store our data. Only MP4 is making full use of it at this stage.
Attachment #8554915 - Flags: review?(matt.woodrow)
Attachment #8554546 - Attachment is obsolete: true
Crash Signature: , nsTArra... ] → , nsTArra... ] [@ OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xmalloc | nsTArray_base<nsTArrayInfallibleAllocator, nsTArray_CopyWithMemutils>::EnsureCapacity(unsigned int, unsigned int) | nsTArray_Impl<bool, …
Attachment #8554915 - Flags: review?(matt.woodrow) → review+
(In reply to Marcia Knous [:marcia - use needinfo] from comment #29)
> Adding a new signature. This crash is reproducible in my Windows 8 lab
> machine.

Can you see memory usage climbing before we crash?

Could you try getting an about:memory report when the usage is high?
Flags: needinfo?(mozillamarcia.knous)
Missed an explicit declaration:
Added quickly in this push
https://hg.mozilla.org/integration/mozilla-inbound/rev/d72836297b6b
https://hg.mozilla.org/mozilla-central/rev/b7b87042f254
https://hg.mozilla.org/mozilla-central/rev/d72836297b6b
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Depends on: 1126499
The addition of #include "MP4Stream.h" to ContainerParser.cpp in this patch is causing Lollipop build bustage ("MP4Stream.h: No such file or directory"). Filed bug 1126499 for this.
Attachment #8554125 - Attachment is obsolete: true
Comment on attachment 8554915 [details] [diff] [review]
Use ref counted compressed data within mediasource

Approval Request Comment
[Feature/regressing bug #]: MSE
[User impact if declined]: Crashes playing YouTube video.
[Describe test coverage new/current, TreeHerder]: Landed on m-c.
[Risks and why]: Almost entirely MSE-specific. Though not small.
[String/UUID change made/needed]: None.

NB this one may need some rebase love.
Attachment #8554915 - Flags: approval-mozilla-beta?
Attachment #8554915 - Flags: approval-mozilla-aurora?
Attachment #8554915 - Flags: approval-mozilla-beta?
Attachment #8554915 - Flags: approval-mozilla-beta+
Attachment #8554915 - Flags: approval-mozilla-aurora?
Attachment #8554915 - Flags: approval-mozilla-aurora+
(In reply to Matt Woodrow (:mattwoodrow) from comment #32)
> (In reply to Marcia Knous [:marcia - use needinfo] from comment #29)
> > Adding a new signature. This crash is reproducible in my Windows 8 lab
> > machine.
> 
> Can you see memory usage climbing before we crash?
> 
> Could you try getting an about:memory report when the usage is high?

I let the machine run overnight with logging - hope to have something this morning when I get in.
Here is a memory log which was run while the video was playing overnight. May not help, but since the crash is reproducible on this machine.
Flags: needinfo?(mozillamarcia.knous)
(In reply to Marcia Knous [:marcia - use needinfo] from comment #40)
> Created attachment 8556009 [details]
> Memory log captured overnight while video was playing
> 
> Here is a memory log which was run while the video was playing overnight.
> May not help, but since the crash is reproducible on this machine.

This looks like a cycle collector log, which probably doesn't help here because these media buffers aren't in the cycle collection graph. The more interesting data would be the reports in about:memory.
Flags: qe-verify+
(The messed up signature is this bug in disguise, due to bug 1124892)
Crash Signature: , nsTArrayInfallib...] → , nsTArrayInfallib...] [@ OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xmalloc | xul.dll@0x278d96 | xul.dll@0x7bfa98 | xul.dll@0x10f419a | xul.dll@0x10f0227 | xul.dll@0x10f004a | xul.dll@0x1e8527 | xul.dll@0x1…
Crash Signature: xul.dll@0x10867e | xul.dll@0x286684...] → xul.dll@0x10867e | xul.dll@0x286684...] [@ OOM | large | mozalloc_abort(char const* const) | mozalloc_handle_oom(unsigned int) | moz_xmalloc | xul.dll@0x26add6 | xul.dll@0x1150ab4 | xul.dll@0x10bcaa3 | xul.dll@0x10b8be6 | xul.dll@0x10b8a09 | xul.dll@0x2…
The crash rates over the past two weeks are the following:
- 38.0a1 Nightly: 3 crashes
- 37.0a2 Aurora: 9 crashes
- 36.0b6: 4 crashes (down from 601 crashes in 36.0b4).

Given the low crash rates I consider it's now safe to close this issue.
Status: RESOLVED → VERIFIED
QA Contact: cornel.ionce
You need to log in before you can comment on or make changes to this bug.