Closed Bug 791344 Opened 12 years ago Closed 12 years ago

Browser freeze, allocates all memory when viewing a particular type of video with a particular build.


(Core :: Audio/Video, defect)

16 Branch
Not set



Tracking Status
firefox17 --- verified


(Reporter: padenot, Assigned: padenot)



(Keywords: regression)


(3 files, 3 obsolete files)

- Go to
- Click on one of the data point, that is quite recent (bottom right corner), notice that a video appears on the right of the graph.
- Click on the video to start the playback

The video plays

Actual result:
The browser freezes, allocates all the memory, and gets killed by the oom killer. The CPU usage showed by `top` is around 150-200%, indicating that at least two threads are doing things. I managed to run sysprof while my system was dying to get a profile (attached, open it in sysprof on Linux).

It shows a couple things (profiles are cumulative, i.e. include time spent by called functions):
- nsBuiltinDecoderStateMachine::AdvanceFrame - 17%
  - nsBuiltinDecoderStateMachine::StartBuffering - 8%
    - nsBuiltinDecodedStateMachine::UpdateReadyState - 5%
      - NS_DispatchToMainThread - 3%
  - nsBuiltinDecoderStateMachine::HasLowUndecodedData - 5%
    - nsBuiltinDecoderStateMachine::GetUndecodedData - 4.5%
      - nsBuiltinDecoderStateMachine::GetBuffered - 4%
        - nsWebMReader::GetBuffered - 4%
- nsBuiltinDecoderStateMachine::UpdateReadyState - 6%
  - NS_DispatchToMainThread - 3.5%
- nsBuiltinDecoderStateMachine::GetUndecodedData - 4%
  - nsWebMReader::GetBuffered - 3.5%

Also, there is a _very_ large amount of time spent in system calls related to locking (sys_futex, do_futex, etc.), and thos are called by the Firefox process

This happens only on those videos, and only on my main nightly (from the tarballs on, updated to version 2012-09-13), not on my builds, and only on Linux (I tested the same nightly on Mac, works fine). It works fine on release.
If I update this build to a new version (2012-09-14), it works, I can't reproduce.

I can also reproduce on another machine, with a build from 2012-09-10. When I updated this machine, the problem goes away as well. Funnily enough, nothing relevant has landed in content/media or media/ in that time range.
Testing against

I can reproduce this in a 2012-09-13 build, 2012-09-16 build, and my out-of-date debug build from September 1stish (3e2f50279d90), all on Linux.  The video plays fine in Firefox 15, and in all builds when played from a local download.

It seems like we're entering a tight loop when running low on data/pausing to buffer.  The main thread is spending most of its time in nsBuiltinDecoder::NextFrameAvailable, and NSPR logging shows the state machine furiously switching between BUFFERING and DECODING.

Also seem to be a surprisingly larger number of "Decoder discarding video frame" messages logged both when reproducing this and when the video plays normally.
This seems to have regressed between Firefox 15 and Beta 16. which should correspond to the range 26dcd1b1a208:6fdf9985acfe if I've interpreted the tags correctly.
Keywords: regression
Hardware: x86_64 → All
Version: unspecified → 16 Branch
849da4232313 is good, 45dbfd1debfd is bad.  That makes this a regression from bug 760336.
Blocks: 760336
(In reply to Matthew Gregan [:kinetik] from comment #2)
> Also seem to be a surprisingly larger number of "Decoder discarding video
> frame" messages logged both when reproducing this and when the video plays
> normally.

This might have to do with the fact that those particular videos are actually 60fps (just checked with mkvinfo). AdvanceFrame _should_ handle that, but apparently does not.

I've checked, and we are each time slightly (less that 200 microseconds) late to make the frame we want, so it gets skipped. I'll see if we can get something better by tweaking some constants or something.

I'm investigating the other problem, which might be related (here, we have to decode more that twice the amount of frames we usually do, and maybe the buffering logic is making assumptions it should not make).
Attached patch Properly log frame skipping. (obsolete) — Splinter Review
The logic that logged frame skipping is off, and is reporting that every frame is skipped. This patch actually makes it report the number of frames skipped.

With this patch applied, we don't skip frames at all.

I'll continue investigating for the other problem.
Attachment #661983 - Flags: review?(kinetik)
Comment on attachment 661983 [details] [diff] [review]
Properly log frame skipping.

Review of attachment 661983 [details] [diff] [review]:

::: content/media/nsBuiltinDecoderStateMachine.cpp
@@ +2292,5 @@
>    int64_t remainingTime = AUDIO_DURATION_USECS;
>    NS_ASSERTION(clock_time >= mStartTime, "Should have positive clock time.");
>    nsAutoPtr<VideoData> currentFrame;
> +#ifdef PR_LOGGING
> +  int32_t popedFrames = 0;

Call this droppedFrames instead.
Attachment #661983 - Flags: review?(kinetik) → review+
I can reproduce reliably when I follow those steps:

- Activate the NSPR log for the media decoder
- Load the video using cpearce's HttpMediaServer, setting the network rate to 100k/s or so, that is slower that what would be needed for this particular video to play through (in the canplaythrough sense), because the video is around 350k/s
- Start playing the video, that hit the end of the buffered region pretty quickly
- Notice that we switch super quickly from buffering to decoding (we see that in the log)
- Hit escape so we kill the http connection

Then an infinite loop occurs, the |clock_time| variable goes way after then end of the video, and never stops.
Attachment #661983 - Attachment is obsolete: true
This was the problem, I guess, some units were not coherent, messing up the
buffering logic. I can't reproduce the bug with this patch applied, and
I'm still able to reproduce it with 100% success rate if I qpop the patches.

Also, when a user press espace when a video loads, the state machine goes go
into shutdown state, and then does something I can't figure out at 11pm, but
it seems sketchy.

It's still switching back and forth between buffering and decoding state, but of
course, no more data are going to come (the resource is closed, there is no
decoder, no element, nothing). I'll investigate tomorrow. Or maybe this belongs
to another bug.
Attachment #662030 - Flags: review?(kinetik)
Comment on attachment 662030 [details] [diff] [review]
Use the right method to build a timestamp from microseconds. r=

Thanks for digging in to this!
Attachment #662030 - Flags: review?(kinetik) → review+
Attachment #662030 - Attachment is obsolete: true
Attachment #661992 - Attachment is obsolete: true
Keywords: checkin-needed
Assignee: nobody → paul
Pushed to Try since I don't see any results here. Will checkin if it goes green.
Looks like this or bug 792274 is causing OSX crashtest failures.

TEST-UNEXPECTED-FAIL | file:///builds/slave/talos-slave/test/build/reftest/tests/content/media/test/crashtests/752784-1.html | Exited with code 1 during test run
INFO | | Application ran for: 0:01:23.503429
INFO | | Reading PID log: /var/folders/3t/f8_jgcpj1_s59lzd72ny3wyh00000w/T/tmputbtvjpidlog
Downloading symbols from:
PROCESS-CRASH | file:///builds/slave/talos-slave/test/build/reftest/tests/content/media/test/crashtests/752784-1.html | application crashed (minidump found)
Crash dump filename: /var/folders/3t/f8_jgcpj1_s59lzd72ny3wyh00000w/T/tmpqjrRQd/minidumps/27D53FF3-B731-4410-B87B-63FB306229E4.dmp
Operating system: Mac OS X
                  10.8.0 12A269
CPU: amd64
     family 6 model 42 stepping 7
     8 CPUs

Crash address: 0x40

Thread 0 (crashed)
 0  XUL!nsBuiltinDecoder::ConnectDecodedStreamToOutputStream [nsAutoPtr.h : 1003 + 0x0]
    rbx = 0x04b643f0   r12 = 0x19a6bb60   r13 = 0x19a6ba80   r14 = 0x19a6ba80
    r15 = 0x42334978   rip = 0x01cdb5f5   rsp = 0x5fbf95d0   rbp = 0x5fbf95f0
    Found by: given as instruction pointer in context
 1  XUL!nsBuiltinDecoder::AddOutputStream [nsBuiltinDecoder.cpp : 181 + 0xa]
    rip = 0x01cdc56c   rsp = 0x5fbf9600
    Found by: stack scanning
 2  XUL + 0xcdc45f
    rip = 0x01cdc460   rsp = 0x5fbf9610
    Found by: stack scanning
 3  XUL!nsHTMLMediaElement::CaptureStreamInternal [nsHTMLMediaElement.cpp : 1551 + 0x13]
    rip = 0x0188e542   rsp = 0x5fbf9640
    Found by: stack scanning
 4  XUL!nsHTMLMediaElement::MozCaptureStream [nsHTMLMediaElement.cpp : 1559 + 0x4]
    rip = 0x0188e590   rsp = 0x5fbf9680
    Found by: stack scanning
 5  XUL!NS_InvokeByIndex_P [xptcinvoke_x86_64_unix.cpp : 162 + 0x3]
    rip = 0x0274ebcf   rsp = 0x5fbf96a0
    Found by: stack scanning
 6  XUL!CallMethodHelper::ConvertIndependentParam [XPCWrappedNative.cpp : 2836 + 0x15]
    rip = 0x01f3bd38   rsp = 0x5fbf96b0
    Found by: stack scanning
 7  XUL!CallMethodHelper::InitializeDispatchParams [nsTArray.h : 192 + 0x0]
    rip = 0x01f3ac92   rsp = 0x5fbf9700
    Found by: stack scanning
 8  XUL!CallMethodHelper::Call [XPCWrappedNative.cpp : 3105 + 0x4]
    rip = 0x01f39e15   rsp = 0x5fbf9740
    Found by: stack scanning
 9  XUL!XPCWrappedNative::CallMethod [XPCWrappedNative.cpp : 2405 + 0x7]
    rip = 0x01f37b22   rsp = 0x5fbf9780
    Found by: stack scanning
10  XUL + 0xc1029f
    rip = 0x01c102a0   rsp = 0x5fbf97b8
    Found by: stack scanning

Thread 1
 0  libsystem_kernel.dylib + 0x12d16
    rbx = 0x04704648   r12 = 0x7317ff40   r13 = 0x7317ff88   r14 = 0x731809c0
    r15 = 0x73180934   rip = 0x8a972d16   rsp = 0x04704508   rbp = 0x04704680
Keywords: checkin-needed
Pushed to try along with other things:
Keywords: checkin-needed
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Comment on attachment 662240 [details] [diff] [review]
Use the right method to build a timestamp from microseconds and properly convert between units

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 760336
User impact if declined: high CPU and memory use, possible main thread hangs; also makes QA of bug 789075's patch difficult
Testing completed (on m-c, etc.): baked on mozilla-central for 3 weeks
Risk to taking this patch (and alternatives if risky): very low, corrects time calculation in two places
String or UUID changes made by this patch: none
Attachment #662240 - Flags: approval-mozilla-beta?
Attachment #662240 - Flags: approval-mozilla-aurora?
Comment on attachment 662240 [details] [diff] [review]
Use the right method to build a timestamp from microseconds and properly convert between units

Oops, this is on aurora already.  It'd still be nice to take on beta.
Attachment #662240 - Flags: approval-mozilla-aurora?
Comment on attachment 662240 [details] [diff] [review]
Use the right method to build a timestamp from microseconds and properly convert between units

We get enough feedback from users about memory use that this seems very worth taking for uplift to Beta, please land this week so it's in Beta 3 and we have time to ensure there are no unintended regressions.
Attachment #662240 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/17.0 Firefox/17.0

Verified fixed with F17 beta 6. Could previously reproduce the issue. Checked with sites in comments 0 and 2.
You need to log in before you can comment on or make changes to this bug.