Closed Bug 506061 Opened 15 years ago Closed 15 years ago

Videos stop playing after 26 minutes

Categories

(Core :: Audio/Video, defect)

All
Linux
defect
Not set
major

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta1-fixed

People

(Reporter: j, Assigned: kinetik)

References

Details

Attachments

(5 files, 6 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090716 Ubuntu/9.04 (jaunty) Shiretoko/3.5.1
Build Identifier: 

playing longer videos in mozilla, they stop after 26 minutes and one second. this happens each time the video is played. before it stops the video stutters.


Reproducible: Always

Steps to Reproduce:
1. open video about an hour long
2. start playing it, no seeking, just play
3. after 26 minutes it hangs
Actual Results:  
video hangs, see strace of firefox attached

Expected Results:  
play the video to the end
using pulseaudio the video was hanging before, at 24 minutes and 31 seconds.
just using also all videos seam to hang at 26:01
playing the video in oggplayer it does not hang.

also seamed to play through with Firefox 3.5.1 on Mac OS X 10.5
http://stream.kein.org/unp/26.01.ogv (1gb)
this video, even if downloaded, will only play 26.01 minutes here
another video that hangs reliably

http://ia331342.us.archive.org/2/items/night_of_the_living_dead/night_of_the_living_dead.ogv

gets out of sync after about 25 minutes, it hangs at 43:18

this bug was introduced after 3.5b4, verified here with several samples, they all play through with 3.5b4 and hang with 3.5b99 and 3.5.1

<kinetik> iirc, doublec's a/v sync stuff landed after beta 4, so that makes some sense
<kinetik> my guess is that something's going wrong with the audio clock
<kinetik> landed 2009-05-20, so would've been in b99 onwards
Do those videos hang if you seek to some time shortly before the hang-time and playback from there? Or do you need to play through from the beginning to get the behaviour?
they only hang if you play through and do not seek.
tested the videos with 26 minute hang again with a nightly build from
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-trunk/

and it does no longer hang but stutter at 26 minutes.
pressing pause and play does not help, it still stutters,
it plays ok after seeking to another place.
Just reproduced this on my Linux machine (Fedora 10, PulseAudio, Intel HDA sound).  The 26.01 video plays up until about ~24 minutes, starts playing frames very slowly, and then eventually stopped at 24:37.

nsOggDecodeStateMachine::PlayFrame is waiting at:

    mon.Wait(PR_MillisecondsToInterval(PRInt64((frame->mTime - time)*1000)));

Where frame->mTime is 1478, and time is 1456, so we're sleeping for over 20 seconds.  time comes directly from hwtime via calling sa_stream_get_position.

Breaking in sa_stream_get_position the next time, it turns out the PCM device is in XRUN state at this point, so the audio clock is no longer advancing, and since we're not writing more data (because we're waiting for it to catch up to the frame time), we don't make progress at the normal rate.

This might be related to bug 495895, except in that case we XRUN immediately.
Status: UNCONFIRMED → NEW
Ever confirmed: true
FWIW, seeking when this problem happens will workaround it temporarily, because we close and reopen the audio stream.
in that loop it should never wait longer than mCallbackPeriod no?
also in this case the sound card just had an xrun so it needs more data,
waiting is the wrong thing to do. calling PlayAudio(frame); again might help.
or could it be that more data needs to be decoded to continue audio playback?
I wonder if the first patch in bug 506434 fixes this? That changes the loop.
The loop assumes that we've written enough audio data to advance the audio clock  to the presentation time of the current frame.  So we're waiting for the audio to actually play (and clock to reach an assumed point) before we play the next frame.

Bug 506434 doesn't seem to change the behaviour in this case.

I instrumented the time and frame->mType variables and discovered that the time we're getting from the audio clock stops advancing, then starts jumping backwards slightly.  This happens a few sa_stream_get_position calls before the PCM device enters XRUN state.
When the alsa backend gets into the underrun state the time doesn't advance. This results in the loop in nsOggDecoder never exiting and playback never continues. We usually do have audio data available to be played but because this loop plays audio for a the single current frame it never queues the remaining audio which would cause the current play position to start moving again.

One approach might be to immediately play the audio data when we first get it in QueueDecodedFrames. This will result in it being buffered in nsAudioStream and eventually written as we call PlayFrame in the loop. If we do this we need to change nsAudioStream so the buffering is more efficient.

A quick test shows that it does prevent this video from stopping at that point.

I'm unsure why we are underrunning though and it would be good to find out the ultimate cause of that.
The ultimate reason we underrun is that there's accumulated error in the FrameData time calculation that, by the 26 minute mark, is around 270ms later than it should be.

Switching mLastFrameTime and mCallbackPeriod from floats to doubles allows the video to play through without underrunning, and won't suffer a similar level of error for many months of continuous playback.  A cleaner fix would be to change the way we calculate mFramerate/mCallbackPeriod/mLastFrameTime so that it's less susceptible to error accumulation.
We also need to fix the handling when we do underrun, because it's easy to get into situations where we can't recover (when using the ALSA backend).  The other backends don't tend to show as many symptoms as their audio clocks keep running during underrun, or they write silence to the hardware on underrun.

Other stuff I noticed:
 - The underrun recovery in sa_stream_write looks suspect, since we subtract len from nframes after recovery, but as far as I can tell, that data has not been successfully written (frames == 0 after recovery).
 - sa_stream_get_write_size should use snd_pcm_avail_update, and either recover on underrun or return s->buffer_size in that situation.
 - sa_steam_get_position caps delay at s->buffer_size, which makes no sense to me.  Also, I can't work out why it calls snd_pcm_forward.  The call to snd_pcm_avail_update seems to be unnecessary, too.
(In reply to comment #16)
>  - The underrun recovery in sa_stream_write looks suspect, since we subtract
> len from nframes after recovery, but as far as I can tell, that data has not
> been successfully written (frames == 0 after recovery).

Yes, this is wrong. The result from the recover call is zero on success, not a number of frames.

>  - sa_stream_get_write_size should use snd_pcm_avail_update, and either recover
> on underrun or return s->buffer_size in that situation.

Why snd_pcm_avail_update and not snd_pcm_avail?
(In reply to comment #17)
> (In reply to comment #16)
> >  - The underrun recovery in sa_stream_write looks suspect, since we subtract
> > len from nframes after recovery, but as far as I can tell, that data has not
> > been successfully written (frames == 0 after recovery).
> 
> Yes, this is wrong. The result from the recover call is zero on success, not a
> number of frames.
> 
in addition to that, len is the not the number of frames written,
that would be frames. as returned from snd_pcm_writei
> Why snd_pcm_avail_update and not snd_pcm_avail?

It's not really clear what snd_pcm_avail does, but probably more than we need, maybe less than we need, and possibly not what we need.  This is the best documentation I've found about what API calls can be used to get particular pieces of information, from one of the ALSA developers:

http://mailman.alsa-project.org/pipermail/alsa-devel/2008-June/008558.html

> in addition to that, len is the not the number of frames written,
> that would be frames. as returned from snd_pcm_writei

That code makes a couple of unit errors with frames vs bytes, too.  Also, I can't work out why it caps the write size to min(nframes, period_size) (and makes another unit error at the same time), since it already writes in a loop to sort-of deal with short writes.  Except it just does a printf in that case and goes ahead and subtracts len.

Actually, we can probably just remove buffer_{size,bytes} and period_{size,bytes} from struct sa_stream as they don't seem to be necessary and every existing use of them is incorrect.
Oh, maybe a simple way to fix the floating point error accumulation is to change mLastFrameTime into a PRUint64 that counts frames, and then do mLastFrame * mCallbackPeriod when necessary to calculate the current frame time.  I haven't thought that idea through or checked to see if it'd make sense everywhere in the code, though.
* only update nframes and data with the actual amount written
* recover from xruns in sa_stream_get_write_size
* just calls snd_pcm_avail_update in sa_stream_get_write_size

tested in oggplayer and mozilla-central, both seamed to work.
tested here and file does no longer hang as suggested in previous comment.
Matthew's suggestion to calculate mLastFrame * mCallbackPeriod sounds like a better idea since it does not accumulate errors.
(In reply to comment #21)
> Created an attachment (id=392466) [details]
> cleanup alsa backend and recover from xruns

It's possible for the state to be XRUN in get_position as well. I noticed this during my testing. If it is, and the current frame has no audio data to write we can get stuck in the loop mentioned in comment 14. I think we should return an error from get_position and handle it so that doesn't happen.
Attached patch further cleanup alsa backend (obsolete) — Splinter Review
in addition to previous patch,
* simplify write loop
* recover xruns on get_position too
* remove not used variables from struct sa_stream
Attachment #392466 - Attachment is obsolete: true
i think i found a way to trigger playback to hang faster:
if you play a video, use cpulimit on firefox

sudo cpulimit -e firefox-bin -l 10

that way you can trigger and xrun in get_position almost right away.
video is stuck after that
an attempt to store last frame and calculating mLastFrameTime from that.
this should not cause errors to accumulate for longer periods of time.

i could still break it with above mentioned trick limiting the cpu time.
(In reply to comment #26)
> i could still break it with above mentioned trick limiting the cpu time.

To fix that the code needs to be changed to constantly feed the audio. Currently when the frame's audio is played and an underrun occurs it gets stuck in this loop:


> double time;
> PRUint32 hasAudio = frame->mAudioData.Length();
> for (;;) {
> ...
>        PlayAudio(frame);
>        double hwtime = ...
>        time = hwtime < 0.0 ?
>         (TimeStamp::Now() - mPlayStartTime - mPauseDuration).ToSeconds() :
>          hwtime;
> ...
>       }

One approach could be to call PlayAudio(Frame) in QueueDecodedFrames so that we have the audio ready to be played immediately when we decode it. This will mean when we get the underrun in get_position and we have audio in other frames, then they will be played and the audio clock will continue to advance.

If this change is made then nsAudioStream's write call will also need to be changed to fix the buffering of data not yet written to Sydney Audio - it currently does a lot of copying.

Another approach is to return an error on underrun in get_position and detect that error in nsAudioStream to return -1. This will force this loop to use the system clock and continue.
What's happening is that we're stuck in that loop because the audio clock doesn't advance while the PCM device is still in the prepared state (see the code in get_position).  We write a bit of audio data then wait for the clock to advance but we're not actually playing yet so it returns 0.  With the old code, we'd break out once time == prevTime, write a bit more data and go through the same cycle.  Eventually we'd write enough data to start playback, but the audio clock is still 0 (or very small) in that loop iteration, so we wait for almost frame->mTime (which is now quite large), all of the queued audio plays while we're waiting, and we underrun.

Writing more audio data up front sort-of solves the problem, but there's still an issue where there's a (potentially large, depending on the audio buffer size) time period where the audio clock isn't advancing at all.  In fact, that's probably why we've had so much trouble with PulseAudio systems--the default buffer size will be quite large there, and we're far more likely to run into this stop-start-underrun problem in that case.
Noting some other things here so I don't forget them:

sa_stream_close leaks all of the ALSA resources if s->playing isn't true (which it won't be until after the first sa_stream_write).  We should also get doublec's other leak fix (https://trac.annodex.net/ticket/487) in the tree.

sa_stream_get_position is wrong, I think:

> s->bytes_played = s->bytes_written - snd_pcm_frames_to_bytes(s->output_unit, delay);

delay is the total delay from write to hardware output, which may be larger than the number of bytes we've written (think networked audio server, or USB sound hardware), so s->bytes_played could become negative.
j^, does the playback stall you mention in comment #25 happen even with all of the discussed changes?  And is this with raw ALSA, or PulseAudio?

The problem I'm looking at now seems to be PulseAudio specific.  I described it incorrectly in comment #28.  What's actually happening is:

- We write a bunch of audio data.
- Eventually we hit the play threshold and ALSA starts playback automatically.
- PCM state is SND_PCM_STATE_RUNNING, snd_pcm_delay returns the amount of data we've written (up to buffer_size) and snd_pcm_avail_update returns 0.
- We spend some time in the wait-for-next-frame loop, sleeping for a few frame periods, but delay and avail still haven't changed.
- Eventually (after 2-2.5 seconds!) avail fails due to XRUN, so we recover and return buffer_size.  snd_pcm_delay still reports buffer_size.
- As there's some buffer space available, we flush a bunch of queued audio data.
- The dance begins again.

This seems to happen for about 10-20 seconds, after which the playback recovers and plays like normal.

In this case, the XRUN happens somewhere between 2 and 2.5 seconds.  At the 2s mark, we sleep for 500ms in the loop, and once we wake up we've hit XRUN.  If I change the loop to only sleep for mCallbackPeriod at a time, what happens is interesting.  The behaviour is the same as described above until the 2s mark, and at around 2.2s avail finally returns non-zero (~33kBytes) and we refill the buffer.  After this point, everything seems to work correctly and we never XRUN because we've refilled the buffer early enough.
(In reply to comment #30)
> j^, does the playback stall you mention in comment #25 happen even with all of
> the discussed changes?  And is this with raw ALSA, or PulseAudio?

this was with the latest patche attached here for the alsa backend and pulseaudio but i can try with raw alsa again.
Attached file test program
Simple test program.  Run it with no arguments to use the default PCM device or "hw:0" to test against direct hardware.

An earlier version of this used to sleep for 90% of the calculated buffer playback time (maxfill - avail), leaving us 50ms (with a 500ms requested latency) to refill the buffer.  It would constantly XRUN when run with PulseAudio on my Fedora 10 machine.  Eventually I understood that this behaviour is legal according to the ALSA docs, because avail/delay/etc. may be reported with only period granularity, and so 90% would effectively wait for all the periods (depending on period/buffer ratio) to empty, causing an XRUN.

Firefox shouldn't hit that case, because we write data in approximately mCallbackTime chunks, and (generally) only sleep for a similar period when waiting for the next frame.

That leaves the snd_pcm_delay issue, which we use to calculate the expected audio position.  Using hw:0, this is relatively constant, and our expected audio position is very close to the elapsed time.  With PulseAudio, it varies widely for the first 12 seconds(!) before settling at ~350ms.  I tested with the latest alsa-lib and alsa-plugins (but kept the distro version of PA at 0.9.14, as it was too hard to build).  Adding printfs to the pcm_pulse.c plugin code, pa_stream_get_latency and pa_stream_get_time report the same values that we're getting from snd_pcm_delay and calculating as the audio position.

Next steps are:
- Test against current version of PA, then follow up with PulseAudio/ALSA developers
- Rework audio code to avoid situations where we've buffered data in the nsAudioStream and have XRUN in the backend
Assignee: nobody → kinetik
Status: NEW → ASSIGNED
> - Test against current version of PA, then follow up with PulseAudio/ALSA
developers

The strange delay behaviour doesn't happen in Fedora 12 Alpha 1 running under VirtualBox.  Under VMWare Fusion, Linux sound seems to be completely broken (still).

> - Rework audio code to avoid situations where we've buffered data in the
nsAudioStream and have XRUN in the backend

I'm working on this.
Attached patch patch v0Splinter Review
This is basically j^'s patch, with the frame counter changed to PRUint64 (since it'll never be negative), and a few minor cleanups.  Mochitests pass on OS X.
Attachment #392467 - Attachment is obsolete: true
Attachment #392508 - Attachment is obsolete: true
Attachment #398565 - Flags: review?(chris.double)
I also reviewed other places where we're using floats or doubles in nsOggDecoder and nsWaveDecoder, but it doesn't look like there's anything else immediately problematic.  There are places where we're storing frame times as floats, but they're calculated once and (at most) only have mCallbackPeriod added to them a single time after that.
Attachment #398565 - Flags: review?(chris.double) → review+
Remembered to credit j^ when I pushed:
http://hg.mozilla.org/mozilla-central/rev/3d43b81874cf
Attached patch sydneyaudio/ALSA patch v0 (obsolete) — Splinter Review
Again, a variant of j^'s patch with a few additional things changed.  Mainly using a signed value to store the result of snd_pcm_avail_update so that error results don't underflow.
Attachment #392475 - Attachment is obsolete: true
Attachment #400404 - Flags: review?(chris.double)
Attached that patch to Trac #485 upstream: https://trac.annodex.net/ticket/485
Attached patch sydneyaudio/ALSA patch v0.1 (obsolete) — Splinter Review
Actually, let's include doublec's leak fix from https://trac.annodex.net/ticket/487 in here too, since it seems to have been forgotten.
Attachment #400404 - Attachment is obsolete: true
Attachment #400404 - Flags: review?(chris.double)
Attachment #400406 - Flags: review?(chris.double)
Try to free the sa_stream_t even if snd_pcm_close fails.  This matches the behaviour of the other backends.
Attachment #400406 - Attachment is obsolete: true
Attachment #400407 - Flags: review?(chris.double)
Attachment #400406 - Flags: review?(chris.double)
Attachment #400407 - Flags: review?(chris.double) → review+
http://hg.mozilla.org/mozilla-central/rev/ad095a435171

With these patches, the symptoms in this bug are fixed.  I'll attach my patches for the sydneyaudio/buffering changes mentioned in comment 34 to one of the other audio-problem bugs.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Is there a chance this will hit 1.9.2?
Attachment #398565 - Flags: approval1.9.2?
Comment on attachment 398565 [details] [diff] [review]
patch v0

Requesting approval for both patches.  They both significantly improve playback behaviour in a number of cases.

The first patch is pretty simple and safe, and has been on trunk for a couple of weeks without any problems.  As well as fixing the problems described in this bug, it's expected to fix a number of related A/V sync issues that people may be experiencing.

The second patch is a little more invasive, but fixes obvious bugs in the Linux-specific sound code.  It has been tested by a few people with no reports of problems.
Attachment #400407 - Flags: approval1.9.2?
CCing Robert for approval consideration.
Roc: it's pretty annoying testing video on Linux without these patches, we often get hangs. It would be great if we could get them into FF3.6.
Flags: wanted1.9.2?
Attachment #398565 - Flags: approval1.9.2? → approval1.9.2+
Attachment #400407 - Flags: approval1.9.2? → approval1.9.2+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: