Closed Bug 573924 Opened 14 years ago Closed 14 years ago

sa_stream_write hangs indefinitely inside snd_pcm_writei

Categories

(Core :: Audio/Video, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- final+

People

(Reporter: kinetik, Assigned: kinetik)

References

Details

Attachments

(1 file)

This has been mentioned elsewhere (e.g. bug 557432 comment 48), but needs a bug to track it.
This is the cause of some of the intermittent timeouts on the new Linux mochitest machines.  It has been observed but never fully debugged in other environments where PulseAudio was in use.  I believe it's an PulseAudio-only bug in the alsa-pulse plugin.

We call sa_stream_write with a chunk of audio data, expecting that it may block in snd_pcm_writei until there is enough buffer space to accommodate the new data.  In some cases, the call to snd_pcm_writei never returns and causes the audio thread (and, potentially, the entire decoder) to hang.

I can reproduce this hang fairly easily in a VirtualBox Fedora 12 VM by running multiple copies of test_playback in parallel.  Usually a single attempt where six copies of the test are run is sufficient to cause a test hang due to this particular bug.

My current theory is that the following patch (which has never been upstreamed) will fix the problem:  http://www.mail-archive.com/pulseaudio-discuss@mail.0pointer.de/msg06027.html
Note that the patch to PulseAudio supplied in the linked thread (here: http://www.mail-archive.com/pulseaudio-discuss@mail.0pointer.de/msg06093.html) has been upstreamed and is available in Fedora 12 updates.  I've tested with this update applied and it does not resolve the problem we're hitting.
Blocks: 557432
More discussion about upstreaming the patch (but no action) here: https://tango.0pointer.de/pipermail/pulseaudio-discuss/2010-April/006913.html

I emailed David Henningsson to find out if he can shed any light on getting this fix upstream.

Since we have plans to rewrite sydneyaudio anyway, the quickest approach to getting this fixed locally may be to get that library rewrite done.  One of the plans for the rewrite was to use PulseAudio directly where available (which I believe would avoid this bug).
David emailed alsa-devel today: http://mailman.alsa-project.org/pipermail/alsa-devel/2010-June/028837.html

I intended to follow that email up with details on what's going wrong in our case and how the patch fixes it, but now I can't reproduce the hang.
So, the bad news is I made a mistake while testing this.

The good news is that the pulseaudio 0.9.21-5 package updates already available in Fedora 12 updates do fix this problem (probably due to http://www.mail-archive.com/pulseaudio-discuss@mail.0pointer.de/msg06093.html).

What happened during testing was:

1. Reproduced the problem with the shipped pulseaudio/F12 packages
2. Updated to pulseaudio 0.9.21-5 from F12 updates
3. Logged out to restart pulseaudio (which runs per-user)
4. Retested and still reproduced the problem
5. Tried David's patch and failed to reproduce the problem

Overnight, Windows 7 rebooted my machine (and, with it, my F12 VM).

Having reverted to the original pulseaudio (0.9.19-2) and repeated the steps, it turns out that logging out wasn't sufficient.  When I logged back in after restarting pulseaudio, there were *two* instances running against my user ID.  Playing sound confirmed that the old version left running from the previous login session was still being used for audio.

Rebooting (or killing the old pulseaudio instance) ensures the new version is the one being used.  Having done this, it looks very much like the pulseaudio 0.9.21-5 update solves this problem.
Filed releng bug 574190 to get PulseAudio updated on the build/test slaves.
Depends on: 574190
FWIW, David's patch has been upstreamed into alsa-plugins now: http://mailman.alsa-project.org/pipermail/alsa-devel/2010-June/028851.html
On my new F13 x86_64 laptop I can reproduce this even with PA 0.9.21-6 and a custom build of the ALSA plugins containing David's patch. :-(  I'll investigate more when I have some time.
I can't reproduce this after removing the calls to snd_pcm_pause in sa_stream_{pause,resume}, so it's likely connected with PA's async stream corking stuff.
I wrote a couple of simple standalone clients to reproduce this.  The ALSA one can reproduce the problem within a few seconds.  The PA one never hangs, which either means the bug is in the alsa-plugin code (or our use of it), or my PA test client isn't doing exactly the same thing the alsa-plugin code is.  I'm still digging into this.

I did discover that a reliable workaround is to never write more than the value returned by snd_pcm_avail_update.  This works by avoiding ending up in the wait-for-free-buffer code that we end up hanging in.  Our old decoder backend used to do this, but we changed to a blocking-writes model for the new decoder for simplicity.
Attached patch patch v0Splinter Review
The hang happens on the first snd_pcm_writei after snd_pcm_pause is called to resume playback when the write is larger than snd_pcm_avail_update has reported.  If this first write is restricted in size to what snd_pcm_avail_update reported, the hang doesn't occur and subsequent blocking writes work correctly.  If the first write is smaller than avail, it's still possible to hang in snd_pcm_writei.

Based on this, it's possible to introduce a workaround to sydney_audio_alsa.c that sets a flag when sound playback is resumed, which is then used to limit the size of the first few writes after resuming to the buffer available size.  The actual bug still needs to be found and fixed upstream, but given the time it'll take to resolve that (and get fixes distributed), I think it's worth working around the problem until the fix is widespread (or we've moved to the new audio backend).  This fix should resolve a lot of random orange on the Linux test machines caused by tests hanging.

I've tested this on F13 x86_64 (including mochitests), F12 x86 in a VM, and on the N900.  The bug doesn't actually occur on the N900 as its PulseAudio predates support the functionality necessary for snd_pcm_pause.
Assignee: nobody → kinetik
Status: NEW → ASSIGNED
Attachment #486816 - Flags: review?(chris.double)
Requesting blocking since we have a workaround fix and this causes a large amount of random orange on the test machines.  It's also likely we'll see complaints about this bug in the wild once Firefox 4 goes out, as the behaviour of audio writes in the new decoder triggers this bug whereas the old decoder was (mostly) immune to it.
blocking2.0: --- → ?
OS: Windows 7 → Linux
Attachment #486816 - Flags: review?(chris.double) → review+
Whiteboard: [needs landing]
http://hg.mozilla.org/mozilla-central/rev/a8c05640aa51
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: [needs landing]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: