sa_stream_write hangs indefinitely inside snd_pcm_writei

RESOLVED FIXED

Status

()

Core
Audio/Video
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: kinetik, Assigned: kinetik)

Tracking

unspecified
x86
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking2.0 final+)

Details

Attachments

(1 attachment)

(Assignee)

Description

7 years ago
This has been mentioned elsewhere (e.g. bug 557432 comment 48), but needs a bug to track it.
(Assignee)

Comment 1

7 years ago
This is the cause of some of the intermittent timeouts on the new Linux mochitest machines.  It has been observed but never fully debugged in other environments where PulseAudio was in use.  I believe it's an PulseAudio-only bug in the alsa-pulse plugin.

We call sa_stream_write with a chunk of audio data, expecting that it may block in snd_pcm_writei until there is enough buffer space to accommodate the new data.  In some cases, the call to snd_pcm_writei never returns and causes the audio thread (and, potentially, the entire decoder) to hang.

I can reproduce this hang fairly easily in a VirtualBox Fedora 12 VM by running multiple copies of test_playback in parallel.  Usually a single attempt where six copies of the test are run is sufficient to cause a test hang due to this particular bug.

My current theory is that the following patch (which has never been upstreamed) will fix the problem:  http://www.mail-archive.com/pulseaudio-discuss@mail.0pointer.de/msg06027.html
(Assignee)

Comment 2

7 years ago
Note that the patch to PulseAudio supplied in the linked thread (here: http://www.mail-archive.com/pulseaudio-discuss@mail.0pointer.de/msg06093.html) has been upstreamed and is available in Fedora 12 updates.  I've tested with this update applied and it does not resolve the problem we're hitting.
(Assignee)

Updated

7 years ago
Blocks: 557432
(Assignee)

Comment 3

7 years ago
More discussion about upstreaming the patch (but no action) here: https://tango.0pointer.de/pipermail/pulseaudio-discuss/2010-April/006913.html

I emailed David Henningsson to find out if he can shed any light on getting this fix upstream.

Since we have plans to rewrite sydneyaudio anyway, the quickest approach to getting this fixed locally may be to get that library rewrite done.  One of the plans for the rewrite was to use PulseAudio directly where available (which I believe would avoid this bug).
(Assignee)

Comment 4

7 years ago
David emailed alsa-devel today: http://mailman.alsa-project.org/pipermail/alsa-devel/2010-June/028837.html

I intended to follow that email up with details on what's going wrong in our case and how the patch fixes it, but now I can't reproduce the hang.
(Assignee)

Comment 5

7 years ago
So, the bad news is I made a mistake while testing this.

The good news is that the pulseaudio 0.9.21-5 package updates already available in Fedora 12 updates do fix this problem (probably due to http://www.mail-archive.com/pulseaudio-discuss@mail.0pointer.de/msg06093.html).

What happened during testing was:

1. Reproduced the problem with the shipped pulseaudio/F12 packages
2. Updated to pulseaudio 0.9.21-5 from F12 updates
3. Logged out to restart pulseaudio (which runs per-user)
4. Retested and still reproduced the problem
5. Tried David's patch and failed to reproduce the problem

Overnight, Windows 7 rebooted my machine (and, with it, my F12 VM).

Having reverted to the original pulseaudio (0.9.19-2) and repeated the steps, it turns out that logging out wasn't sufficient.  When I logged back in after restarting pulseaudio, there were *two* instances running against my user ID.  Playing sound confirmed that the old version left running from the previous login session was still being used for audio.

Rebooting (or killing the old pulseaudio instance) ensures the new version is the one being used.  Having done this, it looks very much like the pulseaudio 0.9.21-5 update solves this problem.
(Assignee)

Comment 6

7 years ago
Filed releng bug 574190 to get PulseAudio updated on the build/test slaves.
Depends on: 574190
(Assignee)

Comment 7

7 years ago
FWIW, David's patch has been upstreamed into alsa-plugins now: http://mailman.alsa-project.org/pipermail/alsa-devel/2010-June/028851.html
(Assignee)

Comment 8

7 years ago
On my new F13 x86_64 laptop I can reproduce this even with PA 0.9.21-6 and a custom build of the ALSA plugins containing David's patch. :-(  I'll investigate more when I have some time.
(Assignee)

Comment 9

7 years ago
I can't reproduce this after removing the calls to snd_pcm_pause in sa_stream_{pause,resume}, so it's likely connected with PA's async stream corking stuff.
(Assignee)

Comment 10

7 years ago
I wrote a couple of simple standalone clients to reproduce this.  The ALSA one can reproduce the problem within a few seconds.  The PA one never hangs, which either means the bug is in the alsa-plugin code (or our use of it), or my PA test client isn't doing exactly the same thing the alsa-plugin code is.  I'm still digging into this.

I did discover that a reliable workaround is to never write more than the value returned by snd_pcm_avail_update.  This works by avoiding ending up in the wait-for-free-buffer code that we end up hanging in.  Our old decoder backend used to do this, but we changed to a blocking-writes model for the new decoder for simplicity.
(Assignee)

Comment 11

7 years ago
Created attachment 486816 [details] [diff] [review]
patch v0

The hang happens on the first snd_pcm_writei after snd_pcm_pause is called to resume playback when the write is larger than snd_pcm_avail_update has reported.  If this first write is restricted in size to what snd_pcm_avail_update reported, the hang doesn't occur and subsequent blocking writes work correctly.  If the first write is smaller than avail, it's still possible to hang in snd_pcm_writei.

Based on this, it's possible to introduce a workaround to sydney_audio_alsa.c that sets a flag when sound playback is resumed, which is then used to limit the size of the first few writes after resuming to the buffer available size.  The actual bug still needs to be found and fixed upstream, but given the time it'll take to resolve that (and get fixes distributed), I think it's worth working around the problem until the fix is widespread (or we've moved to the new audio backend).  This fix should resolve a lot of random orange on the Linux test machines caused by tests hanging.

I've tested this on F13 x86_64 (including mochitests), F12 x86 in a VM, and on the N900.  The bug doesn't actually occur on the N900 as its PulseAudio predates support the functionality necessary for snd_pcm_pause.
Assignee: nobody → kinetik
Status: NEW → ASSIGNED
Attachment #486816 - Flags: review?(chris.double)
(Assignee)

Comment 12

7 years ago
Requesting blocking since we have a workaround fix and this causes a large amount of random orange on the test machines.  It's also likely we'll see complaints about this bug in the wild once Firefox 4 goes out, as the behaviour of audio writes in the new decoder triggers this bug whereas the old decoder was (mostly) immune to it.
blocking2.0: --- → ?
(Assignee)

Updated

7 years ago
OS: Windows 7 → Linux

Updated

7 years ago
Attachment #486816 - Flags: review?(chris.double) → review+
blocking2.0: ? → final+
(Assignee)

Updated

7 years ago
Whiteboard: [needs landing]
(Assignee)

Comment 13

7 years ago
Chasing up the underlying bug on alsa and pulse mailing lists:

http://mailman.alsa-project.org/pipermail/alsa-devel/2010-November/033242.html
https://tango.0pointer.de/pipermail/pulseaudio-discuss/2010-November/008139.html
(Assignee)

Comment 14

7 years ago
http://hg.mozilla.org/mozilla-central/rev/a8c05640aa51
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Whiteboard: [needs landing]
You need to log in before you can comment on or make changes to this bug.