Closed Bug 1340718 Opened 3 years ago Closed 3 years ago

If a audio device (Citrix) disappears and fails on re-open attempts, output can be hung until restart

Categories

(Core :: Audio/Video: MediaStreamGraph, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox-esr45 53+ fixed
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 53+ fixed
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: jesup, Assigned: padenot)

References

(Depends on 1 open bug)

Details

Attachments

(3 files, 1 obsolete file)

If using an audio device such as a Citrix pseudo-mic & speakers, if the Citrix connection disconnects, the audio device fails and returns error codes, and also we see devices being removed.  When cubeb tries to reopen devices, it can end up in an ERROR state (3), and after that audio cannot be restarted since datacallbacks will not happen again.
Rank: 12
Comment on attachment 8839520 [details]
Bug 1340718 - Retry getting the default device when audio devices are present and the first call failed.

https://reviewboard.mozilla.org/r/114136/#review115750

::: media/libcubeb/src/cubeb_wasapi.cpp:709
(Diff revision 1)
> -  if (FAILED(hr)) {
> +    if (FAILED(hr)) {
> -    LOG("Could not get default audio endpoint: %x\n", hr);
> +      LOG("Could not get default audio endpoint: %x\n", hr);
> -    SafeRelease(enumerator);
> +      Sleep(100);
> -    return hr;
> -  }
> +    }
> +  } while (FAILED(hr) && hr != E_NOTFOUND);

It's hard to review this without information about how it was debugged with a description of what', specifically, is happening.  What error is being returned from GetDefaultAudioEndpoint in this situation?

This fix seems a bit hacky and my concern is there are other situations where we get an error here and instead of falling back to no sound we end up stuck in this loop.

Which caller of get_default_endpoint does this matter for?  If this kind of loop/sleep is a necessary workaround, it might be better handled by the caller.

::: media/libcubeb/src/cubeb_wasapi.cpp:713
(Diff revision 1)
> -  }
> +    }
> +  } while (FAILED(hr) && hr != E_NOTFOUND);
>  
>    SafeRelease(enumerator);
>  
>    return ERROR_SUCCESS;

There's now a situation where GetDefaultAudioEndpoint fails (hr == E_NOTFOUND) where we used to return E_NOTFOUND and now return ERROR_SUCCESS.
Attachment #8839520 - Flags: review?(kinetik) → review-
If this is happening after a device loss event where we try to reconfigure but the new target/no available device is ready to configure, I think a better approach would be to have a "settle" delay between the device loss event (which we don't control) and when we attempt to reconfigure.  IIRC, Chromium has something like this (a single 200ms delay, last time I looked) to try to deal with this sort of thing - the basic problem being that we trigger reconfigure on the first device loss event but there can be a stream of device change events over time and there's no logical beginning or end of them at which to start the reconfigure process.
I forgot to send my message explaining the patch, sorry. Also, thinking again about it, I think this is not ideal (for the reason you found), I'll come up with something better.

Also, yes, I've implemented some sort of debouncing for the reconfigure events: we receive a bunch of notification, and only reconfigure once, after things have settled. This is sitting on a branch on my windows machine for now.
Attached patch patchSplinter Review
Much better patch, only touching `GraphDriver.cpp`. As far as cubeb is concerned, the job is done: the error state is being signaled properly. The graph driver now decides to fallback to a `SystemClockDriver`. The graph then decides to re-try periodically to switch to a new `AudioCallbackDriver` (because at least one track needs to output audio). I removed the error case about returning when we can't get the preferred latency, this makes us go into the case where we fallback to a `SystemClockDriver`.

STR:
- On windows, disable all the audio devices, apart from a device that can be unplugged (I used a USB headset)
- Run any demo that outputs sound and uses an MSG (web audio/webrtc, whatever)
- Notice that there is sound
- Unplug the USB device
- At this point, the machine has no audio devices, falls back to a system thread, notice that with this patch, the graph still runs (can't output audio though)
- Re-plug the USB device, notice that audio starts to flow again, with correct timing (no buffering or anything)

I think this is similar with what the folks we're talking with experience with Citrix.

This is easily back-portable, and is a mozilla-central patch for now.
Attachment #8840522 - Flags: review?(rjesup)
Comment on attachment 8840522 [details] [diff] [review]
patch

Review of attachment 8840522 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/GraphDriver.cpp
@@ +1048,5 @@
>  AudioCallbackDriver::StateCallback(cubeb_state aState)
>  {
>    LOG(LogLevel::Debug, ("AudioCallbackDriver State: %d", aState));
> +  if (aState == CUBEB_STATE_ERROR) {
> +    // Fall back to a driver using a normal thread.

Add a comment that device changes (re-plug) will cause it to reopen and switch back to an AudioCallbackDriver

@@ +1051,5 @@
> +  if (aState == CUBEB_STATE_ERROR) {
> +    // Fall back to a driver using a normal thread.
> +    MonitorAutoLock lock(GraphImpl()->GetMonitor());
> +    SystemClockDriver* nextDriver = new SystemClockDriver(GraphImpl());
> +    SetNextDriver(nextDriver);

Comment that we can't use SwitchOnNextIteration since there won't be one.
Attachment #8840522 - Flags: review?(rjesup) → review+
Component: Audio/Video: cubeb → Audio/Video: MediaStreamGraph
Attached patch patch for esr45Splinter Review
Same fix, changed a few calls so that it compiles on ESR45, tested, it still works as intended.
Attachment #8840916 - Flags: review?(rjesup)
Pushed by paul@paul.cx:
https://hg.mozilla.org/integration/mozilla-inbound/rev/71bbaf9a0d0f
When the cubeb stream errors out while driving a MediaStreamGraph, fall back to a SystemClockDriver. r=jesup
Comment on attachment 8840916 [details] [diff] [review]
patch for esr45

Review of attachment 8840916 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/GraphDriver.cpp
@@ +892,5 @@
> +    mNextDriver = nextDriver;
> +    DebugOnly<bool> found = mGraphImpl->RemoveMixerCallback(this);
> +    nextDriver->SetGraphTime(this, mIterationStart, mIterationEnd);
> +    mGraphImpl->SetCurrentDriver(nextDriver);
> +    nextDriver->Start();

Clone the comment about SwitchAtNextIteration.
Attachment #8840916 - Flags: review?(rjesup) → review+
https://hg.mozilla.org/mozilla-central/rev/71bbaf9a0d0f
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment on attachment 8840916 [details] [diff] [review]
patch for esr45

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
This is critical to Intuit who is using Firefox for doing video CSR support for tax preparation (and also other Intuit products).  They use ESR, and are a prime example of an ESR user.  All updates to Firefox are done centrally on the Citrix servers, and they try to avoid CSRs being bugged by Firefox update requests they can't resolve.

User impact if declined: When a CSR has a network glitch (or citrix glitch) and loses connectivity for a few seconds, they'll have to restart the browser on the server, and likely will lose contact with the customer and have to have them call in again.

Fix Landed on Version: 54.  Now that it's verified it will be uplifted to all versions.  We're asking for ESR52 as well.

Risk to taking this patch (and alternatives if risky): Very low risk patch; uses the normal sequence to restart the MediaStreamGraph on cubeb/driver failure callback.

String or UUID changes made by this patch: none

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #8840916 - Flags: approval-mozilla-esr52?
Attachment #8840916 - Flags: approval-mozilla-esr45?
Not sure I understand why we need to uplift it to 45? This isn't a new regression. 
Can Intuit workaround the issue?
We don’t usually uplift patch that isn’t a security fix this late in ESR. We can keep this open, and if we end up with a dot release it could ride along.
Comment on attachment 8840916 [details] [diff] [review]
patch for esr45

This isn't something that we usually do but here, Randell explained me the rational (partner related) and it makes sense to have this exception.
Attachment #8840916 - Flags: approval-mozilla-esr52?
Attachment #8840916 - Flags: approval-mozilla-esr52+
Attachment #8840916 - Flags: approval-mozilla-esr45?
Attachment #8840916 - Flags: approval-mozilla-esr45+
Polo, I guess we want to uplift that to all other branches, right?
Merci et bon week end!
Flags: qe-verify+
Flags: needinfo?(padenot)
(In reply to Sylvestre Ledru [:sylvestre] from comment #16)
> Polo, I guess we want to uplift that to all other branches, right?
> Merci et bon week end!

Yes, thanks.
Flags: needinfo?(padenot)
Attached patch Beta patchSplinter Review
Attachment #8839520 - Attachment is obsolete: true
Attachment #8847656 - Flags: approval-mozilla-beta?
Comment on attachment 8847656 [details] [diff] [review]
Beta patch

Thanks Paul, let's get this in 53.0b3
Attachment #8847656 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
[Tracking Requested - why for this release]:
Still an issue on Fx 53b10, Win 10 x64:
1. Open https://mozilla.github.io/webrtc-landing/gum_test.html
2. Click 'audio'
3. Unplug the USB headset

Actual results:
"Firefox has stopped working" error message displayed. Closing the error pop-up also closes Firefox.
Browser console log after the browser restarts:
"IndexedDB UnknownErr: ActorsParent.cpp:594 (unknown)
UnknownError  IndexedDBHelper.jsm:71:12
PushService:stateChangeProcessEnqueue: Error transitioning state UnknownError  PushService.jsm:160"
Also, there are no crashes in about:crashes.
Thoughts?
Flags: needinfo?(padenot)
Paul, this is in contradiction with https://bugzilla.mozilla.org/show_bug.cgi?id=1354528#c10, are you sure you're testing with beta 10 ?
Flags: needinfo?(padenot) → needinfo?(paul.silaghi)
Yes, but I tested on https://archive.mozilla.org/pub/firefox/candidates/53.0b10-candidates/build2/win32/en-US/ which was build 07-Apr-2017, while Emil tried on a tinderbox build from 09-Apr-2017.
So, I guess bug 1354528 is a follow-up for this, but only for the beta branch?
Flags: needinfo?(paul.silaghi) → needinfo?(padenot)
Anyway, 54.0a2 (2017-04-10) and 55.0a1 (2017-04-09) don't crash like beta, but after re-plug the headset, I can no longer hear myself on https://mozilla.github.io/webrtc-landing/gum_test.html / audio.
Paul, please open a new bug about this.
Flags: needinfo?(padenot) → needinfo?(paul.silaghi)
Depends on: 1355773
Filed bug 1355773.
Meanwhile dropping qe-verify+, will verify later when bug 1355773 gets fixed.
Flags: qe-verify+
Flags: needinfo?(paul.silaghi)
You need to log in before you can comment on or make changes to this bug.