Open Bug 1472410 Opened 2 years ago Updated 1 year ago

Broken audio/video handing (I/O interrupts on video-buffering; random rewinds on livestreams; duplicated sounds on video stopping)

Categories

(Core :: Audio/Video: Playback, defect, P3)

60 Branch
x86_64
Linux
defect

Tracking

()

UNCONFIRMED

People

(Reporter: sworddragon2, Unassigned)

References

Details

(Keywords: regression)

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
Build ID: 20100101



Actual results:

For a while (~1-2 months) I'm noticing several audio/video issues around the same time. It begun with stopping videos on YouTube (non-livestreams) which had a very rare chance that the last few milliseconds of the sound played were played immediately on stopping the video again. This is the hardest to reproduce issue and has the least impact so I'm not mainly focusing on this behavior here.

Shortly after this I noticed I/O interrupts on YouTube videos (non-livestreams; causing the audio/video processing to stop for a few milliseconds while the buffer is far ahead (but those interrupts never appear if the buffer is full - very likely because Firefox now hasn't to store anything in the cache anymore thus eliminating I/O activity and any interrupts based on it)) and random rewinds on Twitch streams (the rewinds can only appear on ~the first 2 minutes of a livestream in any amount causing a seamless audio/video rewind a few seconds into the past).

The last days I focused on trying to find the cause for the random rewinds on Twitch but I could not get a definite result for now as it needs a long time to make reliable tests as this issue reproduces very randomly (I could even get a full day without seeing this issue once while on the next day it appears nonstop). So far I did also try to downgrade some libraries that might affect this issue to a version where they were working with Firefox including some more and some less related stuff like dbus, cairo, freetype, glib, gtk, pango, ffmpeg/x264 and several X-Server related libraries but this issue did still always reproduce.

But now I'm focusing on tbe I/O interrupts as they are much more annoying, reproduce much more often and do not have a semi-workaround (like the random rewinds). While I saw both issues with Firefox 60.0 (my usual Firefox installation on Ubuntu) I could reproduce the I/O interrupts also on yesterdays nightly. I did try to reproduce the I/O interrupts on Firefox 57 but failed so far to reproduce them there. From my memory I'm also relative sure to not remember to have seen any of the 3 issues in Firefox 59 or earlier since the last versions had an issue which required in some configurations to start pulseaudio always manually if Firefox was being started to get working sound (which was already an annoying issue but had an easy workaround - and more importantly its fix landed in Firefox 60) and I'm pretty sure I was happy and didn't had any other major issues in mind once I validated it being fixed with Firefox 60.

I did even play with the thought to invest even much more time to bisect to the offending commit (based on the I/O interrupts) as those issues are causing a such significant breakage here (basically half of the web is not working well with audio/video that malfunctioning). So I did setup a building environment for Firefox and did successfully do a testbuild based on the newest nightly. On testing for the I/O interrupts here (more or less just for curiosity) interestingly I wasn't able to reproduce it anymore (an immediate retest on my default Firefox 60 installation reproduced this issue immediately). I see 3 potential causes for this:
- A random patch made it into the master the last 24 hours randomly fixing this issue (I think this is a more unlikely scenario)
- I was just unlucky and had a very bad reproduction rate (in theory this can happen anytime thus making a bisecting a bit tricky)
- Not sure if this is the case but maybe user-specific builds and public builds from Mozilla result in different binary code (like stronger binded arch-builds on endusers) causing to not have the bug in self-compiled builds (which would make bisecting to the offending commit the default way impossible)

Finding the cause for this issue is really exhausting and I'm trying to find it for around a week (maybe a bit more but I have not counted the days over this time). I mean, Firefox had several issues in the past which made working audio/video barely possible for around the last year (the current issues; the pulseaudio bug already mentioned; and the alsa_to_pulseaudio switch as a whole (which is still causing conceptional issues here)). Since the issues in the past were already significant, they needed a long time to land into a final release (I think it was a few months for the pulseaudio bug that was fixed in Firefox 60), there is not a low chance I will see the current issues for another several months and based on this I already fear the next audio/video issues soon causing another horrible experience I had serious considerations to change to another browser. But since a migration can be tricky and who knows what other issues are awaiting me I'm preferring to stay at Firefox as long as possible (otherwise I would not invest all the time here to get the current issue fixed as fast as possible).


I'm a bit out of ideas now what to do. Has anybody an idea what this behavior could have been caused (with the most likely appearance between Firefox 59 and 60 if my memory and the reproduction rate did not have rekt me). Or any other ideas how I could find the cause for this issue?
Component: Untriaged → Audio/Video: Playback
OS: Unspecified → Linux
Product: Firefox → Core
Hardware: Unspecified → x86_64
What does you machine look like? This works well for most people, so we need to understand what is different in your setup. Can you think of anything?
Flags: needinfo?(sworddragon2)
Since one of the issues is an I/O issue one could think that it might also be a bug introduced in newer kernels. However, due to a kernel panic on newer kernel versions I'm currently on an older kernel version than this issue so this case is not true.

I'm using an old HDD (~12 years old) which is probably significantly slower than todays HDD's and much more slower than SSD's. Eventually this contributes to the reproducibility of the I/O interrupts.

2 days ago this issue happened again causing I/O interrupts every few seconds and I thought what would happen if I set browser.cache.disk.enable to false. And indeed, the I/O interrupts immediately stopped. On making today tests again I could not reproduce the issue with browser.cache.disk.enable still set to false but on setting it to true the I/O interrupts appeared again every few seconds and on setting it back to false while a video was still running the I/O interrupts immediately stopped.
So at least setting browser.cache.disk.enable to false is a workaround to stop the I/O interrupts.


I also noticed that it seemed that those I/O interrupts appeared more often when I had multiple tabs open. I just tried then what would happen if I reload them while a video was running in another tab - and the I/O interrupts drastically increased in frequency and length. But at least this makes this issue very easy to reproduce here and I did test if I could reproduce the I/O interrupts on a complete new profile without any applied settings - and they did indeed appear on a new profile.
Flags: needinfo?(sworddragon2)
Priority: -- → P3
Here is an update to all 3 issues:

Since (or very shortly after) I reported this issue I did not see any random rewinds anymore on Twitch livestreams. In the meantime I did also create a new Firefox profile (and reapplied all my settings) and mainly updated my old Linux kernel. I also noticed that the Twitch layout changed a bit recently. My guess is that this was just a bug in the Twitch player that got more or less fixed in a short time (maybe that potentially broken update has only been rollout to a few people so it went more or less unnoticed to the world).

I did also not hear anymore the duplicated sound since around I reported the bug. However, this issue was very rare and maybe I was just lucky to not see it. Otherwise since I remember to have seen it only on YouTube and in the meantime also PulseAudio updated maybe one of those components fixed it.


But the worst of the 3 issues - the I/O interrupts - still exist. But with the easier reproduction from the last post I was able to definitely confirm Firefox 59.0 (downloaded from https://ftp.mozilla.org/pub/firefox/releases/59.0/linux-x86_64/de/firefox-59.0.tar.bz2) to not have this issue while Firefox 60.0 (downloaded from https://ftp.mozilla.org/pub/firefox/releases/60.0/linux-x86_64/de/firefox-60.0.tar.bz2) is affected by it. I did also test both versions with a complete new profile without applying my usual settings. I also checked if both versions are successfully writing to the cache to exclude that Firefox 59 did just not expose this bug due to being unable to write to the cache but both versions did successfully write to the cache.


Over a month ago I did also try to bisect to the offending commit. While I was shortly before able to compile the Firefox master I couldn't determine which branch/tag was responsible for Firefox 59.0 and Firefox 60.0 so I could not set the commit range for bisecting. I would give it a try again but I do not know where to continue now.
Have you seen https://mozilla.github.io/mozregression/quickstart.html ?
The following should download and test appropriate builds:

  mozregression --good 59 --bad 60
I did never use mozregression for some specific reasons. But it appears to download the nightlys to get first bad build so I just did this now manually:

Last good build: https://ftp.mozilla.org/pub/firefox/nightly/2017/10/2017-10-18-22-00-49-mozilla-central/firefox-58.0a1.en-US.linux-x86_64.tar.bz2
First bad build: https://ftp.mozilla.org/pub/firefox/nightly/2017/10/2017-10-19-10-01-07-mozilla-central/firefox-58.0a1.en-US.linux-x86_64.tar.bz2
(In reply to sworddragon2 from comment #5)
> I did never use mozregression for some specific reasons. But it appears to
> download the nightlys

After which it should narrow the range further, with autoland / inbound builds.

> I just did this now manually:

Thank you for making the effort, though this range may be too loose to be useful.
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9f8b3fc384ae6fe8053b087548c57020a55f00fd&tochange=87e3813e79396f88e63fc6cfd7f07c3dbb415195
Is there a source for autoland/inbound builds to download and test them manually? My guess would have been somewhere in https://ftp.mozilla.org/pub/firefox/tinderbox-builds/ but the directories in there seem to have stopped filling several releases ago like autoland-linux. Also I get always the error 504 on entering autoland-linux64 as this would have been my first guess (but I guess it contains only outdated builds too).

And now I'm wondering if I could just download the repository again and use 9f8b3fc384ae6fe8053b087548c57020a55f00fd as last known good commit and 87e3813e79396f88e63fc6cfd7f07c3dbb415195 as first known bad commit.
In the meantime I tried to bisect this again for https://github.com/mozilla/gecko-dev (master) with "git bisect start", "git bisect good 35a42fae81ab7979749c4c56cb12f412df651a07" and "git bisect bad 0d1588d9f76c2d382673e4a7c480092a8cf03661" (hopefully I did choose the correct commits as the checksums(?) appear to be different here (I'm not a git expert)). Unfortunately even with a second try via "git bisect skip" I get errors in the form of:

 9:20.46 make[3]: *** [/home/sworddragon/tmp/firefox/gecko-dev/config/recurse.mk:33: compile] Error 2
 9:20.46 make[2]: *** [/home/sworddragon/tmp/firefox/gecko-dev/config/rules.mk:452: default] Error 2
 9:20.46 make[1]: *** [/home/sworddragon/tmp/firefox/gecko-dev/client.mk:424: realbuild] Error 2
 9:20.46 make: *** [client.mk:175: build] Error 2


So this commitrange appears to not be able to successfully build on Linux x86_64 and I do not know a way to workaround this so bisecting to the offending commit seems to not be possible here.
I noticed that compiling with multiple threads caused parts of the real error to be shown much earier and the sheer amount of lines of compiler warnings (tens of thousands) made it not obvious (I have no idea why across pretty much any software project nobody cares about compiler warnings). In total I did run into 3 different compiling errors which required the use of --disable-elf-hack, --disable-stylo and --disable-webrtc to successfully compile. But with the above commit range I was able to successfully bisect to the offending commit: 05de18af54459cf30bcd9dfcb41325e6d1844c67 (P2: Enable cubeb remoting on Nightly. r=kinetik)

Unfortunately that patch appears to not directly introduce the bug but instead it does making cubeb sandboxing the default. Testing on an affected version of Firefox if setting media.cubeb.sandbox to false does fix the issue does indeed so. I don't know what cubeb sandboxing does in detail (I guess sandboxing audio in some manner) but it causes the performance here to be significantly worse - worse enough to be not useful at all. For now I just set media.cubeb.sandbox to false until the performance issue with the cubeb sandboxing is solved.


Dan, do you have any idea why cubeb sandboxing causes those extreme I/O-related interrupts as described in this ticket?
Flags: needinfo?(dglastonbury)
Flags: needinfo?(dglastonbury)
(In reply to sworddragon2 from comment #9)
> Dan, do you have any idea why cubeb sandboxing causes those extreme
> I/O-related interrupts as described in this ticket?

Cubeb sandboxing enables an audio interprocesss communication system to move access to pulse audio system into the master process.  With Firefox content sandboxing, the content processes that render webpages don't have access to very many system calls, ones that pulse audio needs.  This is one of the reasons that Firefox was unable to start pulse audio and needed a manual work around before FF60.

By I/O-related interrupts, do you mean audio "glitches" as-in the playback isn't smooth and has portions of silence?  With cubeb sandboxing this would be caused by the IPC system not delivering rendered audio fast enough. As @padenot mentions in comment 1, the system works well for most people, but in your system case it's not working well enough.

Can you copy the results of about:support and add it as as an attachment to this bug?
Flags: needinfo?(sworddragon2)
(In reply to Dan Glastonbury (:kamidphish) | needinfo me from comment #10)
> By I/O-related interrupts, do you mean audio "glitches" as-in the playback
> isn't smooth and has portions of silence?

Indeed, the audio just stops playing for estimated around a few hundred milliseconds. For testing usually I open some websites and a YouTube video and while the video plays I'm reloading the other websites over and over again (this triggers those interrupts a few times per minute). I did also test now what happens when I use 2 different profiles (via --no-remote and --profile) while 1 Firefox instance is only playing the video and the other is only reloading websites: This caused the issue to trigger at the same rate as with a single profile.


(In reply to Dan Glastonbury (:kamidphish) | needinfo me from comment #10)
> As
> @padenot mentions in comment 1, the system works well for most people, but
> in your system case it's not working well enough.

There are some things that might cause I/O operations here to be a bit slower than on other systems:
- I'm on Linux and using ecryptfs on my home directory (which also contains ~/.cache/mozilla) which will result in a bit overhead. However, I think ecryptfs is enabled in Ubuntu installations at default (but I'm not absolutely sure about this).
- My home directory does enable full data journaling (ecryptfs on an ext4 filesystem) to reduce potential data loss (but my system also disables storing of file access times to improve performance).
- Dirty bytes (storage data still in memory) is flushed around every second to reduce data loss on a power loss.
- I'm still on a HDD (and it is pretty old - around 12 years if I'm not wrong).


Also I'm not perfectly fine with the I/O performance on Linux and created several years ago a ticket about it (freezes of several seconds in extreme and usually rare cases - up to minutes when the I/O is caused by swapping). Swapping is to be configured to be done as late as possible and since I'm on 16 GiB of memory here I usually never see any swapping activities.

That all might be sort of an edge-case: While the I/O is not that much it is probably higher on watching videos than on other common systems. But it should not cause those crazy interrupts.


> Can you copy the results of about:support and add it as as an attachment to
> this bug?

I did temporary create a new profile, started Firefox, reproduced the issue there and closed all tabs and opened about:support in a new tab. The raw data is in the attachments.
Flags: needinfo?(sworddragon2)
500 Internal Server Error
Sorry, something went wrong.

A team of highly trained monkeys has been dispatched to deal with this situation.

If you see them, send them this information as text (screenshots frighten them):
    APkpgMXAMs6gJmudm24S8dSp6yNyq6hrtk4YIIXpxT4zIgJ8B824-VEG
    gd7SYvHsP17RX6F-Qn5sy275nhMETFTuOhOR55IBCdtw9cfHVShnvaPj
    LnkTzChGvi9BJ-CKaViMC9wtwa7cL31gDa955zP6bzNPlNI6Bg08CPyo
    zuoLbrobQ5y5aJvRRaOMz641ksAMgJDK3N5ErGvwMW8A9-MRSO2KhL9X
    lkpRHLV8RQ8PAkSfDkIaTh6DtKEBYac8PLdBZ3YcAiAeES5Mk0X0G0BD
    3JPuSJiCJBvfnsYvrTSmUBKW3PwnMN4iwav08jMrXrJ3rGnw9sfr-rb8
    1BQ-d845S9HMZ_3NUAsctgXMWKagC_LXNrKo7lR2Ynhawv4CVHp7MWky
You need to log in before you can comment on or make changes to this bug.