Closed Bug 1299588 Opened 8 years ago Closed 8 years ago

nightly 20160830 Video players fail at vimeo and youtube without multiprocess enabled.

Categories

(Core :: Graphics, defect)

51 Branch
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 1304347
Tracking Status
platform-rel --- ?

People

(Reporter: u532768, Unassigned)

References

Details

(Whiteboard: [platform-rel-Vimeo][gfx-noted])

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0
Build ID: 20160829154746

Steps to reproduce:

Compiled nightly from local hg repository.
Go to Vimeo or Youtube and try to watch a video



Actual results:

Sometimes the first video will play without multiprocess enabled.  But eventually, the video starts to play, then seems to get stuck showing just the same two frames in alternating sequence.  If I click to a new location, this happens with a different two frames.  The sound is not affected, and plays as if the video was playing normally.


Expected results:

The video should have played as usual.
This all started after https://bugzilla.mozilla.org/show_bug.cgi?id=1296684 this was fixed.  Before this, nightly worked properly.  When the above bug was fixed, I began getting an error about accessibility, which I usually have turned off.  I opened this bug for that, https://bugzilla.mozilla.org/show_bug.cgi?id=1298229, and in the meantime enabled accessibility to allow nightly to compile.  This is the nightly that is failing to play videos properly.

If multi process is required to play videos, why not leave it turned on, you might ask.  Because with multi process enabled, video downloaders don't work, and the maff plugin can't save pages.  I realize you don't maintain those, but shouldn't the non multi process version of nightly still be able to play videos?  If the answer is no, I'll close this ticket since it has nothing to do with firefox.
New information.  After experiencing the video failures and opening the ticket, I went back to the pages that failed, and they started working.  I had to close the videos and play them again, and one flashed a little to start with, but firefox seems to be working well again in non multi process mode.  Maybe it was just tired.  Or it got wind of the ticket.  :-)
It started again after a few that played correctly, and the files that played before exhibit the problem now.  I waited a while, and it didn't correct itself this time.  Looks like it's multi process time.
Still happening with today's nightly, 20160831.  Happened immediately on the first video I played.
And then the next two videos worked fine.  I don't see how anyone can work on this with such intermittent behavior.  It's like it's a race condition that occurs only occasionally, and can be reset by as yet unknown actions in the browser.
So, the previous errors were all at vimeo.  I went to youtube to try to duplicate them.  And could not.  I thought this had occurred at youtube, but maybe I'm mis-remembering.  Anyway, it doesn't seem to be a problem there with the latest version of nightly (20160831).  Played maybe a dozen videos, and it didn't happen.
Component: Untriaged → Audio/Video: Playback
Product: Firefox → Core
This morning it is happening on youtube again.  How can it be so hit and miss?
The bug for accessibility has been fixed, and it seemed to fix this issue as well.  But eventually the video player at youtube started misbehaving, and then vimeo also stopped working.  The same files that had just played successfully no longer would.  Closing and opening the browser didn't work; the only solution seems to be using multi-process.

When I get some time I'll try compiling with the nightly internal defaults for configuration.  And do a bisection using the mozilla binaries to see if they have the regression.  The latter is a long shot, since there would be a lot more excitement about this problem in that case.  No, I think this is a corner case caused by my configuration selections.  Some subtle dependency, only occasionally triggered.
I realise it will take some time, but it would be helpful to isolate it down to which configuration options are necessary to reproduce the issue. I suggest removing the --enable-optimize first to rule out compiler bugs.
I compiled without the O3 in the optimization.  I left in the rest of the switches, since I am compiling with gcc 6.1, and without those switches, nightly crashes on start up.  Not the warning, the other three.

" -Wall -fno-schedule-insns2 -fno-lifetime-dse -fno-delete-null-pointer-checks "

The browser still has the problem, especially at youtube.  I noticed yesterday that it also has the problem at finance.yahoo.com.  The little videos that play automatically were all doing the flicker.  There, it's an improvement.  :-)

And the user response time was noticeably slower without the O3.  I'll be going back to using that.

The next options I'll re-enable are the evercookie options for safe-browsing, url-conv, since I only disabled those recently.

No other options have changed since it stopped working.
(In reply to stan from comment #10)
> No other options have changed since it stopped working.

If the official builds work and the builds you're making don't then we want to know the difference between those two sets so that whichever developer ends up looking at the issue doesn't need to make a large number of changes in their environment. For example gcc-5.4 comes with standard Ubuntu stable.
Your comment makes perfect sense.

Unfortunately, if you look at the attached .mozconfig, you'll see that there are *lots* of differences from the official builds.  The number of possibilities of all the possible combinations is probably something like 8!.  Way too many to try them all.

I did try with the safe-browsing and url-classifier enabled, along with O3, and that had the problem also.  

With O3 turned off, the problem wasn't as frequent, and the browser sometimes recovered from the problem if I opened another tab and played another video, and then went back to the tab with the problem video.  The slow response time with O2 was party due to the background compile of nightly using a lot of resources.  But even after that, the O2 compile was perceptually slower than the O3 compile.  I immediately noticed the difference when I opened the browser with the newest configuration using O3.

I think, for now, as a workaround, I'll use O2, and turn off all the options I originally had turned off.  The problem happens rarely enough that it isn't that annoying, and I'll just accept the slower browser responses.  And I can always just swap between multi process and no multi process (I think that's called e10s) if I decide to use O3 instead.

Three weeks ago, everything was working fine.  I've considered the possibility that there was an update to the kernel or a package that changed something, but mplayer has zero problems playing any videos locally, and it would be using the same drivers and kernel resources as nightly.  I am using a 4.8 kernel, which is the development kernel right now.  But I was using a 4.8 kernel when it was working.

It would be useful if I could use my original configuration, and try bisecting to find the commit that caused the problem.  But the original webrtc problem include precludes doing that.  That's a thought, I'll try enabling webrtc and see if that compile has the problem.  If it does, I can bisect until I find the commit that caused the issue.

I'm sanguine with you just letting this slide since it is such a hard problem to approach.  If the problem persists, I'll keep trying tweaks to see if anything fixes the issue.  If it is better for you to have the ticket closed, go ahead and close it.  I'll just update the closed ticket if I find something.
nightly compiled with webrtc enabled exhibits the problem, so bisection is possible.  When I look at the bisect command, it wants revisions, but what I know are dates.  How do I use hg to tell me what revision a specified time on a date corresponds to?  I didn't see a way to do this in the man page.  I presume it has something to do with the hg log command, but it isn't clear to me.

The dates are 
2016-08-17 15:00 good
2016-08-29 15:00 bad

Thanks.
Errata: in comment 12, the number of combinations is 2**8 not 8!, since the options are binary.

The gcc 6.2 compiler just installed, so I will be trying the various nightly compiles using that compiler to see if it will make a difference to this error.
Compiling with gcc6.2 and O3 and O2 both had the problem.  I decided to try O0, without the switches, in case it was related to https://bugzilla.mozilla.org/show_bug.cgi?id=1245783  But, compiling with O0 also has the issue, so it isn't an optimization issue.  It seems to be more prevalent at youtube than elsewhere, at least lately.

I'll keep plugging.
Next trial, used only default config options except for

ac_add_options --enable-application=browser
ac_add_options --prefix="/usr/local"
mk_add_options MOZ_MAKE_FLAGS="-j5" 
ac_add_options --enable-optimize=" -Wall -fno-schedule-insns2 -fno-lifetime-dse -fno-delete-null-pointer-checks -O3"

Compiled, and everything seems to work properly.  I've done everything I can to make this fail, and it hasn't.  So, narrowed down to a dependency that isn't explicitly defined on one or more of the options I'm disabling.

Back to the combinatorics.  I think I'll look at the bisect some more.  Lots of compiles required, but at least it gets to the root of the problem.
It failed.  So I compiled it with only 

ac_add_options --enable-application=browser
ac_add_options --prefix="/usr/local"
mk_add_options MOZ_MAKE_FLAGS="-j5"

letting nightly select its own level of optimization.  And that worked at first too, but eventually failed also.  Back to the drawing board.
platform-rel: --- → ?
Whiteboard: [platform-rel-Vimeo]
Today, I used mozregression to try and find the problem.  I've had good luck with it in the past, but that was for errors that were simple to test.  I ran the 'bad' version for about half an hour and had no failures.  But, because the error can take a while to show up, I can't categorically say that it is 'good'.  The experience of using a browser with no protection was so bad that I just gave up; I don't want to run a browser like that.

Which led me to a thought.  Is it possible that the plugins I am using that e10s disables, (video downloaders, maff saved pages), or even those it doesn't, are actually causing the problem?  It seems unlikely, since the browser can work fine for long periods of time, but it is a difference with the stock compile.

The other difference is the compiler and compile environment (gcc, glibc, etc).  It seems newer here than on the stock compile, and maybe that is the cause.

Just emphasizes the need to get the bisect done.
OK, the bisect is done, but the result doesn't seem to make sense.

First I did a 
hg log -d "2016-08-17 to 2016-08-29" | less
to get the revision numbers.  
The bad revision was the first item, 313379
The good revision was the last item, 309733

Then I ran the bisect, which took 11 compiles and tests of the compiler.  The bad were easy, the problem is very obvious.  The good, I ran 50 videos and if there were no errors, called it good.  No way to prove a negative.

The bisect said that this revision is causing the problem, but it is from outside the date window.

$ hg log -v -r 310316
changeset:   310316:32fd2e8a3be8
user:        Andrew Comminos <andrew@comminos.com>
date:        Fri Jun 24 17:56:26 2016 -0400
files:       gfx/thebes/gfxPlatformGtk.h
description:
Bug 594876 - Accelerate layers by default for nightlies on Linux. r=nical

MozReview-Commit-ID: FtGqib9SIFD

It definitely failed with this one, and not the one before it, but how did it get here?  Or, more to the point, how did a commit in June get a revision number from August?
GL layers is nightly only, so I can P3 this.
Priority: -- → P3
Thanks for the regression window, stan.  The date in the message is when the patch was written, but it was pushed in August.
http://hg.mozilla.org/mozilla-central/rev/32fd2e8a3be8

Workaround for Nightly is
Preferences -> Advanced -> Use hardware acceleration when available

I think it needs a restart to take effect properly, though new windows might use the new preference.

If that works, then that confirms the regression window.

Can you paste the graphics section of about:support, please?
Moving to graphics.
Component: Audio/Video: Playback → Graphics
Priority: P3 → --
Thanks for the clarification on the date.

I already have that preference enabled.  In fact, I had it enabled while I was doing the bisect.  Are you saying that I shouldn't have the problem if that is enabled?  If so, then that isn't true, because it has been enabled all along.  Is it possible that having that enabled in combination with e10s is the workaround you mean?  Because the problem doesn't occur if e10s (Preferences -> General -> Enable multi-process nightly) is enabled.

Here's the graphics section

Graphics
Features
Compositing	Basic
Asynchronous Pan/Zoom	none
WebGL Renderer	X.Org -- Gallium 0.4 on AMD CAICOS (DRM 2.46.0 / 4.8.0-0.rc5.git4.1.20160911.fc25.x86_6
WebGL2 Renderer	X.Org -- Gallium 0.4 on AMD CAICOS (DRM 2.46.0 / 4.8.0-0.rc5.git4.1.20160911.fc25.x86_6
Hardware H264 Decoding	No
Audio Backend	pulse
GPU #1
Active	Yes
Description	X.Org -- Gallium 0.4 on AMD CAICOS (DRM 2.46.0 / 4.8.0-0.rc5.git4.1.20160911.fc25.x86_6
Vendor ID	X.Org
Device ID	Gallium 0.4 on AMD CAICOS (DRM 2.46.0 / 4.8.0-0.rc5.git4.1.20160911.fc25.x86_6
Driver Version	3.0 Mesa 12.0.2
Diagnostics
AzureCanvasAccelerated	0
AzureCanvasBackend	skia
AzureContentBackend	cairo
AzureFallbackCanvasBackend	none
CairoUseXRender	0
Decision Log
HW_COMPOSITING	
blocked by default: Acceleration blocked by platform
OPENGL_COMPOSITING	
unavailable by default: Hardware compositing is disabled
As expected, nightly with today's updates still have the problem, when hardware acceleration is selected and multi-process is not.

If I disable hardware acceleration, the problem seems to be gone even without multi-process, but the video response is noticeably degraded (at least it seems so to me).

So, I came up with another workaround.
I do the hg pull to update.
Then run the command
hg backout -r 310316 --no-commit
Compile and install nightly
Then run the command
hg update --clean .
This reverts the backout so hg won't cavil at me for uncommitted changes when I pull in the next day's updates.

And I have the behavior I had before the revision went in.  I don't really like having to do this, but if I have to, I have to.  And responsiveness seems to be better again.  Subjective, but that's what matters.
I'm assuming the config from comment 23 is once the acceleration has been disabled?

Does the official nightly build (not your local build) has a problem on your platform when acceleration is enabled, and e10s is on?  There is a lot of information in this bug, and I want to make sure I understand.

Until August we wouldn't have had the acceleration anyway, so disabling it gets you back to where we were, but it would be nice to be able to have it, even without e10s.
"I'm assuming the config from comment 23 is once the acceleration has been disabled?"

Yes, that was from the last good bisect, so revision 310316 wasn't in place.  I would have had preferences hardware acceleration enabled with e10s turned off.  Do you need it with the revision in place?  I'll compile tomorrow's updates with the revision active if you do.

"Does the official nightly build (not your local build) has a problem on your platform when acceleration is enabled, and e10s is on?"

When I used mozregression to test the official builds, they didn't have a problem for the testing I did.  But I didn't check the configuration.  Whatever they were compiled with is what I tested with.  Do you need a test with a specific configuration?  mozregression is easy to use, and I could use it to pull a single official build from after the revision, change the preferences, and test again.

"There is a lot of information in this bug, and I want to make sure I understand."

The 'lots of information' is because I was floundering trying to figure out why this was suddenly happening.  That it happened during two other bugs that prevented me from compiling nightly with my custom configuration only muddied the waters further.

"Until August we wouldn't have had the acceleration anyway, so disabling it gets you back to where we were, but it would be nice to be able to have it, even without e10s."

So, my subjective experience was probably due to slow response from youtube.  I notice that when prime time on the East Coast of the US hits, it gets slow, and my testing was around that time.  But that's good news because it means I don't need to do all the manipulation to avoid the problem, just turn off hardware acceleration.
Whiteboard: [platform-rel-Vimeo] → [platform-rel-Vimeo][gfx-noted]
It seems to be more than just two frames repeated.  Once this starts happening, then seeking seems to introduce another frame.

Graphics
--------

Features
Compositing: OpenGL
Asynchronous Pan/Zoom: none
WebGL Renderer: X.Org -- Gallium 0.4 on AMD JUNIPER (DRM 2.43.0, LLVM 3.5.0)
WebGL2 Renderer: X.Org -- Gallium 0.4 on AMD JUNIPER (DRM 2.43.0, LLVM 3.5.0)
Hardware H264 Decoding: No
Audio Backend: pulse
GPU #1
Active: Yes
Description: X.Org -- Gallium 0.4 on AMD JUNIPER (DRM 2.43.0, LLVM 3.5.0)
Vendor ID: X.Org
Device ID: Gallium 0.4 on AMD JUNIPER (DRM 2.43.0, LLVM 3.5.0)
Driver Version: 3.0 Mesa 11.0.6

Diagnostics
AzureCanvasAccelerated: 0
AzureCanvasBackend: skia
AzureContentBackend: skia
AzureFallbackCanvasBackend: none
CairoUseXRender: 0

(In reply to stan from comment #23)
> I already have that preference enabled.  In fact, I had it enabled while I
> was doing the bisect.

Disable "Use hardware acceleration when available" to work around.
Status: UNCONFIRMED → NEW
Ever confirmed: true
See Also: → 1304347
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
comment 27:
Disable "Use hardware acceleration when available" to work around.

This is what I'm doing, and it has been working fine.  So the workarounds are either enable both multi-process and hardware acceleration, or disable hardware acceleration.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: