Closed Bug 643454 Opened 13 years ago Closed 6 years ago

Video is very choppy on Maemo/Android

Categories

(Core :: Audio/Video: Playback, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INACTIVE

People

(Reporter: romaxa, Unassigned)

References

()

Details

Attachments

(7 files, 3 obsolete files)

I was playing with latest upstream fennec with patched pixman and found that video and sound is interrupting all the time.

I've checked oprofile and found that ~45% CPU is free, but video and sound still choppy.

IIRC ~ 2 month ago it was working fine
It looks like problem not in yuv or slow rendering, problem somewhere in decoder...

did we changed anything recently?
Yes, we've made changes over the past two months, `hg log -l 50 content/media` will give you a list of changes to the decoder engine. Can you find a regression range?
yes I will try to do quick bisect.
ok, as start point for bisect 59432:b9dcbc836bb3 mc revision playing good even with old pixman.
will continue bisect
BAD - changeset:   63232:635bb4ffe6ad  - bad
user:        Matthew Gregan <kinetik@flim.org>
date:        Wed Mar 02 14:40:44 2011 +1300
summary:     Bug 636894 - Revert bug 634787's change to AUDIO_DURATION_MS to work around a regression in MozAudioAvailable event delivery.  r=roc a=roc

GOOD - changeset:   62891:d30bc9781cfd - good
user:        Matthew Gregan <kinetik@flim.org>
date:        Mon Feb 21 16:38:29 2011 +1300
summary:     Bug 546700 - Recover gracefully from servers that send Accept-Ranges but don't.  r=roc a=roc


Ok, I found that regression is about in this range.
will try to test 635bb4ffe6ad with reverted bug 636894 changes
Ok, I found
http://hg.mozilla.org/mozilla-central/rev/23cf0cedfd4a
and without this commit video play smooth, with this commit audio and playback is choppy
hmm.. something wrong here...
when I'm using http://hg.mozilla.org/mozilla-central/rev/b2d9d4028d67 (63258) revision it play video smooth.
when I'm using 63259 with reverted patch http://hg.mozilla.org/mozilla-central/rev/23cf0cedfd4a - it still choppy.
hm.. I'm stucked... sometime it playing video smooth, and sometime it is choppy.
CPU freen all the time, but audio stream is interruptible...
Ok, something wrong with sound and seems with pulseaudio interaction/write...
I've tested this with sintel_trailer_800x480 from http://people.xiph.org/~tterribe/tmp/

and nosound version play smooth and fast lot of CPU free et.c
but sound version is choppy (sound and video)... 
When I did kill -STOP to pulseaudio process, then whole video and fennec content process stuck.
Not sure how it should be, but it looks like our write to pulseaudio is sync, blocking decoding and something else is going on...

I've tested on the same environment Flash playback, and it smooth/fast (30 FPS), and audio works fine.
I've tested with flash plugin, and when I STOP pulseaudio, flash continue rendering video for some time, and stop decoding only after 6-7 seconds
Commented out nsAudioStreamLocal::Write function, and video start playing 21FPS (before 6FPS)...
Is there are any tricks about audio write functionality? write chunks size, or non-blocking write?
We do an extra copy of the audio data due to the Audio API. You could try applying the most recent (but obsolete) patch from bug 604682, which eliminates this copy. That patch needs to be reworked, but I'd be curious to see if it makes an impact.
Wait, scratch that, the Audio API gets called outside of nsAudioStreamLocal::Write(), so that won't help this particular problem.
is our final write happening in non-main thread? it looks like sounds write de-sync video playback
All of the audio writes happen on a dedicated audio thread, so blocking writes are expected and shouldn't hold up video playback in general.

Would you mind trying a build with the line at

http://mxr.mozilla.org/mozilla-central/source/media/libsydneyaudio/src/sydney_audio_alsa.c#166

changed from 500000 to 1000000?  This behaviour sounds like bug 607200, which I thought we had worked around.
tried that and does not help... removing nsAudioStreamLocal::Write make video smooth,...
Also tried to change min_write value, and that also does not have any effect.
I tried gstreamer backend from bug 422540, and that play video smoothly with sound
this is weird, we can have 25FPS for HTML5 video with free CPU (I'm using HW accelerated fennec), but audio somehow blocking us for nothing...
Does anyone experienced the same problem on android with HW accel enabled? could it be some platform/audio process scheduling  problem?
Me and ajuma both noticed it. We though it was a regression with OGL Layers but noticed the same problem without OGL Layers.
I've also heard that on maemo we should write with specific buffer size (4096*2), if it is not equals to that, less or bigger, than it will cause perf problems, 
Is there are some place or pref I can use in order to specify preffered writable buffer size?
I'm pretty sure this is going to turn out to be a(nother) bad interaction with PulseAudio.  I'll post a debug logging patch a bit later today.

(In reply to Oleg Romashin (:romaxa) from comment #20)
> I've also heard that on maemo we should write with specific buffer size
> (4096*2), if it is not equals to that, less or bigger, than it will cause
> perf problems, 
> Is there are some place or pref I can use in order to specify preffered
> writable buffer size?

Oh, sorry, I was missing a bit of information during the IRC discussion.  You could try modifying the logic inside the else branch:

http://mxr.mozilla.org/mozilla-central/source/media/libsydneyaudio/src/sydney_audio_alsa.c#259

...so that avail is whatever your magic buffer size is.  Note that avail's units is frames (not bytes), so if your 4096*2 magic size is bytes you'll need to convert it to frames with snd_pcm_bytes_to_frames first.
Checked it and we have now Write calls with aCount = 1024, but I need 4096, how to ask decoder decode bigger chunks?
Attached patch debug patch v0 (obsolete) — Splinter Review
Please try applying this patch and reproducing the bug, then attached a copy of the output to the bug.

There's some additional test code disabled via #if, where the original code is inside the #if 1 block and the test code is in the else block.  Please also test those, the first is below, change to #if 0 to enable the test code:

+// Disable this to test for bug 669556.
+#if 1


The second is in two parts, one in nsBuiltinDecoderStateMachine:

+// Disable to use 8k write batching path.
+#if 1

and one in sydney_audio_alsa.c:

+/* Disable to use 8k write batching path. */
+#if 1
Attached file Output with patch
Attached file Enabled 8kb buffer size (obsolete) —
Thanks.

It looks like you didn't enable the second part of the 8k buffer code for the second run; otherwise it should be logging |write(8192)->1| rather than higher numbers after the ->.

If the low frame rate happens from the start of playback, it'd be useful to have complete logs from the first 5 seconds or so.  There's a bunch of debug info printed at the start of playback that would be useful to see, too.

Given the lack of |write xrun| messages (and assuming they're not happening frequently in the parts of the log not included), that excludes bug 607200.
Attached file Enabled 8kb buffer size (obsolete) —
Attachment #559045 - Attachment is obsolete: true
Attachment #559049 - Attachment is obsolete: true
Attached file pactl list
found also that webm video is also choppy with cube backend 
http://clips.vorwaerts-gmbh.de/big_buck_bunny.webm
and even worse than with SA
but ogv version:
http://clips.vorwaerts-gmbh.de/big_buck_bunny.ogv
works fine with SA and cube backend, and produces 24FPS

so seems problem somewhere in web-codec/audio/video sync
Plus on N9 we have with OGV HW accelerated playback 25% CPU free with 25FPS..
for same webm video with disabled sa_write, I have 20FPS  and 20%CPU free.
Attached file Profile data
Ok, retested it once again on another device, and found different results.
I've disabled skipToNextKeyframe = PR_TRUE; in order to avoid decoding interrupts and get full profile data.
I've found that in both ogv and webm cases we are using almost all CPU, and frame dropping start working. more visible in webm case because that is more expensive.
practically if we disable frame dropping then we have more smooth video (almost no problems) and using full CPU with efficient results.
When frame dropping triggered we seems just breaking video/audio sync and, stop decoding most of frames, freeing CPU, and instead of dropping some frames we drop almost all of them (60%).

One way to fix this problem is to free CPU by optimizing rendering pipeline and give more space for decoding mechanism (or get more powerfull device)
Another way is to make frame dropping mechanism more effective, and skip frames without busting whole playback...
(In reply to Oleg Romashin (:romaxa) from comment #33)
> One way to fix this problem is to free CPU by optimizing rendering pipeline
> and give more space for decoding mechanism (or get more powerfull device)

That will fail to work as soon as someone makes a larger video.

> Another way is to make frame dropping mechanism more effective, and skip
> frames without busting whole playback...

There isn't really a way to just skip some frames in Theora, and for VP8 you could only do it if the file was encoded specifically to allow it, but I don't think libvpx has the API to support it (basically to skip frame n in VP8 you'd need to check that a) it is not a new golden or alt-ref frame (easy) and b) frame n+1 does not use the previous frame as a predictor... for almost every file in existence b) is unlikely to happen for non-keyframes, and requires a significant amount of decoding to check).

What we _should_ do is make it harder to go into keyframe skipping mode. Because it's a decision we can't undo until we get to the next keyframe, we shouldn't activate it when there are "almost no problems".
yes , probably we should tweak frame dropping conditions... but it still bad that one trigger skipToNextKeyframe = PR_TRUE break video and audio playback for 1 second
There are many hacks which could be used to speed up video decoding at the cost of introducing visual artefacts. One more trick is to play video a bit slower than normal and correct audio pitch.

But the most useful solution that I myself would like to see as a user would be a video transcoding support. So that the browser detects that the video can't be played back in realtime on the available hardware and suggests the user to wait a bit until it gets re-encoded to lower resolution. If the user has a charger and the battery life is not an issue, that may be a viable solution. Actually I happened to be in such a situation at least once (in a hotel room with just a phone, but no laptop) and regretted not being able to watch some video on the web.
Another way is to find somewhere DSP decoding implementation for WebM and possibly for Theora... and just use the on mobile where it is possible... IIUC right now we have only x264 dsp optimized codecs which are accessible via gstreamer...
We have a Theora implementation for TI C64x+. http://code.entropywave.com/leonora/
(In reply to Oleg Romashin (:romaxa) from comment #37)
> Another way is to find somewhere DSP decoding implementation for WebM and
> possibly for Theora... and just use the on mobile where it is possible...
> IIUC right now we have only x264 dsp optimized codecs which are accessible
> via gstreamer...

Right, we commissioned a C64x port of Theora called Leonora (and in fact I worked on it some myself). There were a number of issues with making it production-ready:

a) it failed on small frame sizes due to some cache flushing bug (probably could be worked around by just decoding those in software),
b) it has all the DSP resource limit and robustness problems (it'll work fine for one video in one tab, but more than that you'll have problems, many of which result in device reboots, which is a suboptimal thing to allow web page content to produce),
c) actually getting the data to the screen in RGB required either significant CPU, or custom TI kernel modules not shipped with the device that often failed to work (e.g., the first attempt to play video always failed for me) and also sometimes panicked the kernel. But see http://blog.mjg.im/2010/04/16/theora-on-n900.html for more details on that part.

On the N900 it was only a _little_ slower than the pure-software version Robin Watts and I did later with ARM asm (for an A8 chip running at 600 MHz vs. the DSP at 430 MHz). WebM would be even worse. Getting better performance would probably require explicitly managing the cache, which requires some major re-architecting of the decoder (Leonora used a mostly-unmodified libtheora with some accelerator functions written with TI intrinsics), or figuring out how to use the programmable hardware for motion compensation and loop filtering, etc., that the H.264 decoders use (at least theoretically programmable... the only docs I was ever able to find were "stick this blob of hex values into this address to enable RV9, this other blob for MPEG4, this other blob for H.264, etc.).

In other words, producing something like this is not an easy undertaking.
Ok, sounds tricky, but from other side I think I know how to optimize rendering pipeline in order to free CPU for decoding.. at least it is doable on maemo
1) Create IPC channel from video decoding thread to Chrome main-thread (could help also for android pipeline so we can avoid sync with main thread and related planes copy)
2) Make texture swapping from decoding thread to Chrome (no upload)
3) Try different ways of uploading texture:
  a) Decode yuv directly into locked EGL texture
  b) upload planes to normal texture and use YUV shader

2) 3) b) could be used on android if we find way to share textures between processes.

that should give us room of CPU for decoding
Checked skipKeyFrame conditions more detail, and 
First we have lack of GetDecodedAudioDuration,
if we disable that check, then later we have some lack of data in video queue.

But with disabled skipToNextKeyframe, we have more essential frame dropping and video whole video become watchable... so wondering can we just drop that?  because with  skipToNextKeyframe enabled we just breaking whole video experience completely...


Another assumption, is when decoding slowness happening then we do not set skipToNextKeyframe = true (which breaks audio/video sync and need about 1-2 seconds to restore back), but instead of just stop sending updates to Layout, so we give temporary more CPU for decoder, get more decoded audio and video data, and then resume layout rendering..
(In reply to Oleg Romashin (:romaxa) from comment #40)
> 2) Make texture swapping from decoding thread to Chrome (no upload)

Right, this would help a lot, and was planned, but hasn't happened yet. See bug 656185 comment 15. libtheora and libvpx would also benefit from modifications to allow them to decode into a user-specified buffer. I started a libtheora API design for this at http://pastebin.mozilla.org/1203306 but never did the actual implementation. Google seemed interested in doing something similar for libvpx, but that hasn't happened yet, either.
(In reply to Oleg Romashin (:romaxa) from comment #41)
> But with disabled skipToNextKeyframe, we have more essential frame dropping
> and video whole video become watchable... so wondering can we just drop
> that?  because with  skipToNextKeyframe enabled we just breaking whole video
> experience completely...

Right, I think this is the easiest avenue to explore. And as I said above, will still be useful even in the face of other performance optimizations (otherwise you still get failures, just on slightly larger videos).

> Another assumption, is when decoding slowness happening then we do not set
> skipToNextKeyframe = true (which breaks audio/video sync and need about 1-2
> seconds to restore back), but instead of just stop sending updates to
> Layout, so we give temporary more CPU for decoder, get more decoded audio
> and video data, and then resume layout rendering..

Yes, this is also a good idea, but it would be even better to get the cost of doing updates with GL layers turned on low enough that this doesn't matter.
> of doing updates with GL layers turned on low enough that this doesn't
> matter.
That is not only GL layers update, but also paint, LayerManager manipulations et.c.

> modifications to allow them to decode into a user-specified buffer. I
> started a libtheora API design for this at

currently if we go with locked texture approach, then we can just take planes and do yuv conversion directly into locked texture buffer...
otherwise we just upload yuv planes with glTexImage2D upload into 3 textures and do yuv shader...

but if decoder will allow to decode in user specified buffer, then that buffer can be locked yuv texture, which provide memory buffer where you can write yuv data directly. that is practically texture streaming...
> buffer can be locked yuv texture, which provide memory buffer where you can
that is actually already available on Maemo Harmattan N9
Attachment #559036 - Attachment is obsolete: true
The intention of the current frame skipping logic is that the audio should continue playing back seamlessly.  If that's not happening, that's a bug.

It sounds like the decode-time frame skipping needs to be less aggressive.  I'm not sure how much tuning it has seen on low powered devices.
I tried to use texture streaming for webm some time ago but gave up on it because vp8 does not support writing to packed UYVY formats which the N9 can display directly just by enabling the format via flags. The planar images would have to be displayed through a yuv shader and 3 separate textures I guess.
(In reply to Timothy B. Terriberry (:derf) from comment #42)
> http://pastebin.mozilla.org/1203306 but never did the actual implementation.

heeen pointed out on IRC that this link is dead. I guess I just had a copy saved by SessionStore in a window I hadn't closed for a few months. http://pastebin.mozilla.org/1326966 should work.
Attached file vp8 CPU usage
Has actually implemented direct rendering and got some CPU free, and now I have 24 FPS on that video with sound enabled. + ~2 CPU free.
but vp8 codecs still too expensive ~12% CPU, vp8_decode_mb_tokens - in top of profile
I think we should get some arm version of vp8_decode_mb_tokens without bug 645284. that could give us really fast youtube html5 rendering
I've implemented direct compositing from video thread -> chrome process, and added also old SW yuv2rgb565 conversion path,
and found that
DecoderYUV->Copy YUV to ShmemYUV->upload data into locked Texture memory
is actually slower on N9 than
DecoderYUV->Convert YUV to ShmemRGB(neon565)->copy data into locked Texture memory

With HW YUV conversion I see libGLES_v2 using almost 4x more CPU than just simple paint into locked texture path, also I see some weird kernel generic interrupt which is eating almost same amount of CPU

Attached top of oprofile for both cases
of course copy into locked texture again possible only on maemo, but it could be also that our planes upload + YUV shader code is not very friendly for generic GLES drivers...
Also have noticed strange thing... on youtube page while playing video we destroy and create shadow Image layers almost every 1 second...
Depends on: 686770
Depends on: 688363
Bug 688363 covers the frame skipping problem.  I'll try to look at that soon.
Depends on: 693131, 693095
Depends on: 693905
No longer depends on: 693095
Since pastebin keeps expiring things even if they're set to be kept forever, I should just attach the proposed API from comment 42.
Component: Audio/Video → Audio/Video: Playback
Mass closing do to inactivity.
Feel free to re-open if still needed.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: