Closed Bug 870564 Opened 7 years ago Closed 7 years ago

Youtube video freezes after a long time (~30 mins)

Categories

(Firefox OS Graveyard :: General, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:leo+, firefox22 wontfix, firefox23 wontfix, firefox24 fixed, b2g18+ verified, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)

RESOLVED FIXED
1.1 QE2 (6jun)
blocking-b2g leo+
Tracking Status
firefox22 --- wontfix
firefox23 --- wontfix
firefox24 --- fixed
b2g18 + verified
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix
b2g-v1.1hd --- fixed

People

(Reporter: diego, Assigned: mikeh)

References

()

Details

(Whiteboard: c=video [TD-26549], MiniWW, [YouTubeCertBlocker+])

Attachments

(8 files, 1 obsolete file)

This youtube video freezes after a very long time

https://www.youtube.com/watch?v=V1IBH1bz5-c

I've seen it fail in two different ways:

* Green screen. This seems to be a wi-fi connection issue. May be device specific

* Screen freeze. This looks like a problem somewhere in the media graph. I often see a "Timed out waiting for output buffers: 0/0" message for audio buffers.
blocking-b2g: --- → tef?
Assignee: nobody → sotaro.ikeda.g
Depends on: 863441
By Bug 863441, youtube video playback becomes somewhat stable. But there are still some problems.
Depends on: 864230
I did not see green. But failed in 2 ways.
  - [1] video freeze(but ui works)
  - [2] video app crash at
         mozilla::net::HttpChannelChild::OnStartRequest ( Bug 868073 )
I debugged about [1].
[1] happend by nsMediaCacheStream::Read() stall forever. OMXCodec calls the function with grabbing OMXCodec's mutex. Then the other threads(decoder thread and ImageBridge thread) waited forever to get the mutex.
It seems that both [1] and [2] are caused by HttpChannelChild.

The following diagram is about ChannelMediaResource.

https://github.com/sotaroikeda/firefox-diagrams/blob/master/media/content_media_ChannelMediaResource_FirefoxOS_1_01.pdf?raw=true
Whiteboard: [TD-26549]
Target Milestone: --- → 1.1 QE2
Not necessarily a blocker for v1.0.1/v1.1, but we would take a low risk uplift if available. Videos on YouTube that are >30min long are very uncommon, and it's not yet clear if this is specific to long videos or this video (we need more repro cases).
blocking-b2g: tef? → -
tracking-b2g18: --- → +
Blocks: 871753
Whiteboard: [TD-26549] → c= [TD-26549]
Whiteboard: c= [TD-26549] → c=video [TD-26549]
Status: NEW → ASSIGNED
blocking-b2g: - → leo?
Priority: -- → P1
Assignee: sotaro.ikeda.g → mhabicher
Triage 5/20 -
TEF? per blocking a tef+ bug 871753.
If decided not to be tef+, then please leo+.
blocking-b2g: leo? → tef?
(In reply to Wayne Chang [:wchang] from comment #7)
> Triage 5/20 -
> TEF? per blocking a tef+ bug 871753.
> If decided not to be tef+, then please leo+.

bug 871753 is already fixed in all branches. Can you clarify how does this one blocks 871753?
needsinfo wayne here to help with comment #8
Flags: needinfo?(wchang)
Bug 871753 seems to help, but I still saw the time out message around 40 minutes into the video:

E/OMXCodec(  674): [OMX.qcom.audio.decoder.aac] Timed out waiting for output buffers: 0/2

Sotaro,

Are you able to play the youtube video in the description all the way?
I cannot play youtube at the end of contents.

If I try to play URL in comment 0, the screen freezes at 200 seconds from the start, 70~80% probability.
Depends on: 874325
(In reply to Daniel Coloma:dcoloma from comment #8)
> (In reply to Wayne Chang [:wchang] from comment #7)
> > Triage 5/20 -
> > TEF? per blocking a tef+ bug 871753.
> > If decided not to be tef+, then please leo+.
> 
> bug 871753 is already fixed in all branches. Can you clarify how does this
> one blocks 871753?

This block was set by Sotaro, perhaps we need him to comment if this is still true.
Flags: needinfo?(wchang) → needinfo?(sotaro.ikeda.g)
The block was set because Youtube freeze happens by some reasons on a same youtube video. And it is difficult to differentiate which freeze happens just from the freezed UI.
Flags: needinfo?(sotaro.ikeda.g)
thread 18, blocked here:
#7  nsMediaCacheStream::Read (this=0x446b8030, aBuffer=<value optimized out>, aCount=<value optimized out>, aBytes=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/nsMediaCache.cpp:2103
#8  0x40899d08 in mozilla::ChannelMediaResource::Read (this=<value optimized out>, aBuffer=0x80 <Address 0x80 out of bounds>, aCount=4294965384, aBytes=0x0)
    at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/MediaResource.cpp:754
#9  0x40872a9c in android::MediaStreamSource::readAt (this=0x44011320, offset=5362745, data=0x43bbfc00, size=340) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:101

thread 17 and 16, blocked on 'mutex=0x41f25410':
- thread 17: android::OMXCodec::signalBufferReturned
- thread 16: android::OMXCodec::read

thread 9, blocked here:
#0  poll () at bionic/libc/arch-arm/syscalls/poll.S:10
#1  0x4167036a in _pr_poll_with_poll (pds=0x42ca8020, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920
#2  0x41670546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322
#3  0x404426d2 in nsSocketTransportService::Poll (this=<value optimized out>, wait=<value optimized out>, interval=0x434c2e24) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:397
#4  0x40442bf0 in nsSocketTransportService::DoPollIteration (this=0x41f13100, wait=true) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:781
#5  0x40442ddc in nsSocketTransportService::Run (this=0x41f13100) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:645

- thread 8: XPCJSRuntime::WatchdogMain
- thread 7: js::SourceCompressorThread::threadLoop
- thread 6: js::GCHelperThread::threadLoop
- thread 5: nsCycleCollectorRunner::Run

thread 4, here:
#0  __ioctl () at bionic/libc/arch-arm/syscalls/__ioctl.S:9
#1  0x40107b5c in ioctl (fd=<value optimized out>, request=8) at bionic/libc/bionic/ioctl.c:41
#2  0x41858a6e in android::IPCThreadState::talkWithDriver (this=0x41bfe110, doReceive=true) at frameworks/base/libs/binder/IPCThreadState.cpp:804
#3  0x41859354 in android::IPCThreadState::joinThreadPool (this=0x41bfe110, isMain=false) at frameworks/base/libs/binder/IPCThreadState.cpp:457

thread 2, blocked here:
#5  android::MPEG4DataSource::readAt (this=0x42cd3a20, offset=2, data=0x44030000, size=17040) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:164
#6  0x4173e4da in android::MPEG4Source::read (this=0x41b4f240, out=0x100ffc70, options=<value optimized out>) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:2410
#7  0x4174c6f4 in android::OMXCodec::drainInputBuffer (this=0x41f25360, info=0x42cd29f0) at frameworks/base/media/libstagefright/OMXCodec.cpp:4046
Sotaro, I'm trying to reproduce this bug on Inari, and what I'm seeing is the video play for ~5 to 10 minutes without trouble, but it progressively starts to degrade. It doesn't stop all of a sudden.

Instead, it looks like the video just stops playing; then a few minutes later, there is some renewed network activity (at least according to the arrows in the status bar), then the video resumes (albeit it stutters a lot) with audio; then everything seems to catch up for a while and play smoothly, before the video stops again.

With each stop/restart, it takes longer for the video to resume, and it plays for a much shorter time before stopping again.

logcat shows regular occurrences of these statements while this is happening:

E(  872:0x3ea) [OMX.qcom.audio.decoder.aac] Timed out waiting for output buffers: 0/0
W(  872:0x405) obtainBuffer() track 0x43e794c0 disabled, restarting
E(  117:0xe0) Parsing error in AudioFilter.csv.
W(  872:0x405) obtainBuffer() track 0x43e794c0 disabled, restarting
W(  117:0xe0) Thread AudioOut_1 cannot connect to the power manager service
Flags: needinfo?(sotaro.ikeda.g)
mikeh,

I see the same timeout. I don't see any stutter though.

As a side note a very common problem I've been seeing is MAC address clashing. The reference devices my team has all share the same MAC address by default. Fortunately we were able to override it via a setting file called softmac.

Any chance the inaris have the same problem? Can you compare the MAC address of two inaris?
Relevant code:
- "obtainBuffer() track 0x43e794c0 disabled, restarting"
  http://androidxref.com/4.0.4/xref/frameworks/base/media/libmedia/AudioTrack.cpp#915
- comment:
  // restart track if it was disabled by audioflinger due to previous underrun

- "Timed out waiting for output buffers: 0/0"
  http://androidxref.com/4.0.4/xref/frameworks/base/media/libstagefright/OMXCodec.cpp#3429
(In reply to Diego Wilson [:diego] from comment #16)
> 
> As a side note a very common problem I've been seeing is MAC address
> clashing. The reference devices my team has all share the same MAC address
> by default. Fortunately we were able to override it via a setting file
> called softmac.
> 
> Any chance the inaris have the same problem? Can you compare the MAC address
> of two inaris?

Diego, I would hope not--I'm running the tests on my home network.  :)  The MAC on my Inari is 90:1D:27:02:B5:A7.

It's interesting that after a ~10 to 15-minute "freeze", the YouTube video in comment 0 has now resumed playing completely smoothly, with no more pausing.
I also observe that during the periods where the audio plays smoothly, the video is either frozen, or significantly behind the audio track.
mikeh,

Sounds like your seeing bug 863441 and bug 864230. The first one was fixed in gecko a couple of weeks ago. The second one is a fix in gonk. You may need an updated vendor build for that.
diego, I'm running a build from a completely fresh repo that I pulled this morning (~10am, EDT). Would that include the fix in bug 864230?
Flags: needinfo?(dwilson)
It is not enabled on Mozbuild. Add patch to https://github.com/mozilla-b2g/gonk-patches is necessary. The github hadles patchs only for leo and hamachi now.
Flags: needinfo?(sotaro.ikeda.g)
Maybe there needs to be a new flag for that? Or was it supposed to be marked "Fixed" until it made it there?
Flags: needinfo?(dwilson)
(In reply to Mike Habicher [:mikeh] from comment #15)
> Sotaro, I'm trying to reproduce this bug on Inari, and what I'm seeing is
> the video play for ~5 to 10 minutes without trouble, but it progressively
> starts to degrade. It doesn't stop all of a sudden.
> 
//snip
> 
> logcat shows regular occurrences of these statements while this is happening:
> 
> E(  872:0x3ea) [OMX.qcom.audio.decoder.aac] Timed out waiting for output
> buffers: 0/0

This error could be reported, when MediaStreamSource::readAt() was blocked for a longer time.
(In reply to Sotaro Ikeda [:sotaro] from comment #22)
>
> It is not enabled on Mozbuild. Add patch to
> https://github.com/mozilla-b2g/gonk-patches is necessary. The github hadles
> patchs only for leo and hamachi now.

If the fix for these video issues is common to all platforms, it sounds like gonk-patches needs an 'all' subfolder.
The rationale from comment 13 does not change our blocking decision, and comment 15 is being resolved in other bugs. Not blocking.
blocking-b2g: tef? → leo+
diego, sotaro, with:
- gecko: b2g18:2f8bf6531f39 ...which includes the bug 863441 patch
- gaia: c51082d92f18c4c7cdbc4f96c977be735500562a

...and with the bug 864230 patch applied to gonk, I see YouTube videos freeze.

Not all the time: the video in comment 0 has played through at least twice today, completely, without error; but on the third attempt, I see the incoming TCP connection's receive window collapse and the video stop after ~1 second of playback.

Are there any other patches I could be missing?
blocking-b2g: leo+ → tef?
Flags: needinfo?(dwilson)
- thread 19, mutex=0x41f26510:
#5  android::OMXCodec::read (this=0x41f26460, buffer=0x42da5360, options=0x2) at frameworks/base/media/libstagefright/OMXCodec.cpp:5061
#6  0x40880de8 in android::OmxDecoder::ReadVideo (this=0x42da5280, aFrame=0x426c6cc0, aTimeUs=187238095, aKeyframeSkip=false, aDoSeek=false) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:556
#7  0x4087f59e in nsMediaOmxReader::DecodeVideoFrame (this=0x41f42780, aKeyframeSkip=@0x426c6e1f, aTimeThreshold=2) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/nsMediaOmxReader.cpp:133

- thread 16, also mutex=0x41f26510:
#5  android::OMXCodec::signalBufferReturned (this=0x41f26460, buffer=0x80) at frameworks/base/media/libstagefright/OMXCodec.cpp:5227
#6  0x41773d26 in android::MediaBuffer::release (this=0x43dcb980) at frameworks/base/media/libstagefright/MediaBuffer.cpp:102
#7  0x4087fcac in android::OmxDecoder::ReleaseAllPendingVideoBuffersLocked (this=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:769

- thread 18, cond=0x442c7824, mutex=0x43d77284:
#8  0x408a6a70 in mozilla::ChannelMediaResource::Read (this=<value optimized out>, aBuffer=0x80 <Address 0x80 out of bounds>, aCount=4294964942, aBytes=0x0) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/MediaResource.cpp:754
#9  0x4087fa94 in android::MediaStreamSource::readAt (this=0x442c7a50, offset=15881796, data=0x43d8b000, size=238) at /home/mikeh/dev/mozilla/m-c/b2g18/content/media/omx/OmxDecoder.cpp:101
#10 0x4176bc34 in android::MPEG4DataSource::readAt (this=<value optimized out>, offset=15881796, data=0x43d8b000, size=238) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:172
#11 0x4176bc34 in android::MPEG4DataSource::readAt (this=<value optimized out>, offset=15881796, data=0x43d8b000, size=238) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:172
#12 0x4176bee8 in android::MPEG4Source::read (this=0x43d930b0, out=0x45163c70, options=<value optimized out>) at frameworks/base/media/libstagefright/MPEG4Extractor.cpp:2329

- thread 9:
#2  0x4167c546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322
#3  0x4044f6d2 in nsSocketTransportService::Poll (this=<value optimized out>, wait=<value optimized out>, interval=0x435dae24) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:397
#4  0x4044fbf0 in nsSocketTransportService::DoPollIteration (this=0x41f13100, wait=true) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:781
#5  0x4044fddc in nsSocketTransportService::Run (this=0x41f13100) at /home/mikeh/dev/mozilla/m-c/b2g18/netwerk/base/src/nsSocketTransportService2.cpp:645

In the PR_Poll call (frame #2), timeout=4294967295 is 0xFFFFFFFF, or -1, which means (for the POSIX version of poll()[1], at least) no timeout.

1. http://pubs.opengroup.org/onlinepubs/007908799/xsh/poll.html
Depending on how much I trust the gdb backtrace, the thread 9 frame 2 call looks completely wrong:

#2  0x4167c546 in PR_Poll (pds=0xfffffffc, npds=1, timeout=4294967295) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:4322

According to this[1], 'pds' is a pointer type, in which case a value of 0xfffffffc (-2 as a uint32_t) is most definitely bogus.

1. https://developer.mozilla.org/en-US/docs/PR_Poll
Further to comment 20, nsSocketTransportService::Poll()[1] sets the 'pds' argument to one of three values:
- if nsSocketTransportService::mPollList->fd is non-zero, pds = mPollList
- if nsSocketTransportService::mPollList->fd is zero, and:
  - if nsSocketTransportService::mActiveCount is non-zero, pds = mPollList + 1
  - if nsSocketTransportService::mActiveCount is zero, pds = nullptr

Assuming mPollList is a valid pointer, none of the valid paths should lead to pds = 0xfffffffc.

mPollList is initialized in the constructor by calling moz_xmalloc() (which is infallible[2]), and freed in the destructor by calling moz_xfree().

mPollList can also grow if a process uses a LOT of sockets[3] (which watching a long YouTube video seems to do--I've observed one new TCP connection per 800KiB to 1.8MiB of data), by calling moz_xrealloc(). It does this in 100-socket increments, once mActiveListSize exceeds SOCKET_LIMIT_MIN (currently defined as 50)[4].

1. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#361
2. https://developer.mozilla.org/en-US/docs/Infallible_memory_allocation
3. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#318
4. http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsSocketTransportService2.cpp#40
One thing I've noticed about the all-green screen: it seems to happen when WiFi cycles. I haven't been able to determine if it's due to a complete disconnect, or if it's just a rekeying; but I wonder if there's an error code somewhere that isn't getting propagated properly.
Of course, having posted that, I just managed to get a green-screen without a WiFi cycle.
Assuming comment 27 is a mid-air collision and setting back to leo+. No new tef block information provided.
blocking-b2g: tef? → leo+
Some gdb debugging:

(gdb) list 3920
3905	                pds[index].out_flags = 0;
3906	            }
3907	        }
3908	        if (0 == ready)
3909	        {
3910	            switch (timeout)
3911	            {
3912	            case PR_INTERVAL_NO_WAIT: msecs = 0; break;
3913	            case PR_INTERVAL_NO_TIMEOUT: msecs = -1; break;
3914	            default:
3915	                msecs = PR_IntervalToMilliseconds(timeout);
3916	                start = PR_IntervalNow();
3917	            }
3918	
3919	retry:
3920	            ready = poll(syspoll, npds, msecs);
3921	            if (-1 == ready)
3922	            {
3923	                PRIntn oserror = errno;
3924	
3925	                if (EINTR == oserror)
3926	                {
3927	                    if (timeout == PR_INTERVAL_NO_TIMEOUT)
3928	                        goto retry;
3929	                    else if (timeout == PR_INTERVAL_NO_WAIT)
3930	                        ready = 0;  /* don't retry, just time out */
3931	                    else
3932	                    {
3933	                        elapsed = (PRIntervalTime) (PR_IntervalNow()
3934	                                - start);
(gdb) p npds
$1 = 1
(gdb) p msecs
$2 = -1
(gdb) p syspoll
$3 = (struct pollfd *) 0x4359dbcc
(gdb) p *syspoll
$4 = {fd = 21, events = 1, revents = 0}
(gdb) info frame
Stack level 1, frame at 0x4359ddf8:
 pc = 0x4015236a in _pr_poll_with_poll (/home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920); saved pc 0x40152546
 called by frame at 0x4359de00, caller of frame at 0x4359dbb0
 source language c.
 Arglist at 0x4359dbb0, args: pds=0x42ba8020, npds=1, timeout=4294967295
 Locals at 0x4359dbb0, Previous frame's sp is 0x4359ddf8
 Saved registers:
  r4 at 0x4359ddd4, r5 at 0x4359ddd8, r6 at 0x4359dddc, r7 at 0x4359dde0, r8 at 0x4359dde4, r9 at 0x4359dde8, r10 at 0x4359ddec, r11 at 0x4359ddf0, lr at 0x4359ddf4
(gdb) p pds
$5 = (PRPollDesc *) 0x42ba8020
(gdb) p *pds
$6 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1}
(gdb) p *fd
Cannot access memory at address 0xffffffff
(gdb) p fd
$7 = -1
(gdb) p pds->fd
$8 = (PRFileDesc *) 0x41a03c00
(gdb) p *pds->fd
warning: can't find linker symbol for virtual table for `(null)' value
$9 = warning: can't find linker symbol for virtual table for `(null)' value
{methods = 0x4015bbe8, secret = 0x41a01270, lower = 0x42ba6820, higher = 0x0, dtor = 0x40143d21 <pl_FDDestructor>, identity = 1}
(gdb) p POLLIN
No symbol "POLLIN" in current context.
(gdb) p (PRInt16*)malloc(sizeof(PRInt16))
$10 = (PRInt16 *) 0x1f83ef0
(gdb) p *(PRInt16*)0x1f83ef0
$11 = 30024
(gdb) p (pds[0].fd->methods->poll)(pds[0].fd, pds[0].in_flags & 0xfffd, (PRInt16*)0x1f83ef0)
warning: can't find linker symbol for virtual table for `(null)' value
$12 = 1
(gdb) p pds[0]
$13 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1}
(gdb) p *(PRInt16*)0x1f83ef0
$14 = 0
(gdb) p 0xd
$15 = 13
(gdb) p (pds[0].fd->methods->poll)(pds[0].fd, pds[0].in_flags & 0xfffe, (PRInt16*)0x1f83ef0)
warning: can't find linker symbol for virtual table for `(null)' value
$16 = 0
(gdb) p *(PRInt16*)0x1f83ef0
$17 = 0
(gdb) p PR_GetIdentitiesLayer(pds[0].fd, (PRDescIdentity)0)
warning: can't find linker symbol for virtual table for `(null)' value
$18 = (PRFileDesc *) 0x42ba6820
(gdb) p pds[0]
$19 = {fd = 0x41a03c00, in_flags = 1, out_flags = 1}
(gdb) p *(PRFileDesc*)0x42ba6820
warning: can't find linker symbol for virtual table for `(null)' value
$20 = warning: can't find linker symbol for virtual table for `(null)' value
{methods = 0x4015be70, secret = 0x41a03c20, lower = 0x0, higher = 0x41a03c00, dtor = 0, identity = 0}
(gdb) p *(PRFileDesc*)0x42ba6820->secret
Attempt to extract a component of a value that is not a structure pointer.
(gdb) p ((PRFileDesc*)0x42ba6820)->secret
warning: can't find linker symbol for virtual table for `(null)' value
$21 = (PRFilePrivate *) 0x41a03c20
(gdb) p *((PRFileDesc*)0x42ba6820)->secret
warning: can't find linker symbol for virtual table for `(null)' value
$22 = {state = -1431655766, nonblocking = 0, inheritable = _PR_TRI_TRUE, next = 0x0, lockCount = 0, md = {osfd = 21, tcp_nodelay = 0}}
(gdb) p/x *((PRFileDesc*)0x42ba6820)->secret
warning: can't find linker symbol for virtual table for `(null)' value
$23 = {state = 0xaaaaaaaa, nonblocking = 0x0, inheritable = 0x1, next = 0x0, lockCount = 0x0, md = {osfd = 0x15, tcp_nodelay = 0x0}}
(gdb) p ready
$24 = 0
(gdb)
blocking-b2g: leo+ → leo?
It also looks like fd=21 has no data waiting:

(gdb) p (unsigned int*)malloc(sizeof(unsigned int))
$25 = (unsigned int *) 0x1f83f00
(gdb) p FIONREAD
No symbol "FIONREAD" in current context.
(gdb) p ioctl(21, 0x541b, (unsigned int*)0x1f83f00)
$26 = 0
(gdb) p *(unsigned int*)0x1f83f00
$27 = 0

This would be inconsistent with the TCP.rwin=0 observed in the video data stream.
Further to comment 29 and comment 30, the value pds=0xfffffffc seems to be incorrect, possibly a by-product of an optimized build.  Probing with gdb (see comment 35) suggests that a valid PRPollDesc pointer is ultimately making it into PR_Poll().
Hmm, looks like the socket (in this case, fd=21) is not (really) a socket. I can't make any sense out of the 'st_mode' member of the struct stat returned by fstat():

(gdb) p (struct stat*)malloc(sizeof(struct stat))
$50 = (struct stat *) 0x1f86000
(gdb) p fstat(21, (struct stat*)0x1f86000)
$51 = 0
(gdb) p *(struct stat*)0x1f86000
$52 = {st_dev = 7, __pad0 = "\000\000\000", __st_ino = 2979, st_mode = 4480, st_nlink = 1, st_uid = 0, st_gid = 0, st_rdev = 0, __pad3 = "\000\000\000", st_size = 0, st_blksize = 4096, st_blocks = 0, 
  st_atime = 1369663045, st_atime_nsec = 661191246, st_mtime = 1369663045, st_mtime_nsec = 661191246, st_ctime = 1369663045, st_ctime_nsec = 661191246, st_ino = 2979}
(gdb) p ((struct stat*)0x1f86000)->st_mode
$53 = 4480
(gdb) p S_IFMT
No symbol "S_IFMT" in current context.
(gdb) p 00170000
$54 = 61440
(gdb) p ((struct stat*)0x1f86000)->st_mode & 61440
$55 = 4096

According to the AOSP version of stat.h, 4096 is 'S_ISVTX', which seems to mean:

  "on directories, restricted deletion flag"[1]

1. http://pubs.opengroup.org/onlinepubs/7908799/xsh/sysstat.h.html
Most recent reproduction of this problem occurred after 50m30s of playing the comment 0 video. gdb shows that the socket thread is blocked on the poll() call, while Wireshark shows there is no longer any network traffic between the DUT and YouTube.

Wireshark (a 798MB log--too big to upload) shows:

1. a session on local port 46297 receiving ~1.5MiB of data from YT, its Rwin hitting 0
2. Rwin update of 10880
3. FIN send to YT
4. TCP keep-alives from YT
5. YT sends 14 more packets (up to ACK 1510469, 19304 more bytes following item 1 above), which we respond to with TCP RSTs
6. we open a new TCP connection on local port 46298 (the 3-way handshake completes right away)
7. we send 4 HTTP GETs (TCP retransmissions) which are ignored by YT
8. we get a FIN-ACK, and send a FIN-ACK on port 46928
9. YT sends us a RST packet
10. we repeat steps 6 through 9 above on port 46299

There is no more traffic between the DUT and YT: no new connection attempts.  In gdb, the socket is still blocked on poll().  The video player UI is stuck where it was when the connection(s) failed; pressing Pause/Play has no effect on the player, and causes no new traffic to appear in Wireshark.
Is there any test we can do to confirm once and for all if it's a network connection issue or a hardware decoder issue?
Flags: needinfo?(dwilson)
(In reply to Mike Habicher [:mikeh] from comment #27)
> Are there any other patches I could be missing?

Not that I know of. You're probably seeing the same thing I'm seeing now. But does seem to be at least dependent (or even caused by) network conditions.
(In reply to Diego Wilson [:diego] from comment #41)
> 
> Not that I know of. You're probably seeing the same thing I'm seeing now.
> But does seem to be at least dependent (or even caused by) network
> conditions.

I'm investigating that now.

I have my DUT sniffing packets into Wireshark over adb[1], and my test AP is plugged into an old-school Ethernet hub[2] between it and my Internet connection, with an extra computer also running Wireshark sniffing packets on the wire.

I'm checking to see if packets are lost when the video playback stalls.

1. https://wiki.mozilla.org/B2G/Packet_Sniffing_with_Wireshark
2. http://en.wikipedia.org/wiki/Ethernet_hub
Whiteboard: c=video [TD-26549] → c=video [TD-26549], MiniWW
Per the stand-up, I have filed some other bugs I have observed while investigating this one:
- bug 876838: Screen goes green when YouTube video stream stalls
- bug 876843: YouTube video display stops but audio continues to play
dougt, can your team help with this?
Flags: needinfo?(doug.turner)
Comment 35 unintentionally cleared the flag - resetting leo+
blocking-b2g: leo? → leo+
mike, can you put the wireshark profile somewhere?
Flags: needinfo?(doug.turner)
After sleeping on comment 39, this occurred to me: we _are_ talking to the YouTube server, or at least some machine with a TCP server socket on port 80, because in step 6, we successfully complete the TCP handshake!

The SYN/SYN+ACK RTT on the port 46298 connection is 26ms, and on port 46299 it's 34ms, both of which are consistent with earlier connections; also, in the SYN+ACK, MSS, window scaling, the lack of timestamps, and Rwin=14600 are consistent with earlier SYN+ACK packets.

Also, after the 4 TCP retransmissions in step 7, the server sends us a FIN+ACK in step 8 to close the connection.

Here's the weird part: we get the FIN+ACK packets with ACK numbers of 1, which means the server sending them never saw the HTTP GETs sent in step 7 (consistent with the SEQ number of 1 in these packets, which indicates the server never set us any data either).

We respond to the FIN+ACK with our own FIN+ACK with SEQ=980 and ACK=2.

So where are the HTTP GETs getting lost, and why? So far, I have been unable to reproduce this case with a wired-side sniffing connection:

  DUT <--> AP-test <--> EthHub <--> Router <--> {Internet}
                          ^--> Netbook/Wireshark

Next step: raw wireless sniffing.
(In reply to Mike Habicher [:mikeh] from comment #47)

> 
> So where are the HTTP GETs getting lost, and why? So far, I have been unable
> to reproduce this case with a wired-side sniffing connection:
> 

So that's a good question - but its worth thinking about the fact that even if you answer the question for this use case the fact remains the GETs could always get lost for some other reason (radio interference for example) that you aren't experiencing here and which isn't fixable.. because that's true it might make more sense to focus on why retries totally stop at step 9 or 10..
As if to reinforce mcmanus' comment, my most recent reproduction of the YouTube stall occurred 17m48s into the video, where there is an open TCP connection but it is stalled.  The video is frozen on the DUT's screen, and Wireshark shows incoming TCP keep-alives from YT (approximately every 10s), to which we respond with TCP ACKs with Rwin=0.

roc, any idea where to look to see where the video path might be stalled?
Flags: needinfo?(roc)
Further to comment 49, looking at the socket service in the parent process:

(gdb) frame 1
#1  0x403bb36a in _pr_poll_with_poll (pds=0x42f9d020, npds=3, timeout=65535000) at /home/mikeh/dev/mozilla/m-c/b2g18/nsprpub/pr/src/pthreads/ptio.c:3920
3920	            ready = poll(syspoll, npds, msecs);

(gdb) p npds
$1 = 3
(gdb) p syspoll
$2 = (struct pollfd *) 0x42ef5bcc
(gdb) p *syspoll
$3 = {fd = 33, events = 1, revents = 0}
(gdb) p syspoll[0]
$4 = {fd = 33, events = 1, revents = 0}
(gdb) p syspoll[1]
$5 = {fd = 61, events = 3, revents = 0}
(gdb) p syspoll[2]
$6 = {fd = 62, events = 3, revents = 0}
(gdb) p (unsigned int*)malloc(sizeof(unsigned int))
$7 = (unsigned int *) 0x315e80
(gdb) p ioctl(21, 0x541b, (unsigned int*)0x315e80)
$8 = -1
(gdb) p ioctl(33, 0x541b, (unsigned int*)0x315e80)
$9 = 0
(gdb) p *(unsigned int*)0x315e80
$10 = 1
(gdb) p ioctl(61, 0x541b, (unsigned int*)0x315e80)
$11 = -1
(gdb) p ioctl(62, 0x541b, (unsigned int*)0x315e80)
$12 = -1
(gdb) p __errno()
$13 = (volatile int *) 0x42ef5f08
(gdb) p *__errno()
$14 = 22
(gdb) p msecs
$15 = 65535000
(gdb) 

The errno value of 22 is EINVAL, which in conjunction with FIONREAD on bionic, supposedly indicates that fds 61 and 62 are listeners: http://stackoverflow.com/questions/13353146/android-linux-ioctl-fionread-failing (assuming they're even sockets in the first place).

(With both the parent and video player app processes stopped in gdb, Wireshark shows the TCP keep-alives and ACK dance every 10 seconds, seemingly indefinitely.)
You want NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5.

I don't know why data would not be flowing if the HTTP port is open and we're trying to play the video. Definitely sounds like a bug on our side.
Flags: needinfo?(roc)
The media cache does at times suspend reading from stream, but it should only do that if we've cached data ahead of where we're playing from and the cache is full. I can help you interpret the nsMediaCache logs to figure out if and why it stopped reading.
mikeh, are you attaching logs?
Flags: needinfo?(mhabicher)
(In reply to Andreas Gal :gal from comment #53)
>
> mikeh, are you attaching logs?

gal, I haven't been attaching Wireshark .pcap logs because they tend to compress down to ~800MiB; hence the descriptive summaries in (e.g.) comment 39 and comment 47.  On the other hand, logcat logs don't show anything relevant.
Flags: needinfo?(mhabicher)
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #51)
>
> You want NSPR_LOG_MODULES=nsMediaCache:5,MediaResource:5.
> 
> I don't know why data would not be flowing if the HTTP port is open and
> we're trying to play the video. Definitely sounds like a bug on our side.

Is there any way to coax PRLogging out of a build aside from setting DEBUG=1? b2g18 debug builds currently throw assertions like "MOZ_Assert: Assertion failure: status == OK, at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:315"[1]

1. http://mxr.mozilla.org/mozilla-b2g18/source/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp#315

(This issue has already been submitted as bug 841206.)
mike, can you try to reproduce this on desktop?  it sounds like this problem might be reproducible there as well.
(In reply to Doug Turner (:dougt) from comment #56)
>
> mike, can you try to reproduce this on desktop?  it sounds like this problem
> might be reproducible there as well.

dougt, how do I go about doing that?
Flags: needinfo?(doug.turner)
As requested.  This is with the video on the DUT stalled, and the TCP stack receiving and ACKing TCP keep-alives, with Rwin=0.

The log ends with the following output, more or less continuously:

05-30 13:11:39.721  1277  1295 I Gecko - SYDNEY_AUDIO: 0x4435f340 - get position
05-30 13:11:39.721  1277  1277 I PRLog   : 1074652408[425c92b0]: Stream 44ccc030 catching up
Flags: needinfo?(roc)
Thanks for these logs! Very helpful.

OK it looks like the media cache fills up and we start suspending and resuming the connection to throttle the download. This shows up in the log like this:

05-30 13:10:01.635  1277  1277 I PRLog   : 1074652408[425c92b0]: Stream 44ccc030 predict next data in 42.297000, current worst block is 42.004089
05-30 13:10:01.645  1277  1277 I PRLog   : 1074652408[425c92b0]: Stream 44ccc030 Suspended

That means the next data we read via HTTP is expected to play 42.297 seconds from now, but least-useful data in the cache is for 42.004089 seconds from now, so we want to stop reading temporarily, and we tell Necko to suspend reading from the stream.

You see a sequence of Suspends and Resumes as the video playback advances and we try to slide forward the window of data that we have around the current playback position. This is normal. You will also see some DataReceived notifications signalling that we received data from Necko immediately after a Suspend; this is also normal, and I guess this is expected when Suspend calls race with Necko data delivery.

Here is the last time at which data is received:
05-30 13:10:01.645  1277  1277 I PRLog   : 1074652408[425c92b0]: Stream 44ccc030 DataReceived at 105742336 count=59532

There's a couple of Suspends and Resumes after that. Then the media cache does its last Resume call:
05-30 13:10:02.055  1277  1277 I PRLog   : 1074652408[425c92b0]: Stream 44ccc030 Resumed
and there are no Suspend calls after that. So we're trying to read data from Necko, but none arrives.

Everything after that shows the window of cached data after the playback position gradually shrinking as the playback position advances, until finally it runs out, with predictable results.

So as far as I can tell, the media cache is behaving as expected.

Based on previous experience we know that if we suspend for too long, some servers will close the HTTP connection. We have logic to reopen the connection automatically in that case, but these logs seem to show that doesn't happen. (MediaChannelResource::CacheClientSeek would be called in that case, and would have been logged.) Perhaps suspending the HTTP channel has messed something up in another way --- e.g. flow control at some level, or maybe the server has reacted incorrectly and can't resume delivering data?

If you still have the wireshark logs I guess you should be able to infer from those whether our TCP receive window closed and then opened again.

Whatever the problem is, we may be able to work around it by detecting when we have received no data for some amount of time (fairly long, say ten seconds), and if our last action was a Resume (and the server supports byte-range requests) automatically close the connection and open a new one. I will try to make a patch for that.
Flags: needinfo?(roc)
I suspect some state inconsistency happened around IPC. Bug 868073 is similar area bug.
Mike should be able to tell by looking at logs whether that last Resume() actually took effect.
Attached patch timer patch (obsolete) — Splinter Review
Mike, this patch might help. It should at least prevent the kind of log that you supplied here --- the patch should reset the connection after not receiving data for five seconds after the Resume().
maybe jduell can write some tests around ipc and pause/resume?
Flags: needinfo?(jduell.mcbugs)
Sure--we have basic suspend/resume tests, but I'll knock off something that does it repeatedly with a very large file.  I doubt it'll turn anything up (I suspect we really need the specific flakiness of the network connection here), but it's worth trying.

Meanwhile I think it would be useful here to get a run where we have both a wireshark log (post it to a website somewhere so we can download it) AND an NSPR HTTP log:

   https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging#Firefox_OS_phones

(I think using a single log file is fine)
Flags: needinfo?(jduell.mcbugs)
roc:

> You will also see some DataReceived notifications signalling that we 
> received data from Necko immediately after a Suspend; this is also normal,
> and I guess this is expected when Suspend calls race with Necko data delivery.

As long as you suspend a channel on the main thread, Necko should never call OnDataAvailable until you call Resume [1]. I.e no races.  That would be a bug.  Can you verify for me whether you're actually seeing that in the log?

[1] For single process necko, we suspend the underlying inputStreamPump which handles calling OnDataAvailable. For e10s we send a message back to the parent to suspend the pump, and queue any OnDataAvail IPDL messages that arrive in the meantime.
Flags: needinfo?(roc)
Sorry, in case it wasn't clear, we want an NSPR log with both media/HTTP logging, ie.

 NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5:nsHttp:5,nsSocketTransport:5

>> can you try to reproduce this on desktop?
>
> how do I go about doing that?

I'm guessing forging the UserAgent header to B2G's (via about:config 'general.useragent.override') and going to the URI would be a good start.
Flags: needinfo?(doug.turner) → needinfo?(mhabicher)
roc, with your patch, the video plays until the end; the display stalled quite early on (see bug 877461) but the audio kept on going. Wireshark confirms that we hit the end of the stream.

For posterity, I added a printf_stderr() call into the NoDataDeliveredTimeout() callback in your patch; not sure if 'mOffset' value being the same means anything but there you go.

15:14:52 ➜  bug870564  grep -n  "mikeh:" log06_media_timer.txt 
64950:05-31 13:42:44.650  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
101956:05-31 13:44:40.844  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
129423:05-31 13:46:03.845  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
143889:05-31 13:46:49.739  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
181937:05-31 13:48:44.471  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
199981:05-31 13:49:40.996  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
210232:05-31 13:50:11.446  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
223187:05-31 13:50:53.077  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
236334:05-31 13:51:34.377  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
292244:05-31 13:54:32.921  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
313632:05-31 13:55:40.497  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
358557:05-31 13:58:45.167  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
420596:05-31 14:05:17.040  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
427522:05-31 14:06:54.675  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
465309:05-31 14:15:19.327  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
491143:05-31 14:22:47.214  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
522485:05-31 14:31:19.034  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
579806:05-31 14:43:09.917  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
629090:05-31 14:53:20.983  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
664792:05-31 15:01:53.533  1338  1338 I Gecko   : mikeh: handling NoDataDeliveredTimeout (mOffset=1152241664)
(In reply to Jason Duell (:jduell) from comment #66)
> Sorry, in case it wasn't clear, we want an NSPR log with both media/HTTP
> logging, ie.
> 
>  NSPR_LOG_MODULES=timestamp,nsMediaCache:5,MediaResource:5:nsHttp:5,
> nsSocketTransport:5

I'll roll back roc's change and add the extra logging options.
Flags: needinfo?(mhabicher)
jduell: here's the logcat output; the corresponding Wireshark output can be found here: http://people.mozilla.com/~mhabicher/log07_dut.pcap
Attachment #756747 - Flags: feedback?(jduell.mcbugs)
(In reply to Mike Habicher [:mikeh] from comment #67)
> roc, with your patch, the video plays until the end; 

Awesome possum! I'll make sure to try it out.

> the display stalled
> quite early on (see bug 877461) but the audio kept on going. Wireshark
> confirms that we hit the end of the stream.

Yeah, I totally agree this is a separate bug. I can actually reproduce it earlier if I leave on the overlay progress bar.
Blocks: 878343
Ran some e10s xpcshell tests on my linux desktop with large files (10 to 100 MB), where at every 5th OnDataAvailable I call suspend, wait a second, then resume.  With a localhost http server, 25 MB file downloads fine that way.  100 MB caused a crash (which is a bug, but not this one).  Over remote server loading 10 MB file once saw server drop the connection half-way through (but OnStopRequest called), once saw a crash.  Clearly work to be done with the crash issue, but nothing here so far points to a cause for this bug.  I'll take a look at the wireshark/NSPR log next.
On Firefox OS, I saw a lot of consecutive pause()s/resume()s were called without OnDataAvailable.
(In reply to Jason Duell (:jduell) from comment #71)
> Ran some e10s xpcshell tests on my linux desktop with large files (10 to 100
> MB), where at every 5th OnDataAvailable I call suspend, wait a second, then
> resume.  With a localhost http server, 25 MB file downloads fine that way. 
> 100 MB caused a crash (which is a bug, but not this one).  Over remote
> server loading 10 MB file once saw server drop the connection half-way
> through (but OnStopRequest called), once saw a crash.  Clearly work to be
> done with the crash issue, but nothing here so far points to a cause for
> this bug.  I'll take a look at the wireshark/NSPR log next.

Try reducing the size of the media cache if you haven't already (media.cache_size). On Firefox OS it's something pretty small - 4MB IIRC.
(In reply to Jason Duell (:jduell) from comment #65)
> As long as you suspend a channel on the main thread, Necko should never call
> OnDataAvailable until you call Resume [1]. I.e no races.  That would be a
> bug.  Can you verify for me whether you're actually seeing that in the log?

From comment #59:
> Then the media cache does its last Resume call:
> 05-30 13:10:02.055  1277  1277 I PRLog   : 1074652408[425c92b0]: Stream 44ccc030 Resumed
> and there are no Suspend calls after that. So we're trying to read data from Necko, but none
> arrives.

Mike confirmed on IRC that after our last Resume call, Wireshark shows the TCP receive window never opens. Which is why I think it's a Necko bug.
Flags: needinfo?(roc)
Comment on attachment 756494 [details] [diff] [review]
timer patch

Review of attachment 756494 [details] [diff] [review]:
-----------------------------------------------------------------

It sounds like this patch is working. I think we should take it as well as fix any underlying Necko bug. It gives us some extra protection against possible bugs in the network or server.

::: layout/base/FrameLayerBuilder.cpp
@@ +2947,5 @@
>        layer->SetClipRect(nullptr);
>        layer->SetMaskLayer(nullptr);
>        ImageLayer* imgLayer = static_cast<ImageLayer*>(layer.get());
>        layer.forget();
> +      printf("### RECYCLING IMAGE LAYER\n");

Er, ignore this.
Attachment #756494 - Flags: review?(cpearce)
Attachment #756494 - Flags: review?(cpearce) → review+
I adopt patch on this version..
Mozilla build ID: 20130526070207
Gaia Revision : 4d10e1297b859cacc174c0a54af61a7678d7c32d
Gecko Revision : 52341e43539a0e8b9aa77a9128cc3871439b8aa6

There's more severe problem occurred.
Play is lagged from the start and then it is stuck after few seconds.

Is there anyone who check this patch?
(In reply to leo.bugzilla.gecko from comment #76)
> There's more severe problem occurred.
> Play is lagged from the start and then it is stuck after few seconds.

Does audio keep playing? If so, that is a separate issue tracked in bug 877461
(In reply to Diego Wilson [:diego] from comment #77)

> Does audio keep playing? If so, that is a separate issue tracked in bug
> 877461

No, A/V is just stuck forever.
(In reply to leo.bugzilla.gecko from comment #76)
> I adopt patch on this version..
> Mozilla build ID: 20130526070207
> Gaia Revision : 4d10e1297b859cacc174c0a54af61a7678d7c32d
> Gecko Revision : 52341e43539a0e8b9aa77a9128cc3871439b8aa6

Can you give a URL for this revision? I assume this is from git but I don't know which repo...

> There's more severe problem occurred.
> Play is lagged from the start and then it is stuck after few seconds.
> 
> Is there anyone who check this patch?

You mean this is a regression? Does it happen on all Youtube videos? Can you give specific steps to reproduce?

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #79)
> Can you give a URL for this revision? I assume this is from git but I don't
> know which repo...
Do you mean this URL?
Gaia : http://git.mozilla.org/?p=releases/gaia.git;a=shortlog;h=refs/heads/v1-train
Gecko : http://git.mozilla.org/?p=releases/gecko.git;a=shortlog;h=refs/heads/gecko-18

> You mean this is a regression? Does it happen on all Youtube videos? Can you
> give specific steps to reproduce?
I just checked URL in comment 0, just try to play.
I will check for another youtube videos also.
I think......Bug 871485 could affect this problem...
Because I check patch here after patching Bug 871485.
(It seems like youtube got more problem with patch Bug 871485)

I am checking ...
Sorry for confusing. The stuck problem is definitely because of Bug 871485.

After adapting this patch, it's lagged so much and only audio is keep playing.
I think it's same as bug 877461.
There are so many message from to log like this..when it's lagging

MediaResource          Could not trim cache block 268 (destination 127, predicted next use 38.952000, latest predicted use for overflow 198.109000
(In reply to Diego Wilson [:diego] from comment #77)
> Does audio keep playing? If so, that is a separate issue tracked in bug
> 877461

It fills like worse than before.
Is Bug 877461 same as Bug 863441.

I think this patch reopen Bug 863441.
More information..

When a network connection is good, it occurred less than before.
If network connection is not good, it occurred whenever DataReceived notifications signalling is show up.
(In reply to leo.bugzilla.gecko from comment #82)
> Sorry for confusing. The stuck problem is definitely because of Bug 871485.

The patch does not change how video playback is done, but change the timing of codec allocation. I think Bug 871485 does not cause the problem, but hit more often the hidden problem.
(In reply to Jason Duell (:jduell) from comment #71)
>
> Clearly work to be
> done with the crash issue, but nothing here so far points to a cause for
> this bug.  I'll take a look at the wireshark/NSPR log next.

jduell, anything I can do to help?
It seems that on leo device by using MozBuild, we can see more freeze easily. It might be true also on vendor built ROM. Last week, I was in lg b2g work week, lg engineer said, it is easy to generate freeze.
I've got :mikeh running a build that asserts if the necko channel's suspend/resume methods are called off main thread.  We seem to be seeing that (or some other kind of error) in bug 877708, and that could explain the "lost resume" we seem to seeing here.
So we're not seeing any off-main thread use of suspend/resume. But :mikeh has been running some builds with various diagnostic checks, and we did catch a hang and attached a debugger to it, and we saw

1) nsHttpChannel::mSuspendCount = 1
2) the last printf in our log of suspend/resume is a suspend()
3) wireshark shows TCP window is 0 (i.e. full client TCP buffer because channel is suspended)

This really seems to point to the code above necko not calling resume(), yet expecting data. roc: I smell a media cache bug at this point.
Flags: needinfo?(roc)
(In reply to Jason Duell (:jduell) from comment #90)
> This really seems to point to the code above necko not calling resume(), yet
> expecting data. roc: I smell a media cache bug at this point.

Maybe there is a media cache bug too, but in the log in comment #58, it's clearly a Resume that's the last call. Please don't ignore that.
Flags: needinfo?(roc)
(In reply to Jason Duell (:jduell) from comment #90)
> This really seems to point to the code above necko not calling resume(), yet
> expecting data. roc: I smell a media cache bug at this point.

Did you get the nsMediaCache log for this? If so can you drop a link here?
Hmm, it's possible the resume gets lost somewhere between where nsMediaCache logs it and calling Necko. I'll dig.
I now suspect bug 877708 is the real culprit here.  Taking roc off the hook.
mikeh: could you try my patch (the simple one for b2g18 branch) from bug 877708 (without roc's patch) and see if it fixes the freeze here?

Alas, there's a good chance my bug 877708 fix won't actually fix this issue.  The bug there was that if suspend was called again very shortly after resume, we could wind up flushing events from the parent (like OnDataAvailable) that we have queued on the child, even though we're supposed to be suspended (these were the "racy" OnDataAvailable calls mentioned in comment 65).  My patch fixes that, but AFAICT it won't help with any "lost" resume calls (i.e we shouldn't remain suspended if the media code really has issued a matching number of suspends/resumes).  But I don't see any obvious problems with the way necko e10s does suspend/resume, so I'm still hoping that not issuing the racy ODA calls will somehow make the media code behave better.
Flags: needinfo?(mhabicher)
Why does stuck happen at the same position all the time??

and...according to logs, it seems like h/w codec keeps consuming data even though it needs
Does anyone have any objections to landing attachment 756494 [details] [diff] [review] ?
Keywords: checkin-needed
Also, perhaps we can spin off another bug for continuing the media cache debugging discussion?
(In reply to Diego Wilson [:diego] from comment #97)
>
> Does anyone have any objections to landing attachment 756494 [details] [diff] [review] ?

My concern is that while is does unstick the video, it doesn't address the underlying problem that causes the stream to get stuck in the first place. I'm currently testing a patch in bug 877708 that may do that, so I'm going to remove the 'checkin-needed' keyword for now. I'll update both bugs once I have a couple of runs-through to examine.
Flags: needinfo?(mhabicher)
Keywords: checkin-needed
With bug 877708 attachment 757784 [details] [diff] [review] applies (and not attachment 756494 [details] [diff] [review] in this bug), the TCP stream(s) no longer stall. Instead, Wireshark shows periodic TCP window collapses which linger for 1~3s and are then cleared and streaming resumes.

(The video is frozen, however--something that sounds like it is resolved in another bug. Audio was stalled too, despite the network traffic; but when I moved the DUT to temporarily increase WiFi RSSI, the audio resumed with frozen display.)
Further to comment 100, that would probably be bug 879297.
(In reply to Diego Wilson [:diego] from comment #98)
>
> Also, perhaps we can spin off another bug for continuing the media cache
> debugging discussion?

In light of comment 100, I don't think a new bug is needed; I need to conduct more testing, but if the TCP stream stall is fixed, I don't think we need to revisit the MediaCache.
jduell: I've put bug 877708 attachment 757784 [details] [diff] [review] through its paces today, and over several hours of testing, the TCP stream never gets stuck. I think you've done it. :)
Yay. That's good news!  I guess I'll dupe that bug over to this one (so I don't have to ask for leo+ on it).

mikeh: Re: TCP window collapses:  IIRC this is normal--when the media code suspends a channel, we let its TCP buffer fill up (and the TCP recv window becomes 0).  When we resume we read from the TCP buffer and the window gets bigger again.  If that's all you're seeing there's no problem here (but I seem to remember you saying we were sending hard RSTs to server when we talked yesterday?  That sounded like a new, different bug).
Duplicate of this bug: 877708
This is patch for m-c from bug 870564, already approved by jdm (and nits fixed).
Attachment #758276 - Flags: review+
This is patch for mb2g18 from bug 870564, already +r'd by jdm (and nits fixed).

[Approval Request Comment]
Bug caused by (feature/regressing bug #): been in tree for long time.
User impact if declined: youtube on b2g will freeze on long videos 
Testing completed: yes see comment 103.
Risk to taking this patch (and alternatives if risky): low.
String or UUID changes made by this patch: none
Attachment #758280 - Flags: review+
Attachment #758280 - Flags: approval-mozilla-b2g18?
Comment on attachment 756494 [details] [diff] [review]
timer patch

Marking roc's patch obsolete, as it was a bit of a hack and doesn't seem to be needed.  Correct me if I'm wrong about that.
Attachment #756494 - Attachment is obsolete: true
typo in comment 106 && 107--I moved those patches here from bug 877708.
(In reply to Jason Duell (:jduell) from comment #109)
> Marking roc's patch obsolete, as it was a bit of a hack and doesn't seem to
> be needed.  Correct me if I'm wrong about that.

Agreed.
(In reply to Jason Duell (:jduell) from comment #104)
> mikeh: Re: TCP window collapses:  IIRC this is normal--when the media code
> suspends a channel, we let its TCP buffer fill up (and the TCP recv window
> becomes 0).  When we resume we read from the TCP buffer and the window gets
> bigger again.

This is expected. This is the result of the media cache doing flow control.
Whiteboard: c=video [TD-26549], MiniWW → c=video [TD-26549], MiniWW, [YouTubeCertBlocker]
Whiteboard: c=video [TD-26549], MiniWW, [YouTubeCertBlocker] → c=video [TD-26549], MiniWW, [YouTubeCertBlocker+]
jason this is awesome.
For consistency of symptom what I see, please try it via 3G connection.
If you cannot use 3G connection, try it via other device's tethering.
I cannot get any progress even though I applied many patches mentioned in Bug 877461, Bug 878343 and here.
https://hg.mozilla.org/mozilla-central/rev/850b4e5dc45e
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment on attachment 758280 [details] [diff] [review]
less risky patch for b2g18 branch

Has leo+, doesn't need additional approval.
Attachment #758280 - Flags: approval-mozilla-b2g18?
(In reply to Jason Duell (:jduell) from comment #104)
> 
> mikeh: Re: TCP window collapses:  IIRC this is normal--when the media code
> suspends a channel, we let its TCP buffer fill up (and the TCP recv window
> becomes 0).  When we resume we read from the TCP buffer and the window gets
> bigger again.  If that's all you're seeing there's no problem here (but I
> seem to remember you saying we were sending hard RSTs to server when we
> talked yesterday?  That sounded like a new, different bug).

jduell, I still see the RSTs from us to the server. Basically it looks like:
1. regular TCP streaming
2. TCP receive window goes to zero, then:
3. http > us TCP keepalive
4. us > http TCP ACK, Rwin=0
5. us > http TCP RST
6. open new stream

The only time this doesn't happen is on the last chunk, where we close it off gracefully with FIN+ACKs.
(In reply to leo.bugzilla.gecko from comment #114)
> For consistency of symptom what I see, please try it via 3G connection.
> If you cannot use 3G connection, try it via other device's tethering.
> I cannot get any progress even though I applied many patches mentioned in
> Bug 877461, Bug 878343 and here.

FYI, for confirmation a patch in Bug 879529 is also necessary.
> 
> jduell, I still see the RSTs from us to the server. Basically it looks like:
> 1. regular TCP streaming
> 2. TCP receive window goes to zero, then:
> 3. http > us TCP keepalive
> 4. us > http TCP ACK, Rwin=0
> 5. us > http TCP RST
> 6. open new stream
> 
> The only time this doesn't happen is on the last chunk, where we close it
> off gracefully with FIN+ACKs.


The RST gets sent because we closed the socket without reading data that was queued in the kernel before doing that. If a stream was paused and then not resumed before being closed that is what I would expect would happen. (it happens on lots of other close scenarios too) Its not necessarily an indication of a problem unless its associated with some other issue.
Keywords: verifyme
QA Contact: jsmith
> WTF happened to the header on this patch?

Not sure--I didn't tweak it manually, so I'll blame hg.
(In reply to Jason Duell (:jduell) from comment #107)
> Created attachment 758280 [details] [diff] [review]
> less risky patch for b2g18 branch
> 
> This is patch for mb2g18 from bug 870564, already +r'd by jdm (and nits
> fixed).
> 
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): been in tree for long time.
> User impact if declined: youtube on b2g will freeze on long videos 
> Testing completed: yes see comment 103.
> Risk to taking this patch (and alternatives if risky): low.
> String or UUID changes made by this patch: none

This patch worked for me. Thanks Jason!
Flags: in-moztrap?
Flags: in-moztrap? → in-moztrap+
Played fine with some long videos, although i did some of the behavior described in bug 880601.
Attachment #756747 - Flags: feedback?(jduell.mcbugs)
You need to log in before you can comment on or make changes to this bug.