live video stream using webm fails

RESOLVED FIXED

Status

()

RESOLVED FIXED
8 years ago
8 years ago

People

(Reporter: asa, Assigned: cpearce)

Tracking

Trunk
x86
Windows 7
Points:
---

Firefox Tracking Flags

(blocking2.0 betaN+)

Details

(URL)

Attachments

(1 attachment)

(Reporter)

Description

8 years ago
This live video stream fails in Firefox trunk builds. http://zaheer.merali.org/webm/
(Reporter)

Comment 1

8 years ago
more info: when I load the page, I get the (presumably) first frame and then nothing. A reload picks up the latest first frame and then again, nothing.  tested on today's trunk nightly build.  This works in the latest Chromium nightly builds per http://code.google.com/p/chromium/issues/detail?id=44891
It seems to work fine if you right-click on the video and select "View Video", which will open it in a new document as an autoplay video.  I wonder if it's a bad interaction with the default (non-autoplay) suspend after the first frame loads.
blocking2.0: --- → ?
blocking2.0: ? → betaN+
This seems to be working for me now.  Not sure what might've changed to fix it.
(Reporter)

Comment 4

8 years ago
The test URL still fails for me on today's trunk build on win7
How long are you waiting?  It works for me on trunk/Win7 too, but it takes about 25 seconds (buffering) before playback starts.
(Assignee)

Comment 6

8 years ago
(In reply to comment #5)
> How long are you waiting?  It works for me on trunk/Win7 too, but it takes
> about 25 seconds (buffering) before playback starts.

Ditto. Probably a regression from bug 543769.
(Reporter)

Comment 7

8 years ago
You're correct. I didn't wait long enough. I saw the statusbar indicator disappear and assumed it was all loaded. 

Video does display after about 25 seconds.
Any better with cpearce's latest buffering changes?
(Assignee)

Comment 9

8 years ago
(In reply to comment #8)
> Any better with cpearce's latest buffering changes?

Hmm, seems not.
Status: NEW → ASSIGNED
It looks like there are two issues with this stream.  Bad interaction with our buffering logic (which cpearce is looking into) and a problem parsing the WebM stream at certain points.  cpearce asked me to look into the second issue.

Periodically within the stream, there are unknown elements that nestegg skips over.  In one case we lose sync in the stream after skipping an unknown element.  mkvinfo 4.3.0 shows the following elements:

| + SimpleBlock (track number 2, 1 frame(s), timecode 9225658.368s = 2562:40:58.
609) at 243831
|  + Frame with size 3951
| + (Unknown element: DummyElement; ID: 0xc4 size: 100) at 247789
| + (Unknown element: DummyElement; ID: 0xa1 size: 37) at 247891
| + Block group at 247928
|  + Block duration: 12.000000ms at 247937
|  + Block (track number 1, 1 frame(s), timecode 9225659.545s = 2562:40:59.777) at 247940

nestegg correctly reads the unknown element 0xc4 with size 98 (adding the two bytes for ID and size storage, which we've already ready, gives 100 bytes).  nestegg then skips forward 98 bytes and lands at unknown element 0x81 which has a very large size (177723616382336).  It then calls io_read_skip attempting to skip this amount of data and eventually the stream reaches EOF and returns an error.  As far as I can tell, this parse is correct and the file is corrupt.

gsteamer-plugins-good 0.10.25-1's parsing agrees with nestegg's:

gst_matroska_demux_loop: Offset 247789, Element id 0xc4, size 98, needed 2
gst_matroska_demux_parse_id: skipping Element 0xc4
gst_matroska_demux_flush: skipping 100 bytes
gst_matroska_demux_loop: pos 2562:40:58.600000000
gst_matroska_demux_loop: Offset 247889, Element id 0x81, size 177723616382336, needed 8
gst_matroska_demux_parse_id: skipping Element 0x81
gst_matroska_demux_flush: skipping -2130326136 bytes
mkvinfo's parsing is rather confusing.  It seems to be incorrect and just happens to retain valid sync.  Looking at the hex dump of the stream:

0003c7e0: a380 0118 cc4b b286 ff20 4294 00c4 e207  .....K... B.....
                                          ^^ ^^
                                          ID SIZE (e2h & ~80h == 98d)

0003c850: 9b81 02a1 a381 05cd 80d4 e497 d8ad c9af  ................
            ^^   ^^
            #1   #2

nestegg and gst land at #1 and read the next seven bytes as the element's size.  mkvinfo lands at #2 and reads the next byte as the element's size.  The element 0xc4 starts at 0x3c7ed, so after reading the ID (1 byte) and size (1 byte) and skipping 98 bytes, we should land at 0x3c851.  mkvinfo appears to skip forward 102 bytes in total, 1 each for ID and size, then 100 in total, landing at 0x3c853, which is then treated as unknown element 0xa1.

Further evidence that mkvinfo has gone crazy here is that 0xa1 is a perfectly valid element ID (Block), but does appear outside of its container (0xa0, BlockGroup).
Looking at the dump more closely, I think we've already lost sync by the time we see unknown element 0xc4.  

0003c7e0: a380 0118 cc4b b286 ff20 4294 00c4 e207  .....K... B.....
                                          ^^ Unknown element 0xc4

0003c7f0: 2d79 c90f 0223 0362 d900 4c9a 424c 311a  -y...#.b..L.BL1.
0003c800: 1bee 3bd5 8d7e 5d7b 1fa7 9838 b119 5324  ..;..~]{...8..S$
0003c810: 01a0 0100 0000 0000 002d 9b81 02a1 a881  .........-......
                                             ^^ Size (40)
                                          ^^ Block
                                        ^^ Value (2ms)
                                     ^^ Size (1)
                                   ^^ BlockDuration
               ^^^^ ^^^^ ^^^^ ^^^^ Size (45)
            ^^ BlockGroup (skipped by all parsers)
0003c820: 05cb 80c4 e267 1f26 263f fb70 1507 2c05  .....g.&&?.p..,.
0003c830: 2029 8886 8a04 94d3 9b4f ef1d 997d 9bbb   ).......O...}..
0003c840: 32a4 8bb3 2199 03a0 0100 0000 0000 0028  2...!..........(
                              ^^^^ ^^^^ ^^^^ ^^^^ Size (40)
                           ^^ BlockGroup (skipped by all parsers)

            #1   #2
            vv   vv
0003c850: 9b81 02a1 a381 05cd 80d4 e497 d8ad c9af  ................
                 ^^ Block
               ^^ Value (2ms)
            ^^ Size (1) 
          ^^ BlockDuration

0003c860: b1ab 5809 0e20 82e2 8938 8a1b 7ced 76fd  ..X.. ...8..|.v.
0003c870: 34b2 77cf 061b 4f02 a001 0000 0000 0000  4.w...O.........
                              ^^ BlockGroup where mkvinfo resyncs

Scanning backwards from the 0xc4 unknown element, I can't see any valid looking WebM elements until 0x3b877, which is the last valid SimpleBlock parsed before going off the rails.  My only guess is that the size of that SimpleBlock element is 36 bytes too small for some reason.
Having debugged the libebml/libmatroska parser, it turns out it does start parsing the same way nestegg and gstreamer's matroskadec does: after skipping unknown element 0xc4 it sees element 0x81.  It reads the size of unknown element 0x81 as 177723616382336 and then rejects the element due to a 2GB-per-element size check.  At this point it skips over the 0x81 ID byte and retries parsing at 0x02, fails, retries at 0xa1 and succeeds.

So the stream is definitely corrupt, but libebml's parsing is robust enough to work around it and continue parsing at the next valid element.
(Assignee)

Comment 13

8 years ago
Created attachment 478878 [details] [diff] [review]
Patch: Don't consider canplaythrough status when leaving buffering state

On the live stream page, the video's decoder is going into buffering state because the video has the default preload=metadata, and script is play()ing the video. When we play() a video which has suspended its load due to preload=*, we switch to buffering state to ensure the playback doesn't stop soon after we've started playback.

The problem (in the decoder) is that our "should we keep buffering" logic in the BUFFERING case of nsBuiltinDecoderStateMachine::Run() remains in buffering state if !mDecoder->CanPlayThrough(), but CanPlayThrough() returns PR_FALSE if the playback rate is not reliable. The playback rate is not considered reliable if we don't know the length of the stream. We don't know the length of stream when we're playing live streams, so CanPlayThrough() will always return PR_FALSE, and we'll remain in BUFFERING state until our 30s time limit expires.

This patch bypasses the CanPlayThrough check if the stream is a live stream. This has the side effect that we'll stop buffering when HaveAmpleDecodedData() starts to return PR_TRUE, which will happen when we've decoded 2s of audio.  I think this should be ok; no matter what threshold we put there, if a live stream's download can't keep up with the media, you'll have a bad experience.
Attachment #478878 - Flags: review?(kinetik)
Whiteboard: [needs review]
(In reply to comment #12)
> So the stream is definitely corrupt, but libebml's parsing is robust enough to
> work around it and continue parsing at the next valid element.

Turns out the stream corruption is a local issue rather than a bug in the muxer producing the stream.  It's possible to reproduce the issue using the Auckland office's internet connection (TelstraClear DSL), but not via my home connection (Orcon DSL).

I verified this by fetching the stream via the home and office connections, then used dd to split and rejoin the files at the last matching WebM Cluster before the corruption occurs.  These files have the same first 22690 bytes.  Once this point is found, I manually searched forward in the corrupt file for anything looking like a sync point, then found the matching sync point in the valid version of the file.  That sync point is present within 2868 bytes of the corruption in the corrupt file, but doesn't occur until 73167 bytes in the valid file, suggesting that the cause of the corruption is that some 70kB of data is missing from the file.
Attachment #478878 - Flags: review?(kinetik) → review+
(Assignee)

Updated

8 years ago
Keywords: checkin-needed
Whiteboard: [needs review] → [needs-landing]
(In reply to comment #12)
> So the stream is definitely corrupt, but libebml's parsing is robust enough to
> work around it and continue parsing at the next valid element.

I have other samples of this stream where libebml's parsing fails in a similar way to nestegg's, so not only is the stream corrupt, but it's corrupt in a way that no WebM parser I've tested can recover.

I've had a brief email exchange with Zaheer Merali who is hosting the stream and he mentioned that the Flumotion server drops buffers if the client can't keep up.  It makes more sense that losing ~70kB of data is caused by that server-side behaviour rather than something strange with the local office ISP.  I've asked him for more information, but I haven't heard back yet.

If this is indeed caused by Flumotion, it must be a bug.  It doesn't make sense to allow the server to drop buffers in such a way that it generates invalid streams that parsers are unable to recover from in a sensible manner.

blizzard, do you have contacts at Fluendo that might be able to help out?
(In reply to comment #14)
> Turns out the stream corruption is a local issue rather than a bug in the muxer
> producing the stream.  It's possible to reproduce the issue using the Auckland
> office's internet connection (TelstraClear DSL), but not via my home connection
> (Orcon DSL).

Per comment 15, this is unlikely to be the cause of the problem.  I've reproduced the same problem on the home connection now as well.  It took 4hr 25m before it happened, though.

04111330: block t 0 pts 9571644.434000 f 80 frames: 1
04111330: simpleblock t 1 pts 9571644.449000 f 0 frames: 1
04111330: parent element a0
04111330: multi master element a0 (ID_BLOCK_GROUP)
04111330:  -> using data 30910700
04111330: element 9b (ID_BLOCK_DURATION) -> 30910700 (0)
04111330: suspend parse at a1
04111330: block t 0 pts 9571644.456000 f 80 frames: 1
04111330: unknown element a2
04111330: unknown element e0
04111330: unknown element 57b1
04111330: unknown element b4

Comment 17

8 years ago
Ok this is probably a bug in flumotion/gstreamer solution. We are only meant to drop full clusters...I will look into this.
(Assignee)

Comment 18

8 years ago
Landed: http://hg.mozilla.org/mozilla-central/rev/d08b67f76bfc

We should spin off another bug if we want to change our WebM parser to handle incomplete clusters.
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [needs-landing]
You need to log in before you can comment on or make changes to this bug.