Closed Bug 1641972 Opened 5 years ago Closed 5 years ago

"Parse MP4 metadata failed" error when trying to play certain QTFF/MP4 files

Categories

(Core :: Audio/Video, defect, P3)

defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox77 --- wontfix
firefox78 --- wontfix
firefox79 --- wontfix

People

(Reporter: Chris.Paucar, Assigned: jbauman)

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36

Steps to reproduce:

We are using LIVE555 software to create QTFF/MP4 video files from RTSP cameras. We configured it for MP4 output. Chrome plays all videos from all cameras. Firefox has a parsing issue with videos from certain cameras, but not others. Firefox had the same results in both Windows and Linux.

Link to "failed" video: https://rocdevdata.blob.core.windows.net/firefox/B0C55431BD50-evtid-5EBEE83A-5EBEE858.mp4

Link to working video: https://rocdevdata.blob.core.windows.net/firefox/Pablo.mp4

Open the links in Firefox. Also attached the "failed" video just in case.

Actual results:

The error "No video with supported format and MIME type found" appears on the page.

The console logs show "Media resource https://rocdevdata.blob.core.windows.net/firefox/B0C55431BD50-evtid-5EBEE83A-5EBEE858.mp4 could not be decoded."

as well as "Media resource https://rocdevdata.blob.core.windows.net/firefox/B0C55431BD50-evtid-5EBEE83A-5EBEE858.mp4 could not be decoded, error: Error Code: NS_ERROR_DOM_MEDIA_METADATA_ERR (0x806e0006)
Details: virtual RefPtr<MP4Demuxer::InitPromise> __cdecl mozilla::MP4Demuxer::Init(void): Parse MP4 metadata failed"

Expected results:

Video from the file should begin playing, such as in the working video link. Audio does not matter since Firefox does not support all the audio codecs our cameras use, so we expect some videos to be silent or missing audio.

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0

Hi,

I have managed to reproduce this issue on Release version 76.0.1, Beta 78.0b1 and latest Nightly 79.0a1 (2020-05-01) using Windows 10.

Further, I will move this over to a component so developers can take a look over it. If this is not the correct component please feel free to change it to an appropriate one.

Thanks for your report.

Status: UNCONFIRMED → NEW
Component: Untriaged → Audio/Video
Ever confirmed: true
Product: Firefox → Core
Version: 76 Branch → Trunk
Assignee: nobody → jbauman
Severity: -- → S3
Priority: -- → P3

Please let me know if any additional information or action is needed from me, thanks.

Flags: needinfo?(jbauman)

Thanks for your patience. I can reproduce this, so I don't think there's anything else I need from you. I see that the failure is occurring in parsing the descriptor of the ESDS box here: https://github.com/mozilla/mp4parse-rust/blob/309b9b8acdafa1fc34bb6e3df5b27979b272bf42/mp4parse/src/lib.rs#L2423, but I need to do a bit more digging to determine if this is an error in the parser or an invalid encoding in the file.

Flags: needinfo?(jbauman)
Attached video Fixed MP4

After taking some time to analyze the code and the attached video which failed to parse, I believe there is an error in the encoding of the file, specifically the esds box. Here's a hex dump I've generated with hexdump -C -s 0x9bba0 -n 112 failed.mp4 to show the relevant portions:

0009bba0  00 00 00 67 73 74 73 64  00 00 00 00 00 00 00 01  |...gstsd........|
0009bbb0  00 00 00 57 6d 70 34 61  00 00 00 00 00 00 00 01  |...Wmp4a........|
0009bbc0  00 00 00 00 00 00 00 00  00 02 00 10 ff fe 00 00  |................|
0009bbd0  1f 40 00 00 00 00 00 33  65 73 64 73 00 00 00 00  |.@.....3esds....|
0009bbe0  03 80 80 80 2a 00 00 00  04 80 80 80 1c 40 15 00  |....*........@..|
0009bbf0  18 00 00 00 6d 60 00 00  6d 60 05 80 80 80 02 15  |....m`..m`......|
0009bc00  90 06 80 80 80 01 02 00  00 00 18 73 74 74 73 00  |...........stts.|

I'm not sure how familiar you are with the structure of MP4 files (AKA ISOBMFF), so let me know if anything I'm saying doesn't make sense and I can go into more detail.

The stsd box starts at offset 0x0009bba0 in the file with 4 bytes giving the size: 00 00 00 67 (103 bytes in decimal) followed by 4 bytes for the name of the box 73 74 73 64 ("stsd" in ASCII). Adding 0x0009bba0 to 0x67 gives 0x0009bc07, which is where the stts box starts (00 00 00 18 4 bytes for the length, followed by 73 74 74 73 "stts"). This tells us that the stsd box includes everything between the half-open offset range 0x0009bba0..0x0009bc07.

The structure of the stsd box is § 8.5.2 "Sample Description Box" of ISO/IEC 14496-12:2015, which is publicly available free of charge. Interpreting the contents of that box happens in the read_stsd function. This tells us that there's 1 sample, which follows in the form of a AudioSampleEntry (see § 12.2.3). This is parsed by read_audio_sample_entry and we can see the codingname (just name in the code) is mp4a. Looking at the beginning of that box at offset 0x0009bbb0 we see again a 4-byte length 00 00 00 57 (87 bytes in decimal) followed by 6d 70 34 61 ("mp4a" in ASCII). Adding 0x0009bbb0 to 0x57 again gives 0x0009bc07 (the start of the stts box), so we know the mp4a box continues to the end of the stsd box which contains it.

After some information about the channels and sample characteristics, the mp4a box contains an esds box, which unfortunately doesn't seem to be described in any freely available spec. Most of the information comes from ISO/IEC 14496-1:2010, but you can see from the read_esds code that it's a container for an ES_Descriptor defined in § 7.2.6.5. read_esds reads the remainder of the box (again, at offset 0x0009bbd4 we have 00 00 00 33 51 bytes long, 65 73 64 73 "esds", so the esds box goes all the way to the end: 0x0009bc07). This array, starting at offset 0x0009bbe0 and continuing until (but not including 0x0009bc07) is passed to the find_descriptor function.

According to the spec § 7.2.2.2, BaseDescriptor starts with an 1-byte tag. We see 0x03 at offset 0x0009bbe0, which corresponds to the expected ES_DescrTag. Following that is a variable-length size field described in § 8.3.3 "Expandable classes". You can see how this works from the code of find_descriptor, but basically for each byte if the high bit is set, it indicates that the size continues with the subsequent byte and the low 7-bits are concatenated together to form the value. If the high bit is not set, the low 7 bits are still concatenated, but the process stops. In this file, we have (starting at offset 0x0009bbe1 after the 0x3 tag) 80 80 80 2a or:

binary       decimal
--------------------
0b1000_0000  128
0b1000_0000  128
0b1000_0000  128
0b0010_1010  42

The high bit is set on the first 3 bytes, so we continue accumulating the size, but the low 7 bits are all 0, so we end up with 0000_0000_0000_0000_0000_0010_1010 or 42. Per the specification:

The size information shall not include the number of bytes needed for the size and the object_id encoding.

So this size indicates that the ES_Descriptor should continue for 42 bytes starting after the size, meaning from offset 0x0009bbe5. However, adding 42 gives offset 0x0009bc0f, and we know that the esds (and the mp4a and stsd boxes which contain it) end prior to offset 0x0009bc07. So, I think the encoding here is incorrect.

What exactly it should be is a little involved. This length is for an ES_Descriptor box, which contains a DecoderConfigDescriptor. We can see that starting at offset 0x0009bbe8 since we know it starts with tag 0x04 and has a similar pattern for the length: 80 80 80 1c. However, that length appears to have a similar problem, it converts to a value of 28, when added to the offset 0x0009bbed (immediately following the length itself) we get 0x0009bc09, which is beyond the end of the esds box, but it's not immediately clear what the length should be. Continuing on, the specification indicates a DecoderConfigDescriptor contains 0 or 1 DecoderSpecificInfo boxes and a variable number of ProfileLevelIndicationIndexDescriptor boxes. We can see the DecSpecificInfoTag (0x05) at offset 0x0009bbfa followed by 80 80 80 02 indicating a length of 2 bytes. If there were any ProfileLevelIndicationIndexDescriptor boxes after that, we'd expect to see the ProfileLevelIndicationIndexDescrTag (0x14) at offset 0x0009bc01, but instead we see 06 80 80 80 01 indicating a SLConfigDescrTag and a length of 1. Since an ES_Descriptor box does contain a SLConfigDescriptor following the DecoderConfigDescriptor This allows us to conclude that the ES_Descriptor box includes the DecoderConfigDescriptor box, which includes the DecoderSpecificInfo box, at which point those two boxes end and the SLConfigDescriptor occurs as part of the ES_Descriptor, continuing right up to the beginning of the stts box at 0x0009bc07. Or, graphically

offset    box
-------------------------------------
0009bbd4  esds
0009bbe0    ES_Descriptor
0009bbe8      DecoderConfigDescriptor
0009bbfa        DecoderSpecificInfo
0009bc01      SLConfigDescriptor
0009bc07  stts

Knowing that, and the fact that the descriptor box lengths don't include the bytes for the tag or the length itself, the lengths of the first two boxes should be changed accordingly

offset   old-value  new-value
-------------------------------
0009bbe4 0x2a       0x22
0009bbec 0x1c       0x14

Finally, we can check this against the mp4dump utility. With the original file as input, we get a crash:

              [esds] size=12+39
                [ESDescriptor] size=5+42
                  es_id = 0
                  stream_priority = 0
                  [DecoderConfig] size=5+28
                    stream_type = 5
                    object_type = 64
                    up_stream = 0
                    buffer_size = 6144
                    max_bitrate = 28000
                    avg_bitrate = 28000
                    DecoderSpecificInfo = 15 90 
                    [Descriptor:06] size=5+1
Segmentation fault: 11

and with the updated file (attached):

              [esds] size=12+39
                [ESDescriptor] size=5+34
                  es_id = 0
                  stream_priority = 0
                  [DecoderConfig] size=5+20
                    stream_type = 5
                    object_type = 64
                    up_stream = 0
                    buffer_size = 6144
                    max_bitrate = 28000
                    avg_bitrate = 28000
                    DecoderSpecificInfo = 15 90 
                  [Descriptor:06] size=5+1

So, all that is enough to get the file to pass the parser, but there still other issues with the audio that are causing errors when trying to decode them. Removing the audio track works, but I'd need to spend some more time digging to see what's going on.

Thank you for the detailed write up. I can confirm the parsing error is no longer showing up with the fixed MP4 file though the audio remains missing as you stated. I will debug the LIVE555 library that created the ESDS atom/box to see how this could have occurred.

For the video in comment 4 I get audio in Fx under Linux, but not under Windows.

If I take the original file and ask ffmpeg to remux it then it works for me in Fx everywhere.

ffmpeg.exe -i B0C55431BD50-evtid-5EBEE83A-5EBEE858.mp4 -c copy re-mux2.mp4
ffmpeg version git-2020-03-11-36aaee2 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 9.2.1 (GCC) 20200122
  configuration: --enable-gpl --enable-version3 --enable-sdl2 --enable-fontconfig --enable-gnutls --enable-iconv --enable-libass --enable-libdav1d --enable-libbluray --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libtheora --enable-libtwolame --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libzimg --enable-lzma --enable-zlib --enable-gmp --enable-libvidstab --enable-libvorbis --enable-libvo-amrwbenc --enable-libmysofa --enable-libspeex --enable-libxvid --enable-libaom --enable-libmfx --enable-ffnvcodec --enable-cuda-llvm --enable-cuvid --enable-d3d11va --enable-nvenc --enable-nvdec --enable-dxva2 --enable-avisynth --enable-libopenmpt --enable-amf
  libavutil      56. 42.100 / 56. 42.100
  libavcodec     58. 75.100 / 58. 75.100
  libavformat    58. 41.100 / 58. 41.100
  libavdevice    58.  9.103 / 58.  9.103
  libavfilter     7. 77.100 /  7. 77.100
  libswscale      5.  6.100 /  5.  6.100
  libswresample   3.  6.100 /  3.  6.100
  libpostproc    55.  6.100 / 55.  6.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'B0C55431BD50-evtid-5EBEE83A-5EBEE858.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42isom
    creation_time   : 2020-05-15T19:06:34.000000Z
  Duration: 00:00:30.34, start: 0.000000, bitrate: 170 kb/s
    Stream #0:0(eng): Audio: aac (LC) (mp4a / 0x6134706D), 8000 Hz, stereo, fltp, 26 kb/s (default)
    Metadata:
      creation_time   : 2020-05-15T19:06:34.000000Z
      handler_name    : ?Apple Sound Media Handler
    Stream #0:1(eng): Video: h264 (High) (avc1 / 0x31637661), yuvj420p(pc, bt470bg/bt470bg/smpte170m), 1280x720, 141 kb/s, 15 fps, 15 tbr, 600 tbn, 30 tbc (default)
    Metadata:
      creation_time   : 2020-05-15T19:06:34.000000Z
      handler_name    : ?Apple Video Media Handler
      encoder         : H.264
Output #0, mp4, to 're-mux2.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: mp42isom
    encoder         : Lavf58.41.100
    Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuvj420p(pc, bt470bg/bt470bg/smpte170m), 1280x720, q=2-31, 141 kb/s, 15 fps, 15 tbr, 19200 tbn, 600 tbc (default)
    Metadata:
      creation_time   : 2020-05-15T19:06:34.000000Z
      handler_name    : ?Apple Video Media Handler
      encoder         : H.264
    Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 8000 Hz, stereo, fltp, 26 kb/s (default)
    Metadata:
      creation_time   : 2020-05-15T19:06:34.000000Z
      handler_name    : ?Apple Sound Media Handler
Stream mapping:
  Stream #0:1 -> #0:0 (copy)
  Stream #0:0 -> #0:1 (copy)
Press [q] to stop, [?] for help
frame=  453 fps=0.0 q=-1.0 Lsize=     627kB time=00:00:30.20 bitrate= 170.1kbits/s speed=7.58e+03x
video:523kB audio:99kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.781984%

You can see it changes the order of the streams, but it shouldn't be modifying them. Assuming this is true, then that the file now works suggests there is an issue in the metadata. Comparing the output of the remux to comment 4's fixed file it appears that the data in the esds box is different. I don't know for sure, but it's suspect this metadata is bad and when we give it to decoders on Windows and MacOS those decoders become incorrectly configured and fail to decode. On Linux if ffmpeg is present we will use it, and I suspect it will behave more robustly (as when asking it to remux). This also explains why the file works in VLC and Chrome.

So it seems like a muxing issue to me.

I've also tried remuxing with ffmpeg (attached) on macOS and confirm it creates a file which plays correctly (with sound) on Firefox, Chrome and QuickTime Player.

Unfortunately, dumping the structure of the two files shows a lot of differences, so there's not an obvious thing to point to, but I feel fairly confident in saying that there are some additional errors in the way the original file was muxed.

I'm going to close this since I don't think Firefox's behavior here is wrong (even if some other tools are more lenient about processing files with errors). If you discover something contrary to that, feel free to comment and we can look at addressing it.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: