Closed Bug 1692582 Opened 3 years ago Closed 3 years ago

Crash in [@ mozilla::MediaFormatReader::Update] with MOZ_DIAGNOSTIC_ASSERT(mStart <= mEnd) (Invalid Interval)

Categories

(Core :: Audio/Video: Playback, defect, P3)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 --- fixed

People

(Reporter: mccr8, Assigned: bryce)

References

Details

(Keywords: crash, Whiteboard: [not-a-fission-bug][media-crash])

Crash Data

Attachments

(2 files, 1 obsolete file)

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/de95cab3-e06a-4b8c-b26c-e8aef0210211

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(mStart <= mEnd) (Invalid Interval)

Top 10 frames of crashing thread:

0 xul.dll mozilla::MediaFormatReader::Update dom/media/MediaFormatReader.cpp:2335
1 xul.dll mozilla::detail::RunnableMethodImpl< xpcom/threads/nsThreadUtils.h:1201
2 xul.dll mozilla::AutoTaskDispatcher::TaskGroupRunnable::Run xpcom/threads/TaskDispatcher.h:227
3 xul.dll mozilla::TaskQueue::Runner::Run xpcom/threads/TaskQueue.cpp:158
4 xul.dll nsThreadPool::Run xpcom/threads/nsThreadPool.cpp:302
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1152
6 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:302
7 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:328
8 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:310
9 xul.dll static nsThread::ThreadFunc xpcom/threads/nsThread.cpp:391
See Also: → 1530835

Assert looks like it's actually in the Interval code[0], i.e. we're constructing an invalid interval during mozilla::MediaFormatReader::Update.

The first case I can see of us hitting the breakpoint (and not some other crash with a similar sig) is 85, but there's a lot more volume for 86. I'm confused by the lines we're seeing reported in the crashes, as I don't see where interval creation is taking place.

[0] https://searchfox.org/mozilla-central/rev/bf03edd45ee0804f382c1f05ec88a05f5d88833c/dom/media/Intervals.h#49,57

Severity: -- → S3
Priority: -- → P3

I've audited the code involved, and IMO this points to us getting bad data from a decoder. The Windows specific nature of these failures would support that -- something weird with the Windows decoder (we have some Android crashes in there too, but they do not appear to be the same assert). It's not clear to me why we're getting odd lines for the failures in the crash stats but I'm fairly confident the actual lines we're asserting on are different.

mozilla::MediaFormatReader::Update currently has 3 instances of TimeInterval construction[0][1][2]. Of those, two should not fail as one constructs an empty interval[2] and one constructs an interval where start == end[1]. That leaves us with only one site that can fail, and there the interval is driven off decoder data[0].

[0] https://searchfox.org/mozilla-central/rev/c8ce16e4299a3afd560320d8d094556f2b5504cd/dom/media/MediaFormatReader.cpp#2157
[1] https://searchfox.org/mozilla-central/rev/c8ce16e4299a3afd560320d8d094556f2b5504cd/dom/media/MediaFormatReader.cpp#2301
[2] https://searchfox.org/mozilla-central/rev/c8ce16e4299a3afd560320d8d094556f2b5504cd/dom/media/MediaFormatReader.cpp#2313

Not a Fission crash even though the report in comment 0 had DOMFissionEnabled=1.

Some bug report comments:

  • "Mainly happens when you jump through time sequences of a video rapidly."
  • "Used YouTube scrollbar to go to end of the video (Click and Hold) once released at desired time, crashed. Happened twice in a row."
Whiteboard: [not-a-fission-bug]

:bryce, we're seeing a lot of these across all channels. Can anything be done here? Do the crash report comments from comment #3 help?

Flags: needinfo?(bvandyk)

I can't repro, but will look further, taking notes as I go...

Echoing and continuing my thought process from comment 2

  • I think this is a Windows specific decoder issue. Crashes from other platforms have different stacks than the Windows ones and I believe those are different crashes.
  • I think our crashing line is the one constructing an interval from decoded data[0]. There interval assert crashes can have incorrect line information in the reports (see bug 1697641), and I think this is a case where we cannot trust the lines in the reports.

I've just done some debugging of crash dumps and manual assembly walking and am fairly confident it bears out the above. The actual assembly we're dying in appears to be a MOZ_ASSERT inlined into update (sybolificaiton misidentifies this for reasons I don't understand), and the branches that end in that assert appear to be from interval construction.

Assuming I'm correct above, our problem is that we're getting data with an end time less than its start time[0]. Since the end time is calculated from start time + duration[1] this suggests a negative duration, which is odd. Because our time units use a checked integer internally we would expect checked int assertions to fire if we had an overflow. E.g. even if we ended up with an overflow for duration, when we construct the interval the comparison operator[2] we use would trigger the checked int asserts before the interval assertion. That is to say, the only way I can see us arriving here is a negative duration.

I have a proposed fix. We could land it and see if it affects the crash rates. If it does, I think it would be good to follow up with some telemetry to understand the problem better.

[0] https://searchfox.org/mozilla-central/rev/083983b7f09b00cdfe4f70396e39ea4f8a1735e1/dom/media/MediaFormatReader.cpp#2157-2158
[1] https://searchfox.org/mozilla-central/rev/54097530955a98c768f2aaf56925578ec886ec77/dom/media/MediaData.h#302
[2] https://searchfox.org/mozilla-central/rev/54097530955a98c768f2aaf56925578ec886ec77/dom/media/TimeUnits.h#127-136

Assignee: nobody → bvandyk
Flags: needinfo?(bvandyk)
Keywords: leave-open
Whiteboard: [not-a-fission-bug] → [not-a-fission-bug][media-crash]

If the Windows decoder outputs a negative duration, use the last input sample
duration instead. We do not expect to see negative durations, and if the Windows
decoder is outputting them is can lead to crashes else where in the media
pipeline due to asserts.

Attachment #9219934 - Attachment is obsolete: true

This patch changes our handling of durations in the WMF video MFT manager.
Prior to this patch we had some handling for if VP9 samples were returned with
zero duration. However, this patch extends this to the general case so that we
will try and estimate the duration from out own data in the following cases

  • If Windows returns an error code when getting the duration. This can be
    because the duration is unknown, but the API also specifies it can fail due to
    other, unstated reasons (we handle these too).
  • If Windows returns a 0 duration. This is also used to indicate an unknown
    duration. We now handle this for arbitrary codecs, not just VP9.
  • If Windows returns a negative duration. It's unclear if Windows will do this,
    and if it does, under what cases. However, since the API returns the duration
    as an signed int it's possible, so we cover this too.

In all cases above, we will use the duration we record from our last input
sample as the output sample duration.

Attachment #9220140 - Attachment description: Bug 1692582 - Handle non-ok error codes + zero and negative durations for IMFSamples. r?alwu → Bug 1692582 - Handle non-ok error codes + zero and negative values when getting duration for IMFSamples. r?alwu
Pushed by bvandyk@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/05e95301fee9
Handle non-ok error codes + zero and negative values when getting duration for IMFSamples. r=alwu

My last patch used the incorrect format specifier (%d) when logging the seconds
for a negative timestamp. This should be a %f as the value is a double.

Pushed by bvandyk@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/727f19323e02
Fix incorrect format specifier for negative duration log. r=alwu

I'm adding some additional testing around this area in bug 1709664. It doesn't explicitly cover this crash as the reason for this crash remains unclear (even if I'm right above, I don't know what causes the Windows decoder to output negative durations). However, my hope is it improves our overall Windows decoder coverage, which may include cases like this one.

I do not see any new crashes in nightly for build IDs after this landed. This suggests the Windows decoder was giving us negative durations on output. Surprising, but consistent with my investigation above.

I'll leave this open. Because we don't know what triggers these crashes, I'd like to see this go to beta and see the crash rate drop there too.

I think this is fixed. There are still some crashes being registered against the signature, but the ones I've checked are other crashes, not the assert on this bug. Closing this.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: