Closed Bug 1162364 Opened 9 years ago Closed 9 years ago

detect MF_E_TRANSFORM_STREAM_CHANGE infinite loops

Categories

(Core :: Audio/Video, defect)

Unspecified
Windows
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox39 --- fixed
firefox40 --- fixed
firefox41 --- fixed

People

(Reporter: karlt, Assigned: karlt)

References

Details

Attachments

(4 files, 3 obsolete files)

Some stacks (e.g. [1]) tracked in bug 1129455 have the platform decoder thread
running in WMFVideoMFTManager::ConfigureVideoFrameGeometry()
which indicates that MF_E_TRANSFORM_STREAM_CHANGE
is involved.

If the transform implementation is returning MF_E_TRANSFORM_STREAM_CHANGE
again and again, then we would have an infinite loop, which would seem the
simplest explanation for most of the hangs in bug 1129455

We rely on the transform to behave as specified, returning
MFT_OUTPUT_DATA_BUFFER_FORMAT_CHANGE only once on each change.

We handle MF_E_TRANSFORM_STREAM_CHANGE assuming
MFT_OUTPUT_DATA_BUFFER_FORMAT_CHANGE [2].  However, if the transform creates a
new stream then we are not handling that properly [3], though I don't know why
that would happen.

It should not be necessary to repeat ProcessOutput on the transform more than
once.  We can detect if MF_E_TRANSFORM_STREAM_CHANGE is returned a second time
and error out of the decode.

[1] https://crash-stats.mozilla.com/report/index/b901bff8-3d02-4fda-bcb8-dec762150427#allthreads
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/ms704014%28v=vs.85%29.aspx#stream_changes
[3] https://msdn.microsoft.com/en-us/library/windows/desktop/ms699875%28v=vs.85%29.aspx
Attachment #8605531 - Flags: review?(cpearce)
Attachment #8605514 - Flags: review?(cpearce) → review+
Comment on attachment 8605531 [details] [diff] [review]
report telemetry on WMFMediaDataDecoder errors

Review of attachment 8605531 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/fmp4/wmf/WMFMediaDataDecoder.cpp
@@ +78,5 @@
> +    sample = 0;
> +  } else if (hr < 0xc00d36b0) {
> +    sample = 1; // low bucket
> +  } else if (hr < 0xc00d3700) {
> +    sample = hr & 0xffU; // MF_E_*

What if only one of the last two bits is set? Then couldn't this value be misreported as 1 or 2? Ditto for the MF_E_TRANSFORM_* case below. Or is there something about the layout of the error codes that makes it safe?

@@ +89,5 @@
> +  } else {
> +    sample = 3; // high bucket
> +  }
> +
> +  nsCOMPtr<nsIRunnable> runnable = new WMFTelemeteryRunnable(sample);

You could use NS_NewRunnableFunction() instead of defining a new Runnable for this, and either pass a lambda, or (I assume) &Telemetry::Accumulate and the arguments to the call you want to make.
Comment on attachment 8605531 [details] [diff] [review]
report telemetry on WMFMediaDataDecoder errors

(In reply to Chris Pearce (:cpearce) from comment #3)
> ::: dom/media/fmp4/wmf/WMFMediaDataDecoder.cpp
> @@ +78,5 @@
> > +    sample = 0;
> > +  } else if (hr < 0xc00d36b0) {
> > +    sample = 1; // low bucket
> > +  } else if (hr < 0xc00d3700) {
> > +    sample = hr & 0xffU; // MF_E_*
> 
> What if only one of the last two bits is set? Then couldn't this value be
> misreported as 1 or 2? Ditto for the MF_E_TRANSFORM_* case below. Or is
> there something about the layout of the error codes that makes it safe?

In this block the last two bytes range from 0xb0 to 0xff.
Similarly in the MF_E_TRANSFORM_* case, they range from 0x60 to 0x78.

BTW these are useful references for error codes:
http://errco.de/win32/mferror-h/
https://github.com/Alexpux/mingw-w64/blob/master/mingw-w64-headers/include/mferror.h

But the non-overlapping layout is provided by the "if" blocks rather than
the set of possible codes.

> You could use NS_NewRunnableFunction() instead of defining a new Runnable
> for this, and either pass a lambda, or (I assume) &Telemetry::Accumulate and
> the arguments to the call you want to make.

Sounds like that could be simpler.  Will look thanks.
Attachment #8605531 - Flags: review?(cpearce)
NS_NewRunnableFunction currently accepts only a single parameter.
I like how the lamba associates parameters with the function call anyway.
Attachment #8605587 - Flags: review?(cpearce)
Attachment #8605531 - Attachment is obsolete: true
(In reply to Karl Tomlinson (ni?:karlt) from comment #4)
> (In reply to Chris Pearce (:cpearce) from comment #3)
> > ::: dom/media/fmp4/wmf/WMFMediaDataDecoder.cpp
> > @@ +78,5 @@
> > > +    sample = 0;
> > > +  } else if (hr < 0xc00d36b0) {
> > > +    sample = 1; // low bucket
> > > +  } else if (hr < 0xc00d3700) {
> > > +    sample = hr & 0xffU; // MF_E_*
> > 
> > What if only one of the last two bits is set? Then couldn't this value be
> > misreported as 1 or 2? Ditto for the MF_E_TRANSFORM_* case below. Or is
> > there something about the layout of the error codes that makes it safe?
> 
> In this block the last two bytes range from 0xb0 to 0xff.
> Similarly in the MF_E_TRANSFORM_* case, they range from 0x60 to 0x78.

I mean last to hex digits. i.e. last byte.
There is no overlap with 1 or 2 because higher bits are set in this byte.
Comment on attachment 8605587 [details] [diff] [review]
report telemetry on WMFMediaDataDecoder errors (lamba version)

Review of attachment 8605587 [details] [diff] [review]:
-----------------------------------------------------------------

You also need approval from a Telemetry peer.
https://wiki.mozilla.org/Firefox/Data_Collection
Attachment #8605587 - Flags: review?(cpearce) → review+
Comment on attachment 8605587 [details] [diff] [review]
report telemetry on WMFMediaDataDecoder errors (lamba version)

https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Adding_a_new_Telemetry_probe#Declaring_a_Histogram
"You need approval from a data collection peer to add an opt-out measurement:
https://wiki.mozilla.org/Firefox/Data_Collection" seems to imply that default
opt-in releaseChannelCollection doesn't require approval, but

https://wiki.mozilla.org/Firefox/Data_Collection
"For every new measurement, even a simple new Telemetry probe, please request
approval by setting the NEEDINFO flag for the module owner or a peer." says
otherwise.

The intention of this probe is to find out the common causes of Windows Media
Framework video decode failure in browser usage.

Only one error code corresponds to the MF_E_TRANSFORM_STREAM_CHANGE failure
suspected in this bug, but this probe will tell us whether that is one of the common causes.
Attachment #8605587 - Flags: feedback?(benjamin)
Comment on attachment 8605587 [details] [diff] [review]
report telemetry on WMFMediaDataDecoder errors (lamba version)

I notice Vladan has been looking at some of these approval requests recently.
Attachment #8605587 - Flags: feedback?(vdjeric)
Afaik, you don't need approval for opt-in probes, unless you're reporting sensitive/potentially private information. I agree the wording on the Data Collection page is contradictory, I'll ask Benjamin for clarification
Attachment #8605587 - Flags: feedback?(vdjeric) → feedback+
Attachment #8605587 - Flags: feedback?(benjamin) → feedback+
https://hg.mozilla.org/mozilla-central/rev/3f28c4380901
https://hg.mozilla.org/mozilla-central/rev/6bdabdc61b68
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
So it turns out all new data collection needs privacy peer approval. For the vast majority of opt-in histograms, you'll receive a quick r+ within 24 hours.
I updated the docs
https://crash-stats.mozilla.com/report/index/a9c4f9ac-1b1c-4c4f-9175-3d1a02150518#allthreads
is a 40.0a2 hang report with MF_E_TRANSFORM_STREAM_CHANGE involved.
Depends on: 1165802
3f28c4380901 was backed out in
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8c823f6e6c5
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It turns out that sometimes MF_E_TRANSFORM_STREAM_CHANGE is returned on 2
consecutive calls to ProcessOutput().

On Windows 7, try runs with some logging [1][2] show that the video decoder
using MFT_OUTPUT_STREAM_PROVIDES_SAMPLES returns MF_E_TRANSFORM_STREAM_CHANGE
twice.  Sometimes MF_E_TRANSFORM_NEED_MORE_INPUT is returned between the two
calls returning MF_E_TRANSFORM_STREAM_CHANGE and sometimes not.  When the
MF_E_TRANSFORM_STREAM_CHANGE calls are consecutive we hit the loop abort
introduced in attachment 8605514 [details] [diff] [review].

cpearce observed a similar situation with the AAC decoder on Windows 8.1.

The try logging of reveals that on the first MF_E_TRANSFORM_STREAM_CHANGE from
the video decoder the number of available output types changes as well as
cbSize and cbAlignment in MFT_OUTPUT_STREAM_INFO.  There is no change in
MFT_OUTPUT_STREAM_INFO::dwFlags.  On the second MF_E_TRANSFORM_STREAM_CHANGE,
there is no change in any of these properties.

It seems that the transform provides an initial set of output types and then
provides an updated list when it has enough input to determine more
information about the stream.  This is consistent with "To avoid an initial
format change, provide as much information in the input type as possible,
including:" [3].

"To guard against endless loops, the MFT should not set the
MFT_OUTPUT_DATA_BUFFER_FORMAT_CHANGE flag again until there is another change"
[4] and the transform has no new information between the 2 calls so I think
there should not be another consecutive change, but perhaps an async handling of input or other implementation detail means we are getting the extra notification.

If this can happen twice consecutively, then there is no reason why it won't
happen more often.

MFT_OUTPUT_DATA_BUFFER_FORMAT_CHANGE is being returned in
MFT_OUTPUT_DATA_BUFFER::dwStatus with each MF_E_TRANSFORM_STREAM_CHANGE, as
expected.

What is not expected is that the pdwStatus parameter from ProcessOutput() is
returning MFT_PROCESS_OUTPUT_STATUS_NEW_STREAMS.
GetStreamIDs() returns E_NOTIMPL which means "The transform has a fixed
number of streams" [5].  I suspect someone accidentally tried to return MFT_OUTPUT_DATA_BUFFER_FORMAT_CHANGE in this parameter because MFT_OUTPUT_DATA_BUFFER_FORMAT_CHANGE == MFT_PROCESS_OUTPUT_STATUS_NEW_STREAMS, which should be OK because the values should be returned in different places.

The logging reveals some other decoder transforms do not follow the
prescribed behavior.

MFT_OUTPUT_DATA_BUFFER_NO_SAMPLE is frequently in output.dwStatus when
returning MF_E_TRANSFORM_NEED_MORE_INPUT, even though "If no streams are ready
to produce output, the MFT does not set this flag. Instead, the ProcessOutput
method returns MF_E_TRANSFORM_NEED_MORE_INPUT" [6] and GetStreamCount() is
returning one input and one output stream.  This transform returns
mOutputStreamInfo.dwFlags=1 and reports no stream change return values.

Another transform is returning 0x1000007 in MFT_OUTPUT_DATA_BUFFER::dwStatus
even though this is not a valid _MFT_OUTPUT_DATA_BUFFER_FLAGS (which are "not
bit flags") [7].  This is MFT_OUTPUT_DATA_BUFFER_INCOMPLETE with 3 extra bits
set.  This transform returns dwFlags=0 and also reports no stream change return values.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b92a36ffc58
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=42d948267019
[3] https://msdn.microsoft.com/en-us/library/windows/desktop/dd797815%28v=vs.85%29.aspx
[4] https://msdn.microsoft.com/en-us/library/windows/desktop/ms704014%28v=vs.85%29.aspx#stream_changes
[5] https://msdn.microsoft.com/en-us/library/windows/desktop/ms693988%28v=vs.85%29.aspx
[6] https://msdn.microsoft.com/en-us/library/windows/desktop/ms702281%28v=vs.85%29.aspx
[7] https://msdn.microsoft.com/en-us/library/windows/desktop/ms702281%28v=vs.85%29.aspx
Comment on attachment 8609135 [details] [diff] [review]
detect and abort MF_E_TRANSFORM_STREAM_CHANGE infinite loops v2

Review of attachment 8609135 [details] [diff] [review]:
-----------------------------------------------------------------

Don't forget to address my comment please!

::: dom/media/platforms/wmf/WMFAudioMFTManager.cpp
@@ +216,5 @@
>        NS_ENSURE_TRUE(SUCCEEDED(hr), hr);
> +      // Catch infinite loops, but some decoders perform at least 2 stream
> +      // changes on consecutive calls, so be permissive.
> +      // 100 is arbitrarily > 2.
> +      NS_ENSURE_TRUE(typeChangeCount < 100, MF_E_TRANSFORM_STREAM_CHANGE);

++typeChangeCount;
Attachment #8609135 - Flags: review?(cpearce) → review+
(In reply to Chris Pearce (:cpearce) from comment #20)
> Don't forget to address my comment please!

> ++typeChangeCount;

Thank you.  Added.
This was backed out due to failures caused by another changeset and relanded.
https://hg.mozilla.org/mozilla-central/rev/975a4f9dea3d
https://hg.mozilla.org/mozilla-central/rev/4ff2c52aad5c
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Attachment #8605514 - Attachment is obsolete: true
Approval Request Comment
[Feature/regressing bug #]: WMF media playback (several releases old)

[User impact if declined]:
Potential shutdown hang.  See comment 25.
I don't know that this will work around the observed shutdown hangs, and so I'm requesting approval only for Aurora right now.  We can then see if it changes the shutdown hang reports and perhaps if there is time consider again for Beta.

[Describe test coverage new/current, TreeHerder]:
Existing test coverage was sufficient to detect that the first version of this patch was too strict.  This version is very lenient, bailing only when things are obviously wrong.

[Risks and why]: Risks are low given the simplicity of the patch and existing test.
[String/UUID change made/needed]: none.
Attachment #8611587 - Flags: approval-mozilla-aurora?
Approval Request Comment
[Feature/regressing bug #]:
Telemetry on WMF decode errors including reporting when these suspected infinite loops are aborted.
[User impact if declined]:
Less feedback provided to developers.
[Describe test coverage new/current, TreeHerder]:
On m-c.
[Risks and why]: 
This is a little more complicated than the other patch, but any failure should be limited to problems in reporting this specific telemetry data, and testing coverage provides assurance of that.
[String/UUID change made/needed]: none.
Attachment #8611594 - Flags: approval-mozilla-aurora?
Comment on attachment 8611587 [details] [diff] [review]
40 branch patch: detect and abort infinite loops

Shutdown hangs are a pain. Taking it.
Attachment #8611587 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8611594 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8611587 [details] [diff] [review]
40 branch patch: detect and abort infinite loops

This doesn't apply.
Flags: needinfo?(karlt)
Attachment #8611594 - Attachment description: 40 branch patch: telemetry → 40 branch patch 2: telemetry
Comment on attachment 8612592 [details] [diff] [review]
40 branch patch 1: detect and abort infinite loops (with adjustments to diff comment)

Approval Request Comment

I'm requesting beta approval for this patch only.
I'm not requesting approval for the telemetry patch because we wouldn't have time to act on data before release anyway, so it is not worth taking that code change.

[Feature/regressing bug #]: WMF media playback (several releases old)

[User impact if declined]:
One thread may spin and cause a shutdown hang.  See comment 25.

There are no hang reports on aurora like those in comment 25 since this change landed on aurora.  The small number of aurora data points, and the apparently different distribution on aurora, means that doesn't prove anything.  Only one of the dozen media shutdowns hangs reported on the build prior to this change had such a hang report.

However, telemetry on Aurora shows 1 data point out of 35k indicating that the suspected situation here is really happening for some small proportion of users.

[Describe test coverage new/current, TreeHerder]:
Existing test coverage was sufficient to detect that the first version of this patch was too strict.  This version is very lenient, bailing only when things are obviously wrong.

[Risks and why]: Risks are low given the simplicity of the patch and existing tests.

[String/UUID change made/needed]: none
Attachment #8612592 - Flags: approval-mozilla-beta?
Comment on attachment 8612592 [details] [diff] [review]
40 branch patch 1: detect and abort infinite loops (with adjustments to diff comment)

Seems reasonable, let's uplift it to beta 4.
Attachment #8612592 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.