Closed Bug 1557724 Opened 5 years ago Closed 5 years ago

54.82% raptor-youtube-playback-geckoview-live (android-hw-g5-7-0-arm7-api-16) regression (Wed Jun 5 2019)

Categories

(Testing :: Raptor, defect, P2)

defect

Tracking

(firefox-esr60 unaffected, firefox67 unaffected, firefox68 unaffected, firefox69+ fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox67 --- unaffected
firefox68 --- unaffected
firefox69 + fixed

People

(Reporter: igoldan, Assigned: alexandru.irimovici)

References

Details

(Keywords: perf, regression)

Raptor has detected a Firefox performance regression from push:

https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=ebe7fdd19be5dbca6817987e2d80bbe46b357ad7

We need your help to address this regression.

Regressions:

55% raptor-youtube-playback-geckoview-live android-hw-g5-7-0-arm7-api-16 pgo 36.29 -> 56.18

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=21372

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a Treeherder page showing the Raptor jobs in a pushlog format.

To learn more about the regressing test(s) or reproducing them, please see: https://wiki.mozilla.org/Performance_sheriffing/Raptor

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling

Other Android platforms seem to experience this, according to :whimboo. We need to look up & present those cases as well.
As this is a mozilla-central type of alert, the patch must be bisected, as it cannot be backfilled via Treeherder.

Note that there is also an increase for Aarch64 on Windows and MacOS, but the latter has only a slight increase.

It might be good if the last good commit could be re-run with Raptor. If the numbers of that commit is still lower we would know that something in our code caused this regression. Otherwise it might also be a problem with the host serving the media files.

Assignee: nobody → airimovici

Hi Jesup, did you already investigate this?

Flags: needinfo?(rjesup)

I looked at the list of commits. This is bug 1555370 - increase the dropped-frames subtest by 100x (% instead of ratio). very clear when you look at https://treeherder.mozilla.org/perf.html#/comparesubtest?framework=10&originalProject=mozilla-central&originalSignature=2051167&newProject=mozilla-central&newSignature=2051167&originalRevision=4d1920fc1dd0d97630d5d59f978e14b9f3ded156&newRevision=ebe7fdd19be5dbca6817987e2d80bbe46b357ad7

Because of geomeaning, this appears as a much smaller regression in the top numbers..... geomeaning these values is bad - you can geomean dropped frames with each other, but geomeaning decoded frames with dropped frames, etc gets ugly. Even geomeaning dropped frames with % dropped frames is problematic.

Flags: needinfo?(rjesup)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #6)

I looked at the list of commits. This is bug 1555370 - increase the dropped-frames subtest by 100x (% instead of ratio). very clear when you look at https://treeherder.mozilla.org/perf.html#/comparesubtest?framework=10&originalProject=mozilla-central&originalSignature=2051167&newProject=mozilla-central&newSignature=2051167&originalRevision=4d1920fc1dd0d97630d5d59f978e14b9f3ded156&newRevision=ebe7fdd19be5dbca6817987e2d80bbe46b357ad7

Should we revert the changes from bug 1555370?

Because of geomeaning, this appears as a much smaller regression in the top numbers..... geomeaning these values is bad - you can geomean dropped frames with each other, but geomeaning decoded frames with dropped frames, etc gets ugly. Even geomeaning dropped frames with % dropped frames is problematic.

So there is no real regression, only harness update. Do you suggest that we should employ another mean algorithm for the benchmark score?

Flags: needinfo?(rjesup)

(In reply to Alexandru Irimovici from comment #7)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #6)

I looked at the list of commits. This is bug 1555370 - increase the dropped-frames subtest by 100x (% instead of ratio). very clear when you look at https://treeherder.mozilla.org/perf.html#/comparesubtest?framework=10&originalProject=mozilla-central&originalSignature=2051167&newProject=mozilla-central&newSignature=2051167&originalRevision=4d1920fc1dd0d97630d5d59f978e14b9f3ded156&newRevision=ebe7fdd19be5dbca6817987e2d80bbe46b357ad7

Should we revert the changes from bug 1555370?

No - that simply made it record the right value; it had been recording as a % something that was 1/100th of the percentage.

Because of geomeaning, this appears as a much smaller regression in the top numbers..... geomeaning these values is bad - you can geomean dropped frames with each other, but geomeaning decoded frames with dropped frames, etc gets ugly. Even geomeaning dropped frames with % dropped frames is problematic.

So there is no real regression, only harness update. Do you suggest that we should employ another mean algorithm for the benchmark score?

Doing any type of "mean" among things that measure different types of values can cause these problems. It's made worse because geomean (and means in general) treat all the things as the same weight unless you fiddle with them, so when you have it geomeaning loadtime, FCPaint, FNBPaint, and DCF - changes to FCP/FNBP move almost always in lockstep since they're very close in time, so they have "double" weight.

People want a single top-line number - and inherently this can hide information, even if you avoid that sort of problem.

Flags: needinfo?(rjesup)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #8)

(In reply to Alexandru Irimovici from comment #7)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #6)

I looked at the list of commits. This is bug 1555370 - increase the dropped-frames subtest by 100x (% instead of ratio). very clear when you look at https://treeherder.mozilla.org/perf.html#/comparesubtest?framework=10&originalProject=mozilla-central&originalSignature=2051167&newProject=mozilla-central&newSignature=2051167&originalRevision=4d1920fc1dd0d97630d5d59f978e14b9f3ded156&newRevision=ebe7fdd19be5dbca6817987e2d80bbe46b357ad7

Should we revert the changes from bug 1555370?

No - that simply made it record the right value; it had been recording as a % something that was 1/100th of the percentage.

Regarding the increase. This cannot be THIS change. The percentage value is is not taking into account for displaying results in perfherder. It's only in use for the health dashboard. Also note that if it would be, the difference shouldn't only be some decimals but hundreds of them.

The priority flag is not set for this bug.
:gbrown, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gbrown)
Component: General → Raptor
Flags: needinfo?(gbrown) → needinfo?(rwood)
Flags: needinfo?(rwood)
Priority: -- → P2

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #9)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #8)
...

Should we revert the changes from bug 1555370?

No - that simply made it record the right value; it had been recording as a % something that was 1/100th of the percentage.

Regarding the increase. This cannot be THIS change. The percentage value is is not taking into account for displaying results in perfherder. It's only in use for the health dashboard. Also note that if it would be, the difference shouldn't only be some decimals but hundreds of them.

To make things clear, if you look at the subtests (comment 6), you see the %-dropped-frames tests have "regressions" on the order of 10000% (+/-); if you look at the numbers, they're on the order of 100x different (which matches) (1->100, 0.45->38.39, 0.82->92.47, etc).

Due to geomeaning of all the different tests (only the %-dropped-frames tests changed), this maps to a ~55% change in the geomean.

If you filter the subtests by "X_d" (which happens to exclude the "X_%_d" tests), the results look more typical - there's plenty of noise, but no clear regression.

Rob - I think we should just mark this as invalid

Flags: needinfo?(rwood)

Strange! The initial alert indeed references the % values only! Why that? Those values should not taken into account for perfherder as already explained above.

Here the referenced sub tests:
https://treeherder.mozilla.org/perf.html#/comparesubtest?framework=10&originalProject=mozilla-central&originalSignature=2051167&newProject=mozilla-central&newSignature=2051167&originalRevision=4d1920fc1dd0d97630d5d59f978e14b9f3ded156&newRevision=ebe7fdd19be5dbca6817987e2d80bbe46b357ad7

The overall suite value should be computed from the [..]X_dropped_frames values:

https://searchfox.org/mozilla-central/rev/06bd14ced96f25ff1dbd5352cb985fc0fa12a64e/testing/raptor/raptor/output.py#1026

But I can see the problem now! Check that code:

https://searchfox.org/mozilla-central/rev/06bd14ced96f25ff1dbd5352cb985fc0fa12a64e/testing/raptor/raptor/output.py#768

It should not check for name.endswith("dropped_frames"), which also takes the percentage numbers into account. :/

As such this bug is not invalid but clearly needs a fix! Alexandru, can you work on that soon so we can get all the numbers fixed?

Status: NEW → ASSIGNED
Flags: needinfo?(rwood) → needinfo?(airimovici)
Depends on: 1561889

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #12)

Strange! The initial alert indeed references the % values only! Why that? Those values should not taken into account for perfherder as already explained above.

Here the referenced sub tests:
https://treeherder.mozilla.org/perf.html#/comparesubtest?framework=10&originalProject=mozilla-central&originalSignature=2051167&newProject=mozilla-central&newSignature=2051167&originalRevision=4d1920fc1dd0d97630d5d59f978e14b9f3ded156&newRevision=ebe7fdd19be5dbca6817987e2d80bbe46b357ad7

The overall suite value should be computed from the [..]X_dropped_frames values:

https://searchfox.org/mozilla-central/rev/06bd14ced96f25ff1dbd5352cb985fc0fa12a64e/testing/raptor/raptor/output.py#1026

But I can see the problem now! Check that code:

https://searchfox.org/mozilla-central/rev/06bd14ced96f25ff1dbd5352cb985fc0fa12a64e/testing/raptor/raptor/output.py#768

It should not check for name.endswith("dropped_frames"), which also takes the percentage numbers into account. :/

As such this bug is not invalid but clearly needs a fix! Alexandru, can you work on that soon so we can get all the numbers fixed?

I created this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1561889 and I'm working on it

Flags: needinfo?(airimovici)

Based on the landing of bug 1561889 we now should have real numbers - expect some up and downs given the adjustments for the harness.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
OS: Android → All
Hardware: x86 → All
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
You need to log in before you can comment on or make changes to this bug.