Closed Bug 1514172 Opened 5 years ago Closed 5 years ago

2.42 - 4.84% sessionrestore / sessionrestore_no_auto_restore / ts_paint / ts_paint_webext (windows10-64, windows7-32) regression on push ded3b65ca045796830bf03282370926b5810dd25 (Thu Dec 13 2018)

Categories

(Core :: Audio/Video: Playback, defect, P2)

61 Branch
All
Windows
defect

Tracking

()

VERIFIED FIXED
mozilla66
Tracking Status
firefox-esr60 --- unaffected
firefox64 --- unaffected
firefox65 --- fixed
firefox66 --- fixed

People

(Reporter: igoldan, Assigned: mjf)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=c38131baf66082a231aa7ce376b12ff886516ab7&tochange=ded3b65ca045796830bf03282370926b5810dd25

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  5%  sessionrestore windows10-64 pgo e10s stylo                     304.83 -> 319.58
  5%  ts_paint windows10-64 pgo e10s stylo                           303.96 -> 318.67
  5%  ts_paint windows10-64 pgo e10s stylo                           309.25 -> 323.75
  5%  sessionrestore_no_auto_restore windows10-64 pgo e10s stylo     318.50 -> 333.20
  4%  ts_paint_webext windows10-64 pgo e10s stylo                    304.83 -> 317.33
  4%  sessionrestore windows7-32 pgo e10s stylo                     305.00 -> 317.42
  4%  ts_paint_webext windows7-32 pgo e10s stylo                    309.88 -> 322.33
  4%  sessionrestore_no_auto_restore windows7-32 pgo e10s stylo     323.08 -> 335.42
  2%  ts_paint windows10-64 opt e10s stylo                           322.38 -> 330.17


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=18164

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling
Flags: needinfo?(mfroman)
Component: General → Audio/Video: Playback
Product: Testing → Core
thanks for pointing this out :igoldan, this is a regression from 3 days ago and documented a bit more in bug 1514257, I assume that should be higher priority to fix.
Flags: needinfo?(jmaher)
The pref flip in that patch turns on launching the new RemoteDataDecoder (RDD) process on Windows in order to decode AV1.
Flags: needinfo?(mfroman)
Rolling through this bug on triage -- I'm a little lost. We have a performance regression, possibly caused by the RemoteDataDecoder (RDD) per comment #3? We have a confounding issue with Gecko profile generation as noted in comment 2?

We're approaching 3 days since reporting, are we looking at a backout? Does the confounding issue need to be fixed before we can be sure this is a true regression?
To enable the AV1 video decoder we run it in a sandboxed process. This process is currently started at run time, just like the GPU process.

We have plans to only create the process on demand which would resolve this problem.
Depends on: 1514874
Rank: 15
Priority: -- → P2
(In reply to Ionuț Goldan [:igoldan][PTO Dec 21-Jan 2], Performance Sheriffing from comment #0)
> Regressions:
> 
>   5%  sessionrestore windows10-64 pgo e10s stylo                     304.83
> -> 319.58
>   5%  ts_paint windows10-64 pgo e10s stylo                           303.96
> -> 318.67
>   5%  ts_paint windows10-64 pgo e10s stylo                           309.25
> -> 323.75
>   5%  sessionrestore_no_auto_restore windows10-64 pgo e10s stylo     318.50
> -> 333.20
>   4%  ts_paint_webext windows10-64 pgo e10s stylo                    304.83
> -> 317.33
>   4%  sessionrestore windows7-32 pgo e10s stylo                     305.00
> -> 317.42
>   4%  ts_paint_webext windows7-32 pgo e10s stylo                    309.88
> -> 322.33
>   4%  sessionrestore_no_auto_restore windows7-32 pgo e10s stylo     323.08
> -> 335.42
>   2%  ts_paint windows10-64 opt e10s stylo                           322.38
> -> 330.17

5% feels like a lot. Given:

(In reply to Jean-Yves Avenard [:jya] from comment #5)
> We have plans to only create the process on demand which would resolve this
> problem.

who can make the call about accepting the regressions (or not) until we have the on-demand process startup (it's not clear to me how long that will take but it doesn't feel like a super-short task)? Let's see if Vicky has thoughts :)
Flags: needinfo?(vchin)
I agree with overholt. My preference is to back out until we have on-demand process startup.
Flags: needinfo?(vchin)
Agree with :vchin and :overholt. :jya can you please back out pending bug 1514874?
Flags: needinfo?(jyavenard)
Let me take this one on behalf of Jya and Nils.  Nils is off until Jan 7th and Jya is away until Jan 15th.

I'm needinfo-ing Andreas (Product) for his awareness and comment because backing this out means that AV1 will not go to release in Fx 66;  getting AV1 to release is something Product is invested in.  I appreciate we're in a catch-22: we don't want to take this big a regression hit (5% perf hit on browser startup), but we also don't want AV1 to miss the Fx 66 train.

I'll talk with devs to see if there is a reasonable "3rd option" that gets AV1 to ship to release in Fx 66 AND saves us from taking the regression hit.  Stay tuned...
Flags: needinfo?(jyavenard) → needinfo?(abovens)
Thanks for looping me in.

What is the estimated timeline for converting the process startup to be on-demand?

Do we know how much of the overhead is from the process launch itself (the RDD process contending with the main/content processes for CPU), vs. from blocking the parent process main thread waiting for the RDD process to start up and connect to the IPC channel? Profiling could help determine that, or adding code to record the time spent in RDDProcessManager::EnsureRDDReady if the profiler doesn't work this early.

If it's mainly the latter, making RDD launch more asynchronous could help. Copying what I wrote in bug 1515826 comment #11:

RDDProcessManager::CreateContentBridge would need to be changed to return a promise, and then the SendInitRemoteDecoder call could be chained from that and happen asynchronously, and then in the content process RecvInitRemoveDecoder does an async dispatch, so that's already unordered relative to PContent IPC and in theory shouldn't need further changes.

As for the “changed to return a promise“ part, chaining from WhenProcessHandleReady should be enough: the process handle is all that's needed to construct endpoints, and RDDProcessHost::InitAfterConnect (and thus RDDChild::Open) already doesn't happen-before anything blocking on WaitUntilConnected, so making the latter run earlier shouldn't make a difference.

We do have a patch over in bug 1514874 which needs to get reviewed and ready for landing. So it looks like we should be able to fix this soon (assuming no further major blockers showing up).

One other factor which I believe hasn't been raised here before is that the RDD changed got uplifted to 65. So theoretically 65 should be affected by this as well. Are we running Talos on Beta as well, or is it only executed for Nightly?

(In reply to Nils Ohlmeier [:drno] from comment #12)

Are we running Talos on Beta as well, or is it only executed for Nightly?

Talos runs on Beta also.

So here are the Talos results from running with the patch from bug 1514874:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=ed9d5eb646eb&framework=1&showOnlyImportant=1&selectedTimeRange=172800

Which clearly shows that starting RDD on demand reduces session_restore and ts_paint numbers by ~5%. So with that patch landed we should be back to where we were before RDD got turned on.

I'm confirming this got fixed:

== Change summary for alert #18722 (as of Sat, 12 Jan 2019 04:02:11 GMT) ==

Improvements:

6% sessionrestore windows7-32 pgo e10s stylo 307.38 -> 290.25
5% sessionrestore windows10-64 opt e10s stylo 323.67 -> 307.42
5% ts_paint windows7-32 pgo e10s stylo 319.33 -> 303.33
4% sessionrestore_no_auto_restore windows7-32 pgo e10s stylo 325.83 -> 311.33
4% sessionrestore_no_auto_restore windows10-64 opt e10s stylo 345.83 -> 330.50
4% ts_paint_webext windows7-32 pgo e10s stylo 319.38 -> 305.50
4% ts_paint windows10-64 opt e10s stylo 330.50 -> 316.17
4% ts_paint_webext windows10-64 opt e10s stylo 329.92 -> 316.25
2% ts_paint_webext windows10-64-qr opt e10s stylo 335.17 -> 327.33

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=18722

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #16)

I'm confirming this got fixed:

Thanks to bug 1514874.

Assignee: nobody → mfroman
Flags: needinfo?(abovens)
Target Milestone: --- → mozilla66
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.