2.42 - 4.84% sessionrestore / sessionrestore_no_auto_restore / ts_paint / ts_paint_webext (windows10-64, windows7-32) regression on push ded3b65ca045796830bf03282370926b5810dd25 (Thu Dec 13 2018)
Categories
(Core :: Audio/Video: Playback, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox64 | --- | unaffected |
firefox65 | --- | fixed |
firefox66 | --- | fixed |
People
(Reporter: igoldan, Assigned: mjf)
References
Details
(Keywords: perf, regression, talos-regression)
Talos has detected a Firefox performance regression from push: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=c38131baf66082a231aa7ce376b12ff886516ab7&tochange=ded3b65ca045796830bf03282370926b5810dd25 As author of one of the patches included in that push, we need your help to address this regression. Regressions: 5% sessionrestore windows10-64 pgo e10s stylo 304.83 -> 319.58 5% ts_paint windows10-64 pgo e10s stylo 303.96 -> 318.67 5% ts_paint windows10-64 pgo e10s stylo 309.25 -> 323.75 5% sessionrestore_no_auto_restore windows10-64 pgo e10s stylo 318.50 -> 333.20 4% ts_paint_webext windows10-64 pgo e10s stylo 304.83 -> 317.33 4% sessionrestore windows7-32 pgo e10s stylo 305.00 -> 317.42 4% ts_paint_webext windows7-32 pgo e10s stylo 309.88 -> 322.33 4% sessionrestore_no_auto_restore windows7-32 pgo e10s stylo 323.08 -> 335.42 2% ts_paint windows10-64 opt e10s stylo 322.38 -> 330.17 You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=18164 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Performance_sheriffing/Talos/Running *** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
I tried getting the Gecko profiles, but without luck [1] [2]. :jmaher can you figure out the reason? [1] https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=windows%2C10%2Cx64%2Cpgo%2Ctalos%2Cperformance%2Ctests%2Cwith%2Ce10s%2Ctest-windows10-64-pgo%2Fopt-talos-other-e10s%2Ct-e10s%28o-p%29&tochange=ded3b65ca045796830bf03282370926b5810dd25&fromchange=c38131baf66082a231aa7ce376b12ff886516ab7&selectedJob=216978135 [2] https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&searchStr=windows%2C10%2Cx64%2Cpgo%2Ctalos%2Cperformance%2Ctests%2Cwith%2Ce10s%2Ctest-windows10-64-pgo%2Fopt-talos-g5-e10s%2Ct-e10s%28g5-p%29&tochange=ded3b65ca045796830bf03282370926b5810dd25&fromchange=c38131baf66082a231aa7ce376b12ff886516ab7&selectedJob=216978546
Comment 2•5 years ago
|
||
thanks for pointing this out :igoldan, this is a regression from 3 days ago and documented a bit more in bug 1514257, I assume that should be higher priority to fix.
Assignee | ||
Comment 3•5 years ago
|
||
The pref flip in that patch turns on launching the new RemoteDataDecoder (RDD) process on Windows in order to decode AV1.
Rolling through this bug on triage -- I'm a little lost. We have a performance regression, possibly caused by the RemoteDataDecoder (RDD) per comment #3? We have a confounding issue with Gecko profile generation as noted in comment 2? We're approaching 3 days since reporting, are we looking at a backout? Does the confounding issue need to be fixed before we can be sure this is a true regression?
Updated•5 years ago
|
Comment 5•5 years ago
|
||
To enable the AV1 video decoder we run it in a sandboxed process. This process is currently started at run time, just like the GPU process. We have plans to only create the process on demand which would resolve this problem.
Updated•5 years ago
|
Comment 6•5 years ago
|
||
(In reply to Ionuț Goldan [:igoldan][PTO Dec 21-Jan 2], Performance Sheriffing from comment #0) > Regressions: > > 5% sessionrestore windows10-64 pgo e10s stylo 304.83 > -> 319.58 > 5% ts_paint windows10-64 pgo e10s stylo 303.96 > -> 318.67 > 5% ts_paint windows10-64 pgo e10s stylo 309.25 > -> 323.75 > 5% sessionrestore_no_auto_restore windows10-64 pgo e10s stylo 318.50 > -> 333.20 > 4% ts_paint_webext windows10-64 pgo e10s stylo 304.83 > -> 317.33 > 4% sessionrestore windows7-32 pgo e10s stylo 305.00 > -> 317.42 > 4% ts_paint_webext windows7-32 pgo e10s stylo 309.88 > -> 322.33 > 4% sessionrestore_no_auto_restore windows7-32 pgo e10s stylo 323.08 > -> 335.42 > 2% ts_paint windows10-64 opt e10s stylo 322.38 > -> 330.17 5% feels like a lot. Given: (In reply to Jean-Yves Avenard [:jya] from comment #5) > We have plans to only create the process on demand which would resolve this > problem. who can make the call about accepting the regressions (or not) until we have the on-demand process startup (it's not clear to me how long that will take but it doesn't feel like a super-short task)? Let's see if Vicky has thoughts :)
Comment 7•5 years ago
|
||
I agree with overholt. My preference is to back out until we have on-demand process startup.
Comment 8•5 years ago
|
||
Agree with :vchin and :overholt. :jya can you please back out pending bug 1514874?
Comment 9•5 years ago
•
|
||
Let me take this one on behalf of Jya and Nils. Nils is off until Jan 7th and Jya is away until Jan 15th. I'm needinfo-ing Andreas (Product) for his awareness and comment because backing this out means that AV1 will not go to release in Fx 66; getting AV1 to release is something Product is invested in. I appreciate we're in a catch-22: we don't want to take this big a regression hit (5% perf hit on browser startup), but we also don't want AV1 to miss the Fx 66 train. I'll talk with devs to see if there is a reasonable "3rd option" that gets AV1 to ship to release in Fx 66 AND saves us from taking the regression hit. Stay tuned...
Comment 10•5 years ago
|
||
Thanks for looping me in. What is the estimated timeline for converting the process startup to be on-demand?
Comment 11•5 years ago
|
||
Do we know how much of the overhead is from the process launch itself (the RDD process contending with the main/content processes for CPU), vs. from blocking the parent process main thread waiting for the RDD process to start up and connect to the IPC channel? Profiling could help determine that, or adding code to record the time spent in RDDProcessManager::EnsureRDDReady
if the profiler doesn't work this early.
If it's mainly the latter, making RDD launch more asynchronous could help. Copying what I wrote in bug 1515826 comment #11:
RDDProcessManager::CreateContentBridge
would need to be changed to return a promise, and then theSendInitRemoteDecoder
call could be chained from that and happen asynchronously, and then in the content processRecvInitRemoveDecoder
does an async dispatch, so that's already unordered relative toPContent
IPC and in theory shouldn't need further changes.As for the “changed to return a promise“ part, chaining from
WhenProcessHandleReady
should be enough: the process handle is all that's needed to construct endpoints, andRDDProcessHost::InitAfterConnect
(and thusRDDChild::Open
) already doesn't happen-before anything blocking onWaitUntilConnected
, so making the latter run earlier shouldn't make a difference.
Comment 12•5 years ago
|
||
We do have a patch over in bug 1514874 which needs to get reviewed and ready for landing. So it looks like we should be able to fix this soon (assuming no further major blockers showing up).
One other factor which I believe hasn't been raised here before is that the RDD changed got uplifted to 65. So theoretically 65 should be affected by this as well. Are we running Talos on Beta as well, or is it only executed for Nightly?
Reporter | ||
Comment 13•5 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #12)
Are we running Talos on Beta as well, or is it only executed for Nightly?
Talos runs on Beta also.
Assignee | ||
Comment 14•5 years ago
|
||
The latest comparing current moz-central modified with RDD and AV1 pref'd off[1] vs RDD/AV1 pref'd on and launching RDD on-demand instead of at startup[2] is here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=f94624b316c5&newProject=try&newRevision=ed9d5eb646eb&framework=1
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=f94624b316c5
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=ed9d5eb646eb
Comment 15•5 years ago
|
||
So here are the Talos results from running with the patch from bug 1514874:
Which clearly shows that starting RDD on demand reduces session_restore and ts_paint numbers by ~5%. So with that patch landed we should be back to where we were before RDD got turned on.
Reporter | ||
Comment 16•5 years ago
|
||
I'm confirming this got fixed:
== Change summary for alert #18722 (as of Sat, 12 Jan 2019 04:02:11 GMT) ==
Improvements:
6% sessionrestore windows7-32 pgo e10s stylo 307.38 -> 290.25
5% sessionrestore windows10-64 opt e10s stylo 323.67 -> 307.42
5% ts_paint windows7-32 pgo e10s stylo 319.33 -> 303.33
4% sessionrestore_no_auto_restore windows7-32 pgo e10s stylo 325.83 -> 311.33
4% sessionrestore_no_auto_restore windows10-64 opt e10s stylo 345.83 -> 330.50
4% ts_paint_webext windows7-32 pgo e10s stylo 319.38 -> 305.50
4% ts_paint windows10-64 opt e10s stylo 330.50 -> 316.17
4% ts_paint_webext windows10-64 opt e10s stylo 329.92 -> 316.25
2% ts_paint_webext windows10-64-qr opt e10s stylo 335.17 -> 327.33
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=18722
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 17•5 years ago
|
||
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #16)
I'm confirming this got fixed:
Thanks to bug 1514874.
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Description
•