Closed Bug 1216253 Opened 9 years ago Closed 9 years ago

2.5% Linux 64 sessionrestore regression on Fx-Team on October 19, 2015 from push 40c6bcb88bc48ccc49e6be3c31f0deff5d0652a9

Categories

(Firefox :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Talos has detected a Firefox performance regression from your commit 40c6bcb88bc48ccc49e6be3c31f0deff5d0652a9 in bug 1207137.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=40c6bcb88bc48ccc49e6be3c31f0deff5d0652a9&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#sessionrestore.2Fsessionrestore_no_auto_restore

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64 -u none -t other  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a sessionrestore

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Thursday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
All added code in bug 1207137 is executed only when a connection error happened. It is very unlikely to be a cause.
Flags: needinfo?(VYV03354)
doing the backout yields a 3% performance improvement:
https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=557e2b48fed6&newProject=try&newRevision=7a1cf4d3057b

I suspect we might be hitting a lot of connection errors in session restore, it would make sense given the fact that we don't want external network connections so we have a crafted session restore file which would be unique enough, yet not hit the network:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/sessionrestore/profile/sessionstore.js

This set of changes really do affect this performance test, the question is do we wan to accept this as a regression given the fact that this appears to only affect connection errors?
(In reply to Joel Maher (:jmaher) from comment #3)
> I suspect we might be hitting a lot of connection errors in session restore,
> it would make sense given the fact that we don't want external network
> connections so we have a crafted session restore file which would be unique
> enough, yet not hit the network:
> https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/
> startup_test/sessionrestore/profile/sessionstore.js

Do those http://localhost urls redirect to https urls somehow? Otherwise, the code that was changed is never even run (it's only relevant to https connections that fail in certain ways - the only relevant ones being a connection reset or end of file error).

If the test does cause https urls to be accessed, the only explanation I can come up with is that nsNSSComponent::AreAnyWeakCiphersEnabled() is slow and changeset 942e5f11f8ce calls it more often than before.
I am not sure, Yoric might know the answer to how sessionrestore works better than I would.
Flags: needinfo?(dteller)
Normally, this code never hits the network.
Flags: needinfo?(dteller)
I couldn't find the same performance impact by the same backout:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ccfcbf6b433e&newProject=try&newRevision=0961c3741f56

I suspect the perf result is unreliable.
thanks for looking into this.  I couldn't tell by the try pushes if there was a backout, sometimes try server has issues like that where it doesn't show the full set of changes.  We have a sustained regression and retriggers do show the original push as the culprit.  That still leaves random infrastructure changes as a possible cause.

I did 5 extra retriggers on the original suspect changeset and the one prior and the numbers still show a stark difference, this indicates there is no time dependent or infrastructure issues.  So I have more data to backup the fact that https://hg.mozilla.org/integration/fx-team/pushloghtml?changeset=40c6bcb88bc4 is the root cause.

Lets look into why the backout didn't work on try.
another thought is there might be something odd in the build proper, lets figure out what the try push was earlier in comment 7 first, then look at doing either more try pushes or looking at the builds in detail.  

:emk, can you explain more about what the two try pushes were?
Flags: needinfo?(VYV03354)
I tried to break down the offending commit.
https://hg.mozilla.org/try/rev/ccfcbf6b433e
https://hg.mozilla.org/try/rev/a21a72d0dd6d
https://hg.mozilla.org/try/rev/6124dd27eb33
https://hg.mozilla.org/try/rev/0961c3741f56
But I stuck because I couldn't find the perf difference.
Flags: needinfo?(VYV03354)
this is not seen on mozilla-aurora after the uplift.  We cannot figure out the root cause when pushing to try.  I vote for resolving this as worksforme.
Flags: needinfo?(VYV03354)
I agree.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(VYV03354)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.