Closed Bug 1389453 Opened 7 years ago Closed 7 years ago

65.92 - 76.75% sessionrestore_many_windows (windows10-64) regression on push a0a79fd3dd8b6adf8f51906b27de9271b1967eea (Fri Aug 11 2017)

Categories

(Core :: Networking: HTTP, defect, P1)

57 Branch
Unspecified
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox57 --- unaffected

People

(Reporter: igoldan, Assigned: dragana)

References

Details

(Keywords: perf, regression, talos-regression, Whiteboard: [necko-active])

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=a0a79fd3dd8b6adf8f51906b27de9271b1967eea

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

 77%  sessionrestore_many_windows windows10-64 pgo e10s     2,634.08 -> 4,655.75
 66%  sessionrestore_many_windows windows10-64 opt e10s     3,039.47 -> 5,043.00


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=8706

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → Networking: HTTP
Product: Firefox → Core
The last time when I turned on TFO we had a opposite situation: https://bugzilla.mozilla.org/show_bug.cgi?id=1377004#c17

I am not sure why in https://bugzilla.mozilla.org/show_bug.cgi?id=1377004#c17 the improvement was 24%.

I will look into this one.
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Whiteboard: [necko-active]
Bug 1384633 and bug 1363372 are pushed as well. These 2 bugs do not change the firefox behavior without patch bug_turn_on_tfo.patch from bug 1389079.
Blocks: 1384633, 1363372
Maybe this will be resolved by bug 1390503, if not I will look further.
Bug 1390503 improved one test:

19%  sessionrestore_many_windows windows10-64 pgo e10s     4,649.83 -> 3,783.67

see bug 1390503 comment 4

The performance is still worse than with TFO turned on. There is an explanation for this. TFO tries to send TFO cookie request, a TCP option. Our test infrastructure rejects all connections with a TFO option. Necko will retry to connect without TFO, but for every connection we try to connect 2 times and this has an influence on the performance. TFO will be turned off if 5 domains in a row reject TFO.
I will take a look to a log just to be sure that we do the right thing and if there is an other bug in the code.
can we change anything in our test infrastructure?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5)
> can we change anything in our test infrastructure?

yes, we need to find out why connections are refused if the TCP SYN packet has the TCP FastOpen option. The socket error that we get is ERROR_CONNECTION_REFUSED.
would that have anything to do with using localhost?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #7)
> would that have anything to do with using localhost?

no, it should work. I did all my initial test only on localhost.
This sounds like an infra bug, or a bug that should be blocked by one revolving around "why connections are refused if the TCP SYN packet has the TCP FastOpen option". Joel, can you help get that work prioritized?
Flags: needinfo?(jmaher)
Priority: -- → P2
I am not sure what the problem is- all our testing is on localhost with a python webserver, this looks to be windows10 specific- I have that locally and can use a loaner, but I really don't know what to do- if there are specific requests for me to verify or change, I can look into that.
Flags: needinfo?(jmaher)
I can try to debug this on a Windows VM, setting up the same python webserver, the same version.

I will not have time before 08/22 (Nightly freeze).
Flags: needinfo?(dd.mozilla)
I noticed bug 1394818 landed and looks like it resolved the OPT regression completely. The PGO one looks only partially fixed.
Dragana, were you expecting to resolve this bug entirely? If so, be aware that we recently updated the way we do PGO builds.
TFO will be turned off on 57.

This problem will be hard to resolved since it is a Windows OS problem.
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P2 → P1
With help from Honza, I tested this on 2 windows machines. Good news is that the problem does not affect apache serve. Only our server for testing are affected. httpd.js server (our small test server) seems to not work at all, consecutive requests without tfo are rejected as well.
(TFO is turned off, so this was not that urgent to look at.)

I installed windows on a vm and I was running one of the tests. I notice that there is a lot of NS_ERROR_CONNECTION_REFUSED. it was also showing "Unable to connect" error page. I thought it is my local instalation so I run try tests with logging turned on and from log I see the same behavior. I notice that tests are trying to connect to localhost:8080 but the server does not listen on that port.

When we turn on tfo and we see a NS_ERROR_CONNECTION_REFUSED we retry without TFO, but in this case retry will be NS_ERROR_CONNECTION_REFUSED as well. This influeca the performace measurements for this tests but should not reflect reality.


Can we double check if my logs are correct and we try to connect to not existing port?
Flags: needinfo?(dd.mozilla) → needinfo?(jmaher)
which test were you running.  All our talos is run via a python web server- in some cases where we load pages from the tp5 pageset there will be 404's, mostly because we downloaded the files and hacked out specifics that were not possible to get when we cleaned up the pageset.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #17)
> which test were you running.  All our talos is run via a python web server-
> in some cases where we load pages from the tp5 pageset there will be 404's,
> mostly because we downloaded the files and hacked out specifics that were
> not possible to get when we cleaned up the pageset.

sessionrestore-many-windows

Locally I have run only this one and on the try I looked only at this one, at group of tests containing this test.
as per: https://wiki.mozilla.org/Buildbot/Talos/Tests#sessionrestore.2Fsessionrestore_no_auto_restore.2Fsessionrestore_many_windows, I would contact :Yoric or :mikedeboer.  I am not familiar with the dataset used for sesionrestore tests.
Flags: needinfo?(mdeboer)
Hi Dragana, it's indeed possible that there are pages in the set of tabs to open by this test that are probing ports that are not open.
The set of pages that it's loading can be found at [1] and the uncompressed version is at [2]. This file is put in the profile directory and Firefox is started with it. Moving the compressed file into your own profile allows you to run the test locally and selectively, by clicking 'Restore Previous Session'.

[1] http://searchfox.org/mozilla-central/source/testing/talos/talos/startup_test/sessionrestore/profile-manywindows/sessionstore.jsonlz4
[2] http://searchfox.org/mozilla-central/source/testing/talos/talos/startup_test/sessionrestore/profile-manywindows/sessionstore.js
Flags: needinfo?(mdeboer)
Dragana, have you got the time to look more closely over this issue?
Flags: needinfo?(dd.mozilla)
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #21)
> Dragana, have you got the time to look more closely over this issue?

TFO is turned of on 58 for other issues as well, which will take longer to resolved. So this is not very critical.

I did try look at this issue and I think that our tests are problem and not TFO.
Flags: needinfo?(dd.mozilla)
(In reply to Mike de Boer [:mikedeboer] from comment #20)
> Hi Dragana, it's indeed possible that there are pages in the set of tabs to
> open by this test that are probing ports that are not open.
> The set of pages that it's loading can be found at [1] and the uncompressed
> version is at [2]. This file is put in the profile directory and Firefox is
> started with it. Moving the compressed file into your own profile allows you
> to run the test locally and selectively, by clicking 'Restore Previous
> Session'.
> 
> [1]
> http://searchfox.org/mozilla-central/source/testing/talos/talos/startup_test/
> sessionrestore/profile-manywindows/sessionstore.jsonlz4
> [2]
> http://searchfox.org/mozilla-central/source/testing/talos/talos/startup_test/
> sessionrestore/profile-manywindows/sessionstore.js

There are a lot of connections to port 8080. Form the http log the server is not listening to port 8080 and all this connectiong fail with connection-refused error.

From the log that I have seen, TFO was turned off, there were more than 500 connections that were refused because the server is not listening at that port.

With TFO turned on, firefox tries to use TFO in 50% of connections and it is failing because of the server is not listening at that port. And our backup mechanism for TFO reties this connections without TFO and the retries fail as well. These reties are the overhead that we cannnot do anything about. On the real network this is happening very rarely (looking at telemetry data only 0,01%). TFO will turn itself off on such error, therefore only 50% of connections try it.
Closing this bug. TFO is turned of at the moment and from analysis this is an regression that is not really relevant for the normal usage of FF. See comment 23.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.