If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Networking stops occasionally

RESOLVED FIXED in Firefox 55

Status

()

Core
Networking
--
major
RESOLVED FIXED
5 months ago
29 days ago

People

(Reporter: atlanto, Assigned: dragana)

Tracking

({regression})

55 Branch
mozilla55
x86_64
Windows 10
regression
Points:
---

Firefox Tracking Flags

(firefox-esr45 unaffected, firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 fixed)

Details

(Whiteboard: [necko-active])

Attachments

(2 attachments, 3 obsolete attachments)

(Reporter)

Description

5 months ago
As of nightly build 20170506030204, networking stops occasionally while "Performing a TLS handshake to ...", "Transferring data from ...", "Waiting for ...", ... on various sites at random timing.

It seems that disabling TCP fastopen (network.tcp.tcp_fastopen_enable=false) solves the problem.

OS: Windows10 1703 (15063.138) x64
CPU: Intel Core i3 4010U
Net: Realtek PCIe GBE Family Controller (driver ver. 9.1.406.2015)
(Reporter)

Comment 1

5 months ago
"networking" means each operation(Performing a TLS handshake, Transferring data, Waiting, ..."), "stop" means it won't be completed/finished. Reload or any other actions can be done after this issue.
Adding ni on dragana to see if she has ideas.
Flags: needinfo?(dd.mozilla)
(Assignee)

Comment 3

5 months ago
(In reply to atlanto from comment #0)
> As of nightly build 20170506030204, networking stops occasionally while
> "Performing a TLS handshake to ...", "Transferring data from ...", "Waiting
> for ...", ... on various sites at random timing.
> 
> It seems that disabling TCP fastopen (network.tcp.tcp_fastopen_enable=false)
> solves the problem.
> 
> OS: Windows10 1703 (15063.138) x64
> CPU: Intel Core i3 4010U
> Net: Realtek PCIe GBE Family Controller (driver ver. 9.1.406.2015)

Can you please try to make a http log when this happens. This feature depend a lot on a internet path, so I am not sure it could be easy to reproduce. It would be very helpful if you could capture it in a log:

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

rotate:200 limits log to only 200MB, when that limit is reach the oldest entry in the log will be overwritten. maybe increase it a bit or at least stop Firefox as soon as it start happening so that the relevant part of the log is still in the log and is not overwritten.

also change the logging to:
set MOZ_LOG=timestamp,rotate:200,nsHttp:5,nsSocketTransport:5

Thanks a lot!
Flags: needinfo?(dd.mozilla) → needinfo?(euthanasia_waltz)
(Assignee)

Updated

5 months ago
Assignee: nobody → dd.mozilla
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [necko-active]
(Reporter)

Comment 4

5 months ago
Created attachment 8865203 [details]
http-log.zip

On startup, opening https://www.yahoo.co.jp/ as home page.
Staying "Performing a TLS handshake to s.yimg.jp...", then exit.
Flags: needinfo?(euthanasia_waltz)
(Assignee)

Comment 5

5 months ago
Thanks a lot!
(Assignee)

Comment 6

5 months ago
The problem here is that a TFO connection is in progress when another connection that uses H2 to the same host is established. Therefore the backup HalfOpenSocket for the TFO connection gets destroyed and also wrongly clears input/outputStream callbacks.

In this case we can close the TFO connection and return transaction to the pending queue.
(Assignee)

Comment 7

5 months ago
I will write this down to see if I do not mix up logic here:

a TFO connection has transport and that transport has a reference to HalfOpenSocket as a backup. This HalfOpenSocket is in the HalfOpenSocket list but it is not count as active socket:

1) socketTransport can use TFO (mFastOpenCallback->FastOpenEnabled() == true). It calls mFastOpenCallback->StartFastOpen() to start a new connection but HalfOpenSocket is kept as well. The connection is counted as active but HalfOpenSocket is not counted as active (i.e. it calls gHttpHandler->ConnMgr()->RecvdConnect() to decrease counter) The halfOpenSocket has reference to SocketTransport but that socketTransport belongs to the connection. It also keeps reference to the connection(mConnectionNegotiatingFastOpen). If Fast open is not supported we continue in the same way (this is to keep a timer for a backup transport).

2)
  a) the TFO socketTransport succeeds to connect -> socketTransport::OnSocketConnected calls SetFastOpenConnected. The HalfOpenSocket backup timer is canceled and the HalfOpenSocket's reference to socketTransport is deleted.
If backup transport has started this HalfOpenSocket will be counted as active for its backup stream (gHttpHandler->ConnMgr()->StartedConnect() is called in SetupStream)

  b) socketTransport receives an error and it cannot recover from that error -> nsSocketTransport::OnSocketDetached calls SetFastOpenConnected ->  The HalfOpenSocket backup timer is canceled and the HalfOpenSocket's reference to socketTransport is deleted (the same as in case of success)

  c) socketTransport receives an error and it can recover. This errors are:
        NS_ERROR_CONNECTION_REFUSED and NS_ERROR_NET_TIMEOUT -> disable fastopen and try again, i.e. nsSocketTransport::RecoverFromError calls SetFastOpenConnected -> it closes connection (but it does not close socketTransport), transaction is return to pending queue and the halfOpenSocket takes the socketTransport and continues as if TFO has never been used. It also add its self to the active halfOpenSocket counter(i.e. it calls gHttpHandler->ConnMgr()->StartedConnect()).
        NS_ERROR_PROXY_CONNECTION_REFUSED -> I will add this one to the list above.
        NS_ERROR_UNKNOWN_HOST and NS_ERROR_UNKNOWN_PROXY_HOST are happening before TFO has started.

3) The backup transport for the connection is connected, i.e. nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady for backup transport is called. if the HalfOpenSocket still has mConnectionNegotiatingFastOpen, we will cancel the connection also closed its socketTransport and return transaction to the pending queue (this transaction is going to be picked up by this new backup socketTransport that is connected)

4) transaction gets canceled during TFO:
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnectionMgr.cpp#2240
The transaction has connection and transaction cannot be IsDone() so we call conn->CloseTransaction(trans, closeCode);
This will set socketTransport condition to error code. And in the next poll iteration the socket will be detached -> OnSocketDetached is called and this calls SetFastOpenConnected which performs action described in 2b above. Here everything is fine, this code path does not cause this hang or crashes.

5) nsHttpConnectionMgr::TimeoutTick calls the tick handler for each active connection. The TFO connection is one of them. and then it cancels stalled halfOpenSockets. This will be called for the HalfOpenSocket belonging to TFO connection as well. I think this will never happen because of timing but if someone changes timeout durations it could happen. I will fix this: do not cancel HalfOpenSocket belonging to TFO connection here.

6) nsHttpConnectionMgr::OnMsgShutdown: conn->CloseTransaction for active connection is called first. The TFO connections are among them, then all HalfOpenSockets are Abandon(). The HalfOpenSocket belonging to a TFO connection as well. This can be a problem because socketTransport holds reference to the HalfOpenSocket that will be deleted. But is probably not causing bug 1362832. I will fix this one.

7) nsHttpConnectionMgr::UpdateCoalescingForNewConn this is this bug and I am almost 100% sure bug 1362832 as well. Patrick rewrote this code when TFO was almost done and I have not look into this properly, my fault. We could cancel the socketTransport and move transaction into pending queue. This probably causes bug 1362959 as well :)
(Assignee)

Comment 8

5 months ago
Calling mFastOpenCallback->SetTransportConnected(reason) in nsSocketTransport::Close will solve problem under 6).
(Assignee)

Comment 9

5 months ago
Created attachment 8865422 [details] [diff] [review]
bug_1362821_v1.patch
Attachment #8865422 - Flags: review?(mcmanus)
(Assignee)

Comment 10

5 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=34a9426f59caf6ceacf2e7235043ee3f0015d69a
comment 7 is helpful. sounds like 1362959 will be solved with this - I'm not sure about 1362832. dragana?
Attachment #8865422 - Flags: review?(mcmanus) → review+
(Assignee)

Comment 12

5 months ago
(In reply to Patrick McManus [:mcmanus] from comment #11)
> comment 7 is helpful. sounds like 1362959 will be solved with this - I'm not
> sure about 1362832. dragana?

I have added a check where 1362832 fails:
        if (mFastOpenCallback) {
            mFastOpenCallback->SetFastOpenConnected(mCondition);
        }

and also adding 
mFastOpenCallback->SetFastOpenConnected(mCondition);
mFastOpenCallback = nullptr;

to nsSocketTransport::Close will resolved bug 1362832. I think, but I will observe crash stats.
Blocks: 1188435
status-firefox53: --- → unaffected
status-firefox54: --- → unaffected
status-firefox55: --- → affected
status-firefox-esr45: --- → unaffected
status-firefox-esr52: --- → unaffected
Keywords: regression
(Assignee)

Comment 13

5 months ago
Created attachment 8865433 [details] [diff] [review]
bug_1362821_v1.patch

I have realized that in the same patch I have added mFastOpenCallback->SetFastOpenConnected(mCondition);

to nsSocketTransport::Close

which is going to cancel BackupTimer and probably destroy halfOpenSocket. So we need to hold a reference. Sorry.
Attachment #8865422 - Attachment is obsolete: true
Attachment #8865433 - Flags: review?(mcmanus)
(Assignee)

Updated

5 months ago
Attachment #8865433 - Flags: review?(mcmanus)
(Assignee)

Comment 14

5 months ago
Created attachment 8865434 [details] [diff] [review]
bug_1362821_v1.patch
Attachment #8865433 - Attachment is obsolete: true
Attachment #8865434 - Flags: review?(mcmanus)
(Assignee)

Comment 15

5 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4c0f63313da2829da4df5ccf235c94a15e147559
Comment on attachment 8865434 [details] [diff] [review]
bug_1362821_v1.patch

Review of attachment 8865434 [details] [diff] [review]:
-----------------------------------------------------------------

the interdiff aspects of mozreview would really shine here :)

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +793,4 @@
>          LOG(("UpdateCoalescingForNewConn() forcing halfopen abandon %p\n",
> +             half));
> +
> +        RefPtr<nsHalfOpenSocket> halfPtr;

maybe name it deleteProtector
Attachment #8865434 - Flags: review?(mcmanus) → review+
(Assignee)

Comment 17

4 months ago
Created attachment 8865485 [details] [diff] [review]
bug_1362821_v1.patch
Attachment #8865434 - Attachment is obsolete: true
Attachment #8865485 - Flags: review+

Comment 18

4 months ago
Pushed by dd.mozilla@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b31a06dd4936
Properly destroy a HalfOpeSocket that is used as a backup for a TFO connection if another H2 connection is established for the same host. r=mcmanus

Comment 19

4 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/b31a06dd4936
Status: ASSIGNED → RESOLVED
Last Resolved: 4 months ago
status-firefox55: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
I still see this on nightly with network.tcp.tcp_fastopen_enable=true.
Certain script fetches (usually from google-analytics.com or amazon ad subdomains) frequently hang on "Performing TLS handshake" or "Waiting for...".
uBlock Origin usually prevents this due to ad-blocking, but when uBlock is disabled I see hangs that can only be stopped once the slow-script warning appears. 
I can consistently reproduce this on https://servethehome.com/.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sorry for the bug-spam; turns out this is not related to tcp fastopen.
I'll collect more info and open another bug report.
Status: REOPENED → RESOLVED
Last Resolved: 4 months ago2 months ago
Resolution: --- → FIXED

Comment 22

2 months ago
I am getting the error message on startup, opening https://www.google.com as home page.
Staying "Performing a TLS handshake to ...", then exit. The issue appears might be resolved or has it been reopened? My problem started 5 days ago when a Windows 10 patch was done. OS: Windows10 Home 1703 (15063.483) x64. Can anyone confirm what the fix is?

Comment 23

a month ago
Same issue as 1desechalhan. Started a few days ago. Will frequently, and somewhat randomly hang on "Performing a TLS handshake to...". Sometimes refreshing the page or restarting the browser fixes the issue, but only temporarily. Already tried creating a new Firefox profile, but it doesn't help. Same thing happens on Firefox Developer Edition.

Google search is where this most commonly happens, but it also happened while browsing to various other sites such as Patreon.

Comment 24

a month ago
Same issue here (randomly triggered). FF: 15.0.2 (64bit), windows 10 pro N (1703) os build 15063.54
(Assignee)

Comment 25

a month ago
I will open another bug for the last 3 comments. It is easier to track the issue, since this bug is closed.
(Assignee)

Comment 26

a month ago
The new bug is 1391543.

Comment 27

29 days ago
Needs "perf" key word?
You need to log in before you can comment on or make changes to this bug.