Closed Bug 1362821 Opened 7 years ago Closed 7 years ago

Networking stops occasionally

Categories

(Core :: Networking, defect)

55 Branch
x86_64
Windows 10
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr45 --- unaffected
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- fixed

People

(Reporter: euthanasia_waltz, Assigned: dragana)

References

Details

(Keywords: regression, Whiteboard: [necko-active])

Attachments

(2 files, 3 obsolete files)

As of nightly build 20170506030204, networking stops occasionally while "Performing a TLS handshake to ...", "Transferring data from ...", "Waiting for ...", ... on various sites at random timing.

It seems that disabling TCP fastopen (network.tcp.tcp_fastopen_enable=false) solves the problem.

OS: Windows10 1703 (15063.138) x64
CPU: Intel Core i3 4010U
Net: Realtek PCIe GBE Family Controller (driver ver. 9.1.406.2015)
"networking" means each operation(Performing a TLS handshake, Transferring data, Waiting, ..."), "stop" means it won't be completed/finished. Reload or any other actions can be done after this issue.
Adding ni on dragana to see if she has ideas.
Flags: needinfo?(dd.mozilla)
(In reply to atlanto from comment #0)
> As of nightly build 20170506030204, networking stops occasionally while
> "Performing a TLS handshake to ...", "Transferring data from ...", "Waiting
> for ...", ... on various sites at random timing.
> 
> It seems that disabling TCP fastopen (network.tcp.tcp_fastopen_enable=false)
> solves the problem.
> 
> OS: Windows10 1703 (15063.138) x64
> CPU: Intel Core i3 4010U
> Net: Realtek PCIe GBE Family Controller (driver ver. 9.1.406.2015)

Can you please try to make a http log when this happens. This feature depend a lot on a internet path, so I am not sure it could be easy to reproduce. It would be very helpful if you could capture it in a log:

https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

rotate:200 limits log to only 200MB, when that limit is reach the oldest entry in the log will be overwritten. maybe increase it a bit or at least stop Firefox as soon as it start happening so that the relevant part of the log is still in the log and is not overwritten.

also change the logging to:
set MOZ_LOG=timestamp,rotate:200,nsHttp:5,nsSocketTransport:5

Thanks a lot!
Flags: needinfo?(dd.mozilla) → needinfo?(euthanasia_waltz)
Assignee: nobody → dd.mozilla
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Whiteboard: [necko-active]
Attached file http-log.zip
On startup, opening https://www.yahoo.co.jp/ as home page.
Staying "Performing a TLS handshake to s.yimg.jp...", then exit.
Flags: needinfo?(euthanasia_waltz)
Thanks a lot!
The problem here is that a TFO connection is in progress when another connection that uses H2 to the same host is established. Therefore the backup HalfOpenSocket for the TFO connection gets destroyed and also wrongly clears input/outputStream callbacks.

In this case we can close the TFO connection and return transaction to the pending queue.
I will write this down to see if I do not mix up logic here:

a TFO connection has transport and that transport has a reference to HalfOpenSocket as a backup. This HalfOpenSocket is in the HalfOpenSocket list but it is not count as active socket:

1) socketTransport can use TFO (mFastOpenCallback->FastOpenEnabled() == true). It calls mFastOpenCallback->StartFastOpen() to start a new connection but HalfOpenSocket is kept as well. The connection is counted as active but HalfOpenSocket is not counted as active (i.e. it calls gHttpHandler->ConnMgr()->RecvdConnect() to decrease counter) The halfOpenSocket has reference to SocketTransport but that socketTransport belongs to the connection. It also keeps reference to the connection(mConnectionNegotiatingFastOpen). If Fast open is not supported we continue in the same way (this is to keep a timer for a backup transport).

2)
  a) the TFO socketTransport succeeds to connect -> socketTransport::OnSocketConnected calls SetFastOpenConnected. The HalfOpenSocket backup timer is canceled and the HalfOpenSocket's reference to socketTransport is deleted.
If backup transport has started this HalfOpenSocket will be counted as active for its backup stream (gHttpHandler->ConnMgr()->StartedConnect() is called in SetupStream)

  b) socketTransport receives an error and it cannot recover from that error -> nsSocketTransport::OnSocketDetached calls SetFastOpenConnected ->  The HalfOpenSocket backup timer is canceled and the HalfOpenSocket's reference to socketTransport is deleted (the same as in case of success)

  c) socketTransport receives an error and it can recover. This errors are:
        NS_ERROR_CONNECTION_REFUSED and NS_ERROR_NET_TIMEOUT -> disable fastopen and try again, i.e. nsSocketTransport::RecoverFromError calls SetFastOpenConnected -> it closes connection (but it does not close socketTransport), transaction is return to pending queue and the halfOpenSocket takes the socketTransport and continues as if TFO has never been used. It also add its self to the active halfOpenSocket counter(i.e. it calls gHttpHandler->ConnMgr()->StartedConnect()).
        NS_ERROR_PROXY_CONNECTION_REFUSED -> I will add this one to the list above.
        NS_ERROR_UNKNOWN_HOST and NS_ERROR_UNKNOWN_PROXY_HOST are happening before TFO has started.

3) The backup transport for the connection is connected, i.e. nsHttpConnectionMgr::nsHalfOpenSocket::OnOutputStreamReady for backup transport is called. if the HalfOpenSocket still has mConnectionNegotiatingFastOpen, we will cancel the connection also closed its socketTransport and return transaction to the pending queue (this transaction is going to be picked up by this new backup socketTransport that is connected)

4) transaction gets canceled during TFO:
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnectionMgr.cpp#2240
The transaction has connection and transaction cannot be IsDone() so we call conn->CloseTransaction(trans, closeCode);
This will set socketTransport condition to error code. And in the next poll iteration the socket will be detached -> OnSocketDetached is called and this calls SetFastOpenConnected which performs action described in 2b above. Here everything is fine, this code path does not cause this hang or crashes.

5) nsHttpConnectionMgr::TimeoutTick calls the tick handler for each active connection. The TFO connection is one of them. and then it cancels stalled halfOpenSockets. This will be called for the HalfOpenSocket belonging to TFO connection as well. I think this will never happen because of timing but if someone changes timeout durations it could happen. I will fix this: do not cancel HalfOpenSocket belonging to TFO connection here.

6) nsHttpConnectionMgr::OnMsgShutdown: conn->CloseTransaction for active connection is called first. The TFO connections are among them, then all HalfOpenSockets are Abandon(). The HalfOpenSocket belonging to a TFO connection as well. This can be a problem because socketTransport holds reference to the HalfOpenSocket that will be deleted. But is probably not causing bug 1362832. I will fix this one.

7) nsHttpConnectionMgr::UpdateCoalescingForNewConn this is this bug and I am almost 100% sure bug 1362832 as well. Patrick rewrote this code when TFO was almost done and I have not look into this properly, my fault. We could cancel the socketTransport and move transaction into pending queue. This probably causes bug 1362959 as well :)
Calling mFastOpenCallback->SetTransportConnected(reason) in nsSocketTransport::Close will solve problem under 6).
Attached patch bug_1362821_v1.patch (obsolete) — Splinter Review
Attachment #8865422 - Flags: review?(mcmanus)
comment 7 is helpful. sounds like 1362959 will be solved with this - I'm not sure about 1362832. dragana?
Attachment #8865422 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #11)
> comment 7 is helpful. sounds like 1362959 will be solved with this - I'm not
> sure about 1362832. dragana?

I have added a check where 1362832 fails:
        if (mFastOpenCallback) {
            mFastOpenCallback->SetFastOpenConnected(mCondition);
        }

and also adding 
mFastOpenCallback->SetFastOpenConnected(mCondition);
mFastOpenCallback = nullptr;

to nsSocketTransport::Close will resolved bug 1362832. I think, but I will observe crash stats.
Attached patch bug_1362821_v1.patch (obsolete) — Splinter Review
I have realized that in the same patch I have added mFastOpenCallback->SetFastOpenConnected(mCondition);

to nsSocketTransport::Close

which is going to cancel BackupTimer and probably destroy halfOpenSocket. So we need to hold a reference. Sorry.
Attachment #8865422 - Attachment is obsolete: true
Attachment #8865433 - Flags: review?(mcmanus)
Attachment #8865433 - Flags: review?(mcmanus)
Attached patch bug_1362821_v1.patch (obsolete) — Splinter Review
Attachment #8865433 - Attachment is obsolete: true
Attachment #8865434 - Flags: review?(mcmanus)
Comment on attachment 8865434 [details] [diff] [review]
bug_1362821_v1.patch

Review of attachment 8865434 [details] [diff] [review]:
-----------------------------------------------------------------

the interdiff aspects of mozreview would really shine here :)

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +793,4 @@
>          LOG(("UpdateCoalescingForNewConn() forcing halfopen abandon %p\n",
> +             half));
> +
> +        RefPtr<nsHalfOpenSocket> halfPtr;

maybe name it deleteProtector
Attachment #8865434 - Flags: review?(mcmanus) → review+
Attachment #8865434 - Attachment is obsolete: true
Attachment #8865485 - Flags: review+
Pushed by dd.mozilla@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b31a06dd4936
Properly destroy a HalfOpeSocket that is used as a backup for a TFO connection if another H2 connection is established for the same host. r=mcmanus
https://hg.mozilla.org/mozilla-central/rev/b31a06dd4936
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
I still see this on nightly with network.tcp.tcp_fastopen_enable=true.
Certain script fetches (usually from google-analytics.com or amazon ad subdomains) frequently hang on "Performing TLS handshake" or "Waiting for...".
uBlock Origin usually prevents this due to ad-blocking, but when uBlock is disabled I see hangs that can only be stopped once the slow-script warning appears. 
I can consistently reproduce this on https://servethehome.com/.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sorry for the bug-spam; turns out this is not related to tcp fastopen.
I'll collect more info and open another bug report.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
I am getting the error message on startup, opening https://www.google.com as home page.
Staying "Performing a TLS handshake to ...", then exit. The issue appears might be resolved or has it been reopened? My problem started 5 days ago when a Windows 10 patch was done. OS: Windows10 Home 1703 (15063.483) x64. Can anyone confirm what the fix is?
Same issue as 1desechalhan. Started a few days ago. Will frequently, and somewhat randomly hang on "Performing a TLS handshake to...". Sometimes refreshing the page or restarting the browser fixes the issue, but only temporarily. Already tried creating a new Firefox profile, but it doesn't help. Same thing happens on Firefox Developer Edition.

Google search is where this most commonly happens, but it also happened while browsing to various other sites such as Patreon.
Same issue here (randomly triggered). FF: 15.0.2 (64bit), windows 10 pro N (1703) os build 15063.54
I will open another bug for the last 3 comments. It is easier to track the issue, since this bug is closed.
The new bug is 1391543.
Needs "perf" key word?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: