Sometimes network never finish loading

RESOLVED FIXED in Firefox 57

Status

()

P1
normal
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: kanru, Assigned: dragana)

Tracking

unspecified
mozilla58
x86_64
Linux
Points:
---

Firefox Tracking Flags

(firefox57+ fixed, firefox58 fixed)

Details

Attachments

(4 attachments, 1 obsolete attachment)

(Reporter)

Description

a year ago
Recently in Nightly some websites never finish loading. When it happens all new connection to the same website hang forever.

I've captured a recording and a network log.
(Reporter)

Comment 1

a year ago
Created attachment 8905440 [details]
network-hang.webm.webm
(Reporter)

Comment 2

a year ago
Created attachment 8905441 [details]
necko-log.txt-child.4219
Might be the same issue as bug 1396380.

Comment 4

a year ago
Created attachment 8909025 [details]
network.log-main.85989

I met the same problem. Here is the main process log. From devtool, it seems getting blocked on loading javascript file from "cdnjs.cloudflare.com" domain.
From the log provided by @evelyn, the http request is pending because cdnjs.cloudflare.com already has six connections.
>nsHttpConnectionMgr::AtActiveConnectionLimit [ci=.S.....[tlsflags0x00000000]cdnjs.cloudflare.com:443 caps=129,totalCount=6, maxPersistConns=6]

cdnjs.cloudflare.com support HTTP2, however in this log we hit the six connection limit for this domain. @kershaw and I guess that there were 6 pending halfopened connections and we didn't aggressively prune these "half dead" connections. We should do that before this line of code https://searchfox.org/mozilla-central/rev/05c4c3bc0cfb9b0fc66bdfc8c47cac674e45f151/netwerk/protocol/http/nsHttpConnectionMgr.cpp#1438.
(Assignee)

Comment 6

a year ago
I am going through my bug mail back from vacation. We should really look into this and not lose track of it.

I will try to look at it tomorrow. If someone else have time please take it.
(Assignee)

Updated

a year ago
tracking-firefox57: --- → ?
Priority: -- → P1
(Assignee)

Comment 7

a year ago
We have 6 http2 connections and non of them can accept new transaction. I do not know why is so because log starts to late. At the end of the log the 6 connections are closed.

Kan-Ru, Evelyn, can one of you make a http log that starts before opening the affected page?

To make http log:
https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Flags: needinfo?(kchen)
Flags: needinfo?(ehung)

Comment 8

a year ago
Unfortunately I don't have a reliable STR but I will try to log more next time if I meet the problem.
Flags: needinfo?(ehung)

Comment 9

a year ago
FWIW I created a clean profile for my previous log. There were only three pages being loaded with this profile:
1. https://bugzilla.mozilla.org/show_bug.cgi?id=1397686
2. http://localhost:8000/example/index.html
3. https://kanru.info/blog/
and the problematic "cdnjs.cloudflare.com" domain was loaded by (2)

I tried to do it again but couldn't reproduce it. It was relatively easy to reproduce two days ago. I applied Nightly updated everyday, btw.
(Assignee)

Comment 10

a year ago
Can you download the 2-days-old Nightly and try to reproduce. I do not remember seeing anything landing in last 2-3 days that could influence this behavior.

You can find Nightlies at:
https://download-installer.cdn.mozilla.net/pub/firefox/nightly/2017/09/

Comment 11

a year ago
I downloaded a 9/16 build but still failed to reproduce. :(

What I did last time was keep refreshing the localhost page because I was testing some css properties.
The page I ran was very simple, it just loaded two JS files (hard-coded urls to cdnjs.cloudflare.com). These two files should be cached in this case (I hit ctrl+r for reload), so I didn't know why it's blocked by networking.
(Assignee)

Comment 12

a year ago
I missed to see the first time that http2 connections are not reading any data for a log time:

2017-09-16 18:40:14.825905 UTC - [Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x11ea13000 delta since last read 18s
2017-09-16 18:40:14.825923 UTC - [Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x12ea15800 delta since last read 18s
2017-09-16 18:40:14.825940 UTC - [Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x1354d4000 delta since last read 18s
2017-09-16 18:40:14.825958 UTC - [Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x12ab39000 delta since last read 36s
2017-09-16 18:40:14.825974 UTC - [Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x12ab47000 delta since last read 36s
2017-09-16 18:40:14.825989 UTC - [Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x135bed000 delta since last read 45s


This sounds like bug 1395494, but TFO is turned off since 9th so it is off on the 9/16 Nightly build.
(In reply to Dragana Damjanovic [:dragana] from comment #12)
> I missed to see the first time that http2 connections are not reading any
> data for a log time:
> 
> 2017-09-16 18:40:14.825905 UTC - [Socket Thread]: I/nsHttp
> Http2Session::ReadTimeoutTick 0x11ea13000 delta since last read 18s
> 2017-09-16 18:40:14.825923 UTC - [Socket Thread]: I/nsHttp
> Http2Session::ReadTimeoutTick 0x12ea15800 delta since last read 18s
> 2017-09-16 18:40:14.825940 UTC - [Socket Thread]: I/nsHttp
> Http2Session::ReadTimeoutTick 0x1354d4000 delta since last read 18s
> 2017-09-16 18:40:14.825958 UTC - [Socket Thread]: I/nsHttp
> Http2Session::ReadTimeoutTick 0x12ab39000 delta since last read 36s
> 2017-09-16 18:40:14.825974 UTC - [Socket Thread]: I/nsHttp
> Http2Session::ReadTimeoutTick 0x12ab47000 delta since last read 36s
> 2017-09-16 18:40:14.825989 UTC - [Socket Thread]: I/nsHttp
> Http2Session::ReadTimeoutTick 0x135bed000 delta since last read 45s
> 
> 
> This sounds like bug 1395494, but TFO is turned off since 9th so it is off
> on the 9/16 Nightly build.

It doesn't seem to be the case of bug 1395494.  In that case you should see "Ping Timer Exhaustion" from the log.
(Assignee)

Comment 14

a year ago
(In reply to Shian-Yow Wu [:swu] from comment #13)
> (In reply to Dragana Damjanovic [:dragana] from comment #12)
> > I missed to see the first time that http2 connections are not reading any
> > data for a log time:
> > 
> > 2017-09-16 18:40:14.825905 UTC - [Socket Thread]: I/nsHttp
> > Http2Session::ReadTimeoutTick 0x11ea13000 delta since last read 18s
> > 2017-09-16 18:40:14.825923 UTC - [Socket Thread]: I/nsHttp
> > Http2Session::ReadTimeoutTick 0x12ea15800 delta since last read 18s
> > 2017-09-16 18:40:14.825940 UTC - [Socket Thread]: I/nsHttp
> > Http2Session::ReadTimeoutTick 0x1354d4000 delta since last read 18s
> > 2017-09-16 18:40:14.825958 UTC - [Socket Thread]: I/nsHttp
> > Http2Session::ReadTimeoutTick 0x12ab39000 delta since last read 36s
> > 2017-09-16 18:40:14.825974 UTC - [Socket Thread]: I/nsHttp
> > Http2Session::ReadTimeoutTick 0x12ab47000 delta since last read 36s
> > 2017-09-16 18:40:14.825989 UTC - [Socket Thread]: I/nsHttp
> > Http2Session::ReadTimeoutTick 0x135bed000 delta since last read 45s
> > 
> > 
> > This sounds like bug 1395494, but TFO is turned off since 9th so it is off
> > on the 9/16 Nightly build.
> 
> It doesn't seem to be the case of bug 1395494.  In that case you should see
> "Ping Timer Exhaustion" from the log.

In this log I firefox or the tab had been closed before the timeout. That cause canceling connection before timeout (I think it must be closing of ff not the tab)
(Reporter)

Comment 15

a year ago
I haven't been able to reproduce this for a while. I'll update once I see this again.
Flags: needinfo?(kchen)
(Reporter)

Comment 16

a year ago
It happened today. From about:networking HTTP tab

Hostname 	        Port 	HTTP/2 	SSL 	Active 	Idle
cdnjs.cloudflare.com	443	true	true	6	0

When this happens all sites using cdnjs will stuck on reload.

It looks like so far this has only happened on cdnjs.cloudflare.com
(Reporter)

Comment 17

a year ago
I found a way to reproduce.

Just open a url that uses cdnjs.cloudflare.com, for example https://kanru.github.io/hasal-dashboard/?app=xxx and repeatedly hit F5 to reload. The connection count in about:networking will increase slowly and finally reach 6.
(Reporter)

Comment 18

a year ago
The log files are huge so I put them in a dropbox folder.

https://www.dropbox.com/sh/i70n0txr3hmsgdo/AAB0sCHdG2Zngy4Njg3GkteYa?dl=0
(Assignee)

Comment 19

a year ago
Thank you! I will take a look.
(Assignee)

Updated

a year ago
Duplicate of this bug: 1401701
(Assignee)

Comment 21

a year ago
The problem is TLS early-data and transactions that are not safe to send early data, e.g. POST method. During early-data we find a transaction in mReadyForWrite that cannot do early data. We remove it from the queue but we never put it back to the queue after early-data finishes.

It is not surprise that this is happening with cloudflare, since they were the first to support early-data on the server side.
(Assignee)

Comment 22

a year ago
Created attachment 8910747 [details] [diff] [review]
bug_1397686.patch
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Attachment #8910747 - Flags: review?(hurley)
status-firefox57: --- → affected
tracking-firefox57: ? → +
Comment on attachment 8910747 [details] [diff] [review]
bug_1397686.patch

Review of attachment 8910747 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/Http2Session.cpp
@@ +3388,5 @@
>        mOutputQueueSent = 0;
>      }
>    } else {
>      // 0RTT succeeded
> +    for (size_t i = 0; i < mCannotDo0RTTStreams.Length(); ++i) {

This looks like it's fine for when 0rtt succeeds, but what about the case just above here, where 0rtt fails, but we still are doing h2? I think we need to put these transactions back in the queue then too, right?
Attachment #8910747 - Flags: review?(hurley)
(Assignee)

Comment 24

a year ago
(In reply to Nicholas Hurley [:nwgh][:hurley] (also hurley@todesschaf.org) from comment #23)
> Comment on attachment 8910747 [details] [diff] [review]
> bug_1397686.patch
> 
> Review of attachment 8910747 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/protocol/http/Http2Session.cpp
> @@ +3388,5 @@
> >        mOutputQueueSent = 0;
> >      }
> >    } else {
> >      // 0RTT succeeded
> > +    for (size_t i = 0; i < mCannotDo0RTTStreams.Length(); ++i) {
> 
> This looks like it's fine for when 0rtt succeeds, but what about the case
> just above here, where 0rtt fails, but we still are doing h2? I think we
> need to put these transactions back in the queue then too, right?

yes, :)
(Assignee)

Comment 25

a year ago
Created attachment 8910886 [details] [diff] [review]
bug_1397686.patch
Attachment #8910747 - Attachment is obsolete: true
Attachment #8910886 - Flags: review?(hurley)
Attachment #8910886 - Flags: review?(hurley) → review+

Comment 26

a year ago
Pushed by dd.mozilla@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5cf7391a9825
The streams that cannot do 0RTT data need to be put in list and added to mReadyForWrite when 0RTT is done. r=hurley

Comment 27

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/5cf7391a9825
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox58: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Do we want this in ff57?
Oh, forget I asked.  This won't ever reach release.
(Assignee)

Comment 30

a year ago
I will uplift it to 57, it is a rather small isolated change.

I assume that TLS 1.3 will be disabled on 57?
(Assignee)

Comment 32

a year ago
Anyway we need this for beta, it is annoying behavior that this bug produces.
(Assignee)

Comment 33

a year ago
Comment on attachment 8910886 [details] [diff] [review]
bug_1397686.patch

Approval Request Comment
[Feature/Bug causing the regression]: TLS early-data with Http2 (bug 1322373)
[User impact if declined]: If TLS early-data is use it can happen that a request never finish loading. Depens which resource is affected it can produce no content on the page and the load will never finish. 
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: yes
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: It a small isolated piece of code. This is a TLS 1.3 only feature and tls 1.3 will be disabled on the 57 release(bug 1400291), but it is still enabled on beta. 
[String changes made/needed]: none
Attachment #8910886 - Flags: approval-mozilla-beta?
Comment on attachment 8910886 [details] [diff] [review]
bug_1397686.patch

Sure, let's take it in 57b3
Attachment #8910886 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 35

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/d285547c68af
status-firefox57: affected → fixed
You need to log in before you can comment on or make changes to this bug.