Closed Bug 1430659 (ipv4-faster-fallback) Opened 6 years ago Closed 6 years ago

Network requests in "blocked" state for long time (1+ minute) when ipv6 hosts are no longer available

Categories

(Core :: Networking: DNS, defect, P2)

Other Branch
defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: kats, Assigned: mayhemer)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(3 files, 2 obsolete files)

Attached image blocked-for-minutes.png
Backstory: I've noticed a few times now that if I've left Firefox open for a long time (many days to a few weeks) I sometimes get really bad network behaviour, where page loads get really slow. Once it starts happening it seems easy enough to repro, and I usually have to restart Firefox to fix it. It happened again today and I opened the network console while loading the page. This was from reloading a google sheets page - it took many minutes to finish, and per the network console, a couple of requests were in the "blocked" state for a long time (see attached screenshot). Is this expected behaviour, or is there something I can do to debug this further when this happens?
I've seen this too. I have no idea how to reproduce it, but when it happens, sites that aren't fully functional before the load event, and sites that have UI elements that block on AJAX, become pretty much unusable until I restart.
@kats and @kmag, it sounds like the network blocking issue happens randomly but it's reproducible once it happened?
Can you follow https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging#Using_aboutnetworking and provide corresponding MOZ_LOG when you hit this issue again?

It will be also helpful if you can provide the URL that seems reproducing this issue, therefore we can try it on our own.

BTW, what platform you use when encountering this issue?
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(bugmail)
Not sure if it is the same as bug 1405231 based on current information.
See Also: → 1405231
Thanks, I'll update this bug with the requested info next time I see it. In this instance I saw it on https://docs.google.com/spreadsheets/d/1DcqaMvt3I38HIl3-RDS_6aiX67nQWItNcwmVt5utbDk/edit#gid=659543591 on OS X, Firefox 57.0.4.
Flags: needinfo?(bugmail)
Whiteboard: ni?2018-01-16
This happened again, same URL as comment 4. Firefox 58.0.1 release this time. I got the log as requested - i started the log, reloaded the google sheet, let it go for a few minutes, then stopped the log. At this point the page still hadn't finished loading. gzipped log is attached.
^
Flags: needinfo?(polo.hellfire)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)
> Created attachment 8949854 [details]
> network-slow.txt-main.796.gz
> 
> This happened again, same URL as comment 4. Firefox 58.0.1 release this
> time. I got the log as requested - i started the log, reloaded the google
> sheet, let it go for a few minutes, then stopped the log. At this point the
> page still hadn't finished loading. gzipped log is attached.

There is only one channel for that URL in this log and it's done in 740ms.  Child logs are missing, so hard to say if the problem is on the child side.  But I bed this is more a devtools bug.
Ah!  There is a number of other channels in a connected state (unfinished).  Will take a look!
(In reply to Honza Bambas (:mayhemer) from comment #7)
> There is only one channel for that URL in this log and it's done in 740ms. 
> Child logs are missing, so hard to say if the problem is on the child side. 
> But I bed this is more a devtools bug.

Note that there were no child logs produced, at least not in the same folder as the parent log. Also this time I didn't have the devtools open at all. The page itself does not complete loading.
Thanks for the log!

So, the nsHalfOpenSocket @10f2f3ef0 created 2018-02-09 20:15:47.518, key = .S.....[tlsflags0x00000000]ssl.gstatic.com:443, fails to connect ssl.gstatic.com (2 sockets, prim/back, each trying the same ipv6 address 2607:f8b0:400b:80d::2003 twice [1], all fail to connect with 804b000e = NS_ERROR_NET_TIMEOUT, creates two http connections:

2018-02-09 20:17:17.954073 UTC - [Socket Thread]: V/nsHttp Creating nsHttpConnection @0x12279a400
- used to dispatch one of the 5 pending transactions for the host and propagate the connection error to it.

2018-02-09 20:17:17.956008 UTC - [Socket Thread]: V/nsHttp Creating nsHttpConnection @0x120f9c000
- immediately closed (no activation).

This took nearly a minute and half.

The problem is that the remaning 4 transactions
nsHttpTransaction @1274bc400
nsHttpTransaction @12acb4800
nsHttpTransaction @128e0ec00
nsHttpTransaction @12acb4c00

are left in the pending queue.  We then try to connect the server again to dispatch those remaining transactions, what again takes time.  nsHalfOpenSocket @10f2f3ae0 created 2018-02-09 20:17:17.964 doesn't even finish the time you stopped the log.  Its sockets still use the same, blacklisted address to connect to.

I think this is a DNS problem.  We definitely blacklist the address (see [1] ***) but sockets still are getting it and try to connect it.  Pretty weird.



[1]
2018-02-09 20:17:02.796306 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketDetached [this=0x11e588800 cond=804b000e]
2018-02-09 20:17:02.796318 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=0x11e588800 state=3 cond=804b000e]
2018-02-09 20:17:02.796330 UTC - [Socket Thread]: D/nsHostResolver Adding address to blacklist for host [ssl.gstatic.com], host record [0x11eeec200].

(***) 

2018-02-09 20:17:02.796342 UTC - [Socket Thread]: D/nsHostResolver Successfully adding address [2607:f8b0:400b:80d::2003] to blacklist for host [ssl.gstatic.com].
2018-02-09 20:17:02.796360 UTC - [Socket Thread]: D/nsSocketTransport   failed to connect all ipv4-only or ipv6-only hosts, trying lookup/connect again with both ipv4/ipv6
2018-02-09 20:17:02.796373 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x11e588800 type=0 status=0 param=0x0]
2018-02-09 20:17:02.796412 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport: calling PR_Close [this=0x11e588800]
2018-02-09 20:17:02.796942 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x11e588800 type=0 status=0 param=0x0]
2018-02-09 20:17:02.796964 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::ResolveHost [this=0x11e588800 ssl.gstatic.com:443]
2018-02-09 20:17:02.796977 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::SendStatus [this=0x11e588800 status=804b0003]
  804b0003 = STATUS_RESOLVING
2018-02-09 20:17:02.796990 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::OnSocketStatus [this=0x127b6fc00 status=804b0003 progress=0]
2018-02-09 20:17:02.797299 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x11e588800 type=1 status=0 param=0x1151f2910]
2018-02-09 20:17:02.797552 UTC - [Socket Thread]: D/nsSocketTransport   active [0] { handler=0x120f9c000 condition=0 pollflags=6 }
2018-02-09 20:17:02.797840 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x11e588800 type=1 status=0 param=0x1151f2910]
2018-02-09 20:17:02.797860 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::SendStatus [this=0x11e588800 status=804b000b]
  804b000b = STATUS_RESOLVED
2018-02-09 20:17:02.797881 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::OnSocketStatus [this=0x127b6fc00 status=804b000b progress=0]
2018-02-09 20:17:02.798016 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::InitiateSocket [this=0x11e588800]
2018-02-09 20:17:02.798104 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::BuildSocket [this=0x11e588800]
2018-02-09 20:17:02.798397 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransportService::AttachSocket [handler=0x11e588800]
2018-02-09 20:17:02.798411 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransportService::AddToIdleList [handler=0x11e588800]
2018-02-09 20:17:02.798450 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::SendStatus [this=0x11e588800 status=804b0007]
  804b0007 = STATUS_CONNECTING_TO
2018-02-09 20:17:02.798463 UTC - [Socket Thread]: V/nsHttp nsHttpTransaction::OnSocketStatus [this=0x127b6fc00 status=804b0007 progress=0]
2018-02-09 20:17:02.798499 UTC - [Socket Thread]: D/nsSocketTransport   trying address: 2607:f8b0:400b:80d::2003
Component: Networking → Networking: DNS
Summary: Network requests in "blocked" state for long time (1+ minute) → Network requests in "blocked" state for long time (1+ minute) , DNS black-listing broken?)
Another snippet from the log, see the marked spots, the address is already blacklisted before:


2018-02-09 20:17:17.970027 UTC - [Socket Thread]: D/nsHostResolver Resolving host [ssl.gstatic.com].

***

2018-02-09 20:17:17.970065 UTC - [Socket Thread]: D/nsHostResolver   Using cached record for host [ssl.gstatic.com].
2018-02-09 20:17:17.970141 UTC - [Socket Thread]: D/nsHostResolver   Using positive cache entry for host [ssl.gstatic.com] but starting async renewal.
2018-02-09 20:17:17.970178 UTC - [Socket Thread]: D/nsHostResolver   DNS thread counters: total=3 any-live=0 idle=3 pending=1
2018-02-09 20:17:17.970243 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x11e588800 type=1 status=0 param=0x111980af0]
2018-02-09 20:17:17.970261 UTC - [Socket Thread]: D/nsSocketTransport STS dispatch [0x111adcdf0]
2018-02-09 20:17:17.970333 UTC - [Socket Thread]: D/nsSocketTransport OnDispatchedEvent Same Thread Skip Signal
2018-02-09 20:17:17.970352 UTC - [Socket Thread]: D/nsSocketTransport   advancing to STATE_RESOLVING
2018-02-09 20:17:17.970369 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x11e588800 type=0 status=0 param=0x0]
2018-02-09 20:17:17.970382 UTC - [Socket Thread]: D/nsSocketTransport   MSG_ENSURE_CONNECT
2018-02-09 20:17:17.970394 UTC - [Socket Thread]: D/nsSocketTransport   ignoring redundant event
2018-02-09 20:17:17.970409 UTC - [Socket Thread]: D/nsSocketTransport STS poll iter
2018-02-09 20:17:18.222270 UTC - [Socket Thread]: D/nsHostResolver Resolving host [ssl.gstatic.com].
2018-02-09 20:17:18.222281 UTC - [Socket Thread]: D/nsHostResolver   Using cached record for host [ssl.gstatic.com].
2018-02-09 20:17:18.222291 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=0x12279a400 type=1 status=0 param=0x115205040]
2018-02-09 20:17:18.222299 UTC - [Socket Thread]: D/nsSocketTransport STS dispatch [0x11521a940]
2018-02-09 20:17:18.222306 UTC - [Socket Thread]: D/nsSocketTransport OnDispatchedEvent Same Thread Skip Signal
2018-02-09 20:17:18.222313 UTC - [Socket Thread]: D/nsSocketTransport   advancing to STATE_RESOLVING
2018-02-09 20:17:18.222322 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x12279a400 type=0 status=0 param=0x0]
2018-02-09 20:17:18.222329 UTC - [Socket Thread]: D/nsSocketTransport   MSG_ENSURE_CONNECT
2018-02-09 20:17:18.222335 UTC - [Socket Thread]: D/nsSocketTransport   ignoring redundant event
2018-02-09 20:17:18.222342 UTC - [Socket Thread]: D/nsSocketTransport STS poll iter
2018-02-09 20:17:18.222349 UTC - [Socket Thread]: D/nsSocketTransport   active [4] { handler=0x11e588800 condition=0 pollflags=6 }
2018-02-09 20:17:18.222358 UTC - [Socket Thread]: D/nsSocketTransport   active [3] { handler=0x11f5a8c00 condition=0 pollflags=6 }
2018-02-09 20:17:18.222365 UTC - [Socket Thread]: D/nsSocketTransport   active [2] { handler=0x12ae71400 condition=0 pollflags=5 }
2018-02-09 20:17:18.222372 UTC - [Socket Thread]: D/nsSocketTransport   active [1] { handler=0x11f512400 condition=0 pollflags=5 }
2018-02-09 20:17:18.222379 UTC - [Socket Thread]: D/nsSocketTransport   active [0] { handler=0x111c07400 condition=0 pollflags=6 }
2018-02-09 20:17:18.222385 UTC - [Socket Thread]: D/nsSocketTransport   calling PR_Poll [active=5 idle=0]
2018-02-09 20:17:18.222392 UTC - [Socket Thread]: D/nsSocketTransport     timeout = 0 milliseconds
2018-02-09 20:17:18.222463 UTC - [Socket Thread]: D/nsSocketTransport     ...returned after 0 milliseconds
2018-02-09 20:17:18.222472 UTC - [Socket Thread]: D/nsSocketTransport STS dispatch [0x11193e8c0]
2018-02-09 20:17:18.222479 UTC - [Socket Thread]: D/nsSocketTransport OnDispatchedEvent Same Thread Skip Signal
2018-02-09 20:17:18.222486 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::OnSocketEvent [this=0x12279a400 type=1 status=0 param=0x115205040]
2018-02-09 20:17:18.222493 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::SendStatus [this=0x12279a400 status=804b000b]
  804b000b = STATUS_RESOLVED
2018-02-09 20:17:18.222500 UTC - [Socket Thread]: D/nsSocketTransport   MSG_DNS_LOOKUP_COMPLETE

***

2018-02-09 20:17:18.222508 UTC - [Socket Thread]: D/nsHostResolver Checking blacklist for host [ssl.gstatic.com], host record [0x11eeec200].
2018-02-09 20:17:18.222515 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::InitiateSocket [this=0x12279a400]
2018-02-09 20:17:18.222522 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::BuildSocket [this=0x12279a400]
2018-02-09 20:17:18.222529 UTC - [Socket Thread]: D/nsSocketTransport   pushing io layer [0:ssl]
2018-02-09 20:17:18.222595 UTC - [Socket Thread]: D/nsSocketTransport   [secinfo=0x10f37e5d0 callbacks=0x10f37b190]
2018-02-09 20:17:18.222620 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransportService::AttachSocket [handler=0x12279a400]
2018-02-09 20:17:18.222630 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransportService::AddToIdleList [handler=0x12279a400]
2018-02-09 20:17:18.222642 UTC - [Socket Thread]: D/nsSocketTransport   active=5 idle=1
2018-02-09 20:17:18.222683 UTC - [Socket Thread]: D/nsSocketTransport   advancing to STATE_CONNECTING
2018-02-09 20:17:18.222699 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::SendStatus [this=0x12279a400 status=804b0007]
  804b0007 = STATUS_CONNECTING_TO

***

2018-02-09 20:17:18.222712 UTC - [Socket Thread]: D/nsSocketTransport   trying address: 2607:f8b0:400b:80d::2003
Whiteboard: ni?2018-01-16
Daniel, since you are slowly becoming our DNS expert, could you take a look please?
Flags: needinfo?(daniel)
The address blacklist logic resets the blacklist again if it all addresses for that single record are blacklisted (otherwise it would end up with no addresses to try) and that seems to happen a lot in this log, including for host "ssl.gstatic.com". Search for "Resetting blacklist".

They all seem to get blacklisted when they've all failed connections, so I suspect the blacklisting a secondary outcome of connection problems and that the connection problems is the primary problem. Unfortunately, I can't explain why it would have connect problems.
Flags: needinfo?(daniel)
Thanks Daniel.

Kartikaya, just for completeness, what is your setting for network.http.fast-fallback-to-IPv4 preference?  I expect true (the default).


We probably need to change these things here:
- lower the connection timeout of the backup socket to only say 5 seconds (quite arbitrarily chosen const) when family already has a preference from earlier time, have a pref for the timeout
- make sure the socket transport tries a different family on error recovery, mainly on the timeout error: when disable ipv4 was spec'd, switch to disable ipv6 and retry, (and vise versa)
- drop the family preference flag on connentry when we fail connecting that family (in 5 seconds or in the regular timeout?) or rather reset the preference when we recover using a different family (needs a callback from st to hl or to expose the specific family failure somehow)


to support that, I can see in the log:

- we do (also) get an ipv4 address for the ssl.gstatic.com host from getaddrinfo
- we never try to connect that ipv4 address through out the log, though
- when nsSocketTransport @1253be400 (primary) recovers from its first connection error:

2018-02-09 20:14:01.728967 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=0x1253be400 state=3 cond=804b000e]

...it resolves the host again, because:

2018-02-09 20:14:01.729037 UTC - [Socket Thread]: D/nsSocketTransport   failed to connect all ipv4-only or ipv6-only hosts, trying lookup/connect again with both ipv4/ipv6
2018-02-09 20:14:01.729335 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::ResolveHost [this=0x1253be400 ssl.gstatic.com:443]

and again tries to connect the same ipv6 address because it gets it first as a result of that resolve:

2018-02-09 20:14:01.730397 UTC - [Socket Thread]: D/nsSocketTransport   trying address: 2607:f8b0:400b:80d::2003

despite the fact we just blacklisted that ip address (according Daniel's comment, we restart the blacklisting when all ip addresses we have are blacklisted - here it was the only one - but we still have the ipv4 address, so I'm not fully clear on this bit)

the logs are quite poor in the area what family we resolve for which socket transport, so the following is based on studying the code.  I assume that in the past we were able to connect that ipv6 address and remembered that in mEnt->mPreferIPv6.  That makes the primary transport use only ipv6 addresses.  But, the code at [1] makes even the backup socket use ipv6 only, still, it's likely correct - bug 725587.  I can see in the log we go via the same steps for the backup socket as well:

2018-02-09 20:14:02.720711 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=0x126e8e000 state=3 cond=804b000e]
2018-02-09 20:14:02.720739 UTC - [Socket Thread]: D/nsSocketTransport   failed to connect all ipv4-only or ipv6-only hosts, trying lookup/connect again with both ipv4/ipv6
2018-02-09 20:14:02.720874 UTC - [Socket Thread]: D/nsSocketTransport nsSocketTransport::ResolveHost [this=0x126e8e000 ssl.gstatic.com:443]

It then again tries an IPv6 address as first again (we should try an ipv4 address here..) and goes via the same dance again.  We only do a try for the ipv6 address (and not the ipv4 address that's waiting there to be picked) since something called Close on the transport and thus set the mDoNotRetryToConnect flag.

[1] https://searchfox.org/mozilla-central/rev/33cc9e0331da8d9ff3750f1e68d72d61201176cb/netwerk/protocol/http/nsHttpConnectionMgr.cpp#4079-4085
Flags: needinfo?(bugmail)
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Priority: -- → P2
Whiteboard: [necko-triaged]
(In reply to Honza Bambas (:mayhemer) from comment #14)
> Thanks Daniel.
> 
> Kartikaya, just for completeness, what is your setting for
> network.http.fast-fallback-to-IPv4 preference?  I expect true (the default).
> 

Yes, it's the default (true).
Flags: needinfo?(bugmail)
Summary: Network requests in "blocked" state for long time (1+ minute) , DNS black-listing broken?) → Network requests in "blocked" state for long time (1+ minute) when ipv6 hosts are no longer available
Depends on: 1444160
Attached patch v1 (obsolete) — Splinter Review
the patch does what's described at the top of comment 14, also see the detailed commit message

I've done a successful local manual testing using the hosts file and iis.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4e73d7b5a5dcf7f6f5bb1148d0d290f0bb455bdd
Attachment #8957636 - Flags: review?(mcmanus)
Comment on attachment 8957636 [details] [diff] [review]
v1

Review of attachment 8957636 [details] [diff] [review]:
-----------------------------------------------------------------

I'm ok with this in prinicple but I haven't done a real code review.
Attachment #8957636 - Flags: review?(mcmanus) → review?(valentin.gosu)
I think we should make the "try other family" a specific connection flag.  When connecting a host the first time, no preference for an ip fam, we do primary ipv6+ipv4 and backup ipv4+ipv6 attempts, as both sockets have a DISABLE_IPVx flag.  This is not necessary IMO.

It should be like this:
- no preference (so far first connection attempt, or we never reached the server):
  - primary: disable ipv4
  - backup: disable ipv6

- preference is already known:
  - primary: disable ipv-non-preferred
  - backup: timeout=5s, disable ipv-non-preferred + try-other-ip-fam
Blocks: 1444923
(In reply to Honza Bambas (:mayhemer) from comment #19)
> I think we should make the "try other family" a specific connection flag. 

Filed bug 1444923 for that.
Comment on attachment 8957636 [details] [diff] [review]
v1

Review of attachment 8957636 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/init/all.js
@@ +1801,5 @@
>  // Close a connection if tls handshake does not finish in given number of
>  // seconds.
>  pref("network.http.tls-handshake-timeout", 30);
>  
> +// The number of seconds we time out a connection of a retry (fallback) socket

nit: seconds <after which> we time out.

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +206,5 @@
>      , mNetworkChangedTimeout(5000)
>      , mMaxRequestAttempts(6)
>      , mMaxRequestDelay(10)
>      , mIdleSynTimeout(250)
> +    , mFallbackSynTimeout(5000)

Since we clamp the pref to (0, 600) I guess we should default this to either 0 or 600.
Attachment #8957636 - Flags: review?(valentin.gosu) → review+
(In reply to Valentin Gosu [:valentin] from comment #21)
> > +    , mFallbackSynTimeout(5000)
> 
> Since we clamp the pref to (0, 600) I guess we should default this to either
> 0 or 600.

Ups!  This should actually be 5, it's in seconds (!), the default I added to all.js.

thanks!
Flags: needinfo?(kmaglione+bmo)
One more problem found, uncovered by a completely unrelated test on android:
netwerk/test/unit/test_fallback_request-error_canceled.js

the test shuts the server down and then tries if some appcache stuff is loading as expected.  the state of the conn-entry is to prefer ipv4.  so, after the server has been shutdown, any connection attempt to localhost is made with the disable-ipv6 flag.  we try 127.0.0.1 and then try to resolve ipv6.  that fails with "unknown host" as there is no ipv6 on android test infra!

this actually reveals a long standing bug, we never drop the mAttached flag and when handling the unknown-host error in nsSocketTransport::OnSocketEvent, we believe sts will signal with detach and we handle the error there.  but at that moment the socket has already been detached from sts, and thus we bypass call to OnSocketDetached from inside OnSocketEvent.  the socket never notifies upper layers about the connection failures, it's timed out by the http conn manager 90s timeout logic and never resets the preference.

adding mAttached = false to OnSocketDetached helps.

going another try round...
then, I start to believe we should have the TRY_OTHER_IP_FAMILY flag within this patch to prevent delays and unnecessary conn attempts.
Depends on: 1444928
No longer blocks: 1444923
After more research, I will rewrite the patch here.  the retry-other-family logic will move to half-open.
Flags: needinfo?(polo.hellfire)
Attachment #8957636 - Attachment is patch: false
Attachment #8957636 - Flags: review+
Attachment #8957636 - Attachment is obsolete: true
Attachment #8957636 - Attachment is patch: true
Attached patch v2 (obsolete) — Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd88732414c1bc84bd0be6d45765297062a95af8

- eventually I rather implemented RETRY_WITH_DIFFERENT_IP_FAMILY connection flag, as moving the logic to half-open turned out to be too complicated (the motivation was a cleanup + possibility to ignore errors from backup sockets under certain conditions, but that can be a followup, not needed until we decide what to do with bug 1444928)

- I deliberately keep reference to mSynTimer (timer for the backup socket) to not fire it more than once (I've seen logs where HOs were scheduling another backup sockets prolonging timeout detection even more) ; no leak detected with that change

otherwise the patch is identical to the v1 patch.
Attachment #8958928 - Flags: review?(valentin.gosu)
Valentin, the interdiff by bugzilla [1] seems sane, so you can review that

[1] https://bugzilla.mozilla.org/attachment.cgi?oldid=8957636&action=interdiff&newid=8958928&headers=1
Blocks: 1444928
No longer depends on: 1444928
Comment on attachment 8958928 [details] [diff] [review]
v2

Review of attachment 8958928 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsISocketTransport.idl
@@ +253,5 @@
>       */
>      const unsigned long REFRESH_CACHE = (1 << 9);
>  
>      /**
> +     * If connecting the preferred ip family has failed, retry with

Let's rephrase this: "If this flag is set then it means that..."

::: netwerk/base/nsSocketTransport2.h
@@ +300,5 @@
>      bool mProxyTransparent;
>      bool mProxyTransparentResolvesHost;
>      bool mHttpsProxy;
>      uint32_t     mConnectionFlags;
> +    // When we fail to connect a prefered family, we tell the consumer to reset

nit: to connect _using a preferred_ family

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +4084,5 @@
> +        // From the same reason, let the backup socket fail faster to try the other family.
> +        uint16_t fallbackTimeout = isBackup ? gHttpHandler->GetFallbackSynTimeout() : 0;
> +        if (fallbackTimeout) {
> +            socketTransport->SetTimeout(nsISocketTransport::TIMEOUT_CONNECT,
> +                                        fallbackTimeout);

nit: You don't need the temporary fallbackTimeout. Just do if (isBackup) ... unless you think this is clearer.

@@ +4242,5 @@
>      if (!mSynTimer)
>          return;
>  
>      LOG(("nsHalfOpenSocket::CancelBackupTimer()"));
>      mSynTimer->Cancel();

Add a comment about why we deliberately don't null mSynTimer, otherwise someone might add it back

@@ +4329,5 @@
>      MOZ_ASSERT(mEnt);
>  
>      DebugOnly<nsresult> rv = SetupBackupStreams();
>      MOZ_ASSERT(NS_SUCCEEDED(rv));
>  

Same - needs comment.

::: netwerk/protocol/http/nsHttpConnectionMgr.h
@@ +330,5 @@
>          // Set the IP family preference flags according the connected family
>          void RecordIPFamilyPreference(uint16_t family);
>          // Resets all flags to their default values
>          void ResetIPFamilyPreference();
> +        // True iff there is currently an established preference for the host

nit: typo iff
Also, the preference is for IP family, not host.
Attachment #8958928 - Flags: review?(valentin.gosu) → review+
thanks for r!

repushed to try https://treeherder.mozilla.org/#/jobs?repo=try&revision=81606b96bc337f9b80d05e3332703e472064c3de

there were some tests failing, so again on top of bug 1446117
(In reply to Valentin Gosu [:valentin] from comment #29)
> > +        if (fallbackTimeout) {
> > +            socketTransport->SetTimeout(nsISocketTransport::TIMEOUT_CONNECT,
> > +                                        fallbackTimeout);
> 
> nit: You don't need the temporary fallbackTimeout. Just do if (isBackup) ...
> unless you think this is clearer.

0 means to use default (so, don't change the TO).

> >          void ResetIPFamilyPreference();
> > +        // True iff there is currently an established preference for the host
> 
> nit: typo iff

iff == if and only if abbrev (equivalence, not implication)
Attached patch v2.1Splinter Review
Attachment #8958928 - Attachment is obsolete: true
Attachment #8959575 - Flags: review+
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2f4395f5ca24
Network requests in "blocked" state for long time (1+ minute) when ipv6 hosts are no longer available. r=valentin
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/2f4395f5ca24
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Depends on: 1447814
Depends on: 1452028
Alias: ipv4-faster-fallback
You need to log in before you can comment on or make changes to this bug.