Closed Bug 749209 Opened 12 years ago Closed 12 years ago

Happy Eyeballs implementation still not quite right

Categories

(Core :: Networking: HTTP, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla15
Tracking Status
firefox13 + affected
firefox14 + fixed
firefox15 + fixed

People

(Reporter: lorenzo, Assigned: mayhemer)

References

(Depends on 1 open bug)

Details

(Whiteboard: [http-conn])

Attachments

(4 files, 3 obsolete files)

Looks like the fast fallback from IPv6 to IPv4 implemented in bug 621558 and then reimplemented in bug 684893 is still not quite right.

I configured broken IPv6 on my machine (global IPv6 address and default route that blackholed traffic, such that all IPv6 TCP connections would time out) and tried to load dual-stack websites. I was to load one or two standalone URLs fine, but when I tried to load a whole dual-stack website (www.ripe.net or www.arin.net), the browser would start to load but hang for ~20s before loading the page.

I think it might be blocking on full connection queues, because in the logs I see lots of these:

-228590736[f724aa00]: nsHttpConnectionMgr::GetConnection [ci = ...www.ripe.net:80]at active connection limit - will queue

Is it possible that fast-fallback doesn't trigger if the connection queue is already full with other connections?

This is using Firefox 11.0 (20120310194926) on Linux. network.http.fast-fallback-to-IPv4 is true (default). tcpdump attached.
Attached file HTTP log (obsolete) —
Just to make it clear: our implementation is not a full implementation of the spec.  It's just a quick decision algorithm and we know about at least bug 725587.

I'll take a look at this.
Assignee: nobody → honzab.moz
Ack. The bug is not "fails to fully implement the spec" but rather "fails to fall back to IPv4 in some cases", which is basically failing to do what it was designed to do.
ripe.net swallows all ICMP packets which is fatal on IPv6. They should fix their network configuration.
> "fails to fall back to IPv4 in some cases"
We can't fallback if it looks like the connection has been established, but do not respond. Otherwise it may cause something like double POST (one for IPv6 but we fails to receive the response somehow, and one for IPv4).
From the tcpdump it's clear that no IPv6 connection is ever established. My machine sends SYN packets, but never gets SYN+ACK back. The problem is that even though the IPv6 TCP connections aren't establishing, the browser doesn't fall back to IPv4.

I don't see how ICMP is involved at all, since it's not in the tcpdump. If I fix my connection not to drop packets, IPv6 connections to www.ripe.net work fine.
Lorenzo, to make sure you don't have in any way broken preference setting or any extension is involved, can you please try to reproduce with a clean profile?
Attached patch v0.9 (obsolete) — Splinter Review
I've managed to setup an env to reproduce "a" problem like in the description.

It was a regression from speculative connect caused by not setting up the backup timer since nsNullTransaction::IsDone was returning true all the time.

But, I don't see the same issue in the log, however the description conforms with symptoms of what I've found.

Lorenzo, with what version are you able to reproduce the problem?  Is it consistent?  Can you add "timestamp" module (no ":5" after it) to the list of NSPR_LOG_MODULES and produce a new log?
Attachment #622436 - Flags: feedback?(mcmanus)
Sorry for the delay.

Yes, I can reproduce the problem at will. I think it only happens on sites that have at least a few subresources (CSS, images, and so on). For example, it happens on www.ripe.net, but not on www.ietf.org, which is a simple page.

I just reproduced using a clean profile on Firefox 12.0 for Linux downloaded from getfirefox.com, HG revision a294a5b4f12d

Attaching timestamped log and tcpdump again. From the log it looks like the whole process stalls for 20s when the connection limit is hit.

I can't do my own build, but I'm happy to test a patched linux binary if you want to send me one.
Attached file Timestamped HTTP log
Attachment #618657 - Attachment is obsolete: true
Attachment #618656 - Attachment is obsolete: true
Lorenzo, thanks a lot.  I can see it now.

Please, first check the bug is present in the latest beta build:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/13.0b3-candidates/build1/

Then please try this build that contains a potential fix, there should be no stalling with it:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/honzab.moz@firemni.cz-79e13e14f433/

and let me know.  (You don't need to setup and save a log this time, the issues should simply go away with the try build.)

Thank you.
Status: NEW → ASSIGNED
Comment on attachment 622436 [details] [diff] [review]
v0.9

I know this isn't the core patch for this bug, but I think we ought to get this into the tree while the problem is only on nightly. Thanks!
Attachment #622436 - Flags: review+
Attachment #622436 - Flags: feedback?(mcmanus)
Attachment #622436 - Flags: feedback+
from https://hg.mozilla.org/try/rev/79e13e14f433

    1.12 -    if (!gHttpHandler->ConnMgr()->
    1.13 -        AtActiveConnectionLimit(mEnt, mTransaction->Caps())) {
    1.14 -        SetupBackupStreams();
    1.15 -    }
    1.16 +    SetupBackupStreams();

Interesting! I can see how that's a problem if we're using the backup stream to do HE.

that's a big step forward (at the cost of accidentally going beyond 6 once in a while in a slow pure ipv4 env, but that doesn't bother me).. but there will be HE cases where there are 3 active v4 and 3 half-opens that are taking forever to timeout on a v6 address and we won't admit any new connections requests (not backup ones)... we almost need to cancel the original connect if a HE attempt has taken over.
Attached file Log from patched build
Unfortunately, the patched build doesn't work either. Attaching HTTP log from the linux_64 version.

I think the problem is that once the stuck connections saturate the connection pool, that's it - so if the website wants to load enough resources, you always end up hitting the problem and hanging. I think the only way to build a robust implementation is to design it with the assumption that any connect() may never return, ever.

This took a few attempts to get right in chromium as well. IIRC, in the initial implementations, the problems were similar (backup connections not actually triggering because the pool was full, and so on). I believe that in the end it was implemented at the lowest-possible layer, basically just above the TCP socket, and that worked better.

Happy to provide feedback as I have been so far, but it might be better if you were able to reproduce the hang. Do you have a linux box? If so, the following might suffice to give you a broken IPv6 connection that allows you to reproduce the problem:

sudo ip tunnel add broken0 mode sit local any remote 8.8.8.8
sudo ip link set broken0 up
sudo ip -6 addr add dev broken0 2001:db8::1/127
sudo ip -6 route add default via 2001:db8:: metric 1

to unbreak:

sudo ip tunnel del broken0

Does that help you see the problem?
Lorenzo, thanks again.

I think nsHalfOpenSocket should be removed from the array (or somehow marked) when it has established at least one connection.  While it hangs for those 21 seconds before broken IPv6 connection times out, it is measured as an existing connection for the overall limit.  Probably similar issue as chromium had.

Anyway, I'm not willing to write any quick fix here, we have some other issues around the connection limit code right now.  However, Patrick, feel free to write any quick fix you believe it fits.

The proper fix will be done in bug 715905, though.
Assignee: honzab.moz → nobody
Depends on: 715905
Status: ASSIGNED → NEW
honza I think we ought to at least land the patch "0.9".. yes?
(In reply to Patrick McManus [:mcmanus] from comment #16)
> honza I think we ought to at least land the patch "0.9".. yes?

Definitely.  And we could also add a flag to nsHalfOpenSocket that would be used in AtActiveConnectionLimit to exclude it from the total count when a connection has already been created by that half-open.  I looked at that a bit deeper and it might be safe to do and fix this bug also for Lorenzo.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
(In reply to Lorenzo Colitti from comment #14)
> sudo ip tunnel add broken0 mode sit local any remote 8.8.8.8
> sudo ip link set broken0 up
> sudo ip -6 addr add dev broken0 2001:db8::1/127
> sudo ip -6 route add default via 2001:db8:: metric 1

This makes the ipv6 connection attempt timeout in some 600ms... too soon to see whether the patch fixes or not.  Maybe I'm doing something wrong, though.  I'm not a linux expert.
Anyway, here should soon (in 1-2 hours) be a new try build: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/honzab.moz@firemni.cz-6a5cac0f6a72

Please check it if you get to it.  Thanks.
Attached patch v1Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=6a5cac0f6a72

Patrick, please take a look.  So far green on try.
Attachment #622436 - Attachment is obsolete: true
Attachment #624794 - Flags: review?(mcmanus)
So that's pretty much what I was thinking of.. cool.

I think there are a couple other places that want to use the new pendinghalfopen concept too if we're going to assume a low probability of the 2nd half of a half-open structure succeeding: RestrictConnections() and that bit in MakeNewConnection() that is looking for a half-open speculative conection.

also I think it makes sense to land the nullhttptransaction::isdone bits separately as a bugfix that can stick no matter what happens to the rest.
Comment on attachment 624794 [details] [diff] [review]
v1

Assuming we agree it needs to be used those 2 other places as per comment 21
Attachment #624794 - Flags: review?(mcmanus) → review+
trybuild 6a5cac0f6a72 fixes the problem for me. Thanks!
(In reply to Patrick McManus [:mcmanus] from comment #21)

We have 2 weeks until the next train.  Let's land this patch that for sure fixes this bug.  I want to keep this patch minimal.

The rest should be landed in a different bug with also some time to bake this to identify a culprit of a potential regressions.
(In reply to Honza Bambas (:mayhemer) from comment #24)
> (In reply to Patrick McManus [:mcmanus] from comment #21)
> 
> We have 2 weeks until the next train.  Let's land this patch that for sure
> fixes this bug.  I want to keep this patch minimal.
> 
> The rest should be landed in a different bug with also some time to bake
> this to identify a culprit of a potential regressions.

if that's what you want to do that's ok by me.. please remove the comment about 21 seconds though.. I think its misleading as that number varies a lot by OS and by the arp status of your route selection.. I don't want people thinking that we enforce some kind of particular timeout as part of necko beacuse afaict we don't.
As you say, the timeout varies by OS. Windows is 21 seconds, OS X can be up to 75 seconds, and Linux is typically 3 minutes (!). Maybe just say "... that can take tens of seconds to time out if there are connectivity problems"?
https://hg.mozilla.org/mozilla-central/rev/530d57b90461
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla15
Any chance this could get cherry-picked into a release before World IPv6 Launch? On that date users with broken IPv6 (approximately 0.02% of users worldwide) will start experiencing substantial delays when visiting websites such as Google, Facebook, Yahoo, Netflix, AOL, and so on.
Comment on attachment 624794 [details] [diff] [review]
v1

I'd be supportive of at least taking this to Aurora (Firefox 14) before the June 5 merge. Once that happens we can discuss beta.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 684893
User impact if declined: Significant connection delays with IPv6.
Testing completed (on m-c, etc.): Been on m-c for nearly a week.
Risk to taking this patch (and alternatives if risky): My guess is that risk is moderate. No alternatives.
String or UUID changes made by this patch: None
Attachment #624794 - Flags: approval-mozilla-aurora?
Josh - is Lorenzo's assertion in Comment 29 true? If so, that'd only be ~2000 Firefox 10-12 users affected on 6/6. I think we should weigh that against the moderate risk evaluation if this has the possibility of regressing unaffected users.

If there's more reason to be concerned (esp. w/r/t FF12/13), please attend today's channel meeting at 2PM PT to help clarify the situation.
(In reply to Alex Keybl [:akeybl] from comment #31)

I think we've made a pretty strong commitement to working well with IPv6 and this bug is a significant problem for our IPv6 support. If this problem affects you you'll simply have to use another web browser, it doesn't seem like the kind of thing people can put up with for even a short amount of time. The reporter here was reporting a 20 second hang in load time.

The number of users affected by this bug will go up significantly starting on June 5, though it will remain small. I think this fix is safe enough to take in Aurora (FF14) now, which means it'll sit on beta for six weeks. If there is a problem we can simply back it out.
Comment on attachment 624794 [details] [diff] [review]
v1

[Triage Comment]
Given Josh's last comment, I'm happy to see this land on Aurora before our merge of 14 to the Beta channel tomorrow.  Having some bake time, with more users, should minimize the risk of regressions going uncaught and supporting iPv6 in the wider audience is indeed important to user retention.
Attachment #624794 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I believe honza is away - so I'll land this on aurora tonight or tomorrow morning unless we hear from him to do otherwise.
(In reply to Patrick McManus [:mcmanus] from comment #35)
> https://hg.mozilla.org/releases/mozilla-aurora/rev/762d16222aba

Thanks.
Whiteboard: [http-conn]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: