Closed Bug 749209 Opened 12 years ago Closed 12 years ago

Happy Eyeballs implementation still not quite right

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla15

Tracking Flags:

Tracking

Status

firefox13

affected

firefox14

fixed

firefox15

fixed

People

(Reporter: lorenzo, Assigned: mayhemer)

References

(Depends on 1 open bug)

Details

(Whiteboard: [http-conn])

Attachments

(4 files, 3 obsolete files)

tcpdump of failed attempt to load dual-stack website 12 years ago Lorenzo Colitti 890.95 KB, application/cap		Details
HTTP log 12 years ago Lorenzo Colitti 1.14 MB, text/plain		Details
v0.9 12 years ago Honza Bambas (:mayhemer) 2.25 KB, patch	mcmanus : review+ mcmanus : feedback+	Details \| Diff \| Splinter Review
Timestamped HTTP log 12 years ago Lorenzo Colitti 2.29 MB, text/plain		Details
tcpdump matching log file 12 years ago Lorenzo Colitti 917.69 KB, application/cap		Details
Log from patched build 12 years ago Lorenzo Colitti 2.77 MB, text/plain		Details
v1 12 years ago Honza Bambas (:mayhemer) 9.33 KB, patch	mcmanus : review+ lsblakk : approval-mozilla-aurora+ mayhemer : checkin+	Details \| Diff \| Splinter Review

Lorenzo Colitti

Reporter

Description

•

12 years ago

Attached file tcpdump of failed attempt to load dual-stack website (obsolete) — Details

Looks like the fast fallback from IPv6 to IPv4 implemented in bug 621558 and then reimplemented in bug 684893 is still not quite right.

I configured broken IPv6 on my machine (global IPv6 address and default route that blackholed traffic, such that all IPv6 TCP connections would time out) and tried to load dual-stack websites. I was to load one or two standalone URLs fine, but when I tried to load a whole dual-stack website (www.ripe.net or www.arin.net), the browser would start to load but hang for ~20s before loading the page.

I think it might be blocking on full connection queues, because in the logs I see lots of these:

-228590736[f724aa00]: nsHttpConnectionMgr::GetConnection [ci = ...www.ripe.net:80]at active connection limit - will queue

Is it possible that fast-fallback doesn't trigger if the connection queue is already full with other connections?

This is using Firefox 11.0 (20120310194926) on Linux. network.http.fast-fallback-to-IPv4 is true (default). tcpdump attached.

Lorenzo Colitti

Reporter

Comment 1

•

12 years ago

Attached file HTTP log (obsolete) — Details

Honza Bambas (:mayhemer)

Assignee

Comment 2

•

12 years ago

Just to make it clear: our implementation is not a full implementation of the spec.  It's just a quick decision algorithm and we know about at least bug 725587.

I'll take a look at this.

Assignee: nobody → honzab.moz

Lorenzo Colitti

Reporter

Comment 3

•

12 years ago

Ack. The bug is not "fails to fully implement the spec" but rather "fails to fall back to IPv4 in some cases", which is basically failing to do what it was designed to do.

Masatoshi Kimura [:emk]

Comment 4

•

12 years ago

ripe.net swallows all ICMP packets which is fatal on IPv6. They should fix their network configuration.
> "fails to fall back to IPv4 in some cases"
We can't fallback if it looks like the connection has been established, but do not respond. Otherwise it may cause something like double POST (one for IPv6 but we fails to receive the response somehow, and one for IPv4).

Lorenzo Colitti

Reporter

Comment 5

•

12 years ago

From the tcpdump it's clear that no IPv6 connection is ever established. My machine sends SYN packets, but never gets SYN+ACK back. The problem is that even though the IPv6 TCP connections aren't establishing, the browser doesn't fall back to IPv4.

I don't see how ICMP is involved at all, since it's not in the tcpdump. If I fix my connection not to drop packets, IPv6 connections to www.ripe.net work fine.

Honza Bambas (:mayhemer)

Assignee

Comment 6

•

12 years ago

Lorenzo, to make sure you don't have in any way broken preference setting or any extension is involved, can you please try to reproduce with a clean profile?

Honza Bambas (:mayhemer)

Assignee

Comment 7

•

12 years ago

Attached patch v0.9 (obsolete) — Details — Splinter Review

I've managed to setup an env to reproduce "a" problem like in the description.

It was a regression from speculative connect caused by not setting up the backup timer since nsNullTransaction::IsDone was returning true all the time.

But, I don't see the same issue in the log, however the description conforms with symptoms of what I've found.

Lorenzo, with what version are you able to reproduce the problem?  Is it consistent?  Can you add "timestamp" module (no ":5" after it) to the list of NSPR_LOG_MODULES and produce a new log?

Attachment #622436 - Flags: feedback?(mcmanus)

Lorenzo Colitti

Reporter

Comment 8

•

12 years ago

Sorry for the delay.

Yes, I can reproduce the problem at will. I think it only happens on sites that have at least a few subresources (CSS, images, and so on). For example, it happens on www.ripe.net, but not on www.ietf.org, which is a simple page.

I just reproduced using a clean profile on Firefox 12.0 for Linux downloaded from getfirefox.com, HG revision a294a5b4f12d

Attaching timestamped log and tcpdump again. From the log it looks like the whole process stalls for 20s when the connection limit is hit.

I can't do my own build, but I'm happy to test a patched linux binary if you want to send me one.

Lorenzo Colitti

Reporter

Comment 9

•

12 years ago

Attached file Timestamped HTTP log — Details

Attachment #618657 - Attachment is obsolete: true

Lorenzo Colitti

Reporter

Comment 10

•

12 years ago

Attached file tcpdump matching log file — Details

Attachment #618656 - Attachment is obsolete: true

Honza Bambas (:mayhemer)

Assignee

Comment 11

•

12 years ago

Lorenzo, thanks a lot.  I can see it now.

Please, first check the bug is present in the latest beta build:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/13.0b3-candidates/build1/

Then please try this build that contains a potential fix, there should be no stalling with it:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/honzab.moz@firemni.cz-79e13e14f433/

and let me know.  (You don't need to setup and save a log this time, the issues should simply go away with the try build.)

Thank you.

Status: NEW → ASSIGNED

Patrick McManus [:mcmanus]

Comment 12

•

12 years ago

Comment on attachment 622436 [details] [diff] [review]
v0.9

I know this isn't the core patch for this bug, but I think we ought to get this into the tree while the problem is only on nightly. Thanks!

Attachment #622436 - Flags: review+

Attachment #622436 - Flags: feedback?(mcmanus)

Attachment #622436 - Flags: feedback+

Patrick McManus [:mcmanus]

Comment 13

•

12 years ago

from https://hg.mozilla.org/try/rev/79e13e14f433

    1.12 -    if (!gHttpHandler->ConnMgr()->
    1.13 -        AtActiveConnectionLimit(mEnt, mTransaction->Caps())) {
    1.14 -        SetupBackupStreams();
    1.15 -    }
    1.16 +    SetupBackupStreams();

Interesting! I can see how that's a problem if we're using the backup stream to do HE.

that's a big step forward (at the cost of accidentally going beyond 6 once in a while in a slow pure ipv4 env, but that doesn't bother me).. but there will be HE cases where there are 3 active v4 and 3 half-opens that are taking forever to timeout on a v6 address and we won't admit any new connections requests (not backup ones)... we almost need to cancel the original connect if a HE attempt has taken over.

Lorenzo Colitti

Reporter

Comment 14

•

12 years ago

Attached file Log from patched build — Details

Unfortunately, the patched build doesn't work either. Attaching HTTP log from the linux_64 version.

I think the problem is that once the stuck connections saturate the connection pool, that's it - so if the website wants to load enough resources, you always end up hitting the problem and hanging. I think the only way to build a robust implementation is to design it with the assumption that any connect() may never return, ever.

This took a few attempts to get right in chromium as well. IIRC, in the initial implementations, the problems were similar (backup connections not actually triggering because the pool was full, and so on). I believe that in the end it was implemented at the lowest-possible layer, basically just above the TCP socket, and that worked better.

Happy to provide feedback as I have been so far, but it might be better if you were able to reproduce the hang. Do you have a linux box? If so, the following might suffice to give you a broken IPv6 connection that allows you to reproduce the problem:

sudo ip tunnel add broken0 mode sit local any remote 8.8.8.8
sudo ip link set broken0 up
sudo ip -6 addr add dev broken0 2001:db8::1/127
sudo ip -6 route add default via 2001:db8:: metric 1

to unbreak:

sudo ip tunnel del broken0

Does that help you see the problem?

Honza Bambas (:mayhemer)

Assignee

Comment 15

•

12 years ago

Lorenzo, thanks again.

I think nsHalfOpenSocket should be removed from the array (or somehow marked) when it has established at least one connection.  While it hangs for those 21 seconds before broken IPv6 connection times out, it is measured as an existing connection for the overall limit.  Probably similar issue as chromium had.

Anyway, I'm not willing to write any quick fix here, we have some other issues around the connection limit code right now.  However, Patrick, feel free to write any quick fix you believe it fits.

The proper fix will be done in bug 715905, though.

Assignee: honzab.moz → nobody

Depends on: 715905

Honza Bambas (:mayhemer)

Assignee

Updated

•

12 years ago

Status: ASSIGNED → NEW

Patrick McManus [:mcmanus]

Comment 16

•

12 years ago

honza I think we ought to at least land the patch "0.9".. yes?

Honza Bambas (:mayhemer)

Assignee

Comment 17

•

12 years ago

(In reply to Patrick McManus [:mcmanus] from comment #16)
> honza I think we ought to at least land the patch "0.9".. yes?

Definitely.  And we could also add a flag to nsHalfOpenSocket that would be used in AtActiveConnectionLimit to exclude it from the total count when a connection has already been created by that half-open.  I looked at that a bit deeper and it might be safe to do and fix this bug also for Lorenzo.

Assignee: nobody → honzab.moz

Status: NEW → ASSIGNED

Honza Bambas (:mayhemer)

Assignee

Comment 18

•

12 years ago

(In reply to Lorenzo Colitti from comment #14)
> sudo ip tunnel add broken0 mode sit local any remote 8.8.8.8
> sudo ip link set broken0 up
> sudo ip -6 addr add dev broken0 2001:db8::1/127
> sudo ip -6 route add default via 2001:db8:: metric 1

This makes the ipv6 connection attempt timeout in some 600ms... too soon to see whether the patch fixes or not.  Maybe I'm doing something wrong, though.  I'm not a linux expert.

Honza Bambas (:mayhemer)

Assignee

Comment 19

•

12 years ago

Anyway, here should soon (in 1-2 hours) be a new try build: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/honzab.moz@firemni.cz-6a5cac0f6a72

Please check it if you get to it.  Thanks.

Honza Bambas (:mayhemer)

Assignee

Comment 20

•

12 years ago

Attached patch v1 — Details — Splinter Review

https://tbpl.mozilla.org/?tree=Try&rev=6a5cac0f6a72

Patrick, please take a look.  So far green on try.

Attachment #622436 - Attachment is obsolete: true

Attachment #624794 - Flags: review?(mcmanus)

Patrick McManus [:mcmanus]

Comment 21

•

12 years ago

So that's pretty much what I was thinking of.. cool.

I think there are a couple other places that want to use the new pendinghalfopen concept too if we're going to assume a low probability of the 2nd half of a half-open structure succeeding: RestrictConnections() and that bit in MakeNewConnection() that is looking for a half-open speculative conection.

also I think it makes sense to land the nullhttptransaction::isdone bits separately as a bugfix that can stick no matter what happens to the rest.

Patrick McManus [:mcmanus]

Comment 22

•

12 years ago

Comment on attachment 624794 [details] [diff] [review]
v1

Assuming we agree it needs to be used those 2 other places as per comment 21

Attachment #624794 - Flags: review?(mcmanus) → review+

Lorenzo Colitti

Reporter

Comment 23

•

12 years ago

trybuild 6a5cac0f6a72 fixes the problem for me. Thanks!

Honza Bambas (:mayhemer)

Assignee

Comment 24

•

12 years ago

(In reply to Patrick McManus [:mcmanus] from comment #21)

We have 2 weeks until the next train.  Let's land this patch that for sure fixes this bug.  I want to keep this patch minimal.

The rest should be landed in a different bug with also some time to bake this to identify a culprit of a potential regressions.

Patrick McManus [:mcmanus]

Comment 25

•

12 years ago

(In reply to Honza Bambas (:mayhemer) from comment #24)
> (In reply to Patrick McManus [:mcmanus] from comment #21)
> 
> We have 2 weeks until the next train.  Let's land this patch that for sure
> fixes this bug.  I want to keep this patch minimal.
> 
> The rest should be landed in a different bug with also some time to bake
> this to identify a culprit of a potential regressions.

if that's what you want to do that's ok by me.. please remove the comment about 21 seconds though.. I think its misleading as that number varies a lot by OS and by the arp status of your route selection.. I don't want people thinking that we enforce some kind of particular timeout as part of necko beacuse afaict we don't.

Lorenzo Colitti

Reporter

Comment 26

•

12 years ago

As you say, the timeout varies by OS. Windows is 21 seconds, OS X can be up to 75 seconds, and Linux is typically 3 minutes (!). Maybe just say "... that can take tens of seconds to time out if there are connectivity problems"?

Honza Bambas (:mayhemer)

Assignee

Comment 27

•

12 years ago

Comment on attachment 624794 [details] [diff] [review]
v1

https://hg.mozilla.org/integration/mozilla-inbound/rev/530d57b90461

Attachment #624794 - Flags: checkin+

Ed Morley [:emorley]

Comment 28

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/530d57b90461

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla15

Lorenzo Colitti

Reporter

Comment 29

•

12 years ago

Any chance this could get cherry-picked into a release before World IPv6 Launch? On that date users with broken IPv6 (approximately 0.02% of users worldwide) will start experiencing substantial delays when visiting websites such as Google, Facebook, Yahoo, Netflix, AOL, and so on.

Josh Aas

Comment 30

•

12 years ago

Comment on attachment 624794 [details] [diff] [review]
v1

I'd be supportive of at least taking this to Aurora (Firefox 14) before the June 5 merge. Once that happens we can discuss beta.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 684893
User impact if declined: Significant connection delays with IPv6.
Testing completed (on m-c, etc.): Been on m-c for nearly a week.
Risk to taking this patch (and alternatives if risky): My guess is that risk is moderate. No alternatives.
String or UUID changes made by this patch: None

Attachment #624794 - Flags: approval-mozilla-aurora?

Alex Keybl [:akeybl]

Comment 31

•

12 years ago

Josh - is Lorenzo's assertion in Comment 29 true? If so, that'd only be ~2000 Firefox 10-12 users affected on 6/6. I think we should weigh that against the moderate risk evaluation if this has the possibility of regressing unaffected users.

If there's more reason to be concerned (esp. w/r/t FF12/13), please attend today's channel meeting at 2PM PT to help clarify the situation.

tracking-firefox13: --- → +

tracking-firefox14: --- → +

tracking-firefox15: --- → +

Josh Aas

Comment 32

•

12 years ago

(In reply to Alex Keybl [:akeybl] from comment #31)

I think we've made a pretty strong commitement to working well with IPv6 and this bug is a significant problem for our IPv6 support. If this problem affects you you'll simply have to use another web browser, it doesn't seem like the kind of thing people can put up with for even a short amount of time. The reporter here was reporting a 20 second hang in load time.

The number of users affected by this bug will go up significantly starting on June 5, though it will remain small. I think this fix is safe enough to take in Aurora (FF14) now, which means it'll sit on beta for six weeks. If there is a problem we can simply back it out.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 33

•

12 years ago

Comment on attachment 624794 [details] [diff] [review]
v1

[Triage Comment]
Given Josh's last comment, I'm happy to see this land on Aurora before our merge of 14 to the Beta channel tomorrow.  Having some bake time, with more users, should minimize the risk of regressions going uncaught and supporting iPv6 in the wider audience is indeed important to user retention.

Attachment #624794 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Patrick McManus [:mcmanus]

Comment 34

•

12 years ago

I believe honza is away - so I'll land this on aurora tonight or tomorrow morning unless we hear from him to do otherwise.

Patrick McManus [:mcmanus]

Comment 35

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/762d16222aba

status-firefox13: --- → affected

status-firefox14: --- → fixed

status-firefox15: --- → fixed

Honza Bambas (:mayhemer)

Assignee

Comment 36

•

12 years ago

(In reply to Patrick McManus [:mcmanus] from comment #35)
> https://hg.mozilla.org/releases/mozilla-aurora/rev/762d16222aba

Thanks.

Honza Bambas (:mayhemer)

Assignee

Updated

•

12 years ago

Whiteboard: [http-conn]

You need to log in before you can comment on or make changes to this bug.