Closed Bug 725587 Opened 8 years ago Closed 7 years ago

Firefox jumps randomly from IPv6 to IPv4 and vice versa in dual-stack environment

Categories

(Core :: Networking: HTTP, defect)

10 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla21

People

(Reporter: schaal, Assigned: mayhemer)

References

(Depends on 1 open bug)

Details

(Keywords: regression, Whiteboard: [http-conn])

Attachments

(2 files)

User Agent: Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Build ID: 20111221214946

Steps to reproduce:

In our dual stack environment we have client and host machines running on both IPv4 and IPv6 with native IPv6 peering. All employee´s here use an ORTS-Ticketing System which is on a dual-stacked web-server in our local area network. We received an Update to Firefox 10.0 on both our Ubuntu and Windows XP and 7 machines. As usual we logged on to the Ticketing System´s Site via SSL.


Actual results:

After the Update several People had Problems with the Ticket System. The SSL Sessions were interrupted while they were editing Tickets. We never noticed such behaviour from the Ticketing System before and started investigation.

The Logs on the Web-server showed that the clients with FFox 10.0 randomly jumped from IPv6 to IPv4 and vice versa. This causes the SSL Session to become invalid. We tried to set a fix resolution for the hostname and IPv6 address in the host files on ubuntu and Win XP and 7 as well. This did not solve the Problem. 

After installing 4or6 Plug-in on one of the Windows XP Systems and deactivating fall-back to IPv4 and setting our Ticketing Systems Site IPv6 in the Plug-in the behaviour stopped.

We also changed network.http.fast-fallback-to-IPv4 from TRUE to FALSE on one of the Ubuntu machines and the behaviour also stopped.

We tested our environment as well and it is reliable by all means: DNS resolution, the Web-server, the Tickting Sytems Server config, and all Layer 3 Switches and Network Interfaces. 


Expected results:

FFox should have been connecting to the Ticketing Systems Site and should have established a stable SSL Session over IPv6 and then should have stayed on IPv6 for that site considered always stable by FFox in all builds before. It should not have flapped between IPv4 and IPv6 in minutes causing active SSL Session to go invalid.
OS: Linux → All
Hardware: x86 → All
Addition:

Our Senior Network Engineer informed me that all other Websites in our Environment are retrieved by IPv4 now regularly since FFox was updated to 10.0 . The Ticketing-System attracted our attention mainly because the Log-ins are bound to IP Addresses. In our Opinion this Bug impacts all Dual-Stacked Websites.
We implemented "happy eyeballs" with bug 684893. Switching network.http.fast-fallback-to-IPv4 will disable this
"happy eyeballs" can cause a jump from ipv6 to ipv4 if the IPv6 connection responds slower as the IPv4 link and vice versa.
I agree that this can cause problems if logins are bound to IPs

Are you willing to provide a Mozilla http log if necessary ?
Status: UNCONFIRMED → NEW
Component: Untriaged → Networking: HTTP
Depends on: 684893
Ever confirmed: true
Keywords: regression
Product: Firefox → Core
QA Contact: untriaged → networking.http
Sure i am. But i have to set up a test environment to produce this data first. Unfortunately i am not allowed to hand out any data from our productive systems to a 3rd Party. I hope you understand that.

I will try to provide you with the data from a testing scenario with similar build this month.

Kind regards
To explain how our "effective implementation of happy eye balls" works:

- we do a normal DNS resolution process (IPv4 and IPv6 both enabled)
- we try to connect each IP in order they come from the DN server
- if the TCP connection takes more then 250 ms (the default value of the pref) to establish we start another DNS resolution process with just IPv4 enabled ; we still wait for the first connection to finish
- when IPv4 resolution is done, we also try to connect with the list of now only IPv4 addresses ; that is the second connection attempt, while the first still proceeds
- now depends on what wins: the first connection, that is probably IPv6, or the second that is for sure IPv4

This way we quickly fall back from "faulty" IPv6 to IPv4.

The current implementation is quick-and-easy way to workaround faulty dual stack environments and is not the full implementation of happy eyeballs spec.  I wanted to implement the full spec first, but this seemed to be simple and working in most cases.

To solve this bug, we have to remember what address we successfully connect to, and if it is IPv6, remember we are in a working IPv6 env and don't try IPv4 only on the second attempt.  Note: this information will be persistent only for the time of the browser's session.  We don't have a way to persist this and also it's probably not a good idea to persist - laptops/tablets move between various configs.  Also based on that, I think we could expire this information in some 6h/12h and after that prolong the timeout for a secondary IPv4-only connection attempt.

This fix is simple but will take time to get into the release.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Just to point it out again:

We use IPv6 primary in our environment as we are in the ISP branch and use IPv4 only as a Backup. I´ve done some name resolution test´s and both IPv4 and IPv6 got results ranging from 1 to 6 ms. 

ICMP from Webserver to client and vice versa 3ms avarage with 500 ICMP packets.
(Test with ping6 command)

There is still something wrong as we don´t even touch the 250ms border with a IPv6 connection.

Seems to me as if the algorithm gets the IPv4 resolution handed first from the getadressinfo() function and then tests IPv4 against IPv4 this is indeed not what the happy eye balls implementation should do.

It should always test IPv6 against IPv4 regardless which DNS resolution is handed first to the process.

All the more it should consider an IPv6 link stable if the connection is established below 50ms which is 5 times lower than the threshold you mentioned.
It may happen that the first SYN packet is lost.  Then we can easily reach the 250ms time out.  That is actually the original purpose of the backup connection code.

Depends on the system in what order/preference the addresses are returned.  We are using AF_UNSPEC | AI_ADDRCONFIG flags for call of GetAddrInfo in case of the first (both IPv4 and 6 enabled) connection attempt.

We may also add check the first connection stalls on IPv6 address and only then start the second connection for IPv4, when host state is not known.

Also, the first attempt for a host might have prolonged the timeout.  After we remember what address family has been reached successfully we may drop back to the default 250ms (or as the pref is set).
I believe to correctly implement this deps on bug 715905.
Depends on: 715905
On the other hand, we are OK with the information we are able to connect a host that is available w/o that bug fixed.

So, the dependency is just "weak".
Whiteboard: [http-conn]
Attached patch v1Splinter Review
- I'm storing IP family preference state at nsConnectionEntry
- based on what IP family server we connect first, we start preferring only that IP family
- reset of this state is made by Ctrl-F5 for now ; this may be disputable since after Ctrl-F5 user could potentially experience this bug again

Followup ideas:
- we may simply reset the preference when we connect with IPv6-only preferred transport to an IPv4 server or vice versa
- only reason we would ever need to reset is a potential slow down when first connecting a suddenly dead set of IP addresses, however we still connect to something since we then drop DISABLE_IPVx flags in socket transport and retry lookup again with IN_ADDR_ANY
Attachment #693347 - Flags: review?(mcmanus)
Comment on attachment 693347 [details] [diff] [review]
v1

Review of attachment 693347 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm.

ideally we would want to tie clearing this stickiness to dns somehow, right? (i.e. prefer a particular v6 address but if that's not the one we're going to try now drop the preference) - but the connection entry hash is the only persistent list we've really got and hostname ought to get us >99% of the way there
Attachment #693347 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #10)
> ideally we would want to tie clearing this stickiness to dns somehow, right?
> (i.e. prefer a particular v6 address but if that's not the one we're going
> to try now drop the preference) - but the connection entry hash is the only
> persistent list we've really got and hostname ought to get us >99% of the
> way there

I'm not sure I completely understand.  You mean to reset the pref when we are simply not able to connect an IP we were able before and based the preference on?  I wanted to pref a whole family and not just a single IP.  We still may detect a preferred family is dead by inspecting the peer address and checking the connection flags - i.e. whether we preffed IPv4 but connected IPv6 or so..  Probably a followup would be OK for this.  I'd first want to check this bug is now fixed.
(In reply to Honza Bambas (:mayhemer) from comment #11)
> (In reply to Patrick McManus [:mcmanus] from comment #10)
> > ideally we would want to tie clearing this stickiness to dns somehow, right?
> > (i.e. prefer a particular v6 address but if that's not the one we're going
> > to try now drop the preference) - but the connection entry hash is the only
> > persistent list we've really got and hostname ought to get us >99% of the
> > way there
> 
> I'm not sure I completely understand.  You mean to reset the pref when we
> are simply not able to connect an IP we were able before and based the
> preference on?  I wanted to pref a whole family and not just a single IP. 
> We still may detect a preferred family is dead by inspecting the peer
> address and checking the connection flags - i.e. whether we preffed IPv4 but
> connected IPv6 or so..  Probably a followup would be OK for this.  I'd first
> want to check this bug is now fixed.

This patch should go ahead. I'm not even sure we need a follow on bug - what I'm suggesting would I think be good but also definitely more work/complexity than justifies it.

I was just saying that right now we might make entry A with address ::1 prefer ipv6.. and if later that is still entry A and the address has changed to ::2 we still maintain that preference.. my understanding of where ipv6 failures occurs tells me that this is not necessarily a good assumption. does that make sense?
https://hg.mozilla.org/mozilla-central/rev/e4f69649d417
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Depends on: 825501
Going to back this out to prevent bug 825501 getting to Aurora.

This issue comes from a fact I simply don't have a good ipv6 testing environment.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch backout v1Splinter Review
Attachment #697480 - Flags: review?(mcmanus)
Is a backout patch need a review?
Attachment #697480 - Flags: review?(mcmanus) → review+
Status: REOPENED → ASSIGNED
(In reply to Masatoshi Kimura [:emk] from comment #17)
> Is a backout patch need a review?

my understanding is it depends on how long the patch has been in the tree and is done at the discretion of the backout author.
Target Milestone: mozilla20 → ---
(In reply to Honza Bambas (:mayhemer) from comment #21)
> Comment on attachment 693347 [details] [diff] [review]
> v1
> 
> Relanded: https://hg.mozilla.org/integration/mozilla-inbound/rev/816f076c2c15

Backed out due to build bustage(s).  Thanks Ehsan.
https://hg.mozilla.org/mozilla-central/rev/51a772f811e2
Status: ASSIGNED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
Blocks: 1190502
Duplicate of this bug: 816889
You need to log in before you can comment on or make changes to this bug.