Closed Bug 1245059 Opened 8 years ago Closed 8 years ago

network interfaces coming and going frequently causes connection timeouts

Categories

(Core :: Networking, defect)

38 Branch
Unspecified
Windows
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox48 --- fixed

People

(Reporter: lesliewu2008, Assigned: bagder)

Details

(Whiteboard: [necko-active])

Attachments

(4 files, 2 obsolete files)

Attached image File_WireSharkError.png
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36

Steps to reproduce:

manually access a login page in firefox 38


Actual results:

after click sign  in in login page, 
during the page load process, firefox suddenly stopped load the page. and a blank page open . 

by checking wireshark, see a [FIN,ACK] is sent from firefox to server after Firefox try to load the new page, see attached picture. 


Expected results:

firefox load the page successfully
Attached file log.txt
add http log for debug
I am working with bug reporter on this bug....
we found that change preference "network.http.network-changed.timeout" to bigger value can solve this bug. 

==================================================================================

more information:
As mentioned in last post,  we finally found change preference "network.http.network-changed.timeout" to bigger value can solve this bug. 

as we spend lots of time in checking this bug, following is our other findings which may help in analyze the bug:
definition of this bug: 
we see [FIN,ACK] sent from firefox to server to trigger stop the connection. 
we use two different login page to reproduce this. (both login page need >10s to login).
what it appears from user side is that after click sign in in our login page, page begin to load, but after a while , it suddenly stopped to load the page. (in wireshark log check, we see [FIN,ACK] sent from firefox to server)

1. the bug seems is quite intermittent,  for manually reproduce it, we found that only certain VM in our cloud can reproduce it. we didn't reproduce this issue on our local PC
2. for firefox version, although this preference we found "network.http.network-changed.timeout", was introduced in firefox 36, but during our huge testing , in all (>1000 cases), 

it shows that we can't reproduce on the following firefox version:
24ESR, 31 ESR, 34, 35, 36. 
we can reproduce on the following firefox version:
37,38, 40. 

we only tried the above firefox version, for any firefox version, i use its last available small version. 

3. we also found some wired things that in most our reproduced evn, when reproduced,  firefox sent  [fin,ack] after it doesn't receive anything from server after around 5 seconds. (as we didn't change the default setting before. )
but on one system(with the same default setting ), Firefox sent [FIN,ACK] after 10 seconds, even this preference is set to 5 on the machine. 

while for all reproduced machine, if this preference is set bigger enough (like 500), the issue gone.
for "network change event" , which is related to ""network.http.network-changed.timeout ", does it mean that the client(which has installed firefox) changed network connection. 
or can also be, firefox itslef, changed from 1 network to another network (for example , by change proxy)? 


seems there is no network change in our env when we reproduce the bug. we can check with our IT people related to "network change".

in one our example which has reproduced this bug, we have used a most easy login page, which was build for this bug on purpose,   the login does nothing, but after user click login, server side just wait 20 seconds, and then output such text. 
 
when we use a testing tool to automate such process run >100 times continuously , we can produce it intermittently.  some times around 1 in 100, some times, a little bigger chance. 

we have a very complex login page, during login, lots of js run and have redirection, and in total ,takes >10s to login. for this case, it can reproduce some times. 

also as i mentioned, we see this bug only since firefox 37. 

i didn't see we involve net work change . we didn't change firefox proxy. we also didn't change system's network.
Component: Untriaged → Networking
Product: Firefox → Core
firefox detects network changes, like change from cable to wifi, routing table changes. Firefox is not changing network. So your implementation cannot provoke these change events.

I will need info bagder, he knows this code the most.
Flags: needinfo?(daniel)
When a network change is detected - and this log clearly says that Firefox detected one - Firefox can't know if the existing connections are still "alive" and functional. It therefor checks them in various ways and each existing HTTP/1 connection has [network.http.network-changed.timeout] seconds to get traffic to show that it is still alive/in-use and if not, it is assumed to have stalled due to the network change and it is killed.

Increasing the timeout will of course reduce the risk of false positives with terribly slow servers.
Flags: needinfo?(daniel)
why does the testing tool generate network change events? Is that expected? IF we're reacting to spurious events, that would be a problem and nothing seems obvious from the reported that they would expect this.
Flags: needinfo?(daniel)
Summary: Firefox 38 show a blank page after login → unexpected network change events causing connection timeouts
Daniel, excuse my ignorance, please.
I'd like to understand where in the log file you have read that " When a network change is detected - and this log clearly says that Firefox detected one".
I saw this fragment, and I don't know what corresponds to the network change detection.

2016-02-02 03:22:54.701000 UTC - 0[2611140]: nsInputStreamPump::OnInputStreamReady [this=b7b8b20]
>2016-02-02 03:22:54.701000 UTC - 0[2611140]:   OnStateStop [this=b7b8b20 status=0]
>2016-02-02 03:22:54.701000 UTC - 0[2611140]: nsHttpChannel::OnStopRequest [this=b797000 request=b7b8b20 status=0]
>2016-02-02 03:22:54.701000 UTC - 0[2611140]:   calling OnStopRequest
>2016-02-02 03:22:54.701000 UTC - 0[2611140]: nsHttpChannel::CloseCacheEntry [this=b797000] mStatus=0 mCacheEntryIsWriteOnly=0
>2016-02-02 03:22:54.701000 UTC - 0[2611140]: Destroying nsHttpChannel [this=b797000]
>2016-02-02 03:22:54.701000 UTC - 0[2611140]: Destroying HttpBaseChannel @b797000
>2016-02-02 03:22:54.701000 UTC - 0[2611140]: Destroying nsHttpConnectionInfo @b7b8ac0
>2016-02-02 03:22:55.685000 UTC - 5340[2611680]:     ...returned after 1000 milliseconds
>2016-02-02 03:22:55.685000 UTC - 5340[2611680]: nsHttpConnectionMgr::Observe [topic="timer-callback"]

It is not just a lazy curiosity. We are trying to understand what could be a real cause of nework change event.
(In reply to Daniel Stenberg [:bagder] from comment #6)
> When a network change is detected - and this log clearly says that Firefox
> detected one - Firefox can't know if the existing connections are still
> "alive" and functional. It therefor checks them in various ways and each
> existing HTTP/1 connection has [network.http.network-changed.timeout]
> seconds to get traffic to show that it is still alive/in-use and if not, it
> is assumed to have stalled due to the network change and it is killed.
> 
> Increasing the timeout will of course reduce the risk of false positives
> with terribly slow servers.
The testing tool doesn't generate network change event. But anyway intermittently FF detects(? we not sure that it is a case) such event. This is the reason for this bug filing.
Could it be that FF on its own test connection upon sending request for data, if responce is not complete during timeout?
More Background:
A testing tool is involved as a tool to  functional testing the target webs site. And this bug is encountered when do functional testing.

after that , in order to analyze this issue. we worked on reproduce the bug, we successfully worked on reproduce this bug intermittently on a product's login page manually on Firefox (we disabled all plugin and add-on during reproduce), version 37 and 38.  So it is not related to the testing tool.  

Another finding is that using the same environment which can manually reproduce this bug on firefox version 37 and 38 (only change firefox version ), the same issue just can't be reproduced on firefox 36. 

But we found that the setting and feature related to network change is introduced in Firefox 36. 
see: https://hg.mozilla.org/mozilla-central/rev/0a066b7ffa46 

This seems conflict with the assumption that the issue is caused by a real network change , and related to the server side's low performance.  (as if this is true, we should be able to reproduce on firefox 36)
(In reply to arkady from comment #8)

> I'd like to understand where in the log file you have read that " When a
> network change is detected - and this log clearly says that Firefox detected
> one".

Sure. Look at this line:

>2016-02-02 03:22:49.679000 UTC - 0[2611140]: nsHttpHandler::Observe [topic="network:link-status-changed"]

That's the httphandler's observe function that subscribes to the network change notice. And then you'll see that 5 seconds later, at 03:22:54 it will prune connections that haven't had any traffic in that time frame:

>2016-02-02 03:22:54.681000 UTC - 5340[2611680]: nsHttpConnectionMgr::PruneNoTraffic
Flags: needinfo?(daniel)
(In reply to lesliewu2008 from comment #10)

> But we found that the setting and feature related to network change is
> introduced in Firefox 36. 
> see: https://hg.mozilla.org/mozilla-central/rev/0a066b7ffa46 
> 
> This seems conflict with the assumption that the issue is caused by a real
> network change , and related to the server side's low performance.  (as if
> this is true, we should be able to reproduce on firefox 36)

I think it is the other way around. It only helps us confirm that what you're observing is exactly this triggered network change and what Firefox does as a result of that, and that causes this issue.

Before Firefox acted on network changes, it instead suffered from other problems as it didn't properly realize things had changed and it would get stuck or serve old stuff.

What we really want here is to understand what the network change is that Firefox detects that you're not aware of, as it seems like a sort of ghost event that we don't really expect or desire at this point and it would be worthwhile investigating what we can do to filter that one out.

On Windows, we use the Operating systems's notification system and relies on that to tell us when "something" has changed in the network, then we do a checksum on a bunch of network info and make sure that the new checksum is different than the previous run before we send the network change event.
Assignee: nobody → daniel
Whiteboard: [necko-active]
(In reply to Daniel Stenberg [:bagder] from comment #12)
> (In reply to lesliewu2008 from comment #10)
> 
> > But we found that the setting and feature related to network change is
> > introduced in Firefox 36. 
> > see: https://hg.mozilla.org/mozilla-central/rev/0a066b7ffa46 
> > 
> > This seems conflict with the assumption that the issue is caused by a real
> > network change , and related to the server side's low performance.  (as if
> > this is true, we should be able to reproduce on firefox 36)
> 
> I think it is the other way around. It only helps us confirm that what
> you're observing is exactly this triggered network change and what Firefox
> does as a result of that, and that causes this issue.
> 
> Before Firefox acted on network changes, it instead suffered from other
> problems as it didn't properly realize things had changed and it would get
> stuck or serve old stuff.
> 
> What we really want here is to understand what the network change is that
> Firefox detects that you're not aware of, as it seems like a sort of ghost
> event that we don't really expect or desire at this point and it would be
> worthwhile investigating what we can do to filter that one out.
> 
> On Windows, we use the Operating systems's notification system and relies on
> that to tell us when "something" has changed in the network, then we do a
> checksum on a bunch of network info and make sure that the new checksum is
> different than the previous run before we send the network change event.

so it is confirmed that something happens. We need to understand what and why.
Daniel would you help us catch it?

I mean that we need a guidance in creating additional logs and special settings for FF in order to catch the orginator of this network change event.
Hopefully it didn't take creation of special debug instance of FF itself.

Please direct us through these steps
(In reply to arkady from comment #13)

> I mean that we need a guidance in creating additional logs and special
> settings for FF in order to catch the orginator of this network change event.
> Hopefully it didn't take creation of special debug instance of FF itself.

If you can somehow narrow down when this event happens, it could possibly be useful to capture network activities during a few seconds before and under the Firefox detecting the network change. The logging module in Firefox  called "nsNotifyAddr" is otherwise what you can enable to see some additional details on the detection (the code for it is in netwerk/system/win32/nsNotifyAddrListener.cpp)

You can also of course add additional logging in there, but if you don't know what network change that is happening then I don't think more logging will help much.

Which Windows version are you using? Are you using IPv6 on the network where this happens?
(In reply to Daniel Stenberg [:bagder] from comment #14)
> (In reply to arkady from comment #13)
> 
> > I mean that we need a guidance in creating additional logs and special
> > settings for FF in order to catch the orginator of this network change event.
> > Hopefully it didn't take creation of special debug instance of FF itself.
> 
> If you can somehow narrow down when this event happens, it could possibly be
> useful to capture network activities during a few seconds before and under
> the Firefox detecting the network change. The logging module in Firefox 
> called "nsNotifyAddr" is otherwise what you can enable to see some
> additional details on the detection (the code for it is in
> netwerk/system/win32/nsNotifyAddrListener.cpp)
> 
> You can also of course add additional logging in there, but if you don't
> know what network change that is happening then I don't think more logging
> will help much.
> 
> Which Windows version are you using? Are you using IPv6 on the network where
> this happens?

We cannot find Network change event analyzing Windows event viewer -system events.
Could you guide us what sources and/or event type ids of Windows events you are listening to generate Newtwork change event?
Then  we can set filter in Windows event viewer to identify events by source and event Ids.
Probably you can suggest some alternative method of catching this ghost event.
(In reply to arkady from comment #15)

> We cannot find Network change event analyzing Windows event viewer -system
> events.
> Could you guide us what sources and/or event type ids of Windows events you
> are listening to generate Newtwork change event?
> Then  we can set filter in Windows event viewer to identify events by source
> and event Ids.
> Probably you can suggest some alternative method of catching this ghost
> event.

The code we're talking about is here: http://mxr.mozilla.org/mozilla-release/source/netwerk/system/win32/nsNotifyAddrListener.cpp#175

It basically calls NotifyIpInterfaceChange (on Vista or later) and lets the operating tell when something has changed. When it is notified, it calls CheckAdaptersAddresses() in that same source file, to get a checksum of all the network adapters and IP addresses used - and if that checksum is different than previously, it consider it a valid network change. See http://mxr.mozilla.org/mozilla-release/source/netwerk/system/win32/nsNotifyAddrListener.cpp#573

Hence my question about IPv6, as that has caused similar problems before on other platforms since it has link-local addresses etc that can come and go more frequently.

Are you capable of building your own Firefox build? If so, you could instrument the above mentioned code with some additional logging to help track down what changes that are happening. Or would you be able to run a special build I can make for you with that sort logging added?

(and feel free to needinfo me when you ask questions, as then I'll find my way back to respond easier and faster)
Flags: needinfo?(lesliewu2008)
Our servers where we had this issue use IPv6 and IPv4.  If you are able to provide a special build for us with the logging turned on, we would try to reproduce the issue with it and share the log with you.
Flags: needinfo?(daniel)
We're also affected by this. Some of our internal users reported that long running requests were aborted prematurely after upgrading to Windows 10.

I used a clean Windows 10 VM from Microsoft to verify that I do get the network:link-status-changed message, and setting network.notify.changed to false as a workaround reportedly makes the problem go away for the users.

I have no idea if the issue is with the configuration of our internal network, or perhaps with our uncommon use of requests that span 10s of seconds (or minutes).
I'll produce a build with more logging to help us detect what causes this to happen.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Flags: needinfo?(daniel)
OS: Unspecified → Windows
Ok, here we go.

I submitted a try build with an additional logging patch for network change events. The added logging will display all detected "network adapters" and their IP addresses and then checksum for that which Firefox uses to determine if it actually was a change worth bothering about.

For reference, the try build is here (where the patch can also be inspected): https://treeherder.mozilla.org/#/jobs?repo=try&revision=463dc0728d38&selectedJob=17044338

The 64 bit build is available from http://archive.mozilla.org/pub/firefox/try-builds/daniel@haxx.se-463dc0728d3800f67673b9457512fd71549ce790/try-win64-debug/ and the 32bit alternative is at http://archive.mozilla.org/pub/firefox/try-builds/daniel@haxx.se-463dc0728d3800f67673b9457512fd71549ce790/try-win32-debug/

To get sufficient logging to help us, you should probably consider the following environment variable values:

NSPR_LOG_MODULES=timestamp,nsNotifyAddr:5,nsIOService:5
NSPR_LOG_FILE=logfile.txt

It'd be the most helpful if you'd run this version when you catch more than one of these network events, ideally a whole range of them. Then we can try to figure out why they show up and potentially how to ignore the non-interesting ones.

I don't think this logging will reveal any sensitive data for you, but it will include IP address details. Inspect it before you submit it here!

The "HTTP Logging" concept with Firefox is also further explained and detailed here: https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

Thanks a lot for helping out here!
Attached file win10-vbox.log
I tried the patched build in a Win10 VM running in VirtualBox. It should be running with the default network settings as configured by Microsoft for the VM.

I just started firefox and left it running for a short while. The address logging appears to be somewhat broken, but as far as I can tell "F21DA649-1F54-4518-9D5C-DB98292F6FB7" is the "Microsoft Teredo Tunneling Adapter".
Thank you! I obviously screwed up the showing of the IP address part of the adapters, but that turned out to be next to non-relevant anyway. Your log clearly shows this F21DA649 pop up and vanish again every few minutes and that is going to be devastating to performance and proper behavior.

Now I have to admit that it doesn't seem completely healthy to have a network adapter pop in and out that often, but of course Firefox needs to handle the situation nonetheless.
I've really struggled to find a good fix for this problem (ongoing discussion here: https://groups.google.com/d/msg/mozilla.dev.tech.network/LWQYQz1g6RQ/mEK-3-beGwAJ).

My suggested short-term work-around is to disable the detection of IPv6 network changes by switching over to the old API (that doesn't know about IPv4) and the attached patch does exactly that - with a pref that can be set to force Firefox detect IPv6 changes as well (pref change requires a restart to "take").

This would be a work-around to address the immediate problems, not a proper long-term solution.
Flags: needinfo?(lesliewu2008)
Attachment #8731581 - Flags: review?(mcmanus)
(blah, of course I meant to write that the old API doesn't know about IPv6)
Summary: unexpected network change events causing connection timeouts → network interfaces coming and going frequently causes connection timeouts
Comment on attachment 8731581 [details] [diff] [review]
0001-bug-1245059-disable-NotifyIpInterfaceChange-use-for-.patch

Review of attachment 8731581 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/libpref/init/all.js
@@ +1237,5 @@
>  // setup change is noticed while running.
>  pref("network.notify.changed", true);
> +#if defined(XP_WIN)
> +// Allow network detection of IPv6 related changes (bug 1245059)
> +pref("network.notify.IPv6", false);

please define as true for other platforms.. hidden prefs suck
Attachment #8731581 - Flags: review?(mcmanus) → review+
Updated
Attachment #8731581 - Attachment is obsolete: true
Attachment #8732124 - Flags: review+
Keywords: checkin-needed
I had to back this out because something from the push that included this seems to have broken Windows mochitests like https://treeherder.mozilla.org/logviewer.html#?job_id=24152010&repo=mozilla-inbound
https://hg.mozilla.org/integration/mozilla-inbound/rev/e0b052f1f5cd
I'm thinking it was Bug 1209273 (also checked in in the same push), but I'll wait for confirmation before re-landing anything.
Flags: needinfo?(daniel)
Oh wow how silly, yes there's clearly a '#' missing for #else!!
Flags: needinfo?(daniel)
fix stupid 'else' mistake
Attachment #8732124 - Attachment is obsolete: true
Attachment #8732434 - Flags: review+
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/7c4757cf0988
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Is it possible for this bug or a similar effect to have re-surfaced?  We were running into a very random and hard-to-dupe problem where Firefox would abandon a connection, and it seemed to be fixed in FF48.  But we've now started seeing these mysterious dying connections again in FF50 (windows only).

Has anybody else seen this problem get fixed and now come back?

I will continue investigating; we never could duplicate the behaviour with any regularity, and it's possible we've got something else going on.  But so far it's only cropped up when using Firefox on windows, and smells a whole lot like this bug.
(In reply to Owen Swerkstrom from comment #36)

> Is it possible for this bug or a similar effect to have re-surfaced?

When it comes to bugs and weirdo network effects, everything is possible! =) But I'm not aware of anything in the network area that changed that should make this happen again.

The problem that we fixed in this bug had to do with (Ipv6) network adapters coming and going every few minutes, and Firefox treating each such occasion as a "network" change. If you have that happening, then it certainly feels like this bug - again. If you're not seeing network changes or network adapters coming and going, then I would say you're seeing something different (even if the symptoms may very well be quite similar).
After a bit more digging, it looks like the effects of this bug never really went away, they just became more rare.  A network event happens, and Firefox drastically shrinks the timeouts on its existing connections.  For cached connections that makes sense, but in our case, we have something actively open (on an interface that is not going away) which isn't necessarily going to get its response in 5 seconds, and it gets killed.

We're putting in some workarounds and a way to reconnect and get the original connection's intended response, but if you're interested, here's what we're doing to test the situation:

0) Make a request via windows Firefox which will take a long time to get any data back. (to localhost or any IPv4 address)
1) Open the windows Control Panel.
2) Open "Network and Internet"
3) Open "Network Connections"
4) Right click on an adapter, and select "Properties".
5) Uncheck "Internet Protocol Version 6 (TCP/IPv6) and hit the "OK" button.
6) Right click on an adapter, and select "Properties".
7) Check "Internet Protocol Version 6 (TCP/IPv6) and hit the "OK" button.
8) Wait about 7 seconds

Firefox's connection dies, and any mission-critical response it was going to get is lost.
Doing the same test using Chrome (or even IE!) does not kill the connection.

On our windows VM's, this behaviour happens seemingly at random.  Clearly there are other things that trigger the networking events Firefox is responding to, but regardless of where they come from, it would sure be nice to have existing connections stay open.
Right, a network change risks making all the existing connections stale and that's a really bad experience when things just sit there. That is why we use a timeout to do this (lame) detection to guess if the HTTP/1 connections are still functional.

Firefox will allow network.http.network-changed.timeout number of seconds and with nothing received then, the connection is considered dead. The default time there is 5, because it is still unusual that you sit waiting for that long (in the exact moment you get a network change). But as you say, it happens to you.

You can to bump up that number a bit as a work-around at least. Especially if you're not moving legitimately around between networks a lot

It is also curious why you keep getting that frequent network changes, as that is also not really normal. Do you have any idea what's going on at that level?
This one's out of our hands, it's a web-accessible setup tool our customers use to configure our services.  So changing timeouts on our own copies of Firefox won't help.  :^)

We don't know where these network events come from; it may be an odd effect of how our VMs are arranged.  But we have seen this happen on customer sites too.  Maybe there's DHCP chatter going on, maybe the VM host is doing something to keep tabs on resources, we can only guess.  Plus it's windows, all kinds of spooky **** happens for no apparent reason.  ;^)  We don't actually know how frequent these are, just that it's bad news if one happens to hit during a slow FF request.

Anyway, our workarounds seem to be sufficient, but it's worth mulling over if there is anything more intelligent that can be done with open, active, awaiting-a-response connections than just cutting them off.  (As mentioned, Chrome and even IE manage not to act this way.)

Whatever these events are, they do not (necessarily, always) correlate very strongly with "we've switched networks".  Maybe there's no better mechanism available for knowing that on windows, though?  I can absolutely see how you'd want to close stale connections instead of waiting forever, and how it may be hard or impossible to tell whether a connection has any chance of coming back or can be safely left for dead.  I wish I had some more constructive suggestions!  :^)

Sorry to resurrect this ancient bug report, but I ran into the same issue and may be able to add some context. I'm dealing with an unlucky combination of two things:

  1. A web-based application that uses HTTP Long Polling to react to server events. This technique uses connections that go without traffic for up to 30 seconds in the absence of events.
  2. I'm frequently starting and stopping VMs on my PC, causing (completely unrelated) network interfaces to appear and disappear.

My PC is never roaming between networks, so I can easily work around the issue by setting network.notify.changed to false. Still, I'd be very happy if Firefox could implement some more sophisticated check to see whether existing connections are still alive and functional. Maybe you could temporarily enable TCP keep-alive to send some probes to the peer?

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: