Closed Bug 939318 Opened 11 years ago Closed 9 years ago

NetworkLinkService should be enabled so Necko can respond to network changes (not offline auto-detection)

Categories

(Core :: Networking: DNS, enhancement)

25 Branch
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35
Tracking Status
b2g-v1.3T --- wontfix
b2g-v1.4 --- wontfix
b2g-v2.0 --- wontfix
b2g-v2.0M --- affected
b2g-v2.1 --- wontfix
b2g-v2.1S --- affected
b2g-v2.2 --- fixed

People

(Reporter: sworkman, Assigned: bagder)

References

Details

(Keywords: dev-doc-needed)

Attachments

(10 files, 49 obsolete files)

89.26 KB, image/png
Details
8.61 KB, patch
bagder
: review+
Details | Diff | Splinter Review
8.23 KB, patch
bagder
: review+
Details | Diff | Splinter Review
29.94 KB, patch
bagder
: review+
Details | Diff | Splinter Review
4.42 KB, patch
bagder
: review+
Details | Diff | Splinter Review
9.05 KB, patch
bagder
: review+
Details | Diff | Splinter Review
39.99 KB, patch
bagder
: review+
Details | Diff | Splinter Review
299.96 KB, application/gzip
Details
610.19 KB, application/gzip
Details
379.03 KB, application/gzip
Details
Currently, the NetworkManager in Gecko is disabled and thus Necko does not respond to network state changes, e.g. switch WiFi networks, moving from VPN to non-VPN, resuming after laptop is sleeping etc. Host reachability/addressing could have changed: it's important for connectivity to refresh the DNS cache with accurate information pertaining to the current network, as well as dropping HTTP connections.

Note: It's likely that this work should be coordinated with enhancements to Gonk's NetworkManager, Bug 904514.
I've been told there are reasons that the NetworkLinkDetection is disabled, including something to do with autodial starting modem dialups on Windows - unsure if these anecdotes are relevant any more. It would be good to get a list of issues and start knocking them down one by one.

For Firefox OS, I think the work in 904514 might be useful. There may be more accurate information available for network state notifications.

Also, we could enable this on a per platform basis if there are too many issues on a particular platform.
After some investigation and discussion with :bagder, we're going to bypass offline auto-detection, and concentrate on detecting changes in the network link that would require a reset of the DNS cache or the current set of TCP connections. We're going to start with that and worry about offline auto-detection in a later bug.
Assignee: nobody → daniel
Status: NEW → ASSIGNED
Summary: NetworkManager should be enabled so Necko can respond to network changes → NetworkLinkService should be enabled so Necko can respond to network changes (not offline auto-detection)
First, there's a system-specific part of this functionality that monitors when there's a change in IP address/interface on OS-level. I've started focusing on the Windows side of things but you'll see that there are problems both in platform specific as well as in common code. 

Logic in nsNotifyAddrListener.cpp:

It has at least two current problems:

A - it uses the windows API NotifyAddrChange() which only checks IPv4 addresses

B - there's a race situation which makes it possible for the notification function to get notified and then when it checks for a "good" interface it will find one and it'll consider the network as up (again) and never report it as down.

The problem (B) is however not such a big deal it turns out.

When the notification funtion is called because Windows has found a change, it will send a notification event to the nsIOService no matter if an interface was found to be up or down or whatever. Something happened so it tells nsIOService.

Logic in nsIOService.cpp:

nsIOService subsribes to NS_NETWORK_LINK_TOPIC and if it has been told to monitor this it will eventually check the network status by asking the mNetworkLinkService (UP or DOWN are the only possible answers) and call SetOffline() to set it as offline or not.

Here's a problem:

C - SetOffline(), if told that the network is up when already considered up, will just merrily move on and be happy. Nothing to do, moving on to avoid doing work it doesn't have to do. But this could've been a huge change between two interfaces and networks. Similarly it doesn't do anything if told to be down when already down but that's not a concern of mine.

Plan:

For (A) we replace the code to use the improved API for modern windows. Probably with #ifdef or similar.

For (B) it's not really a problem but we can simplify the function since it won't have to tell the event listeners if the network is up or down, just that it changed. An improvement to consider is to somehow detect if the changed network is minor and not a reason to tell the world about (especially with my suggested C change below). I'm not sure exactly when this function triggers but for example DHCP refreshes or just a quick pull out/insert cable could possibly be such cases.

Ideally we should consider the UP-UP case on the _same_ network "adapter" (the windows term for an interface) to not be a change and not send any notification in that case.

For (C) we consider the UP => UP scenario to be a network refresh in need and do a DOWN/UP toggle. I'll do this first since I believe this is the biggest concern.

The underlying code should then make an effort to not send UP-UP events if it can detect that the network conditions are still the same: interface, IP address and gateway basically.

-------------------------

I've worked on cleaning up the code to handle (C) better and to make a difference between when someone sets the offline mode and when the network change notification is sent from the "lower" layers. The key point being to handle UP-UP which would indicate a network topology change. I've also added a new PREF for this purpose so that we can limit testing for this manage-network-change behavior to only those who actually enable this.

This is what my initial patch does and I'm very open for comments and critiques of whatever is in here. I'm perfectly aware that this single patch is not the end solution to all managed-offline problems.
Comment on attachment 8374705 [details] [diff] [review]
version 1, not merge-worthy yet but ready for discussion and feedback

Review of attachment 8374705 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like we're going in the right direction to me.

So, next I think you're adding a patch for Windows (NotifyAddrListener), right? We also need a patch for the DNS Service - it's SetOffline function doesn't really do a whole lot. We'll need it to clear out the cache and notify all callbacks. We can discuss that offline.

Good work. f+.

::: netwerk/base/src/nsIOService.cpp
@@ +47,5 @@
>  #define PORT_PREF_PREFIX           "network.security.ports."
>  #define PORT_PREF(x)               PORT_PREF_PREFIX x
>  #define AUTODIAL_PREF              "network.autodial-helper.enabled"
>  #define MANAGE_OFFLINE_STATUS_PREF "network.manage-offline-status"
> +#define MANAGE_NETWORK_CHANGE_PREF "network.manage-network-change"

I was going to suggest adding a default value to all.js that disables this for all platforms. But I think it might be better to have it hidden/non-existent first, then only add it for Windows.

@@ +660,5 @@
>  
> +// NetworkDown and NetworkUp do the actual bring down and up activities but do
> +// not notify observers.
> +
> +void nsIOService::NetworkDown()

How about OnNetworkLinkDown/Up?

@@ +688,5 @@
> +    // trigger a PAC reload when we come back online
> +    if (mProxyService)
> +        mProxyService->ReloadPAC();
> +
> +    return NS_OK;

Doesn't look like any other return value is used. If so, just return void.

@@ +712,5 @@
> +        // Network change only, refresh! still UP. Do a quick DOWN/UP
> +        // Checking 'notified' makes us avoid to do this dance when just told
> +        // by the outside world twice in a row to go online.
> +        NetworkDown();
> +        DebugOnly<nsresult> rv = NetworkUp();

I think this Down/Up should be ok. It's a nice code reuse, but I'd like to make sure that this is better than writing new reset code which doesn't set offline status to true for anything.

@@ +1222,2 @@
>  #else
> +            return SetNetworkStatus(false, false);

This is not 100% clear. I think I'd like to see an enum in use here.

enum {
  LINK_OFFLINE,
  LINK_CHANGED,
  LINK_ONLINE
};

SetNetworkStatus(LINK_ONLINE) looks much clearer to me.

@@ +1228,5 @@
>      bool isUp;
>      nsresult rv = mNetworkLinkService->GetIsLinkUp(&isUp);
>      NS_ENSURE_SUCCESS(rv, rv);
> +
> +    return SetNetworkStatus(!isUp, true);

If we use the enums, then the 'offline' and 'mOffline' conditions should be checked here - maybe not the prettiest, but I think it's clearer than two boolean params for three conditions.

::: netwerk/base/src/nsIOService.h
@@ +80,5 @@
>      // - destroy using Release
>      nsIOService() NS_HIDDEN;
>      ~nsIOService() NS_HIDDEN;
>  
> +    NS_HIDDEN_(nsresult) TrackNetworkLinkStatus();

If we're going to rename this, I'd prefer CheckNetworkLinkStatus.

To me, 'Track' suggests that the function will be running in the background, or will register itself for notifications. This function does a one-shot check and updates if necessary; then it returns. I think 'Check' is more accurate.

@@ +104,5 @@
>  
> +    NS_HIDDEN_(nsresult) NetworkUp();
> +    void NetworkDown();
> +
> +    NS_IMETHODIMP SetNetworkStatus(bool offline, bool notified);

How about SetNetworkLinkStatus(newStatus)?

newStatus can be an enum type - see the cpp.
Attachment #8374705 - Flags: feedback+
Blocks: 972262
Attached patch v2 - ioservice.patch (obsolete) — Splinter Review
All very good and appreciated comments! Going with the enum instead of bool improved readability a lot since I could then also drop the extra argument to the function. v2 attached.
Attachment #8374705 - Attachment is obsolete: true
Attachment #8375416 - Flags: feedback?(sworkman)
Attached patch nsNotifyAddrListener-v1.patch (obsolete) — Splinter Review
Here's a glimpse on the state of my work on the Windows-specific side of things. Addressing issue (A) in my original explanation.

This change makes use of NotifyIpInterfaceChange() if built for a new enough Windows version, which should make it IPv6 aware an everything.

NotifyIpInterfaceChange causes _many_ notifications and I haven't figured out why it does so. In my VM I can get a notification per minute for a long time without me being aware of any network changes at all. They're far more frequent than what NotifyAddrChange() generates. This triggered me to do a checksum approach so that we better can ignore subsequent seemingly unnecessary notifications.

I also did some experiments with using InternetGetConnectedState() to figure out the state when notifications arrive, but it turns out it is rather slow in detecting change so I've not used it further.
InternetGetConnectedState() captures something more than IP changes, and we're interested in what it is capturing. so please pursue that path.

its conceivably ok that its slow as long as its async and not blocking anything.
Comment on attachment 8378278 [details] [diff] [review]
nsNotifyAddrListener-v1.patch

Review of attachment 8378278 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/system/win32/nsNotifyAddrListener.h
@@ +71,5 @@
> +    //Static Callback function for NotifyIpInterfaceChange API.
> +
> +    static void WINAPI OnInterfaceChange(PVOID callerContext,
> +                                         PMIB_IPINTERFACE_ROW row,
> +                                         MIB_NOTIFICATION_TYPE notificationType)

Re: the high frequency of notifications, seemingly redundant, you might want to look at these params as well.
Re: InternetGetConnectedState(), I agree with Pat that it could prove to be useful. I did some reading, however, and it seems like in some cases (autodial), it says if the client CAN connect vs if the client IS connected. Or at least you need to careful reading the flags that are returned.

But still potentially useful for things like captive portal and offline state.
Comment on attachment 8375416 [details] [diff] [review]
v2 - ioservice.patch

Review of attachment 8375416 [details] [diff] [review]:
-----------------------------------------------------------------

Getting there. You may have made changes since uploading this version, so attend to my comments only if they are relevant. Some are just to do with Mozilla style (which you can read here if you haven't already -> https://developer.mozilla.org/en-US/docs/Developer_Guide/Coding_Style).

::: netwerk/base/public/nsIIOService2.idl
@@ +26,5 @@
>     * this management.
>     */
>    attribute boolean manageOfflineStatus;
>  
> +  attribute boolean manageNetworkChange;

Description needed here.

::: netwerk/base/src/nsIOService.cpp
@@ +671,5 @@
> +    if (mSocketTransportService)
> +        mSocketTransportService->SetOffline(true);
> +}
> +
> +NS_IMETHODIMP nsIOService::OnNetworkLinkUp()

nsresult

@@ +698,5 @@
> +// Set new status for network link.
> +//
> +
> +NS_IMETHODIMP
> +nsIOService::SetNetworkLinkStatus(linkstatus link)  // new status

LinkStatus aLink

@@ +711,5 @@
> +    if ((link == LINK_CHANGED) && mManageNetworkChange) {
> +        // Network change only, refresh if we manage this
> +        OnNetworkLinkDown();
> +        DebugOnly<nsresult> rv = OnNetworkLinkUp();
> +        if (NS_FAILED(rv)) {

So, this is fine for a WIP to evaluate the Windows-specific part. But this is where Pat was talking about using the observer service to send out a new notification - something like what happens for online->offline and offline->online.

Then we can have STS and DNS register with the observer service to get those notifications and reset internally.

@@ +792,3 @@
>      while (mSetOfflineValue != mOffline) {
>          offline = mSetOfflineValue;
> +        SetNetworkLinkStatus(offline?LINK_OFFLINE:LINK_ONLINE);

Spaces please:
  (offline ? kLinkOffline : kLinkOnline)

If |offline| is not used in the rest of the function, you may be able to just use mSetOfflineValue.

::: netwerk/base/src/nsIOService.h
@@ +102,5 @@
>      void LookupProxyInfo(nsIURI *aURI, nsIURI *aProxyURI, uint32_t aProxyFlags,
>                           nsCString *aScheme, nsIProxyInfo **outPI);
>  
> +    NS_HIDDEN_(nsresult) OnNetworkLinkUp();
> +    void OnNetworkLinkDown();

I don't know about the use of NS_HIDDEN here. I think I understand the reason you added it, it just hasn't been my experience for it's usage to be encouraged - i.e. no-one has told me to use it while I've been here.

Maybe Pat has a better idea.

If you do want to use it, use it for OnNetworkLinkDown too:

void OnNetworkLinkDown() NS_HIDDEN;

@@ +107,5 @@
> +
> +    enum linkstatus {
> +        LINK_OFFLINE,
> +        LINK_ONLINE,
> +        LINK_CHANGED /* link change detected, but still online */

I've been steered towards using kLinkOffline instead of ALL CAPS lately. Only macros are allowed to shout.

Also, LinkStatus instead of linkstatus.

@@ +110,5 @@
> +        LINK_ONLINE,
> +        LINK_CHANGED /* link change detected, but still online */
> +    };
> +
> +    NS_IMETHODIMP SetNetworkLinkStatus(linkstatus link);

NS_IMETHOD - not an implementation here.
But even so, nsresult - NS_IMETHOD{IMP} is usually reserved for interface functions from IDLs.
Attachment #8375416 - Flags: feedback?(sworkman) → feedback+
InternetGetConnectedState() is not really checking "internet" connection with the meaning that anyone sane would include in those words. It is basically just doing a variation of what we're already doing but yeah, adding that to the mix will not hurt. The documentation for the function says:

"indicates that at least one connection to the Internet is available. It does not guarantee that a connection to a specific host can be established"

It goes on to say:

"InternetCheckConnection can be called to determine if a connection to a specific destination can be established."

So I've done experiments with InternetCheckConnection, but it is also a disappointment since both with an explicit URL (to my test site) and with NULL to let it figure out one by itself, my builds running in a Windows7 VM simply always return from that function saying no connection was possible.

It seems Windows7 itself does plain connects to a well-known HTTP site somewhere to detect captive portables: http://blog.superuser.com/2011/05/16/windows-7-network-awareness/

I also found these references to a Network List Manager API => http://msdn.microsoft.com/en-us/library/aa370803%28v=vs.85%29.aspx that sounds like something of interest.
(In reply to Steve Workman [:sworkman] from comment #10)

Thanks for all the feedback, most of your comments I'll just silently absorb into my next version of my patch that'll come later.

> @@ +711,5 @@
> > +    if ((link == LINK_CHANGED) && mManageNetworkChange) {
> > +        // Network change only, refresh if we manage this
> > +        OnNetworkLinkDown();
> > +        DebugOnly<nsresult> rv = OnNetworkLinkUp();
> > +        if (NS_FAILED(rv)) {
> 
> So, this is fine for a WIP to evaluate the Windows-specific part. But this
> is where Pat was talking about using the observer service to send out a new
> notification - something like what happens for online->offline and
> offline->online.
> 
> Then we can have STS and DNS register with the observer service to get those
> notifications and reset internally.

Yes, I fully support that idea and I'll change my code to use that way instead. Then there's the basic question whose job it is to signal "network topology has changed" ? Right now we have this OS-specific layer (nsNotifyAddrListener) than sends NS_NETWORK_LINK_TOPIC as UP or DOWN and I currently put logic into nsIOService that detects UP-UP and consider that a network-topology change.

Do you agree with me that we should then make that also capable of sending "CHANGED" in addition to UP and DOWN, as then we can allow each system work hard to figure that out to the best of each system's ability. (So I'll remove my code from nsIOService again basically)

Then I'll proceed to make STS and DNS observe that and act on their own without having nsIOservive having to tell them.
(In reply to Daniel Stenberg [:bagder] from comment #12)
> Do you agree with me that we should then make that also capable of sending
> "CHANGED" in addition to UP and DOWN, as then we can allow each system work
> hard to figure that out to the best of each system's ability.

Yes - this seems appropriate to me.

> (So I'll remove my code from nsIOService again basically)
> 
> Then I'll proceed to make STS and DNS observe that and act on their own
> without having nsIOservive having to tell them.

If I understood Pat's rationale correctly, the point of the notification was that so anyone could register for them, and we wouldn't need to manage a hard-coded list of listeners. Adding the CHANGED sub-topic should do that.

Maybe we could plan to have DNS and STS listen to UP and DOWN events themselves in a follow-up? This way, DNS, STS and HTTP would all be run-time registered listeners for all the LINK_STATUS notifications. But this is not urgent - there's just something cleaner about it; it decouples DNS, STS and HTTP from nsIOService, at least for these notifications where they don't need to be coupled with it. But it's not something I'm going to argue for right now :)

So, yes, have DNS and nsHTTPConnectionMgr listen for the CHANGED notification. You can probably leave STS as we discussed in the meeting - when HTTP connections are dropped, the relevant sockets will drop too.
Here's a new take on this problem after discussions and previous feedback. The longer summary is here: http://daniel.haxx.se/firefox/networkstat.html

This adds event listeners to the DNS and sockettransport services to act on network changes and removes that same functionality from ioservice.
Attachment #8375416 - Attachment is obsolete: true
Attachment #8378278 - Attachment is obsolete: true
Attachment #8396233 - Flags: feedback?(sworkman)
Comment on attachment 8396233 [details] [diff] [review]
v3-0001-bug-939318-detect-network-interface-changes-on-wi.patch

Review of attachment 8396233 [details] [diff] [review]:
-----------------------------------------------------------------

Cool - some progress on getting the arch solid.

-- Check with Pat re nsHttpConnectionMgr vs SocketTransportService observing the CHANGED topic. I remember him suggesting that the former was better.
-- Check what interfaces you're examining in CheckAdaptersAddresses - looks like gateway and DNS aren't being included.

And just some other comments :)

::: netwerk/base/src/nsSocketTransportService2.cpp
@@ +492,5 @@
>      nsCOMPtr<nsIObserverService> obsSvc = services::GetObserverService();
>      if (obsSvc) {
>          obsSvc->AddObserver(this, "profile-initial-state", false);
>          obsSvc->AddObserver(this, "last-pb-context-exited", false);
> +        obsSvc->AddObserver(this, NS_NETWORK_LINK_TOPIC, false);

I thought you were going to have nsHttpConnectionMgr observe this topic instead of the socket transport service, no? This was Pat's suggestion to have more control over which HTTP connections were closed and which were timed out etc.

::: netwerk/dns/nsDNSService2.cpp
@@ +802,5 @@
>  
> +// data is UP/DOWN/UKNOWN/CHANGED (utf8'ed)
> +void nsDNSService::NetworkChange(const char *state)
> +{
> +    if(!strcmp(state, "CHANGED")) {

I think CHANGED should be implicit for this function. Move the strcmp to Observe.

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +106,5 @@
>  nsNotifyAddrListener::Run()
>  {
>      PR_SetCurrentThreadName("Link Monitor");
> +#ifndef HAS_NOTIFYIPINTERFACECHANGE
> +    // For older Windows versions lacking NotifyIpInterfaceChange

You can decide what you want to do here, but for me, it's a little clearer if it's just:

#ifdef _WIN32_WINNT < WIN32_WINNT_VISTA
   // Windows versions older that Vista lack NotifyIpInterfaceChange.

@@ +138,5 @@
> +    DWORD ret = NotifyIpInterfaceChange(
> +        AF_UNSPEC, // IPv4 and IPv6
> +        (PIPINTERFACE_CHANGE_CALLBACK)OnInterfaceChange,
> +        this,  // pass to callback
> +        true, //  initial notification please

Ah, so this will give us an initial notification, which will set up the checksum. And that's why you can also set initial status to offline. Cool.

@@ +214,1 @@
>   */

Since you're changing this comment, go ahead and put it in the .h file instead.

@@ +361,5 @@
> +    //
> +    // Since NotifyIpInterfaceChange() signals a change more often than we
> +    // think is a worthy change, we checksum the entire state of all interfaces
> +    // that are UP. If the checksum is the same as previous check, nothing
> +    // of interest changed!

So, I guess you never found out what was happening here?

How much of the IP interface information did you check? I see FirstUnicastAddress - maybe it was one of the other's that changed?

Not a huge deal though - checksum is at least useful to avoid CHANGED after UP-DOWN-UP if nothing changed.

@@ +374,3 @@
>              if (ptr->OperStatus == IfOperStatusUp &&
> +                ptr->IfType != IF_TYPE_SOFTWARE_LOOPBACK &&
> +                !CheckIsGateway(ptr)) {

nit: Can you change this condition to its inverse, then 'continue'? Means you don't need to indent do much.

More importantly, I think we need to look at the information requested by GetAdaptersAddresses. I started thinking about this because it looks like gateway addresses are excluded from the checksum here, but it's my understanding that we need to consider those. So, then I looked at CheckIsGateway, and it's just checking 192.168.0.1 and just for IPv4. GetAdaptersAddresses seems to be able to return gateways (at least in Vista+).

Local IP changes
Gateway changes
DNS changes
Default interface/gateway changes

I think you want to checksum all of those, right?

@@ +378,5 @@
> +                int i;
> +
> +                pip = ptr->FirstUnicastAddress;
> +
> +                while(pip) {

while (pip) {

Try running ./mach clang-format over your patch for an auto style cleanup.

@@ +381,5 @@
> +
> +                while(pip) {
> +                    SOCKET_ADDRESS *adr = &pip->Address;
> +                    for(i=0; i < adr->iSockaddrLength; i++)
> +                        sum += ((unsigned char *)adr->lpSockaddr)[i];

Braces please.

::: netwerk/system/win32/nsNotifyAddrListener.h
@@ +65,5 @@
>  
>      HANDLE        mShutdownEvent;
> +
> +private:
> +    ULONG mCsum;

s/mCsum/mIPInterfaceChecksum/

Add a comment too.

@@ +76,5 @@
> +    {
> +        nsNotifyAddrListener *notify = (nsNotifyAddrListener *)callerContext;
> +        notify->CheckLinkStatus();
> +    }
> +#endif

I think I'd prefer all this Vista+ code to be in the cpp. At the moment, this block is only included if HAS_NOTIFYIPINTERFACECHANGE is defined, and that is only defined in the cpp. This creates a chain of macro definitions that could be hard to follow, and could be missed for inclusions outside of the cpp, e.g. nsNetModule.cpp etc.

Since the function is static and only used within the cpp, you can probably keep it there. That should leave the class definition a little cleaner.
Attachment #8396233 - Flags: feedback?(sworkman) → feedback+
(In reply to Steve Workman [:sworkman] from comment #17)

Thanks Steve, I'll just silently absorb your style comments and address them for my next round and here are my responses to functionality comments:

> -- Check with Pat re nsHttpConnectionMgr vs SocketTransportService observing
> the CHANGED topic. I remember him suggesting that the former was better.

Yes thanks, I have a list of things that are subject to act on this event. In this patch I'm moving the existing "actions" over to the distributed event-based approach. It mostly shows the idea and the concept, now I need to polish what to actually do when the event comes.

One of my concerns in general when it comes to this Network Changed event is that it doesn't necessarily have to imply a change that requires _anything_ to be done, although it might and we need to balance that correctly.

I assume nsHttpConnectionMgr::PruneDeadConnections() is what we should use for this?

> -- Check what interfaces you're examining in CheckAdaptersAddresses - looks
> like gateway and DNS aren't being included.

This is pretty much on purpose. The idea is to note the network interfaces that are up and react on changes to those. We don't specifically keep track of routing table changes or DNS server config changes. Each IP_ADAPTER_ADDRESSES struct has FirstGatewayAddress etc but I'm not including that in the checksum for now.

I think changes to the DNS or gateway entries in the interfaces with everything else remaining, is a change that is *supposed* to be treated as the same topology.

> Not a huge deal though - checksum is at least useful to avoid CHANGED after
> UP-DOWN-UP if nothing changed.

Ideally, yes. Unfortunately in my tests that I've been doing I disable/enable the network adapter in the Windows control panel thing and when I do that I get several notifications when the interfaces go back up and since they go up independently from each other and not equally fast (there's at least a physical interface and a tunnel interface) and they will pop up in a different order and with some time interval so it inevitably leads to different checksums anyway.

I haven't figured out a good way to address this once and for all, but I'm also not sure this is a problem for when switching between interfaces/wifi etc in a live situation. More testing is required.

> More importantly, I think we need to look at the information requested by
> GetAdaptersAddresses. I started thinking about this because it looks like
> gateway addresses are excluded from the checksum here, but it's my
> understanding that we need to consider those. So, then I looked at
> CheckIsGateway, and it's just checking 192.168.0.1 and just for IPv4.
> GetAdaptersAddresses seems to be able to return gateways (at least in Vista+).

I left the CheckIsGateway() in there because I figured it has to have been added that for an actual reason. Bug #465158 explains why it "has to" be like that, and I'm not knowledgeable enough about those details to question the conclusions there. I think the function name is misleading since it makes a reader think it checks for a generic gateway when in reality it checks for a very special kind of "ICS" case - presumably that's a windows XP only thing too.

I've been experimenting with leaving out that check (including the much more complicated sub-check CheckICSStatus) and while my test cases are limited still I don't see any downsides for me on Windows 7.
(In reply to Daniel Stenberg [:bagder] from comment #18)
> (In reply to Steve Workman [:sworkman] from comment #17)
> 
> Thanks Steve, I'll just silently absorb your style comments and address them
> for my next round ...

Sounds good to me :)

> and here are my responses to functionality comments:
> 
> > -- Check with Pat re nsHttpConnectionMgr vs SocketTransportService observing
> > the CHANGED topic. I remember him suggesting that the former was better.
> 
> Yes thanks, I have a list of things that are subject to act on this event.
> In this patch I'm moving the existing "actions" over to the distributed
> event-based approach. It mostly shows the idea and the concept, now I need
> to polish what to actually do when the event comes.

Ah I see. Understood. Focusing on that for now. Please use a shortened version of that in the patch description for the next version.

> One of my concerns in general when it comes to this Network Changed event is
> that it doesn't necessarily have to imply a change that requires _anything_
> to be done, although it might and we need to balance that correctly.

Interesting. So you mean something like DNS might be ok, but sockets should be reset? Should there be more than one type of CHANGED event? Or perhaps an API in nsIOService so DNSService (for ex.) could query what has changed?

> I assume nsHttpConnectionMgr::PruneDeadConnections() is what we should use
> for this?

Not 100% sure on this, but that might be one of the functions to call. I think some of the active connections need to be reset too.

> > -- Check what interfaces you're examining in CheckAdaptersAddresses - looks
> > like gateway and DNS aren't being included.
> 
> This is pretty much on purpose. The idea is to note the network interfaces
> that are up and react on changes to those. We don't specifically keep track
> of routing table changes or DNS server config changes. Each
> IP_ADAPTER_ADDRESSES struct has FirstGatewayAddress etc but I'm not
> including that in the checksum for now.
> 
> I think changes to the DNS or gateway entries in the interfaces with
> everything else remaining, is a change that is *supposed* to be treated as
> the same topology.

I see. But I'm not clear on what you mean by a topology change then. I think we should define a list of the events we want to detect and then map them to the information we can get here.

For example, suppose a user changes their network settings to use wifi instead of ethernet as the default interface. They could be on different networks with different DNS so we should reset the DNS cache, and probably reset the connections, no? The checksum wouldn't be different in this case, right?

It's very possible I don't understand this correctly, or I'm thinking along different lines.

Oh, and if this is for a follow up patch, that's ok too, but I think the list needs some discussion first.

> I haven't figured out a good way to address this once and for all, but I'm
> also not sure this is a problem for when switching between interfaces/wifi
> etc in a live situation. More testing is required.

Yay testing! ;)

> > More importantly, I think we need to look at the information requested by
> > GetAdaptersAddresses. I started thinking about this because it looks like
> > gateway addresses are excluded from the checksum here, but it's my
> > understanding that we need to consider those. So, then I looked at
> > CheckIsGateway, and it's just checking 192.168.0.1 and just for IPv4.
> > GetAdaptersAddresses seems to be able to return gateways (at least in Vista+).
> 
> I left the CheckIsGateway() in there because I figured it has to have been
> added that for an actual reason. Bug #465158 explains why it "has to" be
> like that, and I'm not knowledgeable enough about those details to question
> the conclusions there. I think the function name is misleading since it
> makes a reader think it checks for a generic gateway when in reality it
> checks for a very special kind of "ICS" case - presumably that's a windows
> XP only thing too.
> 
> I've been experimenting with leaving out that check (including the much more
> complicated sub-check CheckICSStatus) and while my test cases are limited
> still I don't see any downsides for me on Windows 7.

I see. More testing! It makes sense that with the code being old and not being used for a long time that there are things that might not apply any more. Let's see what happens here.
(In reply to Steve Workman [:sworkman] from comment #19)

> I see. But I'm not clear on what you mean by a topology change then. I think
> we should define a list of the events we want to detect and then map them to
> the information we can get here.
> 
> For example, suppose a user changes their network settings to use wifi
> instead of ethernet as the default interface. They could be on different
> networks with different DNS so we should reset the DNS cache, and probably
> reset the connections, no? The checksum wouldn't be different in this case,
> right?

DNS server is generally not a property that is tied to a specific interface though. We set the DNS server as a machine global setting. If I change the DNS, my host will just ask another server for the address. That's not an interface change so it won't be noticed by this code.

Further: I would imagine that the normal use case is not having both an Ethernet and wifi interface available and working at the same time. If you do, it's basically the routing table and default gateway that decide which interface to use. If you then change the routing table so that you suddenly use the other interface for a remote host you already communicated with, how do you detect that? It's not an interface change either. We'd have to subscribe to routing table changes to discover that. Which we don't (yet).

I think the majority of network related problems we get reports on, are people switching from one network to another - these days that's mostly switching between different wifi access points. That will almost always involve either a new IP address and/or a new set of network interfaces like when VPNs/tunnels are setup or torn down. Also, disconnecting a cable from one network and inserting a different one in another network should work similarly.

> Oh, and if this is for a follow up patch, that's ok too, but I think the
> list needs some discussion first.

My position is that we wait and hold off detecting routing table and DNS server changes. I think we're already on track to automatically fix a lot of problems and I would rather have us get the slightly smaller approach merged first and then once we get a feel for what remaining problems we actually have in the real-world we add more efforts to detect the remaining problematic cases.

Of course I'm all ears for different opinions and I'll adapt accordingly!

> > I left the CheckIsGateway() in there because I figured it has to have been
> > added that for an actual reason. Bug #465158 explains why it "has to" be
> > like that,

> I see. More testing! It makes sense that with the code being old and not
> being used for a long time that there are things that might not apply any
> more. Let's see what happens here.

I'll rename that function at least so that it gets slightly more obvious that the function is there to detect a very specific kind of gateway.
I'm attaching v4 of this patch. Changes this time:

- minor style changes after :sworkman's comments

- moved away the event listener from socketransport service to the HTTP handler, as I believe was how :mcmanus suggested it a while ago. I'd like to get your comment on that piece Patrick!

Getting closer I think.
Attachment #8396233 - Attachment is obsolete: true
Attachment #8399397 - Flags: feedback?(mcmanus)
(In reply to Daniel Stenberg [:bagder] from comment #20)
> (In reply to Steve Workman [:sworkman] from comment #19)
> Further: I would imagine that the normal use case is not having both an
> Ethernet and wifi interface available and working at the same time...

I see what you're saying. But, I'm wondering about the cases where there is a Wifi connection and a cellular connection. In saying that though, this bug has become focused on the Windows desktop case only, so Wifi+Cellular which is more Android and FxOS focused is probably outside of the "normal use case" for desktop as you put it. Once we do look at those platforms some verification might be needed to confirm that IP change detection is enough.

For example: I'm imagining that both interfaces could be up and have IP addresses, but the OS switches the default to better use the radio? But that's probably better served in another bug to do with mobile platforms.

> I think the majority of network related problems we get reports on, are
> people switching from one network to another - these days that's mostly
> switching between different wifi access points. That will almost always
> involve either a new IP address and/or a new set of network interfaces like
> when VPNs/tunnels are setup or torn down. Also, disconnecting a cable from
> one network and inserting a different one in another network should work
> similarly.

Yup, I would agree with that. At least, focusing on such changes (Wifi to Wifi, VPN to non-VPN etc. as you say) would seem to get us the greatest benefit now. And, there's an element of refactoring to your patches that will allow us to deal with more cases in follow ups.

> My position is that we wait and hold off detecting routing table and DNS
> server changes. I think we're already on track to automatically fix a lot of
> problems and I would rather have us get the slightly smaller approach merged
> first and then once we get a feel for what remaining problems we actually
> have in the real-world we add more efforts to detect the remaining
> problematic cases.
> 
> Of course I'm all ears for different opinions and I'll adapt accordingly!

Yeah, I think that approach is a good idea. Apologies for being the annoying co-worker asking for the world and more! :)
(In reply to Steve Workman [:sworkman] from comment #22)

> For example: I'm imagining that both interfaces could be up and have IP
> addresses, but the OS switches the default to better use the radio? But
> that's probably better served in another bug to do with mobile platforms.
> 

According to https://bugzilla.mozilla.org/show_bug.cgi?id=786419#c28 there could be more than one interface in a connected state at the same time. There are a few more details about the way it works on B2G in the linked bug.
Status update:

Testing this patch properly requires a "real" windows install since switching between networks and/or wifi access points isn't really that close to the real thing when running windows in a vm (which is what I develop this on).

In my attempts to do this, I've unfortunately run into bug #992739
It wasn't a bug in mach, it is a bug in this patch. It causes 'mach package' to hang.
(In reply to Daniel Stenberg [:bagder] from comment #21)
> Created attachment 8399397 [details] [diff] [review]
> v4-0001-bug-939318-detect-network-interface-changes-on-wi.patch

> - moved away the event listener from socketransport service to the HTTP
> handler, as I believe was how :mcmanus suggested it a while ago. I'd like to
> get your comment on that piece Patrick!
> 
> Getting closer I think.

by "moved away" you mean wrt an earlier patch right? I don't see any part of the diff remving socket transport code.

if so, that seems right - yes. We can probably make the actions a bit stronger here too (starting timeouts or the like on existing connections) but it seems to be going in the right direction
Attachment #8399397 - Flags: feedback?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #27)

> by "moved away" you mean wrt an earlier patch right? I don't see any part of
> the diff remving socket transport code.
> 
> if so, that seems right - yes.

Yes, exactly. Sorry for not stating it clearer but that's what I meant.
Blocks: 886487
Blocks: 756364
New patch version. This version brings the following functional changes:

- nsNotifyAddr for windows sends the events: UP, DOWN and CHANGED for "network link status".

- nsIOService no longer has "manage offline mode" to act on network events and it doesn't tell DNS nor SocketTransport how to act on network stat events

- nsIOService listens to network events and acts on DOWN (for the autodial magic which is Windows specific stuff)

- DNSservice listens to network events and acts on CHANGED. It flushes the cache.

- HTTPhandler listens to network events and acts on CHANGED.

Since this patch intends to fix problems (on Windows to start with) when switching between network conditions, like between wifis, between captive portal and not and more, I will appreciate feedback if someone test drives this a bit in the wild and report back - especially if you've had problems in the past. It is hard to write up good synthetic tests for this functionality.

I would like to see this patch land first for windows-only, then work on adding better network CHANGE detection for other platforms and see what other parts that need to act on the CHANGE event properly to improve further.
Attachment #8399397 - Attachment is obsolete: true
Blocks: 987802
Here's the try server results for this patch (I rebased it just before the push): https://tbpl.mozilla.org/?tree=Try&rev=976834fce528
:mcmanus and :sworkman, can you help out a newbie here, what's left for me here to get this pushed?
Comment on attachment 8405280 [details] [diff] [review]
v7-0001-bug-939318-detect-network-interface-changes-on-wi.patch

Review of attachment 8405280 [details] [diff] [review]:
-----------------------------------------------------------------

I think your list in the description for this patch should get us to landing this. But I have review comments ;)

The following is my wording of your description to make sure we're on the same page - please comment/respond as needed:

-- Add support for NotifyIpInterfaceChange for Win Vista+ to enable network link change detection for IPv6 links.

-- Add NS_NETWORK_LINK_DATA_CHANGED to global notifications.
-- Remove direct calls to DNS Service and Socket Transport Service from nsIOService.
-- Add listening in DNS Service and HTTP Handler to deal with NS_NETWORK_LINK_DATA_CHANGED notifications. 
==> Results in nsIOService being decoupled from other network services (at least as far as CHANGED is concerned).

-- Disable 'manage-offline-status' for Windows only.
==> This is different from your patch in which you remove it completely. I'm not sure that we want to do that, primarily because the pref is enabled for Android, which makes me concerned that there are apps/sites and other parts of our code that may be dependent on Necko setting offline status on Android. Apologies if I gave the impression earlier that this code should be removed - just disabled or bypassed for Windows is what I think we really want right now.

-- Have DNS Service reset its cache upon NS_NETWORK_LINK_DATA_CHANGED notifications.

-- Have HTTP Handler drop connections appropriately. I see you're using some of the ShiftReload mechanism and PruneDeadConnections - I need to look into this in more detail, but I think that's moving along the right tracks.


After looking through the patch, I only see two things that need to be addressed (besides general patch comments):
1. Bypassing 'manage-offline-status' for Windows only.
2. Making sure that HTTP handler deals with CHANGED appropriately.

::: netwerk/base/src/nsIOService.cpp
@@ +1060,5 @@
> +                // On Windows, we should first check with the OS to see if
> +                // autodial is enabled.  If it is enabled then we are allowed
> +                // to manage the offline state.
> +                if (nsNativeConnectionHelper::IsAutodialEnabled()) 
> +                    return SetOffline(false);

Since you're indenting the code here, please add braces around this single line if.

@@ +1065,2 @@
>  #else
> +                return SetOffline(false);

These double negative functions always confuse me :) So, while you're making the other changes, could you change this to be two functions: SetOnline() and SetOffline()? - no params. It'll be a nice change for quicker reading of the code.

But don't let it block you from the main part of the work.

::: netwerk/base/src/nsIOService.h
@@ +80,5 @@
>      // - destroy using Release
>      nsIOService() NS_HIDDEN;
>      ~nsIOService() NS_HIDDEN;
>  
> +    NS_HIDDEN_(nsresult) NetworkLinkEvent(const char *data);

Please rename to OnNetworkLinkEvent.

::: netwerk/dns/nsDNSService2.cpp
@@ +512,2 @@
>              observerService->AddObserver(this, "last-pb-context-exited", false);
> +            observerService->AddObserver(this, NS_NETWORK_LINK_TOPIC, false);

Just to clarify, the DNS Service will listen for NS_NETWORK_LINK_TOPIC for CHANGED notifications. But it will still get online/offline status from nsIOService, right?

@@ +805,5 @@
>  {
> +    if (!strcmp(topic, NS_NETWORK_LINK_TOPIC)) {
> +        const char *state = NS_ConvertUTF16toUTF8(data).get();
> +        if (!strcmp(state, NS_NETWORK_LINK_DATA_CHANGED)) {
> +            // flush the cache

Flush the cache by shutting down and re-initializing.

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1839,5 @@
> +        const char *state = NS_ConvertUTF16toUTF8(data).get();
> +        if (strcmp(state, NS_NETWORK_LINK_DATA_CHANGED) == 0) {
> +            if (mConnMgr) {
> +                mConnMgr->DoShiftReloadConnectionCleanup(nullptr);
> +                mConnMgr->PruneDeadConnections();

I need to review this in more detail, but ShiftReload seems a good place to start.

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +102,5 @@
>    return NS_OK;
>  }
>  
> +#ifdef HAS_NOTIFYIPINTERFACECHANGE
> +//Static Callback function for NotifyIpInterfaceChange API.

nit: Add space - "// Static..."

@@ +106,5 @@
> +//Static Callback function for NotifyIpInterfaceChange API.
> +static void WINAPI OnInterfaceChange(PVOID callerContext,
> +                                     PMIB_IPINTERFACE_ROW row,
> +                                     MIB_NOTIFICATION_TYPE notificationType)
> +{

Do you know what thread this runs on? Not urgent, but we could add a thread check at some point.

@@ +107,5 @@
> +static void WINAPI OnInterfaceChange(PVOID callerContext,
> +                                     PMIB_IPINTERFACE_ROW row,
> +                                     MIB_NOTIFICATION_TYPE notificationType)
> +{
> +    nsNotifyAddrListener *notify = (nsNotifyAddrListener *)callerContext;

static_cast<nsNotifyAddrListener*>(callerContext);

@@ +117,5 @@
>  nsNotifyAddrListener::Run()
>  {
>      PR_SetCurrentThreadName("Link Monitor");
> +#ifndef HAS_NOTIFYIPINTERFACECHANGE
> +    // For older Windows versions lacking NotifyIpInterfaceChange

// For Windows versions which are older than Vista which lack NotifyIpInterfaceChange. Note this means no IPv6 support.

@@ +142,5 @@
>              shuttingDown = true;
>          }
>      }
>      CloseHandle(ev);
> +#else

// Windows Vista and newer versions.

@@ +250,5 @@
>  
> +
> +// Bug 465158 features an explanation for this check. ICS being "Internet
> +// Connection Sharing). The description says it is always IP address
> +// 192.168.0.1 for this case.

Nice explanation. Thank you.

@@ +357,4 @@
>  
>      PIP_ADAPTER_ADDRESSES addresses = (PIP_ADAPTER_ADDRESSES) malloc(len);
>      if (!addresses)
>          return ERROR_OUTOFMEMORY;

For these mallocs we should file a bug to change them to their infallible equivalents (i.e. moz_xmalloc etc).

@@ +366,3 @@
>      if (ret == ERROR_BUFFER_OVERFLOW) {
>          free(addresses);
>          addresses = (PIP_ADAPTER_ADDRESSES) malloc(len);

static_cast<PIP_ADAPTER_ADDRESSES>(...)

@@ +387,2 @@
>      if (ret == ERROR_SUCCESS) {
>          PIP_ADAPTER_ADDRESSES ptr;

Declare ptr in for loop.

@@ +387,5 @@
>      if (ret == ERROR_SUCCESS) {
>          PIP_ADAPTER_ADDRESSES ptr;
>          bool linkUp = false;
>  
> +        for (ptr = addresses; ptr; ptr = ptr->Next) {

Let's rename these.

s/ptr/adapter/
s/addresses/adapterList/

I think this is a bit clearer and accurate enough for our purposes.

@@ +396,5 @@
> +                continue;
> +            }
> +
> +            PIP_ADAPTER_UNICAST_ADDRESS pip;
> +            int i;

Declare the int in the for loop statement.
Declare the PIP_... at assignment time.
s/pip/unicastAddr/

@@ +398,5 @@
> +
> +            PIP_ADAPTER_UNICAST_ADDRESS pip;
> +            int i;
> +
> +            for (i=0; ptr->AdapterName[i]; i++) {

// Add chars from AdapterName to the checksum.

@@ +403,5 @@
> +                sum <<= 2;
> +                sum += ptr->AdapterName[i];
> +            }
> +
> +            pip = ptr->FirstUnicastAddress;

// Add bytes from each socket address to the checksum.

@@ +405,5 @@
> +            }
> +
> +            pip = ptr->FirstUnicastAddress;
> +            while (pip) {
> +                SOCKET_ADDRESS *adr = &pip->Address;

s/adr/sockAddr/

@@ +407,5 @@
> +            pip = ptr->FirstUnicastAddress;
> +            while (pip) {
> +                SOCKET_ADDRESS *adr = &pip->Address;
> +                for (i=0; i < adr->iSockaddrLength; i++) {
> +                    sum += ((unsigned char *)adr->lpSockaddr)[i];

static_cast<unsigned char*>(...)

@@ +421,5 @@
>  
> +    if (mLinkUp) {
> +        /* Store the checksum only if one or more interfaces are up */
> +        mIPInterfaceChecksum = sum;
> +    }

So, if we move from UP to DOWN to UP, then the checksum will be compared to its previous UP state only. Cool.

@@ +465,3 @@
>  
> +        if (prevLinkUp != mLinkUp) {
> +            // UP/DOWN status changed, send the event

// ...send appropriate UP/DOWN event.

@@ +468,5 @@
> +            SendEvent(mLinkUp ?
> +                      NS_NETWORK_LINK_DATA_UP : NS_NETWORK_LINK_DATA_DOWN);
> +        }
> +        if (mLinkUp && (prevCsum != mIPInterfaceChecksum)) {
> +            // topology changed!

// Notify topology has changed.

::: netwerk/system/win32/nsNotifyAddrListener.h
@@ +14,5 @@
> +#include <ws2ipdef.h>
> +#include <tcpmib.h>
> +#include <iphlpapi.h>
> +#include <netioapi.h>
> +#endif

I don't think we need these for the class definition, do we? Please move any required ones into the cpp.

@@ +62,1 @@
>      bool  CheckICSStatus(PWCHAR aAdapterName);

Please add a comment above CheckICSGateway to say what ICS is.

@@ +68,5 @@
> +private:
> +    // this is a checksum of various meta data for all network interfaces
> +    // considered UP at last check
> +    ULONG mIPInterfaceChecksum;
> +

nit: Remove extra line please. Capitalize first word. Add period. :)
Attachment #8405280 - Flags: feedback+
(In reply to Steve Workman [:sworkman] from comment #34)

Thanks for the review! All fair remarks.

> -- Disable 'manage-offline-status' for Windows only.
> ==> This is different from your patch in which you remove it completely. I'm
> not sure that we want to do that, primarily because the pref is enabled for
> Android, which makes me concerned that there are apps/sites and other parts
> of our code that may be dependent on Necko setting offline status on
> Android. Apologies if I gave the impression earlier that this code should be
> removed - just disabled or bypassed for Windows is what I think we really
> want right now.

I recognize and understand your caution here. I agree that I need to do something to make sure the manage-offline-status use cases keep working.

I'm not sure your suggestion of only removing that functionality for windows is the ideal way forward.

A) That would then still have nsIOService being the hub that would decide what to do or not when going offline/online instead of the logic being suitable distributed to each individual service to each decide on their own what needs to be done.

B) It would make the logic in nsIOservice (and elsewhere) fairly complicated and potentially hard to follow. If manage-offline-status should remain, then I think it should be kept for all platforms.

I suggest this slightly modified way forward and I'll of course greatly appreciate your feedback and opinon:

1 - I put back 'manage-offline-status' so that it is used to set the "work offline mode" automatically when the network status goes UP or DOWN. In case there's something somewhere that cares about the mode being set.

2 - On platforms such as Android that that has manage-offline-status enabled by default, we make sure that the CHANGED event is sent every time the network goes UP (until we figure out how to not do it when it has indeed not changed). That should make HTTPhandler and DNSservice act correctly - as in it would take the safe approach and always flush caches much in the way DOWN/UP-events do on for example Android today (I would probably also include Linux to do the same). It would have the additional side-effect or improvement, depending on the situation, that if Android would detect a network event when already UP it would do a CHANGED for a case that currently wouldn't do anything.

3 - I would like the setoffline() handling in nsIOService to *not* touch DNS or sockettransportservice, in fact I'd like to remove that completely from there.

4 - File individual separate bugs for Android and Linux (dbus) to have them optimize away the case where UP => UP happens and the topology did in fact not change (in case it actually even can occur).
While discussing the final bits on how to deal with manage-offline-status for non-windows, here's the updated patch after addressing review remarks from sworkman.
Attachment #8405280 - Attachment is obsolete: true
Comment on attachment 8413735 [details] [diff] [review]
v8-0001-bug-939318-detect-network-interface-changes-on-wi.patch

Review of attachment 8413735 [details] [diff] [review]:
-----------------------------------------------------------------

Cool. Thanks for making those changes. Some more comments. Also, please put feedback?sworkman or r?sworkman on the next patch, depending on what you want :)

(In reply to Daniel Stenberg [:bagder] from comment #35)
> I'm not sure your suggestion of only removing that functionality for windows
> is the ideal way forward.
> 
> A) That would then still have nsIOService being the hub that would decide
> what to do or not when going offline/online instead of the logic being
> suitable distributed to each individual service to each decide on their own
> what needs to be done.

For now, I think this is ok. We're working on this incrementally; the endgame is for these components to be nicely decoupled, but for now I think it's ok to make one step forward.

> B) It would make the logic in nsIOservice (and elsewhere) fairly complicated
> and potentially hard to follow. If manage-offline-status should remain, then
> I think it should be kept for all platforms.

Complicated/ugly is ok for now. #ifdef WIN_XP for all the code in nsIOService that deals with CHANGED isn't pretty, but we can remove it later.

> I suggest this slightly modified way forward and I'll of course greatly
> appreciate your feedback and opinon:
> 
> 1 - I put back 'manage-offline-status' so that it is used to set the "work
> offline mode" automatically when the network status goes UP or DOWN. In case
> there's something somewhere that cares about the mode being set.

I think we need to do this, yup. I had been thinking we could just do "manage-offline-status" for Android, but I bet there are folks out there who have enabled that pref on Win or other platforms. If possible, we should try to keep CHANGED from changing (hehe) the handling of that pref.

> 2 - On platforms such as Android that that has manage-offline-status enabled
> by default, we make sure that the CHANGED event is sent every time the
> network goes UP (until we figure out how to not do it when it has indeed not
> changed). That should make HTTPhandler and DNSservice act correctly - as in
> it would take the safe approach and always flush caches much in the way
> DOWN/UP-events do on for example Android today (I would probably also
> include Linux to do the same). It would have the additional side-effect or
> improvement, depending on the situation, that if Android would detect a
> network event when already UP it would do a CHANGED for a case that
> currently wouldn't do anything.

I'm fine with doing this, but as a follow-up. UP-UP as CHANGED seems like a decent intermediary step, but it will require more testing, and that testing could block landing of this patch. I think we have more to gain by landing the patch now and letting the code bake on Nightly for Windows while we do the FxOS/Android follow-up(s).

> 3 - I would like the setoffline() handling in nsIOService to *not* touch DNS
> or sockettransportservice, in fact I'd like to remove that completely from
> there.

Indeed. I think we're on the same page with where this ends up, i.e. more decoupling. But I don't think we can afford to remove this yet until DNS, STS or HTTP are observing the right topics and we're matching current behavior at least. I'd like it to be another follow-up.

> 4 - File individual separate bugs for Android and Linux (dbus) to have them
> optimize away the case where UP => UP happens and the topology did in fact
> not change (in case it actually even can occur).

Sounds good to me. Let's make sure FxOS is first, then Android and Linux. If solving one solves all three, great; if not, we should work in that order of priority.

So, this bug has many comments and is quite squarely focused on Windows right now. Let's change the description to say Windows only and open new bugs for the follow-ups. If it's helpful, a new meta bug can be created for the general topic of enabling NetworkLinkService, with all these follow-ups as blockers.

How does this all sound? Are we on the same page?

::: netwerk/dns/nsDNSService2.cpp
@@ +858,5 @@
> +            // Flush the cache by shutting down and re-initializing.
> +            if (mResolver) {
> +                Shutdown();
> +            }
> +            Init();

Looks like this is the same process as what happens for the other two topics observed here (NS_PREFBRANCH_PREFCHANGE_TOPIC_ID and "last-pb-context-exited"). So, we should probably combine those....

// pseudocode...

NS_ASSERTION(topic == PREFCHANGE || "last-pb..." || NETWORK_LINK, "unexpected observe call");

if (topic == NETWORK_LINK) {
  nsDependentString subTopic(data);
  if (data.EqualsLiteral(CHANGED)) {
    // Ignore all other data items for this topic.
    return NS_OK;
  }
}

if (mResolver) { ... etc.

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +369,2 @@
>      if (ret == ERROR_BUFFER_OVERFLOW) {
> +        free(adapterList);

moz_free. It maps to free currently, but just in case a change is made in the future, it's better to pair it with moz_xmalloc now.

@@ +390,3 @@
>          bool linkUp = false;
>  
> +        for (PIP_ADAPTER_ADDRESSES adapter = adapterList; adapter; adapter = adapter->Next) {

nit: 80 chars max per line :)

@@ +408,5 @@
> +                 pip; pip = pip->Next) {
> +                SOCKET_ADDRESS *sockAddr = &pip->Address;
> +                for (int i=0; i < sockAddr->iSockaddrLength; i++) {
> +                    // static_cast<> doesn't work here
> +                    sum += ((unsigned char *)sockAddr->lpSockaddr)[i];

Oh. Out of interest, why not?

@@ +453,5 @@
> +        if (mStatusKnown) {
> +            event = mLinkUp ? NS_NETWORK_LINK_DATA_UP :
> +                NS_NETWORK_LINK_DATA_DOWN;
> +        }
> +        else {

nit: 

} else {
Attachment #8413735 - Flags: feedback+
:sworkman: yes sure, we're certainly on the same page. I have a refreshed patch here now, but I ran into some (unrelated) build issues on Windows so I've not yet been able to verify my changes locally. I'll be back with a patch with a review flag "soon"! ;-)
(In reply to Daniel Stenberg [:bagder] from comment #38)
> :sworkman: yes sure, we're certainly on the same page. I have a refreshed
> patch here now, but I ran into some (unrelated) build issues on Windows so
> I've not yet been able to verify my changes locally. I'll be back with a
> patch with a review flag "soon"! ;-)

Awesome - looking forward to it!
Another iteration. Addresses review comments and now network.manage-offline-status can't be enabled on windows but remains functional on other platforms.
Attachment #8413735 - Attachment is obsolete: true
Attachment #8417204 - Flags: review?(sworkman)
Comment on attachment 8417204 [details] [diff] [review]
v9-0001-bug-939318-detect-network-interface-changes-on-wi.patch

Review of attachment 8417204 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/src/nsIOService.cpp
@@ +1059,5 @@
>  
> +#ifndef XP_WIN
> +    // As per bug 939318 we switch off the manage-offline-status functionality
> +    // for Windows builds. It was disabled by default. Only non-windows builds
> +    // can enable the mode here.

Ah, so I'm not so sure about this. For those users who have enabled manage-offline on Windows, this effectively disables that.

Looking at this again, I don't think we need to worry about the #ifndef here. CHANGED is only created and sent from nsNotifyAddrListener, and that class/file only exists on Windows. So, we have automatic isolation from other platforms.

As far as conflicting with manage-offline on Win:
1. We should be good for our default setting. manage-offline is disabled, so for the majority, it should be fine.
2. On systems where users have enabled manage-offline, it should still be ok. We might be notifying UP/DOWN and CHANGED, so there would be a bit of redundancy, but it should be ok. There might even be a way to optimize this in nsNotifyAddrListener. (I added a suggestion near that part of the patch).

We're kinda coming full circle here and it looks like nsIOService might not need many changes. (The cleanup and rename changes are nice though, so I think we should leave them).

So, when we get to landing, the communication should be something like:

-- nsNotifyAddrListener
   creates and sends UP, DOWN and CHANGED notifications.
-- nsIOService
   listens for UP and DOWN, but only acts if manage-offline is enabled.
-- DNS and STS
   called by nsIOService for UP and DOWN (we can clean this up eventually).
-- DNS and HTTP
   listen for CHANGED.

Once we get more platforms notifying CHANGED, we can start to pull UP and DOWN out of nsIOService.

@@ +1099,5 @@
>  
>      if (mShutdown)
>          return NS_ERROR_NOT_AVAILABLE;
> +
> +    if (!strcmp(data, NS_NETWORK_LINK_DATA_DOWN)) {

Looks like this should be wrapped with if (mManageOfflineStatus) as well, no?

::: netwerk/dns/nsDNSService2.cpp
@@ +857,3 @@
>      NS_ASSERTION(strcmp(topic, NS_PREFBRANCH_PREFCHANGE_TOPIC_ID) == 0 ||
> +                 strcmp(topic, "last-pb-context-exited") == 0 ||
> +                 strcmp(topic, NS_NETWORK_LINK_TOPIC),

== 0

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +455,5 @@
> +        if (mStatusKnown) {
> +            event = mLinkUp ? NS_NETWORK_LINK_DATA_UP :
> +                NS_NETWORK_LINK_DATA_DOWN;
> +        } else {
> +            event = NS_NETWORK_LINK_DATA_UNKNOWN;

We should make sure that this branch running on the main thread matches the off-main-thread branch below, i.e. can we send CHANGED here too? If not, please add a comment explaining why not.

@@ +470,5 @@
> +            // UP/DOWN status changed, send appropriate UP/DOWN event
> +            SendEvent(mLinkUp ?
> +                      NS_NETWORK_LINK_DATA_UP : NS_NETWORK_LINK_DATA_DOWN);
> +        }
> +        if (mLinkUp && (prevCsum != mIPInterfaceChecksum)) {

Can this be:

if (mLinkup &&
    mLinkup == prevLinkUp &&
    prevCsum != mIPInterfaceChecksum) { 
...

?

Looks like that might avoid sending UP and CHANGED at the same time, no?
Attachment #8417204 - Flags: review?(sworkman) → review-
(In reply to Steve Workman [:sworkman] from comment #41)

> Ah, so I'm not so sure about this. For those users who have enabled
> manage-offline on Windows, this effectively disables that.

Yes, that's what I intended. Primarily because having both the new CHANGED actions and the old manage-offline logic makes a very strange combination (to me) and double work that at best leads to doing almost the same thing multiple times and at worst leads to some subtle side-effects. Are you happy with doing "both" activities for people with manage-offline enabled?
(In reply to Daniel Stenberg [:bagder] from comment #42)
> Yes, that's what I intended. Primarily because having both the new CHANGED
> actions and the old manage-offline logic makes a very strange combination
> (to me) and double work that at best leads to doing almost the same thing
> multiple times and at worst leads to some subtle side-effects. Are you happy
> with doing "both" activities for people with manage-offline enabled?

Yes, because it's a pref that they have to explicitly enable, so they're taking their browser settings into their own hands. And it's only on Windows. And the long term plan is to look at/fix manage-offline too.

However, what kind of side effects are you talking about? Are they mitigated by not sending CHANGED with UP? I added a suggestion at the end of comment 41 about how that might be accomplished in nsNotifyAddrListener.
(In reply to Steve Workman [:sworkman] from comment #41)

> @@ +1099,5 @@
> >  
> >      if (mShutdown)
> >          return NS_ERROR_NOT_AVAILABLE;
> > +
> > +    if (!strcmp(data, NS_NETWORK_LINK_DATA_DOWN)) {
> 
> Looks like this should be wrapped with if (mManageOfflineStatus) as well, no?

Ah yes.

It had that check before so yes that would be consistent with how it has been working up until now. I do however think it is overloading the mManageOfflineStatus pref with more functionality than just "managing offline status" so I'm not sure I agree it should have the check. The autodial stuff is controlled by its own pref anyway and I think that's the only pref to check for that condition.

It begs the question about how many users that are actually (successfully) using autodial...

> ::: netwerk/dns/nsDNSService2.cpp
> @@ +857,3 @@
> >      NS_ASSERTION(strcmp(topic, NS_PREFBRANCH_PREFCHANGE_TOPIC_ID) == 0 ||
> > +                 strcmp(topic, "last-pb-context-exited") == 0 ||
> > +                 strcmp(topic, NS_NETWORK_LINK_TOPIC),
> 
> == 0

Ugh, very good catch!

> ::: netwerk/system/win32/nsNotifyAddrListener.cpp
> @@ +455,5 @@
> > +        if (mStatusKnown) {
> > +            event = mLinkUp ? NS_NETWORK_LINK_DATA_UP :
> > +                NS_NETWORK_LINK_DATA_DOWN;
> > +        } else {
> > +            event = NS_NETWORK_LINK_DATA_UNKNOWN;
> 
> We should make sure that this branch running on the main thread matches the
> off-main-thread branch below, i.e. can we send CHANGED here too? If not,
> please add a comment explaining why not.

Right, since we don't call CheckAdaptersAddresses() on the main thread we can't catch the CHANGED condition so it can't happen there. I'll add a comment about this.

> Can this be:
> 
> if (mLinkup &&
>     mLinkup == prevLinkUp &&
>     prevCsum != mIPInterfaceChecksum) { 
> ...
> 

> Looks like that might avoid sending UP and CHANGED at the same time, no?

It would, but I think it would be wrong. I think the events have this meaning:

UP - the network link's new status is UP. (Ideally and usually it means that it was previously DOWN) The network connectivity just came on.

DOWN - network connectivity disappeared

CHANGED - network connectivity is present and the topology has changed since previously.

So a listener that doesn't care about topology changes can listen on DOWN/UP only. Also, it makes Firefox basically send UP/DOWN for the same scenarios independently if CHANGED has been implemented or not.

Make sense?
(In reply to Steve Workman [:sworkman] from comment #43)

> However, what kind of side effects are you talking about? Are they mitigated
> by not sending CHANGED with UP? I added a suggestion at the end of comment
> 41 about how that might be accomplished in nsNotifyAddrListener.

Yes I guess the primary "risk" I thinking of would be UP immediately followed by CHANGED.

By removing the UP event in that case the risk for "double action" would be removed, but removing the UP for that case would then instead circumvent the manageofflinestatus logic in nsIOService that we're talking about preserving!
(In reply to Daniel Stenberg [:bagder] from comment #45)

(Talking to myself, sorry for this but I'm thinking slower than I type apparently)

> but removing the UP for that case would then instead circumvent the
> manageofflinestatus logic in nsIOService that we're talking about preserving!

Nah, that was wrong. Since the code checks for UP and DOWN explicitly and otherwise reads the status, the nsIOService logic will remain functional even if we remove UP in the UP+CHANGED case. The question left is then _only_ if not sending UP is really what want for that scenario.
(In reply to Daniel Stenberg [:bagder] from comment #44)
> I do however think it is overloading the
> mManageOfflineStatus pref with more functionality than just "managing
> offline status" so I'm not sure I agree it should have the check. The
> autodial stuff is controlled by its own pref anyway and I think that's the
> only pref to check for that condition.

Hmm, I'm ok with that I think. My understanding has been that autodial only works or makes sense if manage-offline-status is enabled. In any case, we can leave it as is for now and then come back to it.

> It begs the question about how many users that are actually (successfully)
> using autodial...

Indeed - My inclination is to not enable it by default, and probably to exclude it in the future/potential work on manage-offline-status.

> > ::: netwerk/system/win32/nsNotifyAddrListener.cpp
> > @@ +455,5 @@
> > > +        if (mStatusKnown) {
> > > +            event = mLinkUp ? NS_NETWORK_LINK_DATA_UP :
> > > +                NS_NETWORK_LINK_DATA_DOWN;
> > > +        } else {
> > > +            event = NS_NETWORK_LINK_DATA_UNKNOWN;
> > 
> > We should make sure that this branch running on the main thread matches the
> > off-main-thread branch below, i.e. can we send CHANGED here too? If not,
> > please add a comment explaining why not.
> 
> Right, since we don't call CheckAdaptersAddresses() on the main thread we
> can't catch the CHANGED condition so it can't happen there. I'll add a
> comment about this.

Hmm. I think it would be better to 1. Dispatch this off the main thread to make sure we catch and notify network changes. And 2. Add some kind of NS_WARNING so we can get a rough idea of how often we get these on the main thread.
(In reply to Daniel Stenberg [:bagder] from comment #44)
> > Looks like that might avoid sending UP and CHANGED at the same time, no?
>
> So a listener that doesn't care about topology changes can listen on DOWN/UP
> only. Also, it makes Firefox basically send UP/DOWN for the same scenarios
> independently if CHANGED has been implemented or not.
> 
> Make sense?

Yes, specifically the part about listeners not caring about UP and DOWN. Tricky, huh.

(In reply to Daniel Stenberg [:bagder] from comment #46)
> (In reply to Daniel Stenberg [:bagder] from comment #45)
> Since the code checks for UP and DOWN explicitly and
> otherwise reads the status, the nsIOService logic will remain functional
> even if we remove UP in the UP+CHANGED case. The question left is then
> _only_ if not sending UP is really what want for that scenario.

Actually, I didn't mean removing UP; instead I meant not sending CHANGED. So, UP and DOWN would always be sent, and CHANGED would be implicit in an UP event. BUT, it seems like that doesn't fit with the idea that listeners could ignore UP and DOWN.

So, here's what I think: we can view this as the listener's problem. At present, we only have STS and DNS that care - they're indirectly listening via nsIOService. To start, we can make sure that nothing is broken by having UP + CHANGED. I imagine it's fine. Then, as a follow-up, we can adjust things. For example, in DNS, let's not blow away the cache in DOWN/SetOffline events. Instead, let's just set a flag and make sure that the DNS service is blocked while offline. Then, when we come back up, we just use the same cache as before. However, if we get a CHANGED event, then we should blow away the cache and start again. (Might actually be an optimization for current manage-offline cases).

In this case, I can a see a race condition:

UP->DNS online;
socket requests host record; gets it
CHANGED->DNS reset

=> socket has host record from old version of cache.


So, let's set a rule that if the network is coming online, and the topology has changed, then we must send CHANGED first, followed by UP.

How does this sound?

(Btw, I think changing the behavior of DNS listening to UP should be done after we fix the other platforms. That way we can avoid platform ifdefs. But if we need to do those, I'm fine with that).
(In reply to Steve Workman [:sworkman] from comment #47)

> > Right, since we don't call CheckAdaptersAddresses() on the main thread we
> > can't catch the CHANGED condition so it can't happen there. I'll add a
> > comment about this.
> 
> Hmm. I think it would be better to 1. Dispatch this off the main thread to
> make sure we catch and notify network changes. And 2. Add some kind of
> NS_WARNING so we can get a rough idea of how often we get these on the main
> thread.

The comment in nsNotifyAddrListener::CheckLinkStatus() says a reason it doesn't check the status in the main thread is that it can take a long time - too long time to do synchronously. And if we are to do the action without waiting for it to complete, then I don't think there's much point in doing it in the main thread anymore as the only time it can get called on that thread is when someone explicitly asks for current status. Delayed/asynch information about link status is already provided!

Regarding (2), there already is an NS_WARNING() for that situation and I've not yet seen it happen.
(In reply to Steve Workman [:sworkman] from comment #48)

> Actually, I didn't mean removing UP; instead I meant not sending CHANGED.
> So, UP and DOWN would always be sent, and CHANGED would be implicit in an UP
> event. BUT, it seems like that doesn't fit with the idea that listeners
> could ignore UP and DOWN.

Right. I wanted the CHANGED event to mean "network topology changed since the last time the network was UP" as that can let us keep (some) caches survive ordinary DOWN/UP cycles as long as we get back to the same "network topology" that we had before we want offline.
(In reply to Steve Workman [:sworkman] from comment #48)

> So, let's set a rule that if the network is coming online, and the topology
> has changed, then we must send CHANGED first, followed by UP.
> 
> How does this sound?

That's a really good idea! me like.
Next iteration. I believe I've addressed all nits discussed so far... Try link for this version: https://tbpl.mozilla.org/?tree=Try&rev=d3bed454c770 (still working when I post this)
Attachment #8417204 - Attachment is obsolete: true
Attachment #8420108 - Flags: review?(sworkman)
Comment on attachment 8420108 [details] [diff] [review]
v10-0001-bug-939318-detect-network-interface-changes-on-w.patch

Review of attachment 8420108 [details] [diff] [review]:
-----------------------------------------------------------------

Ok ok - good stuff. I am going to spend a little time today looking at the HTTP part - I just want to make sure about those calls. I'll comment on that later today.

::: netwerk/base/src/nsIOService.cpp
@@ +1093,5 @@
>  
>      if (mShutdown)
>          return NS_ERROR_NOT_AVAILABLE;
> +
> +    if (!strcmp(data, NS_NETWORK_LINK_DATA_DOWN)) {

I re-read my comment on this and it could have been read two ways quite easily, sorry. What I meant was that I was ok with manage-offline wrapping autodial, and that we should leave it like that for now. I really do want to come back to this offline stuff, but I want to avoid an autodial and manage-offline kerfuffle in the meantime.

So, if (mManageOfflineStatus && !strcmp...)

Sound ok?

@@ +1124,5 @@
> +        } else if (!strcmp(data, NS_NETWORK_LINK_DATA_UP)) {
> +            isUp = true;
> +        } else {
> +            nsresult rv = mNetworkLinkService->GetIsLinkUp(&isUp);
> +            NS_ENSURE_SUCCESS(rv, rv);

Can we ignore if it data is something else? CHANGED shouldn't be overloaded right?

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +481,5 @@
> +        if (prevLinkUp != mLinkUp) {
> +            // UP/DOWN status changed, send appropriate UP/DOWN event
> +            SendEvent(mLinkUp ?
> +                      NS_NETWORK_LINK_DATA_UP : NS_NETWORK_LINK_DATA_DOWN);
> +        }

Yay!
Comment on attachment 8420108 [details] [diff] [review]
v10-0001-bug-939318-detect-network-interface-changes-on-w.patch

Review of attachment 8420108 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1856,5 @@
> +        const char *state = NS_ConvertUTF16toUTF8(data).get();
> +        if (strcmp(state, NS_NETWORK_LINK_DATA_CHANGED) == 0) {
> +            if (mConnMgr) {
> +                mConnMgr->DoShiftReloadConnectionCleanup(nullptr);
> +                mConnMgr->PruneDeadConnections();

Pat, would like your input here please.

I remember from the meeting that Daniel, you and I had a couple of months ago that you had said something about letting some of the active connections time out. In this code, Idle connections are closed and active connections are set as DontReuse; looks like SPDY connections are closed in PruneDeadConnections. So HTTP/1 active connections will time out based on the socket, right? Or is there another timer in play here?

Is this the kind of thing you were thinking about? If these connections get stalled, and they're timing out based on TCP, that seems like a long time to wait. Do we need to add another timeout mechanism here?

How to deal with these active connections is cloudy for me... :)

(Btw, Daniel, I think we need to ResetIPFamilyPreference for each connection entry's nsHttpConnectionInfo. So, I think we need another function like OnMsgDoShiftReloadConnectionCleanup which enumerates on mCT using a new function, ResetPersistentConnectionsCB (?) which is like ClosePersistentConnectionsCB, but also calls ResetIPFamilyPreference).
Flags: needinfo?(mcmanus)
(In reply to Steve Workman [:sworkman] from comment #54)

> I re-read my comment on this and it could have been read two ways quite
> easily, sorry. What I meant was that I was ok with manage-offline wrapping
> autodial, and that we should leave it like that for now. I really do want to
> come back to this offline stuff, but I want to avoid an autodial and
> manage-offline kerfuffle in the meantime.
> 
> So, if (mManageOfflineStatus && !strcmp...)
> 
> Sound ok?

Okay, I'll adapt in my next version. I just want to mention that I can't find any documentation anywhere that says you need both these options for autodial to work. Like https://developer.mozilla.org/en/docs/Autodial_for_Windows_NT and http://kb.mozillazine.org/Network.autodial-helper.enabled which makes me suspect the relying on both prefs is unintented.

> @@ +1124,5 @@
> > +        } else if (!strcmp(data, NS_NETWORK_LINK_DATA_UP)) {
> > +            isUp = true;
> > +        } else {
> > +            nsresult rv = mNetworkLinkService->GetIsLinkUp(&isUp);
> > +            NS_ENSURE_SUCCESS(rv, rv);
> 
> Can we ignore if it data is something else? CHANGED shouldn't be overloaded
> right?

It's not really overloaded here. It is now simply using the info it received. So if we actually got a network UP or a DOWN event there's no point in asking the network service for the state when we were just told about it. All other states passed in will then make the code do as before and ask, since it didn't get the info told up front.

Possibly I should clarify that with a comment of some sorts there.
(In reply to Daniel Stenberg [:bagder] from comment #56)
> It's not really overloaded here. It is now simply using the info it
> received. So if we actually got a network UP or a DOWN event there's no
> point in asking the network service for the state when we were just told
> about it. All other states passed in will then make the code do as before
> and ask, since it didn't get the info told up front.

I see what you're saying. nsIOService has already been given data in the NETWORK_LINK notification to say that the link is now UP or DOWN. So, there is not need to query mNetworkLinkService for those cases. That part I understood.

I guess I just don't understand why we need to query up/down status for CHANGED events. We already defined the meaning of CHANGED and UP to be mutually exclusive, right? i.e. only UP and DOWN say if the network link is up or down. So, maybe we can just return in this else branch before calling SetOffline, because I'm not sure if we need to if there hasn't been a reported UP/DOWN event.

It also seems like this should be ok for all platforms, right? The other platforms are only aware of UP and DOWN anyway, so they shouldn't be getting other notifications. What do you think?
> 
> Possibly I should clarify that with a comment of some sorts there.
(In reply to Steve Workman [:sworkman] from comment #57)

> I guess I just don't understand why we need to query up/down status for
> CHANGED events. We already defined the meaning of CHANGED and UP to be
> mutually exclusive, right? i.e. only UP and DOWN say if the network link is
> up or down. So, maybe we can just return in this else branch before calling
> SetOffline, because I'm not sure if we need to if there hasn't been a
> reported UP/DOWN event.

The else clause is in fact intended for the NS_NETWORK_LINK_DATA_UNKNOWN cases (when this method is called without an actual event having triggered it) so yes for the CHANGED case we should make sure that it doesn't do anything at all. Thanks!
(In reply to Daniel Stenberg [:bagder] from comment #58)
> The else clause is in fact intended for the NS_NETWORK_LINK_DATA_UNKNOWN
> cases (when this method is called without an actual event having triggered
> it) so yes for the CHANGED case we should make sure that it doesn't do
> anything at all. Thanks!

Ah yes - UNKNOWN. Good one. So, as you imply, adding an 'else if (!strcmp(...CHANGED))' and returning silently sounds good :)
- added the check for managed-offline-status for autodial

- added check for CHANGED in nsIOService::OnNetworkLinkEvent as discussed
Attachment #8420108 - Attachment is obsolete: true
Attachment #8420108 - Flags: review?(sworkman)
Attachment #8422930 - Flags: review?(sworkman)
Comment on attachment 8422930 [details] [diff] [review]
v11-0001-bug-939318-detect-network-interface-changes-on-w.patch

Review of attachment 8422930 [details] [diff] [review]:
-----------------------------------------------------------------

Woohoo! Looks great. I'd like to hear back from Pat on the HTTP Handler part, but apart from that r=me.

::: netwerk/base/src/nsIOService.cpp
@@ +1117,4 @@
>          }
>      }
>  
> +    if (mManageOfflineStatus) {

nit: (had to be one!) you can probably put:

if (!mManageOfflineStatus) {
  return NS_OK;
}

... near the top of the function, since both of the important 'if' branches require it.
Attachment #8422930 - Flags: review?(sworkman) → review+
Flags: needinfo?(mcmanus)
Me, :sworkman and :mcmanus had a meeting in which we decided this bug will benefit from a few more items to get done (there are also items left for follow-up bugs but I'll ignore them for now):

Make some more actions on the CHANGED event:
 - send a websocket ping and require a response
 - proxy service reload
 - add timer to existing alive HTTP connections to tear down if no traffic go through to avoid them stalling

To address the case where a laptop is closed and then opened again, figure out how to get a power event and send CHANGED for that too (probably)

Test on XP
Here's a separate patch that only introduces a nsHttpConnectionMgr::VerifyTraffic() that gets called on a network changed event. The new method does this:

A) first marks all the active connections with the current amount of data having been transfrered

B) sets a timer to run in 5 seconds (arbitrary time, too long will be annoying and too short may risk cutting off things it doesn't have to)

C) when the timer expires, check all the active connections and close those that haven't had any traffic during those 5 seconds.

The idea here being of course that when a network change is registered, we don't know if the connections can still go on or not as they are and the active connections risk "hanging" in case the network change was big enough.

I'm asking for feedback on the general approach first and the exact implementation secondary.

This patch is supposed to be applied after the v11 one but should be readable and understandable without knowing much of the first patch.
Attachment #8426178 - Flags: feedback?(mcmanus)
Comment on attachment 8426178 [details] [diff] [review]
0001-Bug-939318-find-and-close-HTTP-connections-without-t.patch

Review of attachment 8426178 [details] [diff] [review]:
-----------------------------------------------------------------

this seems like the right direction

::: netwerk/protocol/http/nsHttpConnection.h
@@ +193,5 @@
>      void    SetInSpdyTunnel(bool arg);
>  
> +    // Get a snapshot of the transferred data amount to allow a later check to
> +    // see if there's been any traffic on it since the "mark"
> +    void MarkForTrafficCheck() { mTrafficCheck = mTotalBytesWritten + mTotalBytesRead; }

I think its fine to use bytesRead like this, but bytesWritten needs to account for the fact that we can do some writing that juts gets buffered as waiting for ack in the kernel socket.. any ideas? its probably not critical

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +341,5 @@
> +nsHttpConnectionMgr::VerifyTraffic()
> +{
> +    LOG(("nsHttpConnectionMgr::VerifyTraffic\n"));
> +
> +    nsresult rv = PostEvent(&nsHttpConnectionMgr::OnMsgVerifyTraffic);

I would prefer that the timer be init'd after the various verify calls had been made (i.e from the onmsgverifytraffic) to enforce the ordering. Its probably not possible for the timer to fire synchronously, but let's not depend on that.

@@ +344,5 @@
> +
> +    nsresult rv = PostEvent(&nsHttpConnectionMgr::OnMsgVerifyTraffic);
> +    if (NS_SUCCEEDED(rv)) {
> +        if(!mTrafficTimer)
> +          mTrafficTimer = do_CreateInstance("@mozilla.org/timer;1");

braces

@@ +347,5 @@
> +        if(!mTrafficTimer)
> +          mTrafficTimer = do_CreateInstance("@mozilla.org/timer;1");
> +
> +        // failure to create a timer is not a fatal error, but dead
> +        // connections will not be cleaned up as nicely

what do we want to do when the timer is currently running from a previous changed event?

@@ +351,5 @@
> +        // connections will not be cleaned up as nicely
> +        if (mTrafficTimer) {
> +            // give active connections 5 long seconds to get more traffic before we
> +            // kill them off
> +            mTrafficTimer->Init(this, 5000, nsITimer::TYPE_ONE_SHOT);

make 5000 a pref

@@ +1024,5 @@
> +                                     nsAutoPtr<nsConnectionEntry> &ent,
> +                                     void *closure)
> +{
> +    // Iterate over all active connections and "mark them"
> +    for (uint32_t index = 0; index < ent->mActiveConns.Length(); ++index) {

some of your active connections are spdy/http2.. those conns should use a ping instead of this mechanism.

@@ +1042,5 @@
> +    nsHttpConnectionMgr *self = (nsHttpConnectionMgr *) closure;
> +    for (uint32_t index = 0, len = ent->mActiveConns.Length();
> +         index < len; ++index) {
> +        nsHttpConnection *conn = ent->mActiveConns[index];
> +        if(!conn->HasTraffic()) {

so the set of active conns at this point in time is not necessarily the same as the set that was marked earlier. In a healthy system some will arrive and some will leave. we don't care about the members that weren't around last time (they weren't around during the change event) - so this loop is right, I think HasTraffic() just needs a little state.

@@ +1043,5 @@
> +    for (uint32_t index = 0, len = ent->mActiveConns.Length();
> +         index < len; ++index) {
> +        nsHttpConnection *conn = ent->mActiveConns[index];
> +        if(!conn->HasTraffic()) {
> +            ent->mActiveConns.RemoveElementAt(index);

you are iterating forward through the array and also removing elements from it. not good :)

@@ +1046,5 @@
> +        if(!conn->HasTraffic()) {
> +            ent->mActiveConns.RemoveElementAt(index);
> +            self->DecrementActiveConnCount(conn);
> +
> +            conn->Close(NS_ERROR_ABORT);

LOG() about why

::: netwerk/protocol/http/nsHttpConnectionMgr.h
@@ +92,5 @@
>  
> +    // called to close active connections with no registered "traffic"
> +    nsresult PruneNoTraffic();
> +
> +    // "Verify" means setting the mDead field to true now, and then again in N

I don't think this comment about mDead is current
Attachment #8426178 - Flags: feedback?(mcmanus) → feedback+
Here's a new iteration of the patch for the HTTP handler acting on the network changed event. I believe I've addressed all but one of the feedback I received previously. I'm convinced my new stuff have problems or issues to identify.

I changed the code so that it better handles that new connections appear when the time times to check for stalled ones.

I added a comment about the fact that if we want to set a timer and it already exists, we re-init it. spdy-based connections get a ping sent on them, only non-spdy get the data check after timeout treatment. I'm not sure what else is needed for the ping logic to work, but I don't think this is complete for also checking that there's a response coming.

I hope I iterate through the active connections better now.

I added a pref for the networked changed timeout.

What I haven't addressed: the fact that mTotalBytesWritten may change without it actually sending that data over the network thus avoiding getting triggered by this logic. Ideas on how to improve that are welcome!
Attachment #8426178 - Attachment is obsolete: true
Attachment #8427773 - Flags: feedback?(mcmanus)
This is the third patch for this bug, it requires the main patch (detect-network-interface-changes) to be applied first to work.

This patch makes the protocol proxy service act on the network changed event and reloads the PAC. It is actually the same action that nsioservice already does if manage-offline-status is enabled.
Attachment #8429995 - Flags: review?(mcmanus)
This is the fourth patch for this bug, it requires the main patch (detect-network-interface-changes) to be applied first to work.

This patch make the websockets channel act on network changed events, It simply sends off a ping and sets a timer (using the regular websockets ping timeout), if not pending ping response is pending. If the channel is already waiting for a ping response it is left and thus assumed to be dealt with accordingly anyway.
Attachment #8429998 - Flags: review?(mcmanus)
Comment on attachment 8429998 [details] [diff] [review]
0004-bug-939318-websockets-act-on-network-interface-chang.patch

Review of attachment 8429998 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/websocket/WebSocketChannel.cpp
@@ +1123,5 @@
> +{
> +  LOG(("WebSocketChannel::Observe [topic=\"%s\"]\n", topic));
> +
> +  if (strcmp(topic, NS_NETWORK_LINK_TOPIC) == 0) {
> +    const char *state = NS_ConvertUTF16toUTF8(data).get();

the usual problem

@@ +1134,5 @@
> +      } else {
> +        LOG(("nsWebSocketChannel:: Generating Ping as network changed\n"));
> +        mPingOutstanding = 1;
> +        GeneratePing();
> +        mPingTimer->InitWithCallback(this, mPingResponseTimeout,

I think you may need to initialize the ping timer in this path.. under the default config it isn't used at all. See StartWebsocketData()

ResetPingTimer()also needs to cope with being called when pinginterval is 0.
Attachment #8429998 - Flags: review?(mcmanus)
Comment on attachment 8429995 [details] [diff] [review]
0003-bug-939318-protocolproxy-service-acts-on-network-int.patch

Review of attachment 8429995 [details] [diff] [review]:
-----------------------------------------------------------------

with fix for .get() problem r=mcmanus

::: netwerk/base/src/nsProtocolProxyService.cpp
@@ +442,5 @@
>          if (mPACMan) {
>              mPACMan->Shutdown();
>              mPACMan = nullptr;
>          }
>      }

} else if (foo) {

@@ +444,5 @@
>              mPACMan = nullptr;
>          }
>      }
> +    else if (strcmp(aTopic, NS_NETWORK_LINK_TOPIC) == 0) {
> +        const char *state = NS_ConvertUTF16toUTF8(aData).get();

same comment as previous patch.. I'm pretty sure state is pointing to dangling memory.

@@ +454,1 @@
>      else {

} else {
Attachment #8429995 - Flags: review?(mcmanus) → review+
Comment on attachment 8422930 [details] [diff] [review]
v11-0001-bug-939318-detect-network-interface-changes-on-w.patch

Review of attachment 8422930 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1851,5 @@
>      }
>      else if (strcmp(topic, "last-pb-context-exited") == 0) {
>          mPrivateAuthCache.ClearAll();
>      }
> +    else if (strcmp(topic, NS_NETWORK_LINK_TOPIC) == 0) {

} else if (foo) {
}

@@ +1852,5 @@
>      else if (strcmp(topic, "last-pb-context-exited") == 0) {
>          mPrivateAuthCache.ClearAll();
>      }
> +    else if (strcmp(topic, NS_NETWORK_LINK_TOPIC) == 0) {
> +        const char *state = NS_ConvertUTF16toUTF8(data).get();

NS_ConvertUTF16toUTF8(data) returns a nsCString and .get() returns a ptr into that nsCString.. that nsCString then immediately goes out of scope leaving you with a dangling pointer.

{
 nsCString converted = NS_ConvertUTF16toUTF8(data);
 const char *state = converted.get();
 state stuff;
}

@@ +1855,5 @@
> +    else if (strcmp(topic, NS_NETWORK_LINK_TOPIC) == 0) {
> +        const char *state = NS_ConvertUTF16toUTF8(data).get();
> +        if (strcmp(state, NS_NETWORK_LINK_DATA_CHANGED) == 0) {
> +            if (mConnMgr) {
> +                mConnMgr->DoShiftReloadConnectionCleanup(nullptr);

we don't want to land this because it will terminate (gracefully) http/2 connections which have a ping mechanism available to them to better handle this situation.
Attachment #8422930 - Flags: review-
looking good.. we need sleep and http/2 changes.. anything else?
(In reply to Patrick McManus [:mcmanus] from comment #72)
> looking good.. we need sleep and http/2 changes.. anything else?

I didn't see the http/2 patch.. looking at it now.

so just sleep?
Comment on attachment 8427773 [details] [diff] [review]
v2-0001-Bug-939318-find-and-close-HTTP-connections-withou.patch

Review of attachment 8427773 [details] [diff] [review]:
-----------------------------------------------------------------

so this spdy code just sends pings.. it doesn't make us timeout on them. normally that is driven through readtimeouttick().. you can probly push a smaller mPingThreshold onto the stack and the pop it back to the original value after a reply is received - you'll need to call ActivateTimeoutTick() in the connectionmanager to get the tick on the right cadence. (its probably gone to a long delay by the time you need it, and this will get it back on a frequent tick).

consider if the http/1 conns can be managed the same way rather than with another timer.

::: netwerk/protocol/http/Http2Session.cpp
@@ +3224,5 @@
>  }
>  
> +void
> +Http2Session::SendPing()
> +{

assert socket thread .. also in spdy versions

::: netwerk/protocol/http/Http2Session.h
@@ +209,5 @@
>    nsISocketTransport *SocketTransport() { return mSocketTransport; }
>    int64_t ServerSessionWindow() { return mServerSessionWindow; }
>    void DecrementServerSessionWindow (uint32_t bytes) { mServerSessionWindow -= bytes; }
>  
> +  void SendPing();

sendping() moz_override also in spdy versions

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +2062,5 @@
>  
> +void
> +nsHttpConnection::CheckForTraffic()
> +{
> +    if(UsingSpdy()) {

if(mSpdySession) (usingSpdy() can actually outlive that object)

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +1038,5 @@
> +            LOG(("  closed active connection due to no traffic [conn=%p]\n", conn));
> +            NS_RELEASE(conn);
> +            len--; // one less to loop over
> +        } else {
> +            // bump only if entry not removed

let's just walk the list backwards from n-1 to 0.. then you can delete safely and use the normal --index iterator. I _think_ the only code that does similar things to your fragment is the code that needs to honor the order of the list as a priority.

@@ +1041,5 @@
> +        } else {
> +            // bump only if entry not removed
> +            index++;
> +        }
> +    }

I think we need to also clear mTrafficStamp in all the connections (both active and idle) so as not to be confused about the state of things the next time this event runs.

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +142,5 @@
>      , mProxyPipelining(true)
>      , mIdleTimeout(PR_SecondsToInterval(10))
>      , mSpdyTimeout(PR_SecondsToInterval(180))
>      , mResponseTimeout(PR_SecondsToInterval(600))
> +    , mNetworkChangedTimeout(5000)

libpref/src/init/all.js should have a corresponding entry
Attachment #8427773 - Flags: feedback?(mcmanus) → feedback-
Any news on when this is planning to land, folks? Fennec is suffering pretty badly from totally stalling whenever you switch from wifi to cellular, and as we understand it this is the cause.

(Extra points if we can uplift part or all of the fix.)
Flags: needinfo?(daniel)
(In reply to Richard Newman [:rnewman] from comment #75)

> Any news on when this is planning to land, folks? Fennec is suffering pretty
> badly from totally stalling whenever you switch from wifi to cellular, and
> as we understand it this is the cause.

Once I have addressed the above mentioned remarks I would guess that we'll be in a good position to land these patches.

Just note that the current work in this bug only has the CHANGED event being sent on Windows. We need to apply the same/similar logic to the other various plaforms' network backends to have this functionality, something I plan to do once I've closed this report as separate issue - probably one per platform. Of course it could also be done in parallel should anyone want and be able to.
Flags: needinfo?(daniel)
Blocks: 1008091
(1/4) 

Change since last iteration:

* mConnMgr->DoShiftReloadConnectionCleanup(nullptr) removed, the closing of stalled connection will be handled by a separate patch, see (2/4)
Attachment #8422930 - Attachment is obsolete: true
Attachment #8441160 - Flags: review?(mcmanus)
(2/4)

Changes since last iteration:

* code formatting in several places

* NS_ConvertUTF16toUTF8() dangling pointers in several places

* the spdy code didn't make us timeout on them

* Http2Session::SendPing() => assert socket thread .. also in spdy versions

* sendping() MOZ_OVERRIDE in spdy and http2 code

* if(mSpdySession) and not usingSpdy()

* walk the list of active connections backwards from n-1 to 0 and fixed the
  removal of entries

* clear mTrafficStamp in all idle connections to avoid triggering on connections
  that have changed since last

* updated all.js with the new prefs
Attachment #8427773 - Attachment is obsolete: true
Attachment #8441161 - Flags: review?(mcmanus)
(3/4)

Rebased. No changes.
Attachment #8429995 - Attachment is obsolete: true
Attachment #8441162 - Flags: review+
(4/4)

Changes:

* initialize the ping timer too if necessary
Attachment #8429998 - Attachment is obsolete: true
Attachment #8441163 - Flags: review?(mcmanus)
Comment on attachment 8441160 [details] [diff] [review]
v3-0001-bug-939318-detect-network-interface-changes-on-wi.patch

Review of attachment 8441160 [details] [diff] [review]:
-----------------------------------------------------------------

awesome.. r=mcmanus with a bunch of style nits addressed

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1856,5 @@
>              mConnMgr->ReportFailedToProcess(uri);
>      }
>      else if (strcmp(topic, "last-pb-context-exited") == 0) {
>          mPrivateAuthCache.ClearAll();
>      }

} and else if ( on same line

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +398,5 @@
> +                continue;
> +            }
> +
> +            // Add chars from AdapterName to the checksum.
> +            for (int i=0; adapter->AdapterName[i]; i++) {

I know its nearly impossible to intuit the current style because of all the legacy code :)

this should look like "for (int i = 0; adapter->AdapterName[i]; ++i) {"
(spaces around assignemment, and ++i is preferred to i++)

@@ +407,5 @@
> +            // Add bytes from each socket address to the checksum.
> +            for (PIP_ADAPTER_UNICAST_ADDRESS pip = adapter->FirstUnicastAddress;
> +                 pip; pip = pip->Next) {
> +                SOCKET_ADDRESS *sockAddr = &pip->Address;
> +                for (int i=0; i < sockAddr->iSockaddrLength; i++) {

formatting nit

@@ +408,5 @@
> +            for (PIP_ADAPTER_UNICAST_ADDRESS pip = adapter->FirstUnicastAddress;
> +                 pip; pip = pip->Next) {
> +                SOCKET_ADDRESS *sockAddr = &pip->Address;
> +                for (int i=0; i < sockAddr->iSockaddrLength; i++) {
> +                    // static_cast<> doesn't work here:

c++ casts.. const_cast or reinterpret_cast<> probly. but never c casts.

@@ +454,5 @@
>          mLinkUp = true;
> +
> +        if (!mStatusKnown) {
> +            event = NS_NETWORK_LINK_DATA_UNKNOWN;
> +        }

} else if (!prevLinkUp) {

@@ +457,5 @@
> +            event = NS_NETWORK_LINK_DATA_UNKNOWN;
> +        }
> +        else if(!prevLinkUp) {
> +            event = NS_NETWORK_LINK_DATA_UP;
> +        }

} else {

@@ +463,5 @@
> +            // Known status and it was already UP
> +            event = NULL;
> +        }
> +
> +        if (event != NULL) {

if (event) {

(or event != nullptr if you strongly prefer it - but the style guide says to avoid comparisons to null)
Attachment #8441160 - Flags: review?(mcmanus) → review-
Comment on attachment 8441161 [details] [diff] [review]
v3-0002-Bug-939318-find-and-close-HTTP-connections-withou.patch

Review of attachment 8441161 [details] [diff] [review]:
-----------------------------------------------------------------

same comments from http2session also apply to spdysession3 and spdysession3.1

::: netwerk/protocol/http/Http2Session.cpp
@@ +282,2 @@
>        mPingSentEpoch = 0;
> +      if(mPreviousUsed) {

if - space - (

@@ +289,2 @@
>  
> +    return PR_IntervalToSeconds(threshold) -

I think this should be mPringThreshold - no? (its calculating the delta to the next timer).. I'm not sure why you're keeping the temp value around after its been restored.

@@ +3256,5 @@
> +    return;
> +  }
> +
> +  mPingSentEpoch = PR_IntervalNow();
> +  if (!mPingSentEpoch)

always braces these days - even for single line conditionals

::: netwerk/protocol/http/Http2Session.h
@@ +441,5 @@
>    PRIntervalTime       mLastDataReadEpoch; // used for IdleTime()
>    PRIntervalTime       mPingSentEpoch;
>  
> +  PRIntervalTime       mPreviousPingThreshold; // backup for the former value
> +  bool                 mPreviousUsed;          // true when backup is used

unitialized

::: netwerk/protocol/http/nsHttpConnection.cpp
@@ +2062,5 @@
> +{
> +    if (check) {
> +        if (mSpdySession) {
> +            // Send a ping to verify it is still alive
> +            mSpdySession.get()->SendPing();

why not mSpdySession->SendPing()?

::: netwerk/protocol/http/nsHttpConnection.h
@@ +296,5 @@
>      bool                            mInSpdyTunnel;
>  
> +    // A snapshot of current number of transfered bytes
> +    int64_t                         mTrafficCount;
> +    bool                            mTrafficStamp; // true then the above is set

mtrafficstamp should be initialized to false

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +333,5 @@
> +// traffic since they were "marked" and nuke them.
> +nsresult
> +nsHttpConnectionMgr::PruneNoTraffic()
> +{
> +    return PostEvent(&nsHttpConnectionMgr::OnMsgPruneNoTraffic);

LOG() for consistency

@@ +1035,5 @@
> +    uint32_t numConns = ent->mActiveConns.Length();
> +    if (numConns) {
> +        // walk the list backwards to allow us to remove entries easily
> +        for (int index = numConns-1; index >= 0; index--) {
> +            nsHttpConnection *conn = ent->mActiveConns[index];

I know you're copying the horrendous mix of reference management used in this file, but let's use the better practice with new code. (I've tried unsuccessfully to fix the whole module a couple of times - I'll try again if nobody beats me to it)

if (ent->mActiveConns[index]->NoTraffic()) {
 nsRefPtr<nsHttpConnection> conn = dont_AddRef(ent->mActiveConns[index]);
 removeelement;
 deceremnetactiveconncount(conn);
 conn->close(abort);
 log();
}

@@ +2428,5 @@
> +    mCT.Enumerate(PruneNoTrafficCB, this);
> +}
> +
> +void
> +nsHttpConnectionMgr::OnMsgVerifyTraffic(int32_t, void *)

I think this code needs a "verifyandprune-in-progress" guard bool that makes this method just return early if anything is in the cycle from onmsgverifytraffic through onmsgprunenotraffic.. its not real clear what will happen with multiple events entering at various points in the life cycle and it doesn't seem meaningful to figure it out when we can just set a limit of 1.

if we don't do that you're minimally going to need to checkfortraffic(false) on every conn after doing the prune scan.. but I don't think that's necessary with the guard.
Attachment #8441161 - Flags: review?(mcmanus) → review-
Comment on attachment 8441163 [details] [diff] [review]
v3-0004-bug-939318-websockets-act-on-network-interface-ch.patch

Review of attachment 8441163 [details] [diff] [review]:
-----------------------------------------------------------------

r=mcmanus with nits addressed

::: netwerk/protocol/websocket/WebSocketChannel.cpp
@@ +78,5 @@
>                    nsIHttpUpgradeListener,
>                    nsIRequestObserver,
>                    nsIStreamListener,
>                    nsIProtocolHandler,
> +                  nsIObserver,

end of the list please

::: netwerk/protocol/websocket/WebSocketChannel.h
@@ +13,5 @@
>  #include "nsIAsyncInputStream.h"
>  #include "nsIAsyncOutputStream.h"
>  #include "nsITimer.h"
>  #include "nsIDNSListener.h"
> +#include "nsIObserverService.h"

can you just double check that you need this in the .h and not just the .cpp?

@@ +63,5 @@
>                           public nsIInputStreamCallback,
>                           public nsIOutputStreamCallback,
>                           public nsITimerCallback,
>                           public nsIDNSListener,
> +                         public nsIObserver,

end of the list please
Attachment #8441163 - Flags: review?(mcmanus) → review+
Attachment #8441162 - Flags: review+
in addition to these patches we still need (from comment 62)
 - deal with sleeping
 - confirmation on xp

 and InternetGetConnectedState()for captive portal transitions (comment 11)

and of course non windows things in another bug.
Blocks: 973169
Please do not forget SOCKS-Proxys!
In some cases the entire Connection to the public internet goes through a SOCKS-Proxy, this can including DNS-Functions in case the SOCKS-Proxy does its own domain name resolution.
And the Proxy can be accessible only with IPv6 or only with IPv4 or with both IP-Versions, independent from the used IP-Version for the external Communication with the desired Hosts in the public Internet.

In the case the domain suffix for the Host (on which Firefox runs) changes the Web Proxy Autodiscovery Protocol must be restarted.

Greetings
Erik
(In reply to Patrick McManus [:mcmanus] from comment #81)

> >      }
> >      else if (strcmp(topic, "last-pb-context-exited") == 0) {
> >          mPrivateAuthCache.ClearAll();
> >      }
> 
> } and else if ( on same line

But surely you don't want me to modify existing lines that I don't change otherwise? I didn't write this code quoted above and in fact several of the files I modify here already have source code guide violations. I'll try to make sure that my edited or added lines follow the right style, but for other lines I would prefer we leave those and clean them up later in a separate patch instead!
generally if you touch it, fix it. you don't need to rewrite the whole file you aren't touching - but you might need to reformat the whole if/else/else tree.

sometimes it just gets out of hand and you're better off just fitting in with what's there. its a judgment call. certainly not the most important thing to worry about.
(In reply to Patrick McManus [:mcmanus] from comment #84)
> in addition to these patches we still need (from comment 62)
>  - deal with sleeping
>  - confirmation on xp
> 
>  and InternetGetConnectedState()for captive portal transitions (comment 11)
> 
> and of course non windows things in another bug.

Sorry, but "deal with sleeping" means sending a CHANGED event when the computer comes back from "sleep" right? I would prefer us taking that in a separate bug, as it is quite easy to separate from the rest and doesn't hurt what we have so far. And it will help us land this particular set sooner.

Testing this on XP is indeed important. I don't personally have any winxp around my house. Anyone reading this than can take the patches for a spin on XP? I'm basically looking for other ways to test on XP than me having to install it in my end...

The captive portal part could possibly be done in/for Bug 664266.
(In reply to Patrick McManus [:mcmanus] from comment #82)

> @@ +2428,5 @@
> > +    mCT.Enumerate(PruneNoTrafficCB, this);
> > +}
> > +
> > +void
> > +nsHttpConnectionMgr::OnMsgVerifyTraffic(int32_t, void *)
> 
> I think this code needs a "verifyandprune-in-progress" guard bool that makes
> this method just return early if anything is in the cycle from
> onmsgverifytraffic through onmsgprunenotraffic.. its not real clear what
> will happen with multiple events entering at various points in the life
> cycle and it doesn't seem meaningful to figure it out when we can just set a
> limit of 1.
> 
> if we don't do that you're minimally going to need to checkfortraffic(false)
> on every conn after doing the prune scan.. but I don't think that's
> necessary with the guard.

I believe this is the only remark I'm struggling with to address properly - I think because I don't see how the problem you mention could be triggered. Let me try elaborate so that you can point out the flaw(s) in my thinking/code:

The current patch currently handles getting another "traffic check" event while a prior hasn't yet been handled, and it does so by just resetting the timeout. In my mind, that would properly solve the case where we get multiple CHANGED events within the grace period. We check for traffic activity TIMEOUT milliseconds after the last event.

If then we actually fire the timeout and check the existing connections for activity, that function only runs in the socket thread and after the timeout has expired. How would a flow be able to run more than one of those within a short period of time?
Flags: needinfo?(mcmanus)
(In reply to Daniel Stenberg [:bagder] from comment #88)

> 
> Sorry, but "deal with sleeping" means sending a CHANGED event when the
> computer comes back from "sleep" right?

deal with sleeping means detecting stale network config after a sleep event. Its definitely one of our known problem cases.

 I would prefer us taking that in a
> separate bug, as it is quite easy to separate from the rest and doesn't hurt
> what we have so far. And it will help us land this particular set sooner.
> 

you can organize it how you like. From a process view I want to be clear its part of this project and isn't to be deferred.
Flags: needinfo?(mcmanus)
Refresh. I believe all previous remarks have been addressed.
Attachment #8441160 - Attachment is obsolete: true
Attachment #8462440 - Flags: review?(mcmanus)
Refreshed. I believe all remarks are addressed, except the one mentioned in comment 89.
Attachment #8441161 - Attachment is obsolete: true
Attachment #8462444 - Flags: review?(mcmanus)
And here's a try run with all four patches in v4: https://tbpl.mozilla.org/?tree=Try&rev=63cfabebfbc3
(In reply to Patrick McManus [:mcmanus] from comment #90)

> deal with sleeping means detecting stale network config after a sleep event.
> Its definitely one of our known problem cases.

It seems we already have an observer topic for exactly this: NS_WIDGET_WAKE_OBSERVER_TOPIC "After the OS wakes up, this topic is notified". (It seems to be listening to the power events on Windows at least)

I suggest we then subscribe to that notification (as well) in the suitable places that need to act on it. Right? An alternative approach would be to listen to that at a single spot and then send out a "network changed" event, but that seems less good to me as it makes it less fine-grained preventing listeners from making a distinction when they feel a need to.
(In reply to Daniel Stenberg [:bagder] from comment #94)
> (In reply to Patrick McManus [:mcmanus] from comment #90)
> 
> > deal with sleeping means detecting stale network config after a sleep event.
> > Its definitely one of our known problem cases.
> 
> It seems we already have an observer topic for exactly this:
> NS_WIDGET_WAKE_OBSERVER_TOPIC "After the OS wakes up, this topic is
> notified". (It seems to be listening to the power events on Windows at least)
> 
> I suggest we then subscribe to that notification (as well) in the suitable
> places that need to act on it. Right? An alternative approach would be to
> listen to that at a single spot and then send out a "network changed" event,
> but that seems less good to me as it makes it less fine-grained preventing
> listeners from making a distinction when they feel a need to.

I'd actually be inclined to listen to it at one spot and generate a changed for it.. do you see any code that would do something different on WAKE vs the changes we already have?

> 
> If then we actually fire the timeout and check the existing connections for
> activity, that function only runs in the socket thread and after the timeout
> has expired. How would a flow be able to run more than one of those within a
> short period of time?

the timeout firing results in PostEvent(onmsgprunenotraffic).. 

there is a period of time after this has happened, when the timer is not armed, when verifytraffic can be invoked (these events can be queued before the onmsgprunenotraffic).. connections marked by that last verifytraffic won't get the grace period before onmsgprunenotraffic is executed.

one easy way to handle that is to handle the timeout synchronously (it should get fired on the socket thread because that is where the init was called, but assert it)
I'll wait till we get comment 96 sorted out before reviewing, to save myself any churn that may (or may not) happen
(In reply to Patrick McManus [:mcmanus] from comment #95)

> I'd actually be inclined to listen to it at one spot and generate a changed
> for it.. do you see any code that would do something different on WAKE vs
> the changes we already have?

I don't think I can foresee any real problems with always generating CHANGED for it (except for the little thing that I need to figure out a good place to add that logic, nsIOService?).

What I could possibly think wouldn't be strictly necessary to do when coming back for sleep is for example the flushing of the DNS cache. Possibly there can pop up others later. But I'm perfectly fine with doing it with a single CHANGED event now and if we ever need to split them at a later point in time we can deal with that then.
This is patch 5, it simply adds a notifyobserver in the ioservice file that listens to the wake-up event and sends a network-changed event on arrival.
Attachment #8463270 - Flags: review?(mcmanus)
Attachment #8463270 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #96)

> there is a period of time after this has happened, when the timer is not
> armed, when verifytraffic can be invoked (these events can be queued before
> the onmsgprunenotraffic).. connections marked by that last verifytraffic
> won't get the grace period before onmsgprunenotraffic is executed.

Thanks, I see it too. I have implemented a simple boolean protector for that now. Would you prefer to have it squashed into the original patch or would you prefer to have this fix handled separately for review purposes?
new patch please - thanks for asking. you can squash them together before landing
Comment on attachment 8462440 [details] [diff] [review]
v4-0001-bug-939318-detect-network-interface-changes-on-wi.patch

Review of attachment 8462440 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +457,5 @@
> +        } else if (!prevLinkUp) {
> +            event = NS_NETWORK_LINK_DATA_UP;
> +        } else {
> +            // Known status and it was already UP
> +            event = NULL;

s/NULL/nullptr
Attachment #8462440 - Flags: review?(mcmanus) → review+
this is a change on top of patch 8462444: v4-0002-bug-939318-find-and-close-HTTP-connections-withou.patch

This adds a boolean that avoids that the check function runs again until the pruning has completed.

I intend to squash the two patches into a single one pending a granted review.
Attachment #8464743 - Flags: review?(mcmanus)
Comment on attachment 8464743 [details] [diff] [review]
0001-bug-939318-avoid-a-pruning-race-condition-r-mcmanus.patch

Review of attachment 8464743 [details] [diff] [review]:
-----------------------------------------------------------------

there is a massive patch here to the .h file that I think is a vcs mistake of some sort...
Attachment #8464743 - Flags: review?(mcmanus) → review-
Comment on attachment 8462444 [details] [diff] [review]
v4-0002-bug-939318-find-and-close-HTTP-connections-withou.patch

Review of attachment 8462444 [details] [diff] [review]:
-----------------------------------------------------------------

r=mcmanus contingent on fixing up the initialization and the inclusion of a r+'d avoid-a-pruning-race-condition patch

thanks - this is exciting!

::: netwerk/protocol/http/SpdySession3.h
@@ +380,5 @@
>    PRIntervalTime       mPingSentEpoch;
>    uint32_t             mNextPingID;
>  
> +  PRIntervalTime       mPreviousPingThreshold; // backup for the former value
> +  bool                 mPreviousUsed;          // true when backup is used

this is still uninitialized in the spdy code (both versions) but not in the http2 code
Attachment #8462444 - Flags: review?(mcmanus) → review+
First attempt was a total screw-up. New try. Patch on top of 8462444.
Attachment #8464743 - Attachment is obsolete: true
Attachment #8464846 - Flags: review?(mcmanus)
FYI: I learned that we use the same windows binary for all windows versions so I need to fix up the XP/non-vista code split in nsNotifyAddrListener.cpp in the first patch and do the detection in run-time.
Comment on attachment 8464846 [details] [diff] [review]
0001-bug-939318-avoid-a-pruning-race-condition-r-mcmanus.patch

Review of attachment 8464846 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/protocol/http/nsHttpConnectionMgr.cpp
@@ +345,5 @@
>  nsresult
>  nsHttpConnectionMgr::VerifyTraffic()
>  {
>      LOG(("nsHttpConnectionMgr::VerifyTraffic\n"));
> +    if (mPruningNoTraffic) {

I think this check needs to be in onmsgverifytraffic()..

::: netwerk/protocol/http/nsHttpConnectionMgr.h
@@ +712,5 @@
>      static PLDHashOperator PrintDiagnosticsCB(const nsACString &key,
>                                                nsAutoPtr<nsConnectionEntry> &ent,
>                                                void *closure);
>      nsCString mLogData;
> +    bool mPruningNoTraffic;

it would be better to just put this right after mTrafficTimer for logical grouping
Take two. Fixed remarks from first review.
Attachment #8464846 - Attachment is obsolete: true
Attachment #8464846 - Flags: review?(mcmanus)
Attachment #8464895 - Flags: review?(mcmanus)
Attachment #8464895 - Flags: review?(mcmanus) → review+
(In reply to Daniel Stenberg [:bagder] from comment #107)
> FYI: I learned that we use the same windows binary for all windows versions
> so I need to fix up the XP/non-vista code split in nsNotifyAddrListener.cpp
> in the first patch and do the detection in run-time.

For the record: I fixed the version detection to happen in run-time and I've done basic manual testing on XP and it didn't go down in flames. 

I've rebased and squashed the patch set into 5 individual ones and I'm doing another try-server run now while rebuilding locally to see no major hiccups.

Will post the updated set once I've got that through.
The main "network changed" event sending functionality.
Attachment #8462440 - Attachment is obsolete: true
Attachment #8465241 - Flags: review+
The new http2/spdy logic to detect and kill dead connections after a network change
Attachment #8462444 - Attachment is obsolete: true
Attachment #8464895 - Attachment is obsolete: true
Attachment #8465242 - Flags: review+
protocolproxy handling on network changed
Attachment #8441162 - Attachment is obsolete: true
Attachment #8465243 - Flags: review+
detect idle websockets connections after network change
Attachment #8441163 - Attachment is obsolete: true
Attachment #8465244 - Flags: review+
send "network change" when coming back from sleep
Attachment #8463270 - Attachment is obsolete: true
Attachment #8465245 - Flags: review+
Try-server results with this set of 5 patches: https://tbpl.mozilla.org/?tree=Try&rev=eb8014a43d1c
Keywords: checkin-needed
Whole lotta Windows xpcshell failure in that push....
https://tbpl.mozilla.org/php/getParsedLog.php?id=44939359&tree=Try etc
Keywords: checkin-needed
Contents of my /etc/hosts is as follows:

127.0.1.1       amidea.dev      amidea sanad.sanad.amidea.dev

Firefox is NOT in Offline Mode, however with the network disconnected I'm unable to access a local site using Firefox v30. Firefox requires network connection to even browse local sites resolving to 127.0.1.1/127.0.0.1.

Chrome (v35) can access the site normally as depicted in the screenshot.

Occurs in Linux Mint 17 64-bit.

Should I report a new bug? Because this happens irrespective of Work Offline checked or not. Bug 339814 and bug 939318 seems to only cater to when Work Offline is checked.
(In reply to Hendy Irawan from comment #118)

> Firefox is NOT in Offline Mode, however with the network disconnected I'm
> unable to access a local site using Firefox v30. Firefox requires network
> connection to even browse local sites resolving to 127.0.1.1/127.0.0.1.
> 
> Chrome (v35) can access the site normally as depicted in the screenshot.
> 
> Occurs in Linux Mint 17 64-bit.
> 
> Should I report a new bug? Because this happens irrespective of Work Offline
> checked or not. Bug 339814 and bug 939318 seems to only cater to when Work
> Offline is checked.

I would appreciate a separate report (as I suspect this particular bug and patch series will not fix it). Also please detail what circumstances that's necessary to trigger this error situation.
Here's an updated version of the websockets patch. This makes sure that the sending of the ping is always done on the correct thread, and it makes sure that the sending is not happening when it is "too early".

Asking for a new review because of the larger change in logic and flow.

The new patch series try-run looks like this: https://tbpl.mozilla.org/?tree=Try&rev=ba44b58ba1cb

I'm having a hard time to see which of the problems I should really care about (I'm clearly still a TBPL newbie) so I'd appreciate an extra set of eyes on that.
Attachment #8467727 - Flags: review?(mcmanus)
Attachment #8465244 - Attachment is obsolete: true
(In reply to Daniel Stenberg [:bagder] from comment #120)
> Created attachment 8467727 [details] [diff] [review]

> Asking for a new review because of the larger change in logic and flow.
> 
> The new patch series try-run looks like this:
> https://tbpl.mozilla.org/?tree=Try&rev=ba44b58ba1cb
> 
> I'm having a hard time to see which of the problems I should really care
> about (I'm clearly still a TBPL newbie) so I'd appreciate an extra set of
> eyes on that.

I don't see anything really concerning in there, but I didn't go through it in great detail. The basic strategy is to click on each non-green element and see if the suggestions are either
 1] infrastructure related
or 2] a well known existing orange failure

often retriggering a specific test (the plus icon) a time or two will help you make a judgment. But it definitely boils down to judgment, not a discrete process. The worst failures will break the same test on all platforms so are easy to see - but that's not always the case.
Comment on attachment 8467727 [details] [diff] [review]
v6-0004-bug-939318-websockets-act-on-network-interface-ch.patch

Review of attachment 8467727 [details] [diff] [review]:
-----------------------------------------------------------------

its curious that the NS_ABORT_IF_FALSE(PR_GetCurrentThread() == gSocketThread in ::notify timer==pingtimer didn't show up in the tests (because that's what you're fixing). I guess it just never failed on debug?

::: netwerk/protocol/websocket/BaseWebSocketChannel.h
@@ +71,5 @@
>    uint32_t                        mClientSetPingTimeout      : 1;
>  
>    uint32_t                        mPingInterval;         /* milliseconds */
>    uint32_t                        mPingResponseTimeout;  /* milliseconds */
> +  uint32_t                        mPingForced                : 1;

move this up to be with the other bit field items - you'll save 4 bytes.
Attachment #8467727 - Flags: review?(mcmanus) → review+
The websockets patch, updated with mcmanus' suggestion.
Attachment #8467727 - Attachment is obsolete: true
Attachment #8468393 - Flags: review+
Thank you Daniel. I've found another bug that is exactly as my issue, which is Bug 698302, reported 3 years ago (!!!).
Updated with two notable changes:

1 - no longer do the initial network notification, prevent an early CHANGED event that mostly is useless and interferes with our tests =)

2 - more importantly: added a FlushCache() method for the hostresolver so that the DNSservice can only flush the cache on the CHANGED event as the previous method with shutdown/init would introduce a short period (race condition) of offline time where the resolver would return an OFFLINE error. I could use a comment on my approach for doing that.
Attachment #8465241 - Attachment is obsolete: true
Attachment #8473641 - Flags: feedback?(mcmanus)
(In reply to Daniel Stenberg [:bagder] from comment #125)
> Created attachment 8473641 [details] [diff] [review]
> v8-0001-bug-939318-detect-network-interface-changes-on-wi.patch
> 
> Updated with two notable changes:
> 
> 1 - no longer do the initial network notification, prevent an early CHANGED
> event that mostly is useless and interferes with our tests =)

i'm concerned about the comment about the tests. can you explain further what was going on?
https://tbpl.mozilla.org/?tree=Try&rev=15bb62c8149f - I'm leaning toward that the few remaining oranges are not caused by this patch set.
(In reply to Patrick McManus [:mcmanus] from comment #126)

> > 1 - no longer do the initial network notification, prevent an early CHANGED
> > event that mostly is useless and interferes with our tests =)
> 
> i'm concerned about the comment about the tests. can you explain further
> what was going on?

The CHANGED event causes a dns flush and depending on exactly when that comes it messes up some of the tests that relies on it - the fact that it is tricky timing and so far windows-specific also makes it awkward to adapt tests for it. (Of course before I fixed the race condition it was even worse.)

But primarily I think we should try to avoid an early changed event "in vain".
(In reply to Daniel Stenberg [:bagder] from comment #128)

> The CHANGED event causes a dns flush and depending on exactly when that
> comes it messes up some of the tests that relies on it - the fact that it is
> tricky timing and so far windows-specific also makes it awkward to adapt
> tests for it. (Of course before I fixed the race condition it was even
> worse.)
> 

tell me more :) What test relies on it - and are we breaking something that relies on it in the real world?
(In reply to Patrick McManus [:mcmanus] from comment #129)

> tell me more :) What test relies on it - and are we breaking something that
> relies on it in the real world?

Anything that relies on that the DNS cache holds contents from a previous operation risk getting fooled.

The specific test I've been fighting with is netwerk/test/unit/test_ping_aboutnetworking.js. It makes a connection to localhost and then checks that the host name exists in the cache. With the possibility of network-changed events, such tests become fragile since the connect and the host name check aren't atomic. 

With the initial notification I could get a CHANGED event exactly in that little time period after the connection was made but before the cache was read, so it would report a blank cache bank in the test and it would fail.
(In reply to Daniel Stenberg [:bagder] from comment #130)
> (In reply to Patrick McManus [:mcmanus] from comment #129)
> 
> > tell me more :) What test relies on it - and are we breaking something that
> > relies on it in the real world?
> 
> Anything that relies on that the DNS cache holds contents from a previous
> operation risk getting fooled.
> 
> The specific test I've been fighting with is
> netwerk/test/unit/test_ping_aboutnetworking.js. It makes a connection to
> localhost and then checks that the host name exists in the cache. With the
> possibility of network-changed events, such tests become fragile since the
> connect and the host name check aren't atomic. 
> 

ok - thanks. That makes sense. so your patch doesn't fix that problem - it just takes away the racing trigger we happen to see right now, right? Let's have that test disable the new feature via pref for the duration of that test. That at least sounds deterministic. There can't be that many things that rely on this kind of cache behavior, are there?

I'm not opposed to the change in the initial behavior, I'd just like to see something deterministic go with it.


> With the initial notification I could get a CHANGED event exactly in that
> little time period after the connection was made but before the cache was
> read, so it would report a blank cache bank in the test and it would fail.
(In reply to Patrick McManus [:mcmanus] from comment #131)

> Let's
> have that test disable the new feature via pref for the duration of that
> test. That at least sounds deterministic.

Yes, a pref for this purpose feels like really a good idea. I'll add one.

> There can't be that many things
> that rely on this kind of cache behavior, are there?

No, I think the number of tests affected by this are limited by I haven't done any deeper research but mostly fell over this since I managed to trigger the problem for it.
Here's a separate patch that adds the new boolean pref "network.notify.changed". Set to true by default. If set false, no network changed events is sent.

Modified both nsIOService and nsNotifyAddrListener to respect the pref.

I modified test_ping_aboutnetworking.js to disable this setting, to avoid the risk that such an event comes in and affects the test results.

This patch depends on the presence of some (2?) of the previous patches.
Attachment #8475166 - Flags: review?(mcmanus)
Comment on attachment 8475166 [details] [diff] [review]
v9-0006-bug-939318-introduce-the-network.notify.changed-p.patch

Review of attachment 8475166 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/system/win32/nsNotifyAddrListener.cpp
@@ +196,5 @@
> +nsNotifyAddrListener::updateFromPref(nsIPrefBranch *prefs)
> +{
> +    MOZ_ASSERT(NS_IsMainThread());
> +
> +    bool allow=true;

nit spacing
Attachment #8475166 - Flags: review?(mcmanus) → review+
Attachment #8473641 - Flags: feedback?(mcmanus) → feedback+
Rebased, no news.
Attachment #8473641 - Attachment is obsolete: true
Attachment #8478095 - Flags: review+
Rebased, no news.
Attachment #8465242 - Attachment is obsolete: true
Attachment #8478098 - Flags: review+
Rebased, no news.
Attachment #8465243 - Attachment is obsolete: true
Attachment #8478099 - Flags: review+
Rebased, no news.
Attachment #8468393 - Attachment is obsolete: true
Attachment #8478100 - Flags: review+
Rebased, no news.
Attachment #8465245 - Attachment is obsolete: true
Attachment #8478101 - Flags: review+
Rebased, no news.
Attachment #8475166 - Attachment is obsolete: true
Attachment #8478102 - Flags: review+
This suite of patches just did this try-run: https://tbpl.mozilla.org/?tree=Try&rev=cccb513011ce

I can no longer spot any problems introduced by this patch set. I think it is getting ready to land.
(In reply to Daniel Stenberg [:bagder] from comment #140)
> Created attachment 8478102 [details] [diff] [review]
> v9-0006-bug-939318-introduce-the-network.notify.changed-p.patch
> 
> Rebased, no news.

Oops, this was actually modified ever so slightly. It now also disables the network changed event in the test_proxy-failover_passing.js test.
Keywords: checkin-needed
Flags: needinfo?(daniel)
Thanks for the UUID fix, I wasn't aware that I needed that!

There are no tests for this feature yet. Bug 1058755 has been filed to make sure we don't miss out, and I have that marked in my short-term TODO list.
Flags: needinfo?(daniel)
I made the prefs variable actually work so that when disabled in the sensitive tests they actually won't ever get the event that ruins them. Now also uses the fancy AddBoolVarCache() which makes the code much smaller and simpler.

I made a new try run: https://tbpl.mozilla.org/?tree=Try&rev=97ea2b4c0850
... and I manually re-triggered the tests on win32 there many times to make sure they don't pop up again.
Attachment #8478102 - Attachment is obsolete: true
Attachment #8481136 - Flags: review+
Refreshed the 0001 patch with UUID update.
Attachment #8478095 - Attachment is obsolete: true
Attachment #8481138 - Flags: review+
Keywords: checkin-needed
Still hitting intermittent Windows xpcshell failures :(
https://hg.mozilla.org/integration/mozilla-inbound/rev/11e4f1678eab

All the same test this time, at least (one of the few different ones it was hitting the first time):
https://tbpl.mozilla.org/php/getParsedLog.php?id=47038777&tree=Mozilla-Inbound
This test_resource_header.js is another test that uses proxy and that gets hurt if we get a changed event in the middle of it. We can easily disable network.notify.changed for it, but...

Now I can't tell exactly why we would get changed events here, but otoh I'm not sure that's entirely relevant to figure out right now since I figure our tests still need to survive such situations.

This is turning into a whack-a-mole to find tests that can possibly fail when a network changed event triggers. Since we don't even test this particular feature yet, I'm starting to consider if we shouldn't switch it around and explicitly disable it during tests and only enable it when we add the specific tests for it. Of course that has its downsides too...
Another iteration of the 0006 patch that adds a prefs variable to prevent the changed event from getting sent.

This version of the patch adds event-ignore to 6 xpcshell tests (and restores the prefs variable again after the test completion).
Attachment #8481136 - Attachment is obsolete: true
Attachment #8482256 - Flags: review+
For completeness I could show that the updated patch gets a clean try-run too, although the history shows that it isn't fool-proof: https://tbpl.mozilla.org/?tree=Try&rev=4246b657ac47
New iteration of the protocolproxyservice part of this series. Now the event receiver only reloads the PAC if it isn't using a data: or file: URL, which in effect will make it much less intrusive for tests - that mostly are using data: URLs for PAC.

Since the intention is to act on changed network conditions, it only needs to reload a PAC if that is actually read from the network.
Attachment #8478099 - Attachment is obsolete: true
Attachment #8483345 - Flags: review?(mcmanus)
With the modified protocolproxyservice code in v12-0003*, I could remove the event-disabling of 5 tests that use the PAC from a data: URL.

This patch now only disables the network changed event in a single test: test_ping_aboutnetworking.js
Attachment #8482256 - Attachment is obsolete: true
Attachment #8483347 - Flags: review+
Fixed a bug in the v12-0003 patch, here's the fixed version.
Attachment #8483345 - Attachment is obsolete: true
Attachment #8483345 - Flags: review?(mcmanus)
Attachment #8483413 - Flags: review?(mcmanus)
Comment on attachment 8483413 [details] [diff] [review]
v13-0003-bug-939318-protocolproxy-service-acts-on-network.patch

Review of attachment 8483413 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/src/nsProtocolProxyService.cpp
@@ +445,5 @@
> +    }
> +
> +    nsXPIDLCString pacSpec;
> +    if (type != PROXYCONFIG_PAC) {
> +        // not a PAC

so you need to do the reload thing for wpad or system types too.. as those can lead to pac files. we just have to assume they are not file or data schemes

mProxyConfig != PROXYCONFIG_PAC && mProxyConfig != PROXYCONFIG_WPAD &&
        mProxyConfig != PROXYCONFIG_SYSTEM

@@ +451,5 @@
> +    }
> +
> +    prefs->GetCharPref(PROXY_PREF("autoconfig_url"),
> +                       getter_Copies(pacSpec));
> +    if (!pacSpec.IsEmpty()) {

you're on the main thread here; so use the url parser and check the scheme the normal way
Attachment #8483413 - Flags: review?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #157)

> mProxyConfig != PROXYCONFIG_PAC && mProxyConfig != PROXYCONFIG_WPAD &&
>         mProxyConfig != PROXYCONFIG_SYSTEM

Ah, thanks.

But ReloadPAC() doesn't seem to do anything for PROXYCONFIG_SYSTEM, should it?
ReloadNetworkPAC() now uses existing functions to extract and compare the scheme properly - to avoid "data" and "file" reloads on network changed events.

I went with using GetProtocolInfo() in there just for simplicity since it exists already. It does a little bit more than needed there, but I think it should be fine.

ReloadNetworkPAC() now also calls ReloadPAC() if the config type is *_WPAD or *_SYSTEM
Attachment #8483413 - Attachment is obsolete: true
Attachment #8484154 - Flags: review?(mcmanus)
Attachment #8484154 - Flags: review?(mcmanus) → review+
Keywords: checkin-needed
Sorry, but given the track record of this bug, there aren't nearly enough retriggers on that last Try run to make me comfortable pushing this. I've triggered more. If they're green, go ahead and re-request checkin.
Keywords: checkin-needed
I sympathize with that, but with the new proxy reload logic there will be significantly less chances of breakage since all those PAC-using tests with data: URLs are no longer affected. But let's see if we get anything bad to show up.
That was 20 x 6 test runs (3 windows flavors, opt + debug), all green.
Re-requesting checkin as per comment 160 and comment 162.
Keywords: checkin-needed
Those errors make me suspect the 0003 patch doesn't work as they look exactly the same as the errors I could get when reloadpac actually were invoked unconditionally. They both are proxy related and it seems that patch the only part of the series that should affect proxies (or PACs rather) specifically...
Keywords: checkin-needed
Keywords: checkin-needed
Confirmed. I broke the 0003 patch after the try-run (as a response to review comments) so the 60-test try was done on an earlier still working generation of 0003. The latest version of that patch can't handle data: URLs (NS_NewURI errors on it) and thus the condition to avoid the reload isn't met and it gets reloaded anyway => the badness.

Improved version pending.
Added error checking so that it'll bail out instead of continue through, which was what happened which made data: URLs run the reloadpac anyway.

Also, this version removes a few trailing whitespaces in the source file.

Try-run in progress: https://tbpl.mozilla.org/?tree=Try&rev=81c5cbc8fbb2
Attachment #8484154 - Attachment is obsolete: true
Attachment #8485038 - Flags: review+
That try run shows 36 tests running all green.

Since I've now failed so many times with this, it feels appropriate to ask here: anyone think I should do anything else now to be more certain of this change?
Maybe another try push with all the tests? (not just xpcshell)
Keywords: checkin-needed
sorry had to backout this test changes again for test failures like https://tbpl.mozilla.org/php/getParsedLog.php?id=47590307&tree=Mozilla-Inbound
Minor updated of this patch. I discovered "junk" in one of the headers that must've been leftovers from a bad previous rebase. Removed in this version.
Attachment #8478098 - Attachment is obsolete: true
Attachment #8487903 - Flags: review+
This is an updated version of the 0001 patch. It A) initially clears the IP checksum just for completeness and less randomness, B) brings back a check for managedoffline state that I had accidentally removed before, C) it no longer sends any network-changed event during the first 2 seconds since start but most importantly D) it fixes the nsHostResolver::FlushCache() function:

Now it flushes only the already resolved host names, and those that are being resolved at the moment the event happens will be marked "again" and they will simply redo the resolve when the complete so that they are done on the new side of the network change. Resolve requests that are still only in one of the queues are not touched. The former version of this patch would cause a name resolve to return NS_ERROR_ABORT if a network change event hit in the middle at an unfortunate time.

The FlushCache() changes and modifications to nsHostResolver::ThreadFunc() are the primary ones I need a new review for.
Attachment #8481138 - Attachment is obsolete: true
Attachment #8492150 - Flags: review?(mcmanus)
Just rebased against current code.
Attachment #8478101 - Attachment is obsolete: true
Attachment #8492155 - Flags: review+
Rebased, no actual change.
Attachment #8483347 - Attachment is obsolete: true
Attachment #8492156 - Flags: review+
There's a memory leak in the 0001 patch and I have a fix for that. I just suspect I'll get some additional comments on that so I'll wait with uploading an update here until after the review of the v19-0001 version.
Comment on attachment 8492150 [details] [diff] [review]
v19-0001-bug-939318-detect-network-interface-changes-on-w.patch

Review of attachment 8492150 [details] [diff] [review]:
-----------------------------------------------------------------

on the one hand this is nicer behavior.. on the other its more uses of legacy datastructure implementations in dns that really ought to die. oh well.

please r?sworkman on the dns changes next time. other than dns r+mcmanus

stevem anytime you want to rewrite hostresolver to something sane - be my guest :)

::: netwerk/dns/nsDNSService2.cpp
@@ +881,5 @@
> +                 strcmp(topic, NS_NETWORK_LINK_TOPIC) == 0,
> +                 "unexpected observe call");
> +
> +    if (!strcmp(topic, NS_NETWORK_LINK_TOPIC)) {
> +        nsCString converted = NS_ConvertUTF16toUTF8(data);

try converted(NS_ConvertBlah(data)).. and if not, at least use nsAutoCString

::: netwerk/dns/nsHostResolver.cpp
@@ +428,5 @@
> +                  PLDHashEntryHdr *hdr,
> +                  uint32_t number,
> +                  void *arg)
> +{
> +    nsHostDBEnt* ent = static_cast<nsHostDBEnt*>(hdr);

nit: foo *bar = static_cast<foo *>(baz);

@@ +432,5 @@
> +    nsHostDBEnt* ent = static_cast<nsHostDBEnt*>(hdr);
> +
> +    // Try to remove the record, or mark it for refresh
> +    if (ent->rec->RemoveOrRefresh()) {
> +        return PL_DHASH_REMOVE;

I like PL_DHASH_REMOVE | PL_DHASH_NEXT for clarity even though NEXT is 0.

@@ +527,5 @@
> +        MoveCList(mEvictionQ, evictionQ);
> +        mEvictionQSize = 0;
> +
> +        // prune the hash from all hosts already resolved
> +        PL_DHashTableEnumerate(&mDB, HostDB_PruneEntry, nullptr);

pretty sure you need to be dropping references here too. is this the game of spot the leak your comment referred to?
Attachment #8492150 - Flags: review?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #180)

Thanks for your comments!

> > +        // prune the hash from all hosts already resolved
> > +        PL_DHashTableEnumerate(&mDB, HostDB_PruneEntry, nullptr);
> 
> pretty sure you need to be dropping references here too. is this the game of
> spot the leak your comment referred to?

Exactly. I need to free the references in the evictionQ there.
Comment on attachment 8492150 [details] [diff] [review]
v19-0001-bug-939318-detect-network-interface-changes-on-w.patch

Review of attachment 8492150 [details] [diff] [review]:
-----------------------------------------------------------------

My 2c re nsHostResolver, my old friend.

I think the idea is mostly fine, but I have a concern about where to check mAgain; I think it should be checked while the lock is held in nsHostResolver::OnLookupComplete. That's quite a bit of refactoring though, and I might be overthinking it. Take a look and see what you think.

Going to give an f+ for now.

::: netwerk/dns/nsDNSService2.cpp
@@ +883,5 @@
> +
> +    if (!strcmp(topic, NS_NETWORK_LINK_TOPIC)) {
> +        nsCString converted = NS_ConvertUTF16toUTF8(data);
> +        const char *state = converted.get();
> +        if (!strcmp(state, NS_NETWORK_LINK_DATA_CHANGED)) {

nit: Do you need state? Can you just place converted.get() in strcmp?

::: netwerk/dns/nsHostResolver.cpp
@@ +307,5 @@
> +bool
> +nsHostRecord::RemoveOrRefresh()
> +{
> +    // leave entries that are resolving but not in queue, and set mAgain
> +    if (resolving && !onQueue) {

// This condition implies that the request has been passed to the OS resolver. The resultant DNS record should be considered removed and not trusted; set a flag to ensure it is called again.

@@ +428,5 @@
> +                  PLDHashEntryHdr *hdr,
> +                  uint32_t number,
> +                  void *arg)
> +{
> +    nsHostDBEnt* ent = static_cast<nsHostDBEnt*>(hdr);

Or auto ent = ...;

@@ +527,5 @@
> +        MoveCList(mEvictionQ, evictionQ);
> +        mEvictionQSize = 0;
> +
> +        // prune the hash from all hosts already resolved
> +        PL_DHashTableEnumerate(&mDB, HostDB_PruneEntry, nullptr);

Ah, refcounting.

Yup, one needs dropped because of the eviction queue.

One from creation should be released in HostDB_ClearEntry (which I think I remember is called when REMOVE is returned in the enumerator functions). So, that should be ok.

And the lookup ref, added in IssueLookup, should be released in OnLookupComplete. And that one should be fine and you deal with calling the OS resolver again in RemoveOrRefresh.

@@ +1227,5 @@
> +            rec->mAgain = false;
> +            if (again) {
> +                LOG(("DNS lookup thread: repeat resolve [%s].\n", rec->host));
> +            }
> +        } while (again);

This part seems mostly ok. However, I don't think it's going to catch all the cases.

Since calling the OS resolver happens outside the lock, FlushCache could be executing at the same time even though it's in the lock. The state would be ->resolving == true and ->onQueue == false, so we would try to set mAgain = true. But let's say that the OS resolver lookup is already done and we're just waiting in OnLookupComplete to get the lock; we'd be waiting on FlushCache to complete. The second OS resolver lookup would not happen even though mAgain was set.

There is a small window of opportunity for this to happen, but it's possible that entries from before the network change could be returned to the callbacks in OnLookupComplete.

I wonder if mAgain should be examined in OnLookupComplete; perhaps OnLookupComplete could return a specific value if the record needs to be looked up again? This seems like a better guarantee that mAgain==true will be caught. What do you think? Maybe I'm overthinking it ....

@@ +1347,5 @@
>  nsHostResolver::GetDNSCacheEntries(nsTArray<DNSCacheEntries> *args)
>  {
>      PL_DHashTableEnumerate(&mDB, CacheEntryEnumerator, args);
>  }
> +

nit: Extra line added; please remove.

::: netwerk/dns/nsHostResolver.h
@@ +87,5 @@
>      void   ReportUnusable(mozilla::net::NetAddr *addr);
>  
>      size_t SizeOfIncludingThis(mozilla::MallocSizeOf mallocSizeOf) const;
>  
> +    bool RemoveOrRefresh();

nit: Comment here please.

@@ +102,5 @@
>      bool    onQueue;  /* true if pending and on the queue (not yet given to getaddrinfo())*/
>      bool    usingAnyThread; /* true if off queue and contributing to mActiveAnyThreadCount */
>      bool    mDoomed; /* explicitly expired */
>  
> +    bool    mAgain;  // when the results from this resolve is returned, it is not

nit: s/mAgain/mResolveAgain/

Just a little bit clearer and potentially more unique for searches.
Attachment #8492150 - Flags: feedback+
Updated the 0001 patch according to feedback.

I did a few smaller fix-ups according to suggestions and the bigger change is this:

To avoid the possible short time gap where a name resolve would not be remade in nsHostResolver::ThreadFunc() I moved the resolveagain-check to OnLookupComplete() as suggested by sworkman and made it return information about that fact.

I then also had to make it _not_ handle the resolve-again condition if NS_ERROR_ABORT is passed in, for cases like when the queue is being cleaned up perhaps due to shutdown.

My try-runs for this look good so far, but I'll await a "clean" review+ until I do a full try-run for the record.
Attachment #8492150 - Attachment is obsolete: true
Attachment #8493658 - Flags: review?(sworkman)
Comment on attachment 8493658 [details] [diff] [review]
v22-0001-bug-939318-detect-network-interface-changes-on-w.patch

Review of attachment 8493658 [details] [diff] [review]:
-----------------------------------------------------------------

Cool - thanks for making those changes. r=me for the DNS code.

::: netwerk/dns/nsHostResolver.cpp
@@ +1631,5 @@
> +            LOG(("DNS lookup thread - Re-resolving host [%s].\n",
> +                 rec->host));
> +        } else {
> +            rec = nullptr;
> +        }

Looks good to me. Thanks for making this fix.

I guess we could get stuck in a loop here if the network keeps changing, but that should be ok. We should drop out of the loop once things stabilize and then we'll get a 'good' DNS record. And I remember you mentioning you were thinking about introducing rate limiting of the CHANGE events at some point, so that should mitigate such situations too.

Looks good.

::: netwerk/dns/nsHostResolver.h
@@ +145,5 @@
>      uint32_t mBlacklistedCount;
>  
> +    // when the results from this resolve is returned, it is not to be
> +    // trusted, but instead a new resolve must be made!
> +    bool    mResolveAgain;

Thanks for changing the name.

@@ +304,5 @@
> +      LOOKUP_OK,
> +      LOOKUP_RESOLVEAGAIN,
> +    };
> +
> +    LookupStatus OnLookupComplete(nsHostRecord *, nsresult, mozilla::net::AddrInfo *);

Nice.
Attachment #8493658 - Flags: review?(sworkman) → review+
Here's the try-run: https://tbpl.mozilla.org/?tree=Try&rev=06c1fd3608bb

I triggered 10-something extra runs of the test on each of the windows-optimized versions as they are the ones that historically have given me pain, and they are all green.
Keywords: checkin-needed
They were on the Try push too. Sorry I missed it before pushing :(
Ack, I missed that too. Uninitialized class member...
New version of the 0001 patch. It inits the mResolveAgain field in the nsHostRecord constructor to fix the valgrind warning.

Try run: https://tbpl.mozilla.org/?tree=Try&rev=444459620f5e
Attachment #8493658 - Attachment is obsolete: true
Attachment #8495035 - Flags: review+
Keywords: checkin-needed
Depends on: 1075530
Keywords: dev-doc-needed
Depends on: 1077084
Blocks: Woodduck
blocking-b2g: --- → 2.0M?
Blocks: 1097440
Hello everybody,
this issue seems to be not fixed. I tried it with Firefox nightly: 37.0a1 (2014-12-14). Switching from WiFi to network works, but the other way around does not work.

Is there anything I can do to support you in finding the problem?

Best regards
Jan
(In reply to Jan Tietjens from comment #193)
> Hello everybody,
> this issue seems to be not fixed. I tried it with Firefox nightly: 37.0a1
> (2014-12-14). Switching from WiFi to network works, but the other way around
> does not work.
> 
> Is there anything I can do to support you in finding the problem?

Is this on Android? We are tracking down an issue related to that right now in bug 991923.
No longer blocks: Woodduck, 1097440
blocking-b2g: 2.0M? → 2.2?
This is on Windows 7.
(In reply to Jan Tietjens from comment #193)

> this issue seems to be not fixed. I tried it with Firefox nightly: 37.0a1
> (2014-12-14). Switching from WiFi to network works, but the other way around
> does not work.
> 
> Is there anything I can do to support you in finding the problem?

Can you please specify with more detail exactly what you do and exactly what "does not work" mean here?
(In reply to Daniel Stenberg [:bagder] from comment #196)

> Can you please specify with more detail exactly what you do and exactly what
> "does not work" mean here?

Of course: When a network switch from LAN to Wifi is performed by Windows then the Firefox is not loading webpages anymore. No timeout occurs, Firefox just shows the loading sign in the tab. In my case I undock my notebook and them Win7 is switching to Wifi. Btw. IE and Chrome works just fine. Same problem exists for all my co-workers.
(In reply to Daniel Stenberg [:bagder] from comment #196)

Can I do something to support you in finding the error for my network configuration? I can also test some nightly builds here...
(In reply to Jan Tietjens from comment #198)

> Can I do something to support you in finding the error for my network
> configuration? I can also test some nightly builds here...

When the problem occurs, can you try to set Firefox into "Work Offline" and then off again and see if that changes anything?

Do you have any ability to switch on some logging to get us to see what Firefox thinks when the network change occurs? I'll also add some more specific logging for the nsNotifyAddr logic for windows to improve the logging quality for this situation. I'll create a separate bug for that.

(re-opening this bug since the problem this bug is about still seems to happen)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
See Also: → 1119717
"Work offline" mode does not help. But I saw an interesting pattern here:

Switching from
1. WIFI -> LAN works. Afterwards switching back:
2. LAN -> WIFI does not work

Second case: 
1. LAN -> WIFI works. Afterwards
2. WIFI -> LAN does not work. Afterwards
3. LAN -> WIFI does also not work. 

Could you give me a hint how to enable logging and how to find the logs?
Am I wrong to think that this also should have fixed going between networks with proxies and ones without?  e.g. There's a corporate proxy at the office, and I don't have one at home.  I have to reset the proxy settings from "Use system proxy" to "No proxy" going back and forth, or pages don't load.  I'm accessing both networks via WiFi.  Neither IE nor Chrome has an issue with that.  I thought Firefox 35 was going to fix that, but it hasn't.

(I'm super glad to see someone engaging these issues, though. :) )
(In reply to Mark Anderson from comment #202)
> Am I wrong to think that this also should have fixed going between networks
> with proxies and ones without?  e.g. There's a corporate proxy at the
> office, and I don't have one at home.  I have to reset the proxy settings
> from "Use system proxy" to "No proxy" going back and forth, or pages don't
> load.  I'm accessing both networks via WiFi.  Neither IE nor Chrome has an
> issue with that.  I thought Firefox 35 was going to fix that, but it hasn't.

It reloads the PAC if you have one set, which is one way to control which proxy to use. But I'm very interested in learning more about your setup and what we can and should do to make it work! How do you configure the proxy in the office case (I guess you're not using a PAC)?

Would you mind filing that as a separate bug with as much details as you can think of (in an effort to not overload this already very extensive bug entry)?
Blocks: 1121800
I have a log file produce with a current nightly. Log can be found here: 
http://www.4shared.com/archive/egJgadGGba/logmodtxt.html

My log with timestamps:

Switching from WIFI->LAN
 15:34:34.777000

Switching from LAN->WIFI
Something around: 15:37:13

Calling website: 15:38:15.993
=> Again problem with loading website

Hope you can find something in the log.

Logging configured with: C:\Program Files (x86)\Nightly>set 
NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,nsHostResolver:5
(In reply to Jan Tietjens from comment #204)
> I have a log file produce with a current nightly. Log can be found here: 
> http://www.4shared.com/archive/egJgadGGba/logmodtxt.html

Thanks a lot, and sorry, but I really can't figure out how to download it from there! Is there a direct link somewhere there that we can get? Was it too big to just attach to this bug report?

> NSPR_LOG_MODULES=timestamp,nsHttp:5,nsSocketTransport:5,nsStreamPump:5,
> nsHostResolver:5

If you re-run a test, please also switch on logging for "nsNotifyAddr" which is the module doing the actual monitoring of the network situation and it'll log when it detects network changes and what it says to "the rest of Firefox" about them so they will be highly interesting!
Attached file Logfile for testrun @2015-01-21 (obsolete) —
Here is the new logfile. Hopefully everything is included.

Steps taken:
LAN->WIFI 2015-01-21 ~11:40:29.419000


Trying to reload page in WIFI: Did not work
at around 11:42 tried to open different webpage gmx.de... Did also not work.

Shows only loading sign in tab.

Shutting down firefox @~11:44
Flags: needinfo?(daniel)
As a Firefox user I just wanted to thank you all for taking time and working on this issue. I have been encountering it for months now and I am glad to have just discovered the reason of the issue and that it will soon be resolved :)

Any idea in which next final version of Firefox this patch may end-up (roughly)?

I suspect the same issue is affecting Thunderbird, would you know by any chance if the code base in which this patch will end-up is more likely to be shared code with Thunderbird?

Many thanks again for all the good work.
(In reply to Richard Leger from comment #207)

> Any idea in which next final version of Firefox this patch may end-up
> (roughly)?

Thanks for the moral support! I already had a patch set get landed for this bug that is included in Firefox 35. The idea was that with that patch set this kind of behavior was history, Turns out it was wrong and now I'll research what we need to do more, and once we know that we need to decide in which versions it will be merged/backported.

> I suspect the same issue is affecting Thunderbird, would you know by any
> chance if the code base in which this patch will end-up is more likely to be
> shared code with Thunderbird?

I'm positive the original bug 939318 existed in Thunderbird as well, but I'm not sure if this latest incarnation of the bug also happens in Thunderbird.
(In reply to Jan Tietjens from comment #206)

> Here is the new logfile. Hopefully everything is included.

Thank you, this is great! It'll take me a little while to analyze. Just a quick question on your setup: are you using any sort of proxy when this happens?
Flags: needinfo?(daniel)
Clearing the blocking nom for 2.2? as this is already fixed/verified on that branch per the status flag
blocking-b2g: 2.2? → ---
(In reply to Daniel Stenberg [:bagder] from comment #209)
> Thank you, this is great! It'll take me a little while to analyze. Just a
> quick question on your setup: are you using any sort of proxy when this
> happens?

Yes, I tested it with two different proxies in our company network. No difference with them in the behaviour of the application.

I will attached  another three files with some fresh testing today.
Attached file log-lan-wifi.txt (obsolete) —
Attached file log-lan-wifi.txt.gz
This is the gzipped version of the previous log file.

8555220: log-lan-wifi.txt could be deleted.
Attachment #8552399 - Attachment is obsolete: true
Attachment #8555220 - Attachment is obsolete: true
Attached file log-wifi-lan.txt.gz
Thanks a lot Jan, I believe they all show the same pattern in the logs:

nsSocketTransportService::Poll() gets stuck in a call to PR_Poll() as there is no " ...returned after X milliseconds" following.

In three out of the four cases, it gets called with no active and no idle connections in that call that gets stuck.
(In reply to Daniel Stenberg [:bagder] from comment #216)
What shall we now do about it? Could I help somehow?
(In reply to Jan Tietjens from comment #217)
> (In reply to Daniel Stenberg [:bagder] from comment #216)
> What shall we now do about it? Could I help somehow?

(Sorry, I've been stuck in other things.) Can you confirm if this problem only happens when you're using a proxy?
I'm a Mac user, currently running OS X 10.10.2 with Firefox 36.0.4.  I've had this bug for years now, and it persists to this day.  Will these fixes be rolled up into the OS X version of Firefox?

For reference, I'm experiencing the same problem in which I'm unable to browse with Firefox if I disconnect from a VPN.  Issue appears with either the Cisco Anyconnect VPN and also the built in VPN client within OS X.  I have to restart Firefox to get it working again.

I'm willing to provide logs if it helps any troubleshooting efforts.
Hi JB,

Until we turn it on by default, you can try going to about:config and setting the network.manage-offline-status pref to true.

Let us know if you encounter issues with this.
(In reply to Valentin Gosu [:valentin] from comment #220)
> Hi JB,
> 
> Until we turn it on by default, you can try going to about:config and
> setting the network.manage-offline-status pref to true.
> 
> Let us know if you encounter issues with this.

Valentin, 

Thank you, this fix worked.  I connected and disconnected from my VPN several times and all works perfectly. Thanks for taking the time to respond!
I have the same problem on Win7 with FF 37.0.2 (similar in previously versions, but not in all!). Setting network.manage-offline-status pref to true does not help, using "Work offline" manually also not. I would appreciate it if you could fix this issue, or provide a workaround, which is better than restart the browser.
(In reply to stefanr from comment #222)
> I have the same problem on Win7 with FF 37.0.2 (similar in previously
> versions, but not in all!).

Can you please be very specific and explain EXACTLY what problem you have?
Yes, it will try:
1) I use my notebook in a docking station, network is connected via LAN cable. Firefox can connect to internet (via proxy).
2) I disconnect the LAN cable (or dock out), the notebook is switching to W-LAN. Firefox cannot connect to internet anymore (in the most cases; sometimes it works, but is very very slow...)
3) An restart of Firefox will solve the problem immediately, but Firefox cannot be stopped normally. I have to kill the process. If I exit normally, the process will remain.

The same problem occurs, if I switch from W-LAN to LAN.

If I disable the proxy to use only websites from intranet I have the same problem too.
Encountered again the issue yesterday with Firefox 38.0.5 on Win7 Pro 64 bits fully up-to-date. Firefox lost connection to network for some reason. Possibly because I put my computer to sleep from one network, and wake it up on a different one... I tried restarting Firefox, disable/enable wireless, VPN in/out, unplug the network cable replug it, disable network adapters re-enable it, flush dns, release/renew dhcp lease, set no proxy, reset proxy setting to using system setting, work offline/online, run windows network troubleshooting... nothing worked.
It only restarted to work when I put my computer to sleep, move from office to home and opened computer again. Possibly because they're is no proxy at home while there is a transparent one at work. Though it happened outside office as well at else locations sometime mainly when using various VPN connections or switching between networks.
I have now set manually the network.manage-offline-status pref to true, hopping this would help, though I am aware that not all issue have been resolved yet. Similar loss of connection with Thunderbird. Both are set to use system proxy settings which is set to auto detect.
It would be expected the applications to sort out the connection issue by itself without a restart or at least the "Work offline/online" option to fully reset the entire network connection (and socket) of the application but that does not seems to be the case at the moment. 
Could some cache information not be fully reset properly?
In my case unfortunately to my despair, I haven't quite clearly manage to identify the reason of the issue nor manage to identify steps to reproduce it so it can be tested and fixed. I'll see how it goes with new option set to true.
Thank you guys for your good work and support. Please keep going we all keep hope that this nasty bug will be soon fully resolved and put back behind us :)
Windows 7
FF 39.0.3
Cisco AnyConnect Secure Mobility Client Ver: 4.1.02011
I am connecting to Intel VPN and of course there is a proxy sitting behind.

I browse internet perfectly, and then I connect to VPN above, and the Firefox connection is lost. 
Only way out of this - 
1: Restart Firefox.
2: Goto  settings - advance - Network - connection-setting -  and then 'toggle' between "Auto detect proxy settings" & "Use system proxy settings".

The interesting fact is that - this toggling is necessary, the Firefox-connection-lose issue happens in both of these options..but once toggled to another mode, the connection is back to normal..

This bug seems to bother so many people who are connected to company VPN and want to work from home.
If it sounds reasonable then please increase the priority to fix it..
(In reply to ashish5679 from comment #226)
> Windows 7
> FF 39.0.3
> Cisco AnyConnect Secure Mobility Client Ver: 4.1.02011
> I am connecting to Intel VPN and of course there is a proxy sitting behind.
> 
> I browse internet perfectly, and then I connect to VPN above, and the
> Firefox connection is lost. 

Let me ask you some further details on this! What exactly does "the Firefox connection is lost" mean? What's the exact error you get from Firefox after you start the VPN? Are you using a proxy with or without that VPN?
Friends in this bug entry. I landed the primary set of patches for this bug a long time ago by now, this is done.

This is way out of control and there's no way to address the problems within this bug since this is already handled and there seem to be several new/related things discussed here. I instead urge you to file new bug reports if you have outstanding problems rather than to fill up this bug - of course with the usual detailed-filled and thoroughness that we appreciate when trying to solve problems.

If you believe the issue you're having is related to this issue or other online/offline/network-change related changes I'll be happy to work with you on driving these forward and ultimately iron all these bugs out,

Thanks for your understanding.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Depends on: 1244617
for above mentioned bug 1244617, 
please refer to the following bug: 

https://bugzilla.mozilla.org/show_bug.cgi?id=1245059
same issue in FF 50.0.1 and cisco any connect, im trying to replace chrome with FF but this issue is very annoying, and developer don't seem to be interested on this issue so i will have to get back to chrome and the devil...

sorry for this.
Baner, this bug was closed a long time ago and it has been working fine for many people. You refer to "this issue" but I don't understand what issue that is.

If you still suffer from a problem in version 50 like you say, I would urge you to file a new bug and give us all the details on exactly what you see and how to reproduce, rather than to add yet another comment to an old, very long-living and closed bug report without enough details for us to work with. Chances are that the underlying reasons are different than they were before we landed these fixes.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: