Closed Bug 465158 Opened 11 years ago Closed 11 years ago

Minefield Nightly fails to initiate dial-up login when using internet connection sharing

Categories

(Core :: Networking, defect, P2, major)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: mayhemer, Assigned: mayhemer)

References

Details

(Keywords: fixed1.9.1, regression)

Attachments

(1 file, 11 obsolete files)

The PC has a modem and one NIC. There is ICS enabled on it so, the NIC is assigned a permanent address. Bug 454381 has changed the code to raise dial-up only when all adapters doesn't have any IP address. In this case the code believes there is still a connection and prevents dial-up login dialog to open.

I have to think of a different approach for condition to start dial-up.
Flags: blocking1.9.1?
Attached patch v1 (obsolete) — Splinter Review
This is a quick workaround for this problem. Adapter assigned address 192.168.0.1 is considered as down because it is in no case a way to connect to internet.

More proper but much more complicated solution would be to use ICS API to collect all windows defined connections, enumerate them and check on all of them media type for NCM_SHAREDACCESSHOST_LAN. In that case remeber the name of the adapter/device joint with the connection in a list and then when checking adapters check against that list and ignore such adapters. It is quit heavy and complicated but if you will not like this simple and tested patch I will create a new patch as suggested.
Attachment #348457 - Flags: review?(cbiesinger)
Why is 192.168.0.1 privileged over, say, 10.0.0.1 here?  Is it something that Windows sets up by default without any user-configuration?
Exactly. When ICS is setup you choose one of the connection to WAN to share and one of remaining connections to behave as a gateway. The address of this gateway is then always 192.168.0.1, setup by windows automatically as fixed address.
Flags: blocking1.9.1? → blocking1.9.1+
Whiteboard: [needs review cbiesinger]
Priority: -- → P2
Attachment #348457 - Flags: review?(cbiesinger) → review+
Comment on attachment 348457 [details] [diff] [review]
v1

Hardcoding this address in three different places in three different ways is ugly :/

Why did you change the timeout handling? If you just want to make sure that CheckLinkStatus gets called, why not just add an explicit call to that before the loop?

+                            if (PL_strcmp(ipAddr->IpAddress.String, "0.0.0.0")
+                                && PL_strcmp(ipAddr->IpAddress.String,

Can you put the && on the previous line?
(In reply to comment #4)
> (From update of attachment 348457 [details] [diff] [review])
> Hardcoding this address in three different places in three different ways is
> ugly :/
> 

Is, but there are 3 different pieces of code that do the same check to support different OS versions, so there is no other way.

> Why did you change the timeout handling? If you just want to make sure that
> CheckLinkStatus gets called, why not just add an explicit call to that before
> the loop?
> 

True :) Done so.

> +                            if (PL_strcmp(ipAddr->IpAddress.String, "0.0.0.0")
> +                                && PL_strcmp(ipAddr->IpAddress.String,
> 
> Can you put the && on the previous line?

Done.

I did yet more intensive testing and fixed a crash in this new code.
Attachment #348457 - Attachment is obsolete: true
Attachment #359509 - Flags: review+
Whiteboard: [needs review cbiesinger]
Attachment #359509 - Flags: superreview?(jst)
Attachment #359509 - Flags: superreview?(jst)
Landed as changeset e644a19c8f23.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reopening, the patch was backed out from 1.9.1 because of test timeout and other failures.

I discovered that firing information about link state immediately after start of the address listener leads through observer service to SetOffline(PR_FALSE). When executing yet a second TUnit test this happens inside of XPCOM shutdown and makes all become crazy. Main thread indefinitely loops posting events. I haven't looked further why. If anyone has a clue why this doesn't happen on the trunk, let me know please.

As a quick fix we could remove the immediate state check. However, w/o it the patch will become incomplete, necko won't be aware of the state after start and dialup invoke will become unstable again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Attached patch v1.2 (obsolete) — Splinter Review
This is a new patch. Cause of deadlock was that socket transport service and dns service was started by async notification from the address listener that was received on the main thread after xpcom threads shutdown event. Socket transport service (not the main thread) started to loop with 100% CPU load (there were no sockets to poll nor the pollable event nor events to handle).

Main thread was waiting for the socket transport service thread to shutdown, indefinitely.

My fix (and the only difference from the v1.1 patch) is to drop the mManageOfflineStatus in IOService when xpcom shutdown has been received. This prevents any SetOffline manipulation after shutdown. TUnit tests are passing now. I am not sure of the test_Scriptaculous.html that failed too, it locally fails independently on the patch being applied or not.

This is IMHO bug that were always present, just revealed now. A fixup patch should be landed on trunk too.
Attachment #359509 - Attachment is obsolete: true
Attachment #360838 - Flags: review?(cbiesinger)
It's probably worthwhile to do something in shutdown to prevent *anyone* from trying to go back online, not just the link service (which mManageOfflineStatus controls).
Whiteboard: [has patch][needs review cbiesinger]
Attachment #360838 - Flags: review?(cbiesinger) → review+
Comment on attachment 360838 [details] [diff] [review]
v1.2

r=biesi, though I agree with campd that it would be better to disallow SetOffline(TRUE) after shutdown in general.
Attached patch v2 (obsolete) — Splinter Review
Christian, I was just about to add a new version addressing the Dave's objection. Tested on 1.9.1, no deadlocks.
Attachment #360838 - Attachment is obsolete: true
Attachment #363875 - Flags: review?(cbiesinger)
Attachment #363875 - Flags: review?(cbiesinger) → review+
Comment on attachment 363875 [details] [diff] [review]
v2

+    , mShuttedDown(PR_FALSE)

should be mShutdown

+        // movings with the offline status from now. We must not allow going

movings with -> changes of?
Attachment #363875 - Attachment is obsolete: true
Attachment #364084 - Flags: review+
Whiteboard: [has patch][needs review cbiesinger] → [has patch] needs landing
Keywords: checkin-needed
Whiteboard: [has patch] needs landing → [has patch]
Comment on attachment 364084 [details] [diff] [review]
v2 with addressed r comments
[Backout: Comment 18]


http://hg.mozilla.org/mozilla-central/rev/fe7134fc65ec
Attachment #364084 - Attachment description: v2 with addressed r comments → v2 with addressed r comments [Checkin: Comment 15]
Attachment #359509 - Attachment description: v1.1 → v1.1 [(1.9.1 only !?) Backout: Comment 8]
Attachment #359509 - Attachment description: v1.1 [(1.9.1 only !?) Backout: Comment 8] → v1.1 [(1.9.1 only checkin !) Backout: Comment 8]
(In reply to comment #6)
> Landed as changeset e644a19c8f23.

I can't find this in either m-c nor m-1.9.1 :-/
What is it ?
Status: ASSIGNED → RESOLVED
Closed: 11 years ago11 years ago
Keywords: fixed1.9.1
Resolution: --- → FIXED
Whiteboard: [has patch] → [needs 1.9.1 landing]
Target Milestone: --- → mozilla1.9.2a1
A 30% Ts regression appeared after this landed:

http://graphs.mozilla.org/#show=395018,395046,395006&sel=1235497474,1235843074

this bug was in the changeset:

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=37d6dfd70bac&tochange=76592d149b61

#developers highly suspected this patch from that set, and was wondering if we could get it backed out?
Comment on attachment 364084 [details] [diff] [review]
v2 with addressed r comments
[Backout: Comment 18]


Backed out:
http://hg.mozilla.org/mozilla-central/rev/9a17545ac667
http://hg.mozilla.org/mozilla-central/rev/52110c34b6b4
Attachment #364084 - Attachment description: v2 with addressed r comments [Checkin: Comment 15] → v2 with addressed r comments [Backout: Comment 18]
Status: RESOLVED → REOPENED
Depends on: 480464
Keywords: checkin-needed
Resolution: FIXED → ---
Whiteboard: [needs 1.9.1 landing]
I agree this might be cause for Ts. There could be an easy workaround to fix it, one of the first versions of the patch was ready for that. The landed (and now backed out) patch calls some relatively complex code during startup. This apparently has an affect even it is on a background thread. There is no problem to add a delay before we call code first time.

(In reply to comment #4)
> Why did you change the timeout handling? If you just want to make sure that
> CheckLinkStatus gets called, why not just add an explicit call to that before
> the loop?
> 

Now I remember why I did it that way.
Hmm... as I'm thinking of this, the check that the patch does at the start should pass before we try to load a home page or pages from session store. It's needed to correctly show dial-up box when conditions for it are met - say it a different way - we need to know the connection state on all NICs before loading any page at start up to correctly fix this bug...
Status: REOPENED → ASSIGNED
(In reply to comment #18)
> (From update of attachment 364084 [details] [diff] [review])
> 
> Backed out:
> http://hg.mozilla.org/mozilla-central/rev/52110c34b6b4

Talos numbers returned to normal starting with this changeset.
(Pasted from Tinderbox):
id:20090226224734
rev:52110c34b6b4
ts: 457.47
Attached patch v3 (obsolete) — Splinter Review
I verified ones again that w/o detecting the net adapters state at the start this bug cannot be fixed. Also the dial-up dialog is not shown even there is not any ICS and all adapters are unplugged from cables or disconnected from wifi and I set windows to show dial-up every time (even there is a connection).

So, here is a patch that shortens time from 650 ms to half. We was calling GetAdaptersAddresses two times to first determine the buffer size then fill the addresses. Also there is a delay of 1s before the thread determines the link status after it's started. Try server build w/ this patch works as I would expect. Latest trunk doesn't show the dial-up dialog.

Other option is to use CheckAdaptersInfo as a first choice (after a bug in it was fixed). It works for me on an XP machine as expected and takes just <90ms even there is also a duplicate call w/o buffer pre-allocation and the time also includes send of UI notification event. Problem is that I have no chance to test this on other configs with e.g. PPPoE etc and on platforms different from XP. I have sent an email to Juan Lang, original author of this patch to get some closer info from him.
Attachment #364084 - Attachment is obsolete: true
Attachment #365032 - Flags: review?(cbiesinger)
+                        // Also not 192.168.0.1 - the LAN ICS gateway...
+                        table->table[i].dwAddr != 0x0100A8C0 &&

This looks fragile to me.  Is there no other way to check whether ICS is enabled?  Someone might well use 192.168.0.1 in other circumstances (I have done so.)

Also, I'm confused what the cause of the bug is.  Let me assume that you have a machine with a modem with no IP address assigned, and a network card with IP address 192.168.0.1 assigned by ICS.  With the existing code, the machine is assumed to be online, because the address 192.168.0.1 is assigned to the network code.  With your patch, the machine is assumed to be offline.

How does this cause the dial-up login to appear?
(In reply to comment #23)
> +                        // Also not 192.168.0.1 - the LAN ICS gateway...
> +                        table->table[i].dwAddr != 0x0100A8C0 &&
> 
> This looks fragile to me.  Is there no other way to check whether ICS is
> enabled?  Someone might well use 192.168.0.1 in other circumstances (I have
> done so.)
> 

It's a workaround. API to detect an adapter is quit heavy and I assume also very slow. It was decided to use this approach. I have never personally seen an adapter configured under windows to 192.168.0.1 unless it is a gateway. Gateway-configured adapter has not to be used to decide on link state.

> How does this cause the dial-up login to appear?

When a cable is connected to the ICS gate configured adapter (with 192.168.0.1), what happens very often, your code decides that we have a link up and prevents the dial-up. I had for some time very similar configuration at home.

I am more interested in your opinion to use the GetAdaptersInfo and GetOperationalStatus functions (CheckAdaptersInfo method) as a first choice which is much faster then GetAdaptersInfo. Why exactly have you decided to use GetAdapterAddresses function first?
> I have never personally seen an adapter configured under windows to
> 192.168.0.1 unless it is a gateway.

As I already implied, I have.  It's not a safe assumption.

> API to detect an adapter is quit heavy and I assume also
> very slow.

Don't assume, check.  Also, you can check for the presence of a registry value to decide if ICS is enabled on Windows XP.  See e.g.
http://www.cisco.com/en/US/products/sw/secursw/ps2308/products_tech_note09186a0080094b3c.shtml

> I am more interested in your opinion to use the GetAdaptersInfo and
> GetOperationalStatus functions (CheckAdaptersInfo method) as a first choice
> which is much faster then GetAdaptersInfo. Why exactly have you decided to use
> GetAdapterAddresses function first?

I preferred GetAdaptersAddresses because it explicitly states whether an adapter is a loopback adapter (IF_TYPE_SOFTWARE_LOOPBACK), and because it supports IPv6.  GetAdaptersInfo does neither.
Juan, thanks a lot for your quick feedback. 

I'll check the registry entry value and change the patch if it works for me. Do you personally have any suggestion how to avoid the Ts regression?
> Juan, thanks a lot for your quick feedback. 

Happy to be of help.

> Do you personally have any suggestion how to avoid the Ts regression?

No, sorry.  Good luck.
Comment on attachment 365032 [details] [diff] [review]
v3

per comment 26 clearing review request until you figured out if you want to go with this patch or something else
Attached patch pre-v4 (WIP) (obsolete) — Splinter Review
This is just a work in progress for consideration.

The idea with registry entry doesn't work for me. It is not changing when I turn on and off sharing. So, I'm using the ICS API to figure out if an online adapter assigned 192.168.0.1 is used as an ICS gateway. This disallows considering such adapters as 'not internet connected' when using NAT software different from internal ICS.

I was running Ts tests with an optimized build on 2G dual core intel notebook. 
This patch rise Ts by cca 160ms (1280ms - 1120ms w/o the patch).

I checked ones again that w/o any connection (ICS may not be involved) dial-up is not shown at the start w/o this patch. We introduced this issue by fixing bug 454381 that landed on mozilla-central before 1.9.1 branching.

Other solution then Ts increase seems to me to backout bug 454381 and force link state determination (with code from this patch) when a network request failed and after that decide if to dial-up or not.
Attachment #365032 - Attachment is obsolete: true
This approach looks better.  A few comments:

@@ -311,21 +326,25 @@ nsNotifyAddrListener::CheckIPAddrTable(v
(snip)
                 if (GetOperationalStatus(table->table[i].dwIndex) >=
                         MIB_IF_OPER_STATUS_CONNECTED &&
-                        table->table[i].dwAddr != 0)
+                        table->table[i].dwAddr != 0 &&
+                        // Also not 192.168.0.1 - the LAN ICS gateway...
+                        table->table[i].dwAddr != 0x0100A8C0 &&
+                        // ...and nor a loopback
+                        table->table[i].dwAddr != 0x0100007F)

This change seems superfluous.  GetAdaptersAddresses is available on all supported platforms where ICS is available.  ICS is also available on Windows 98, where GetAdaptersAddresses is not available, but Mozilla doesn't support that anymore.

+                                PL_strcmp(ipAddr->IpAddress.String,
+                                    "192.168.0.1")) {

Same here.

-    ULONG len = 0;
+    ULONG len = 16384;
 
-    DWORD ret = sGetAdaptersAddresses(AF_UNSPEC, 0, NULL, NULL, &len);
+    PIP_ADAPTER_ADDRESSES addresses = (PIP_ADAPTER_ADDRESSES) malloc(len);
+    if (!addresses)
+        return ERROR_BUFFER_OVERFLOW;
+
+    DWORD ret = sGetAdaptersAddresses(AF_UNSPEC, 0, NULL, addresses, &len);
(snip)

This change is an attempt to address the Ts regression, yes?  It seems that this is logically distinct from fixing this bug (Minefield fails to initiate dial-up) and should be in a separate patch.

Thanks very much for working on this.
We don't take regressions and then patch them with a separate changeset (I mean, not intentionally ;-). It's ok to have a patch that fixes the bug and has other changes necessary to avoid a Ts regression.

/be
Attached patch v4 (obsolete) — Splinter Review
- check of the link status moved to nsNotifyAddrListener::GetIsLinkUp, this way it's called when we fail a page load and ensures correct dial-up; best would be to do this asynchronously, but it means huge changes to nsSocketTransport::RecoverFromError()
- reverted previous changes to Win2000 and 98 compatibility code as on those plaforms could not be recognized if it were an ICS private adapter; rather make people do a manual dial-up then annoy them with dial-up on every FF start
- I'm missing synchronization of memebers in nsNotifyAddrListener, but it has already got r+ in the past...
Attachment #365526 - Attachment is obsolete: true
Attachment #365748 - Flags: review?(cbiesinger)
I know you're waiting for Christian's review, I just thought I'd chime in that I like this one.  Thanks again.
Attached patch v4 (obsolete) — Splinter Review
Just fixed early return w/o CoUninitialize() call from nsNotifyAddrListener::CheckICSStatus when CoCreateInstance of NetSharingManager failed. Otherwise identacial to the previous patch.
Attachment #365748 - Attachment is obsolete: true
Attachment #366290 - Flags: review?(cbiesinger)
Attachment #365748 - Flags: review?(cbiesinger)
Comment on attachment 366290 [details] [diff] [review]
v4

+    cohr = CoInitialize(NULL);

Quoting MSDN:
" New applications should call CoInitializeEx instead of CoInitialize. "

+    INetSharingManager *netSharingManager;

use an nsRefPtr for this and the other objects? (getter_AddRefs on an nsRefPtr lets you get a void**)


Don't use __uuidof, gcc/mingw doesn't support it. There should be an IID_ constant for this, right?

+            if (connectionVariant.vt != VT_UNKNOWN)
+                continue;

Shouldn't you call VariantClear before continue;?

+                if (!wcscmp(properties->pszwName, aAdapterName))
+                    isICSGatewayAdapter = TRUE;
+
+                connection->Release();

Have to free properties as well (NcFreeNetconProperties)

+    if (cohr == S_OK)
+        CoUninitialize();


Quoting MSDN:
"To close the COM library gracefully, each successful call to CoInitialize or CoInitializeEx, including those that return S_FALSE, must be balanced by a corresponding call to CoUninitialize"

+    if (!addresses)
+        return ERROR_BUFFER_OVERFLOW;

Not ERROR_OUTOFMEMORY?
Attachment #366290 - Flags: review?(cbiesinger) → review+
Attached patch v4.1 [Backout comment 38] (obsolete) — Splinter Review
Thanks for the review, comments addressed.
Attachment #366290 - Attachment is obsolete: true
Attachment #366466 - Flags: review+
Comment on attachment 366466 [details] [diff] [review]
v4.1 [Backout comment 38]

http://hg.mozilla.org/mozilla-central/rev/c3d89722f165
Attachment #366466 - Attachment description: v4.1 → v4.1 [Checkin comment 37]
Status: ASSIGNED → RESOLVED
Closed: 11 years ago11 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
I backed this out because we had a windows-only Ts regression.

http://hg.mozilla.org/mozilla-central/rev/25b0bf152130
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It appears that this patch caused the (very large) regression.
Hmm... only thing that could cause this is load of the Netshell.dll so early. We could potentially load all necessary dlls later as we don't need them at the very start.

I'll create a new patch ASAP and let's give it a new try.
Status: REOPENED → ASSIGNED
(In reply to comment #41)
> Hmm... only thing that could cause this is load of the Netshell.dll so early.
> We could potentially load all necessary dlls later as we don't need them at the
> very start.
> 
> I'll create a new patch ASAP and let's give it a new try.

The TryServers should be able to detect a Ts regression for you.

https://wiki.mozilla.org/Build:TryServer
So, I adjusted the patch and ran try server build. My Ts result is 543ms. At that time it was pretty stable at ~533ms, otherwise. I am loosing ideas what to do now. Could that be added static linking to oleaut32.lib necessary just to have VariantClear function?
Attached patch v4.3 (obsolete) — Splinter Review
This has just passed Ts test on the try server. There is no increase, but I have no function VariantClear that requires oleaut32.lib linking.

I changed only nsNotifyAddrListener::CheckICSStatus method.
Attachment #366466 - Attachment is obsolete: true
Attachment #367312 - Flags: review?(cbiesinger)
Attachment #366466 - Attachment description: v4.1 [Checkin comment 37] → v4.1 [Backout comment 38]
Attachment #367312 - Flags: review?(cbiesinger) → review+
Comment on attachment 367312 [details] [diff] [review]
v4.3

+                // We should call VariantClear here but it needs to link
+                // with oleaut32.lib that produces a Ts incrase about 10ms
+                // that is undesired. As it is quit unlikly the result would
+                // be of a different type anyway, let's pass the variant
+                // unfreed here.
+                continue;


unlikly -> unlikely

Can you add an NS_ERROR here?

Also are you sure it's the linking to oleaut32 that's causing the Ts increase and not actual clearing of variants?
(In reply to comment #46)
> Also are you sure it's the linking to oleaut32 that's causing the Ts increase
> and not actual clearing of variants?

The Ts increase was measured on a machine w/o ICS turned on that is a condition to let the code using VarianClear run. The last patch landed on trunk was also loading the Netshell.dll at the very start, that's why so large regression (~150ms), now deferred. On the try server I was testing two patches, one linking to oleaut32 and other not linking. The first increased Ts by 10ms. The other left it unaffected.
OK, thanks for testing. That is somewhat unfortunate, but I guess not that bad...
Status: ASSIGNED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Keywords: checkin-needed
Whiteboard: [needs landing]
Comment on attachment 368021 [details] [diff] [review]
as landed [Checkin comment 49][Checkin 1.9.1 comment 50]

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/4ccf1639165c
Attachment #368021 - Attachment description: as landed [Checkin comment 49] → as landed [Checkin comment 49][Checkin 1.9.1 comment 50]
Whiteboard: [needs landing]
Duplicate of this bug: 165335
You need to log in before you can comment on or make changes to this bug.