Closed Bug 854176 Opened 11 years ago Closed 5 years ago

[Win7] crash in [CPrivAlloc::operator delete] with MSIE 10 after calling InternetGetConnectedStateExW in nsWindowsSystemProxySettings.cpp

Categories

(Core :: General, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: scoobidiver, Unassigned)

References

Details

(Keywords: crash, steps-wanted, Whiteboard: [tbird crash][ms-support][REG:114063011580728])

Crash Data

Attachments

(6 files)

It started spiking on March 14 across all Firefox versions.
It's #72 browser crasher in 19.0.2, #184 in 20.0b6, #63 in 21.0a2, and #211 in 22.0a1 with a rising tendency.

Windows Update from March 13 on Windows 7 contains IE 10 and other security fixes non-specific to Windows 7.

It's correlated to API contracts (see http://technet.microsoft.com/en-us/subscriptions/hh802935%28v=vs.85%29.aspx):
  CPrivAlloc::operator delete(void*)|EXCEPTION_PRIV_INSTRUCTION (98 crashes)
    100% (98/98) vs.   2% (3552/177218) api-ms-win-downlevel-shlwapi-l2-1-0.dll
          2% (2/98) vs.   0% (91/177218) 6.2.9200.16440
         98% (96/98) vs.   2% (3461/177218) 6.2.9200.16492
    100% (98/98) vs.   2% (3585/177218) netprofm.dll
        100% (98/98) vs.   2% (3535/177218) 6.1.7600.16385
    100% (98/98) vs.   2% (3615/177218) api-ms-win-downlevel-advapi32-l2-1-0.dll
          2% (2/98) vs.   0% (91/177218) 6.2.9200.16440
         98% (96/98) vs.   2% (3524/177218) 6.2.9200.16492
    100% (98/98) vs.   3% (4439/177218) npmproxy.dll
        100% (98/98) vs.   2% (4362/177218) 6.1.7600.16385
    100% (98/98) vs.   4% (7149/177218) api-ms-win-downlevel-ole32-l1-1-0.dll
          2% (2/98) vs.   0% (126/177218) 6.2.9200.16440
         98% (96/98) vs.   4% (7023/177218) 6.2.9200.16492
    100% (98/98) vs.   4% (7582/177218) api-ms-win-downlevel-normaliz-l1-1-0.dll
          2% (2/98) vs.   0% (128/177218) 6.2.9200.16440
         98% (96/98) vs.   4% (7454/177218) 6.2.9200.16492
    100% (98/98) vs.   4% (7582/177218) api-ms-win-downlevel-user32-l1-1-0.dll
          2% (2/98) vs.   0% (128/177218) 6.2.9200.16440
         98% (96/98) vs.   4% (7454/177218) 6.2.9200.16492
    100% (98/98) vs.   4% (7582/177218) api-ms-win-downlevel-version-l1-1-0.dll
          2% (2/98) vs.   0% (128/177218) 6.2.9200.16440
         98% (96/98) vs.   4% (7454/177218) 6.2.9200.16492
    100% (98/98) vs.   4% (7590/177218) api-ms-win-downlevel-shlwapi-l1-1-0.dll
          2% (2/98) vs.   0% (128/177218) 6.2.9200.16440
         98% (96/98) vs.   4% (7462/177218) 6.2.9200.16492
    100% (98/98) vs.   4% (7593/177218) api-ms-win-downlevel-advapi32-l1-1-0.dll
          2% (2/98) vs.   0% (128/177218) 6.2.9200.16440
         98% (96/98) vs.   4% (7465/177218) 6.2.9200.16492
    100% (98/98) vs.  17% (30090/177218) slc.dll
        100% (98/98) vs.  17% (30038/177218) 6.1.7600.16385

Signature 	CPrivAlloc::operator delete(void*) More Reports Search
UUID	3c09ace5-e293-40fd-bd6e-799802130324
Date Processed	2013-03-24 04:16:43
Uptime	6360
Install Age	2.3 weeks since version was first installed.
Install Time	2013-03-08 02:11:06
Product	Firefox
Version	19.0.2
Build ID	20130307023931
Release Channel	release
OS	Windows NT
OS Version	6.1.7601 Service Pack 1
Build Architecture	x86
Build Architecture Info	GenuineIntel family 6 model 37 stepping 5
Crash Reason	EXCEPTION_ACCESS_VIOLATION_READ
Crash Address	0xffffffffffffffff
User Comments	Was playing a game and it turned off.
App Notes	
AdapterVendorID: 0x8086, AdapterDeviceID: 0x0046, AdapterSubsysID: 05261028, AdapterDriverVersion: 8.15.10.2342
D2D? D2D+ DWrite? DWrite+ D3D10 Layers? D3D10 Layers+ 
Processor Notes 	sp-processor05.phx1.mozilla.com_7968:2008
EMCheckCompatibility	True
Adapter Vendor ID	0x8086
Adapter Device ID	0x0046
Total Virtual Memory	4294836224
Available Virtual Memory	3566931968
System Memory Use Percentage	73
Available Page File	1501822976
Available Physical Memory	530780160

Frame 	Module 	Signature 	Source
0 	ole32.dll 	CPrivAlloc::operator delete 	
1 	ole32.dll 	CClientContextActivator::CreateInstance 	
2 	ole32.dll 	CComActivator::DoCreateInstance 	
3 	ole32.dll 	CoCreateInstanceEx 	
4 	ole32.dll 	CoCreateInstance 	
5 	netprofm.dll 	CPubINetworkListManager::EnsureNLPConnected 	
6 	netprofm.dll 	CPubINetworkListManager::GetNetworks 	
7 	wininet.dll 	NETWORK_MANAGER::ReadGuidsForConnectedNetworks 	
8 	wininet.dll 	InternalReadGuidsForConnectedNetworks 	
9 	wininet.dll 	CSwpadSupport::ReadIdsForConnectedNetworks 	
10 	wininet.dll 	NETWORK_MANAGER::SetWpadDecisionForCurrentNetwork 	
11 	wininet.dll 	AutoProxyResolver::UpdateAutoproxyWithCompletedDetection 	
12 	wininet.dll 	AutoProxyWpadAndResultThread 	
13 	wininet.dll 	RefCountWorkItemThread 	
14 	ntdll.dll 	RtlpTpWorkCallback 	
15 	ntdll.dll 	TppCallbackCheckThreadBeforeCallback 	
16 	kernel32.dll 	BaseThreadInitThunk 	
17 	ntdll.dll 	__RtlUserThreadStart 	
18 	ntdll.dll 	_RtlUserThreadStart

More reports at:
https://crash-stats.mozilla.com/report/list?signature=CPrivAlloc%3A%3Aoperator+delete%28void*%29
With combined signatures, it's even higher: #40 in 19.0.2, #91 in 21.0b6, #30 in 21.0a2, and #103 in 22.0a1.

More reports also at:
https://crash-stats.mozilla.com/report/list?signature=%400x0+|+CClientContextActivator%3A%3ACreateInstance%28IUnknown*%2C+IActivationPropertiesIn*%2C+IActivationPropertiesOut**%29
https://crash-stats.mozilla.com/report/list?signature=CClientContextActivator%3A%3ACreateInstance%28IUnknown*%2C+IActivationPropertiesIn*%2C+IActivationPropertiesOut**%29
Crash Signature: [@ CPrivAlloc::operator delete(void*)] → [@ CPrivAlloc::operator delete(void*)] [@ @0x0 | CClientContextActivator::CreateInstance(IUnknown*, IActivationPropertiesIn*, IActivationPropertiesOut**) ] [@ CClientContextActivator::CreateInstance(IUnknown*, IActivationPropertiesIn*, IActivationProper…
That seems bad, since none of our code is on that thread's stack.
Summary: [Win7] crash in CPrivAlloc::operator with MSIE 10 → [Win7] crash in [CPrivAlloc::operator delete] with MSIE 10
Some brief searching indicates that the most likely reason for this crash is that somebody is calling CoCreateInstance without first having called CoInitialize. In this case it seems clear that the code at fault is either wininet or netprofm (and certainly not Mozilla code).

Interesting crash comment: "Closed Laptop at home (WLAN-Connection), restarted Laptop in Company (Connected to Company Ethernet)"

QA, could you take a fully uptodate windows 7 laptop and try playing around with steps such as these to see if we can come up with steps to reproduce that we can give to microsoft?
Flags: needinfo?
Juan, could you find someone to look at the QA action in comment #3, last paragraph?

(I asked Tracy, but he doesn't have access to a machine like that.)
Flags: needinfo? → needinfo?(jbecerra)
With combined signatures, it's #12 top browser crasher in 20.0 and #20 in 21.0b1.
Keywords: topcrash
Flags: needinfo?(mschifer)
Marcia has a laptop that should work.
Flags: needinfo?(mschifer)
Flags: needinfo?(jbecerra)
QA Contact: mozillamarcia.knous
So far no luck trying to repro with a fully update to date Windows 7 laptop. I tried a variety of different things listed in the comments. I concentrated on Facebook operations since Facebook URLs some up top in the crash URLs.

One avenue I did pursue was visiting http://www.pinoy-ako.info/tv-show-replay/66-bubble-gang/84034-bubble-gang-05-april-2013.html, which was one of the URLs listed that was spoofing a HD version of flash. After installing lots of different programs and browsing around flash content, still no crashes.
I have another machine in lab I am updating currently that I will try as well - I am updating it now.
It seems clear that the crash is happening in Windows network proxy code. I think any testing ought to focus on setting up and using the network with a proxy and then doing things like switching networks, turning wifi off and on, etc.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130427 Firefox/23.0 ID:20130427030919 CSet: 0e45f1b9521f

I just got this crash on an unstable WiFi connection. The connection randomly times out (without Windows reporting connection loss to the AP), with network graphs flatlining for a few seconds every now and then. It's probably not an issue with proxy code because I have set my proxy preference to "No proxy".

bp-a410e448-86fc-4c59-8a74-07cda2130427
Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0

I've seen this crash too. I believe it happened after experimenting with a VPN connection.

bp-8c60b7f0-3970-49f2-83f3-8bc2b2130425

What makes this a real pain is that when FF starts the next time, it has forgotten the session information (i.e. all open tabs), and reset all installed extensions to it's defaults. This pretty much renders the profile useless.

I've seen this odd behavior a number of times, previously the trigger possibly was an unavailable Wifi connection. But I couldn't nail it down to the Wifi yet. When this undesired 'profile reset' happened, FF did not always crash before. It just came up like this.
(In reply to Christian Riechers from comment #11)
Forgot to mention, I have no proxy configured.
(In reply to Christian Riechers from comment #11)

> What makes this a real pain is that when FF starts the next time, it has
> forgotten the session information (i.e. all open tabs), and reset all
> installed extensions to it's defaults. This pretty much renders the profile
> useless.

I've never experienced that (in relation to this crash or otherwise). It must be a different problem.
Blocks: KB2670838
Hi!

We have same troubles with our products Kaspersky Interney Security 2013/2014 beta.
After many times of research we have refused using some WinInet functionality for network state detection such as
InternetGetConnectedStateEx, InternetQueryOption etc. on Vista+ platforms and migrated to using similar 
functionslity via INetworkListManager:
NLM_CONNECTIVITY state;
HRESULT hr = networkListManager->GetConnectivity(&state);
etc.

We have observed that troubles resolved.
I have detect network state checking in 
toolkit\system\windowsproxy\nsWindowsSystemProxySettings.cpp

static nsresult ReadInternetOption(uint32_t aOption, uint32_t& aFlags,
                                   nsAString& aValue)
{
    DWORD connFlags = 0;
    WCHAR connName[RAS_MaxEntryName + 1];
    MOZ_SEH_TRY {
        InternetGetConnectedStateExW(&connFlags, connName,
                                     mozilla::ArrayLength(connName), 0);
    } MOZ_SEH_EXCEPT(EXCEPTION_EXECUTE_HANDLER) {
        return NS_ERROR_FAILURE;
    }
...
Mitch, is refactoring to use nsINetowkrListManager as mentioned in comment 14 a reasonable workaround?

I see also bug 817568 and bug 829518 where we've been wrapping windows functions in __try/__except blocks, which is a recipe for disaster.
Flags: needinfo?(mitchell.field)
I just crashed here, bp-064fde15-81cf-4f41-9d29-ef8342130531

I had suspended my laptop at home and then woke it up again at home. I do not use a proxy.

Incidentally, chatzilla (running on XULRunner) also crashed at the exact same time. I didn't get a stack for that though...
This is the #7 topcrash in 21.0 at this time, do we have any plans on what to do here?
I asked rstrong about this a few weeks ago and he said that one of bbondy/jmathies would help coordinate filing a ticket with Microsoft. I'm going to speculatively assign this based on that information.

If we need to work around the issue on our side by switching to INetworkListManager, I don't know who the correct assignee would be. jduell might be able to help.
Assignee: nobody → netzen
Summary: [Win7] crash in [CPrivAlloc::operator delete] with MSIE 10 → [Win7] crash in [CPrivAlloc::operator delete] with MSIE 10 after calling InternetGetConnectedStateExW in nsWindowsSystemProxySettings.cpp
Depends on: 880716
Is there any way to correlate this somehow with what the other threads are doing?
We cannot give that in general, but we can for employees or volunteers who specifically agree to ie. bent/Christian Riechers/John Volikas, may I give MS your minidumps?
Flags: needinfo?(ferongr)
Flags: needinfo?(chriechers)
Flags: needinfo?(bent.mozilla)
Totally fine for me.
Flags: needinfo?(bent.mozilla)
Oh sorry, responded across bugs. MS would like access to some minidumps of this crash.

jimm, I can certainly do some analysis of what other threads are doing. Do you mean the main thread, or some other thread?
Sure, go ahead.
Flags: needinfo?(ferongr)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #20)
I'm fine with that.
Flags: needinfo?(mitchell.field)
Flags: needinfo?(chriechers)
Do we have anyone who can reproduce this reliably? MS is requesting a complete mini dump or better, a full dump of this crash. They are asking we have have found a way to reproduce in our lab.
Jim, I sent you two minidumps for this, right? Full dumps are going to be harder, although if somebody on this bug (Christian/John/bent) can reproduce this semi-reliably, I can give you instructions for collecting a full dump from it.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #26)
> Jim, I sent you two minidumps for this, right? Full dumps are going to be
> harder, although if somebody on this bug (Christian/John/bent) can reproduce
> this semi-reliably, I can give you instructions for collecting a full dump
> from it.

They were hoping we had a full dump. I'll forward the email to you.
It looks like volume of this is dropping on release since MS Patch Tuesday. I will only trust this if the dropping volume persists, though. Seen too much already here.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #28)
> It looks like volume of this is dropping on release since MS Patch Tuesday.
> I will only trust this if the dropping volume persists, though. Seen too
> much already here.

I can't seem to get any historical graphs out of crashstats currently. Do we know when this started to fall off?
Assignee: netzen → jmathies
(In reply to Jim Mathies [:jimm] from comment #29)
> I can't seem to get any historical graphs out of crashstats currently. Do we
> know when this started to fall off?
Use the date field (see https://crash-stats.mozilla.com/report/list?date=2013-07-17&signature=CPrivAlloc%3A%3Aoperator+delete%28void*%29) and go back until the crash volume gets stable. You'll find July 10. Then go to https://technet.microsoft.com/en-us/security/bulletin/ms13-jul

The first correlations are:
    100% (239/239) vs.   5% (5921/109122) netprofm.dll
          0% (0/239) vs.   0% (2/109122) 6.0.6000.16386
          0% (0/239) vs.   0% (8/109122) 6.0.6001.18000
        100% (239/239) vs.   5% (5854/109122) 6.1.7600.16385
          0% (0/239) vs.   0% (6/109122) 6.2.9200.16384
          0% (0/239) vs.   0% (2/109122) 6.2.9200.16518
          0% (0/239) vs.   0% (48/109122) 6.2.9200.16604
          0% (0/239) vs.   0% (1/109122) 6.3.9431.0
    100% (239/239) vs.   6% (6659/109122) npmproxy.dll
          0% (0/239) vs.   0% (13/109122) 6.0.6000.16386
          0% (0/239) vs.   0% (1/109122) 6.1.7100.0
          0% (0/239) vs.   0% (1/109122) 6.1.7600.16384
        100% (239/239) vs.   6% (6546/109122) 6.1.7600.16385
          0% (0/239) vs.   0% (11/109122) 6.2.9200.16384
          0% (0/239) vs.   0% (4/109122) 6.2.9200.16518
          0% (0/239) vs.   0% (82/109122) 6.2.9200.16604
          0% (0/239) vs.   0% (1/109122) 6.3.9431.0
Oddly, these DLLs are the latest versions for Windows 7.
(In reply to Scoobidiver from comment #30)
> (In reply to Jim Mathies [:jimm] from comment #29)
> > I can't seem to get any historical graphs out of crashstats currently. Do we
> > know when this started to fall off?
> Use the date field (see
> https://crash-stats.mozilla.com/report/list?date=2013-07-
> 17&signature=CPrivAlloc%3A%3Aoperator+delete%28void*%29) and go back until
> the crash volume gets stable. You'll find July 10. Then go to
> https://technet.microsoft.com/en-us/security/bulletin/ms13-jul
> 
> The first correlations are:
>     100% (239/239) vs.   5% (5921/109122) netprofm.dll
>           0% (0/239) vs.   0% (2/109122) 6.0.6000.16386
>           0% (0/239) vs.   0% (8/109122) 6.0.6001.18000
>         100% (239/239) vs.   5% (5854/109122) 6.1.7600.16385
>           0% (0/239) vs.   0% (6/109122) 6.2.9200.16384
>           0% (0/239) vs.   0% (2/109122) 6.2.9200.16518
>           0% (0/239) vs.   0% (48/109122) 6.2.9200.16604
>           0% (0/239) vs.   0% (1/109122) 6.3.9431.0
>     100% (239/239) vs.   6% (6659/109122) npmproxy.dll
>           0% (0/239) vs.   0% (13/109122) 6.0.6000.16386
>           0% (0/239) vs.   0% (1/109122) 6.1.7100.0
>           0% (0/239) vs.   0% (1/109122) 6.1.7600.16384
>         100% (239/239) vs.   6% (6546/109122) 6.1.7600.16385
>           0% (0/239) vs.   0% (11/109122) 6.2.9200.16384
>           0% (0/239) vs.   0% (4/109122) 6.2.9200.16518
>           0% (0/239) vs.   0% (82/109122) 6.2.9200.16604
>           0% (0/239) vs.   0% (1/109122) 6.3.9431.0
> Oddly, these DLLs are the latest versions for Windows 7.

Do you know if ms revs version numbers for this type of release? I'm curious if we can confirm that machines with the update on the 9th are not experiencing this crash.
It spikes again.
I'm seeing the same crash in our product, which is not connected to Firefox by any means.
So far, I belive the following stacks are all instances of the same bug:

1) * | wininet!AutoProxyWpadAndResultThread | *
2) * | wininet!NETWORK_MANAGER::ReadGuidsForConnectedNetworks | *
3) * | wininet!InternalReadGuidsForConnectedNetworks | *

In our product, I also have 'CFastBH::Get | *' filtered to this bug.
I've got a full dump for 'AutoProxyWpadAndResultThread' type.

The first report was received on 20mar13.

Here's the grouping of automatic reports for our product by version of wininet.dll:
10.0.9200.16521 - 38 reports  (Initial IE10)
10.0.9200.16540 - 74 reports  (IE10.0.4 KB2817183)
10.0.9200.16576 - 80 reports  (IE10.0.5 KB2829530)
10.0.9200.16611 - 180 reports (probably MS13-047)
10.0.9200.16618 - 12 reports  (IE10.0.6 KB2838727)
10.0.9200.16635 - 92 reports  (IE10.0.7 KB2846071)

As you can see, all reports belong to IE10.
10.0.9200.16635 is the most current as of 29jul13 and it still crashes.
I found a computer in my office which has such crashes almost every day. What's interesting in that computer is that it goes to sleep every 30 minutes during the night (power saving policy) but due to misconfiguration it will always wake a minute later. And, like it was hinted before, the crashes are always a few seconds after the computer wakes! I checked the last 6 crashes or so against the 'Control panel | Event Log | System' and figured there's no need to check the rest.

Also, today I got a support ticket from one of our clients with the same problem, and once again, his computer goes to sleep due to idling, 4 minutes later it's woken up and 27 seconds later the application crashes.
(In reply to Alexander from comment #34)
> I found a computer in my office which has such crashes almost every day.
> What's interesting in that computer is that it goes to sleep every 30
> minutes during the night (power saving policy) but due to misconfiguration
> it will always wake a minute later. And, like it was hinted before, the
> crashes are always a few seconds after the computer wakes! I checked the
> last 6 crashes or so against the 'Control panel | Event Log | System' and
> figured there's no need to check the rest.
> 
> Also, today I got a support ticket from one of our clients with the same
> problem, and once again, his computer goes to sleep due to idling, 4 minutes
> later it's woken up and 27 seconds later the application crashes.

Is the laptop on a wireless network, proxy or no proxy? Win7?

We could probably write a little desktop app to simulate these wakeups on one of our laptops in our lab.
Flags: needinfo?(mozillamarcia.knous)
Attached file wakeonsleep.zip
This test app waits for the suspend event, then fires off a 30 second sleep wake timer. If you take a laptop and set the power options to low values (~2 minutes to sleep or so) and run this app it will continually sleep/wake the system.

Tried this on a win7 laptop with firefox open to a page that regularly refreshes. After about ten cycles I didn't see any crashes though.
Attached file wakeonsleep.cpp
The computer in the office is a Win7 PC. It's not a laptop, but a regular computer with a monitor and a system block. It is not wifi-capable. It has 'Automatically detect settings' in IE connection settings, everything else disabled.

As far as I can tell from the dumps, the problem is in the WPAD technology, which stands for automatic proxy configuration. Google says it's active when 'Automatically detect settings' is enabled, somewhat contrary to expectations that it's proxy-settings related.

There're 14 sleep rounds per day (it tries to sleep every 30 minutes for 7 hours in the night). The number of crashes average to 1 per day, sometimes it will have 2 a day with say 1.5hr interval, sometimes it will have none.

Browsing various dumps, I concluded the problem is sorts of "restart wpad on wake up". Maybe you didn't give it enough time after wake up to crash. Also, I believe that that application has to use default wininet proxy settings to trigger the problem. Maybe it also has to use InternetSendRequest-like functions. Another observation ios that another thread often calls GetAddrInfoW() during the crash.
Firefox does not call InternetSendRequest() at all.
Maybe a plugin do. Maybe there're multiple entry points that trigger WPAD. Maybe that isn't important, even. I'm talking HttpSendRequest (sorry for wrong api name before) because I can often see it happening at the moment of the crash in MY application.
Now plug-ins are executed out-of-process (except Java). Not using HttpSendRequest, either (except crashreporter).
Attached file Repro application
Okay, I got a repro. Hopefully you will excuse me for throwing both GetAddrInfoW() and HttpSendRequest() without figuring which is more important :)

I started the application and gave it a minute to warm up (first HttpSendRequest() are slower for whatever reason). Then I started clicking 'sleep' keyboard button, waiting till the fan silences out, and waking bu pressing button again.

The first couple times nothing happened. I decided to check if I'm getting it right and attached WinDBG with
bp wininet!NETWORK_MANAGER::ReadGuidsForConnectedNetworks

the next wake, breakpoint was hit. After a few F5's it crashed into WinDBG.
The repro's dump (bonus: gflags +ust was enabled, so you can use !heap -p -a with stacks)
https://docs.google.com/file/d/0BxYgPL6MR_0UVDlvNVhQM2hNUk0/edit?usp=sharing

0682e98c 75d011f9 ole32!CFastBH::Get [d:\w7rtm\com\ole32\common\fastbh.cxx @ 29]
0682e9fc 75d053bd ole32!CRpcResolver::ServerGetReservedID+0x16 [d:\w7rtm\com\ole32\com\dcomrem\resolver.cxx @ 919]
0682eaac 75d052fe ole32!MakeProxyHelper+0x98 [d:\w7rtm\com\ole32\com\dcomrem\resolver.cxx @ 1356]
0682ead0 75d052d4 ole32!MakeSCMProxy+0x1e [d:\w7rtm\com\ole32\com\dcomrem\resolver.cxx @ 1410]
0682eaec 75d05c1e ole32!CRpcResolver::BindToSCMProxy+0x56 [d:\w7rtm\com\ole32\com\dcomrem\resolver.cxx @ 1506]
0682eb50 75d0637b ole32!CRpcResolver::CreateInstance+0x74 [d:\w7rtm\com\ole32\com\dcomrem\resolver.cxx @ 2385]
0682edac 75d13170 ole32!CClientContextActivator::CreateInstance+0x11f [d:\w7rtm\com\ole32\com\objact\actvator.cxx @ 711]
0682edec 75d13098 ole32!ActivationPropertiesIn::DelegateCreateInstance+0x108 [d:\w7rtm\com\ole32\actprops\actprops.cxx @ 1917]
0682f5c8 75d19e25 ole32!ICoCreateInstanceEx+0x404 [d:\w7rtm\com\ole32\com\objact\objact.cxx @ 1334]
0682f628 75d19d86 ole32!CComActivator::DoCreateInstance+0xd9 [d:\w7rtm\com\ole32\com\objact\immact.hxx @ 343]
0682f64c 75d19d3f ole32!CoCreateInstanceEx+0x38 [d:\w7rtm\com\ole32\com\objact\actapi.cxx @ 157]
0682f67c 6d4b2505 ole32!CoCreateInstance+0x37 [d:\w7rtm\com\ole32\com\objact\actapi.cxx @ 110]
WARNING: Frame IP not in any known module. Following frames may be wrong.
0682f6bc 7687bfe8 <Unloaded_netprofm.dll>+0x2505
0682f744 7685de21 wininet!NETWORK_MANAGER::ReadGuidsForConnectedNetworks+0x131
0682f770 7687a4cd wininet!InternalReadGuidsForConnectedNetworks+0x87
0682f790 7687a55d wininet!CSwpadSupport::ReadIdsForConnectedNetworks+0x1d
0682f7f8 7687fb79 wininet!NETWORK_MANAGER::SetWpadDecisionForCurrentNetwork+0x79
0682f88c 76885916 wininet!AutoProxyResolver::UpdateAutoproxyWithCompletedDetection+0x1fe
0682f8d8 7680c9f5 wininet!AutoProxyWpadAndResultThread+0xd1
0682f8e8 76f094d2 wininet!RefCountWorkItemThread+0xe
0682f95c 76ef43e9 ntdll!RtlpTpWorkCallback+0x11d
0682fabc 758033aa ntdll!TppWorkerThread+0x572
0682fac8 76ed9ef2 kernel32!BaseThreadInitThunk+0xe
0682fb08 76ed9ec5 ntdll!__RtlUserThreadStart+0x70
0682fb20 00000000 ntdll!_RtlUserThreadStart+0x1b

netprofm.dll is unloaded in the middle of the stack. I can see where the problem is...
Jim, can you forward this app and dumps to Microsoft as part of our support ticket?
Flags: needinfo?(mozillamarcia.knous) → needinfo?(jmathies)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #44)
> Jim, can you forward this app and dumps to Microsoft as part of our support
> ticket?

yep.
Flags: needinfo?(jmathies)
#3 crash for TB23beta1 for combined signature total
Whiteboard: [tbird topcrash][waiting on Microsoft]
We're not waiting on ms, they are waiting on us for a reproducible test case in our lab. Once we have that they'll want us to turn on some detailed debugging features.
Whiteboard: [tbird topcrash][waiting on Microsoft] → [tbird topcrash]
I have already prepared a repro application in post 42. Why doesn't it fit?
(In reply to Alexander from comment #48)
> I have already prepared a repro application in post 42. Why doesn't it fit?

They were not able to reproduce using this sample.
Can you?
Marcia, I know it was a while ago but in comment 8 you mentioned you were going to try this on a machine in our lab. Did you have any results to share from that testing? Is there anything we can do to farm this out to Softvision?
Flags: needinfo?(mozillamarcia.knous)
(In reply to Anthony Hughes, Mozilla QA (:ashughes) from comment #51)
> Marcia, I know it was a while ago but in comment 8 you mentioned you were
> going to try this on a machine in our lab. Did you have any results to share
> from that testing? Is there anything we can do to farm this out to
> Softvision?

I did try a while back and was unable to reproduce. But now it seems there is a test case so we can try again.
Flags: needinfo?(mozillamarcia.knous)
(In reply to Marcia Knous [:marcia] from comment #52)
> I did try a while back and was unable to reproduce. But now it seems there
> is a test case so we can try again.

Are you able to work on this or should I reassign it?
I would like to focus testing on the repro application from comment 42. Microsoft says that they cannot reproduce using this testcase, and so we need to understand whether only some of our computers see the crash, or whether there are other conditions that will help Microsoft reproduce and fix the issue.
Assignee: jmathies → nobody
I can reliably crash using Alexander's code from comment 42. Jim please let me know if I can help collect data using Microsoft's debugging switches.

The machine is MOZILLA-RD6310. It's an up-to-date Win7 SP1 netbook with proxy set to auto-detect. 
    OriginalFilename: wininet.dll
    FileVersion:      10.00.9200.16660 (win8_gdr_escrow.130725-1505)

Only the HttpSendRequest thread appears to be necessary in the repro app; GetAddrInfoW doesn't seem to contribute.

The repro seems to be much more reliable on a wired connection (nearly every sleep cycle) versus wireless (takes about 10 attempts), I'm guessing because wired lets the app pound on HttpSendRequest more intensely.

I've seen the crash take several forms:
- ole32!CFastBH::Get, deref null
- ole32!CRpcChannelBuffer::SendReceive2+0x309, deref 0xfeeefeee (freed memory sentinel?)
- "survivable" access violation in ntdll, where the debugger shows a first-chance exception but you can continue past it, presumably someone further up does a catch
Wininet is on the stack in all of these crashes. Details in attached file.
Flags: needinfo?(jmathies)
It looks like the sample app isn't reproducing the crash we see in firefox, which looks something like this, with a few variations -  

0 	ole32.dll 	CPrivAlloc::operator delete(void *) 	
1 	ole32.dll 	CClientContextActivator::CreateInstance(IUnknown *,IActivationPropertiesIn *,IActivationPropertiesOut * *) 	
2 	ole32.dll 	CComActivator::DoCreateInstance(_GUID const &,IUnknown *,unsigned long,_COSERVERINFO *,unsigned long,tagMULTI_QI *,ActivationPropertiesIn *) 	
3 	ole32.dll 	CoCreateInstanceEx 	
4 	ole32.dll 	CoCreateInstance 	
5 	netprofm.dll 	CPubINetworkListManager::EnsureNLPConnected() 	
6 	netprofm.dll 	CPubINetworkListManager::GetNetworks(NLM_ENUM_NETWORK,IEnumNetworks * *) 	
7 	wininet.dll 	wininet.dll@0xfbfe8 	
8 	wininet.dll 	wininet.dll@0xdde21 	
9 	wininet.dll 	wininet.dll@0xfa4cd 	
10 	wininet.dll 	wininet.dll@0xfa55d 	
11 	wininet.dll 	wininet.dll@0xffb79 	
12 	wininet.dll 	wininet.dll@0x105916 	
13 	wininet.dll 	wininet.dll@0x8c9f5 	
14 	ntdll.dll 	ntdll.dll@0x69512 	
15 	ntdll.dll 	ntdll.dll@0x54429 	
16 	kernel32.dll 	BaseThreadInitThunk 	
17 	ntdll.dll 	ntdll.dll@0x39f72 	
18 	ntdll.dll 	ntdll.dll@0x39f45
Flags: needinfo?(jmathies)
(In reply to Jim Mathies [:jimm] from comment #57)
> It looks like the sample app isn't reproducing the crash we see in firefox,
> which looks something like this, with a few variations -  

I suspect it's the same underlying cause, even if the top stack frames differ. There's always a wininet-netprofm-CoCreateInstance call chain. And with the feeefeee derefs indicating potential memory corruption, you might encounter the badness in a different place depending on which version of ole32 you have. So I wouldn't rely on CPrivAlloc as the only indicator (heck, I can't even find CPrivAlloc in the crash-stats dumps for this bug).
The crash takes multiple forms. Most of them will contain unloaded netprofm.dll in the stack. I'm not sure how to read your crash signature exactly, but naively, it seems to correlate with my own stack in post 43. I believe that giving it a few tries and collecting all stacks will resolve the confusion.
Also, I would like to underline that the repro application is obviously legit, so if it crashes, it has to be fixed anyway. You could save time by sending that to microsoft to have it fixed first, even if you're not confident that it's the bug in question.
(In reply to David Major [:dmajor] from comment #58)
> (In reply to Jim Mathies [:jimm] from comment #57)
> > It looks like the sample app isn't reproducing the crash we see in firefox,
> > which looks something like this, with a few variations -  
> 
> I suspect it's the same underlying cause, even if the top stack frames
> differ. There's always a wininet-netprofm-CoCreateInstance call chain. And
> with the feeefeee derefs indicating potential memory corruption, you might
> encounter the badness in a different place depending on which version of
> ole32 you have. So I wouldn't rely on CPrivAlloc as the only indicator
> (heck, I can't even find CPrivAlloc in the crash-stats dumps for this bug).

Looking through crash stats the most common stack is the one in comment 57 which has CPrivAlloc in the signature.

If it's relatively easy to do this in a sample using a set of repo steps, have we tried using that knowledge to repo in firefox or a xul app?

If we want to go with the sample app, I can go back to ms and reopen the request. They will ask us to produce a time travel trace of the crash, so maybe we should try to get that set up and working first.

(In reply to Alexander from comment #60)
> Also, I would like to underline that the repro application is obviously
> legit, so if it crashes, it has to be fixed anyway. You could save time by
> sending that to microsoft to have it fixed first, even if you're not
> confident that it's the bug in question.

We would have to open up a separate support request for this independent of the problem we have in firefox.
(In reply to Jim Mathies [:jimm] from comment #61)
> Looking through crash stats the most common stack is the one in comment 57
> which has CPrivAlloc in the signature.

Right, but when I open the .dmp files, I don't actually see CPrivAlloc. I did some more digging this morning and I think there may be a symbol issue (on the server, or on my end, or both). 

For the sake of a concrete example, here is one dump I am looking at:
https://crash-stats.mozilla.com/report/index/c1359d19-efb5-47ac-a228-566932130906

And this is the stack I see:
ole32!`string'+0x9
ole32!ICoCreateInstanceEx+0x243
ole32!CComActivator::DoCreateInstance+0xd9
ole32!CoCreateInstanceEx+0x38

I want to say I believe my debugger's analysis over the server's. I unassembled DoCreateInstance and it does have a call to ICoCreateInstanceEx, but there are no calls to CClientContextActivator::CreateInstance (nor any indirect calls that might hide it). 

On crash-stats, in the Raw Dump tab, underneath the modules it has stacks with instruction offsets. It claims that it saw CClientContextActivator::CreateInstance+0x115 on the stack. When I ask WinDbg what is at that address, it tells me it's ICoCreateInstanceEx+0x243. (Huh?!) What I think is happening is that there is some aggressive compiler optimization and block reordering going on. The instruction that is physically 0x115 bytes away from CClientContextActivator::CreateInstance is actually logically part of the flow of ICoCreateInstanceEx. 

Given the symbol confusion, I'm even more inclined to believe that Alexander's repro demonstrates the same root cause bug. 

I would say that the sample app is more likely to lead to a fruitful investigation from Microsoft than a full-fledged Firefox repro -- the sample is super reduced, and it calls the network API intensely in a tight loop, which helps the crash happen more readily than it would in the wild.
If you get a better stack out of a Microsoft debugger I would always believe that over Breakpad. It's pretty upsetting that we can't even get the first frame right, though. Can you file a separate bug in Toolkit:Breakpad Integration on figuring out the root cause there?
Whiteboard: [tbird topcrash] → [tbird crash]
Whiteboard: [tbird crash] → [tbird topcrash]
I have been fighting this bug since IE10 was released. I have not been able to reproduce the problem but it has caused a 5x increase in the number of crash dumps from customer machines. I'm amazed that more companies are not reporting the problem and that Microsoft has not resolved this yet.
I was able to capture a trace of the crash and sent it to Jim to pass along to Microsoft.
(In reply to David Major [:dmajor] from comment #65)
> I was able to capture a trace of the crash and sent it to Jim to pass along
> to Microsoft.

Uplifted those to the request and pinged my contact. This was a week or so ago. Haven't heard back yet, will ping them again today.
Major issue for Thunderbird. 13.5% of 24.0.1 crashes - #1, #2 and #9 in topcrash https://crash-stats.mozilla.com/topcrasher/products/Thunderbird/versions/24.0.1
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
> Mitch, is refactoring to use nsINetowkrListManager as mentioned in comment
> 14 a reasonable workaround?
> 
> I see also bug 817568 and bug 829518 where we've been wrapping windows
> functions in __try/__except blocks, which is a recipe for disaster.

I'm not sure. Network List Manager has ways to enumerate network interfaces based on connectivity, but I can't see a way to directly get their associated proxy settings.

If the aim is to completely avoid WinINet, WinHTTP has similar functionality to what we have now, with a few major drawbacks.
http://msdn.microsoft.com/en-us/library/windows/desktop/hh227297%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa384068%28v=vs.85%29.aspx

Without knowing the common case here (for crashes at least), I can't give an assessment of whether it makes sense to use WinHTTP interfaces first and fall back to WinINet as a last resort (in case of unsupported proxy type, etc.).
Keywords: topcrashtopcrash-win
This is #6 and #8 on Firefox 25 release right now.
(In reply to Mitchell Field [:Mitch] (mostly inactive) from comment #68)
> If the aim is to completely avoid WinINet, WinHTTP has similar functionality
> to what we have now, with a few major drawbacks.
> http://msdn.microsoft.com/en-us/library/windows/desktop/hh227297%28v=vs.85%29.aspx
> http://msdn.microsoft.com/en-us/library/windows/desktop/aa384068%28v=vs.85%29.aspx
> 
> Without knowing the common case here (for crashes at least), I can't give an
> assessment of whether it makes sense to use WinHTTP interfaces first and
> fall back to WinINet as a last resort (in case of unsupported proxy type,
> etc.).

Unfortunately WinHttp will regress bug 787757. It has no corresponding option to INTERNET_PER_CONN_FLAGS_UI. Chromium already tried WinHttp and failed.
Why is it still stuck? It has been over 5 months since I presented a reliable repro!
(In reply to Alexander from comment #71)
> Why is it still stuck? It has been over 5 months since I presented a
> reliable repro!

We've been going back and forth with ms on this. Last contact with them was before our holiday break. Generally they seem to be having a hard time isolating the problem.
Does that mean they can't reproduce with the source code from post 42? Can you? I just can't understand what could hold them back after I've got a repro that some of you confirmed.
(In reply to Alexander from comment #73)
> Does that mean they can't reproduce with the source code from post 42? Can
> you? I just can't understand what could hold them back after I've got a
> repro that some of you confirmed.

That's correct, they were not able to repo. However one of our engineers was able to so we did a bunch of tracing and logging for them using special tools.
I will try to improve repro to avoid sleep-wake complications and increase the crash rate. Just need to remember to do that...
I don't know it would be of any use but I have tens of thousands of crash dumps from users plagued by this problem.
Spent a few hours, but didn't get a more reliable repro. My theory is that there's a ref not added to COM object somewhere, which leads to the COM object and related DLL being unloaded too early, depending on the thread timings. Looks like I'll have to actually debug the problem to figure the way of making a reliable repro.

So far, the problem exists up to the newest IE11, out of 1521 crash dumps I received the stats are as following:
10.0.9200.16521 - 53 reports  (Initial IE10)
10.0.9200.16540 - 89 reports  (IE10.0.4 KB2817183)
10.0.9200.16576 - 86 reports  (IE10.0.5 KB2829530)
10.0.9200.16611 - 189 reports (probably MS13-047)
10.0.9200.16618 - 15 reports  (IE10.0.6 KB2838727)
10.0.9200.16635 - 166 reports (IE10.0.7 KB2846071)
10.0.9200.16660 - 196 reports (IE10.0.8 KB2862772)
10.0.9200.16686 - 177 reports (IE10.0.9 KB2870699)
10.0.9200.16720 - 206 reports (MS13-080)
10.0.9200.16736 - 99 reports  (IE10.0.11)
10.0.9200.16750 - 1 reports   (IE10.0.12 KB2898785)
11.0.9600.16428 - 49 reports  (IE11 RTM KB2841134)
11.0.9600.16476 - 74 reports  (IE11.0.2 KB2898785)
We have the same kind of crash with Avast and have over 1000 dumps I can provide. Wasn't able to repro it locally, though. 

The stacks pretty much always contain 
ole32!CoCreateInstance+37
netprofm!CPubINetworkListManager::EnsureNLPConnected+58
netprofm!CPubINetworkListManager::GetNetworks+39
wininet!NETWORK_MANAGER::ReadGuidsForConnectedNetworks+12d

with netprofm sometimes being unloaded but not always. Majority of dumps have it still loaded but it can be seen it has been unloaded and reloaded many times from the process already (in some dumps even over 20 times).
I can also confirm the problem happening on pretty much any version of wininet.dll but only on one version of netprofm.dll - the one included in Windows 7 x32 SP1.

I've tried to repro with Alexander's solution on a notebook waking up from a sleep state, unfortunately no success so far. 

Let me know if I can help as this crash plagues us just as well as you.
Ondrej,

I suggest that you
1) Configure Application Verifier with Basic\Heaps
2) Wait at least a minute between waking.

At some moment I thought I got a 100% repro, but then something changed and it stopped happening reliably.

I have seen the crash on two different Win7 SP1 x64 machines. Didn't try on Win8 (due to a driver going nuts after waking), will do if I have time
Unfortunately, nothing at all after 10 more sleep/wake cycles with 3 instances of the app, run under Verifier on Windows 7 x64 SP1.

Will try on a Windows 7 x32 machine as well.
Oh well. I really did think my repro is good enough after having it crash more or less reliable and it was confirmed by someone from mozilla. Ondrej, can you see netprofm.dll / wininet.dll being loaded/unloaded after you wake your laptop? I noted it will or will not happen depending on something, presumably how long I wait before waking, and it seems the crash only occurs in the scenario where DLL's are loaded/unloaded.
NB: Having multiple copies of repro application running at once is a smart move.
Tested on another Win 7 x32 SP1. About 15 sleep iterations, 4 instances of sample, did not encounter any issue.  Dumps tend to have 1 unloaded netprofm.dll per sleep/wake cycle so it does seem to take effect but unfortunately, no erroneous behavior encountered.  It seems this in unfortunately dependent on some other state of the OS...
Attached file AutoProxyRepro.cpp
Oh well. Looks like I'm the only one who can save the universe. So be it.

Here's how to repro it with debugger, as far as I can tell that's a 100% repro.
TLDR: I set a breakpoint on the code that will be unloaded, suspend thread on it, wait and resume to have it crashed.

I will use WinDBG. IGNORE > sign in the text boxes!

-- 0 --
File | Symbol file path. Enter the following string to load MS symbols and cache them in C:\Symbols:
> srv*C:\Symbols*http://msdl.microsoft.com/download/symbols

-- 1 -- 
File | Open executable...
Open the repro program. In fact, any other program that does HttpSendRequest() at least once should do.

-- 2 --
Debug | Event filters... | Unload module = output
This is to see when DLL is unloaded.

-- 3 --
Enter command in the bottom "command" box:
> bu netprofm!CPubINetworkListManager::EnsureNLPConnected "k;.printf \"***** Suspend: ~~[%x] n\\n***** Resume:  ~~[%x] m\\n\", @$tid, @$tid"
This is to create a breakpoint on the function that crashes later.

-- 4 --
Press F5 to run program. Whenever it stops on breakpoint, it will print stack, Thread ID, Suspend/Resume commands. When it stops, check stack. IT MUST CONTAIN
> WININET!AutoProxyWpadAndResultThread
If it doesn't, press F5 again. Be patient. When the stack with AutoProxyWpadAndResultThread occurs, grab the suspend command starting with ~~ and run this command. The thread will be suspended.

-- 5 --
Press F5 until you see module unload notifications.

-- 6 --
Debug | Break.

-- 7 --
Run the resume command you had earlier.

-- 8 --
Press F5 and it crashes.
Now for the explanation.

First, the only important thing in the repro program is that it uses HttpSendRequest once. When it does, WININET creates a subscription to network changes.

> WINNSI!NsiRpcRegisterChangeNotification
> IPHLPAPI!InternalRegisterChangeNotification
> IPHLPAPI!NotifyIpInterfaceChange
> WININET!NetworkChangeMonitor::Startup
> WININET!StartGlobalNetworkChangeMonitor
> WININET!WxRegisterForNetworkChangeNotification
> WININET!RegisterForNetworkChangeNotificationInternal
> WININET!GlobalDataInitializeWorkItem
> WININET!FailFastThreadPoolCallback<&GlobalDataInitializeWorkItem>
> ntdll!TppSimplepExecuteCallback
> ntdll!TppWorkerThread
> kernel32!BaseThreadInitThunk
> ntdll!__RtlUserThreadStart
> ntdll!_RtlUserThreadStart

Whenever the number of connected networks change, a callback is called.

> ntdll!ZwCreateThreadEx [WININET!AutoProxyResolver::AutoProxyThreadStart]
> KERNELBASE!CreateRemoteThreadEx
> kernel32!CreateThreadStub
> WININET!AutoProxyResolver::CheckAutoProxyThreadRunning
> WININET!AutoProxyResolver::QueueAsyncAutoProxyRequest
> WININET!AutoProxyResolver::RefreshProxySettings
> WININET!AutoProxyResolver::OnNetworkChange
> WININET!WxProxyManager::OnNetworkChange
> WININET!OnNetworkChanged
> WININET!NetworkChangeMonitor::TriggerChangeNotification
> WININET!NetworkChangeMonitor::DoNotificationWait
> WININET!NetworkChangeMonitor::NotificationThread
> WININET!RefCountWorkItemThread
> ntdll!RtlpTpWorkCallback
> ntdll!TppWorkerThread
> kernel32!BaseThreadInitThunk
> ntdll!__RtlUserThreadStart
> ntdll!_RtlUserThreadStart

This callback creates a thread to configure proxy. This thread is also started the first time you call HttpSendRequest(). The thread will go down to QueueAndWaitForDetections().

> WININET!AutoProxyResolver::QueueAndWaitForDetections
> WININET!AutoProxyResolver::DetectAndDownloadProxyScript
> WININET!AutoProxyResolver::ProcessRefresh
> WININET!AutoProxyResolver::OnRefresh
> WININET!AutoProxyResolver::ProcessMessages
> WININET!AutoProxyResolver::AutoProxyThread
> WININET!AutoProxyResolver::AutoProxyThreadStart
> kernel32!BaseThreadInitThunk
> ntdll!__RtlUserThreadStart
> ntdll!_RtlUserThreadStart

This function will spawn two tasks and wait for their completion, WININET!SwpadWpad and WININET!IpAddressWpad. Then it will spawn the third task, WININET!AutoProxyWpadAndResultThread, and WILL NOT WAIT for it. It then goes towards the exit, where it will OleUninitialize(), which unloads DLLs.

The WININET!AutoProxyWpadAndResultThread task seem to not have a reference to DLL. So it is unloaded while the tasks still executes it, which causes the crashes.

Now, to the sleep/wake thingie. In fact, it's not needed. As shown in the previous post, you can easily repro without sleep/wake at all. It merely contributes to the thread timings.
Forgot to mention. Bringing that closer to Firefox. As noticed, you only need HttpSendRequest (that's the one I know, there could be many others) once. That's it. You call it, or your plugin call it, or some hook dll injected into your process call it, and... You're doomed.
Is it possible to work around the crash in our side, then? For example, will adding OleInitialize into _tmain of AutoProxyRepro.cpp work? Or do we have to wait until Microsoft fixes the bug?
I believe multiple workarounds are possible. Still, instead you should have Microsoft fix it at last. You lived with the bug for over a year, what's the hurry now?
We should do both. Even if Microsoft fixes the issue it will still affect existing users for potentially a long time. Jim can you re-contact MS with the details here?

Alexander, I salute your persistence and ability in figuring all this out!

Does anyone here know what the cost would be of just doing OleInitialize at startup and OleUninitialize at shutdown? Or at least OleInitialize before we enter this proxy code and then uninit at shutdown?
Flags: needinfo?(jmathies)
Thanks. Benjamin, I would like to ask you to confirm the repro so I know it isn't going to lay untouched for another year :)

DISCLAMER 1: I hate dirty fixes and advise against.
DISCLAMER 2: I'm not particularily experienced with COM at the moment.

I believe OleInitialize() is the wrong path. As far as I understand the problem in question, OLE is initialized per-thread (in this case), and the AutoProxyResolver::AutoProxyThread is out of your control as it is launched on demand. I can imagine issuing a LoadLibrary("netprofm.dll") without a paired FreeLibrary() could do the trick. It needs to be tested, and I leave it to you. Otherwise, you could try to CoCreateInstanceEx() the class that is used (GUID can be found in disasm). That's better, as it's more natural. That's worse, because having an instance of class could cause side effects.

Again, I'm against dirty fixes.
We already call OleInitialize() in the nsWindow constructor. Maybe we will have to call OleInitialize() in the proxy autoconfig thread. It will explain why this bug appears after making the proxy config out-of-main-thread.
IMO unbalanced LoadLibrary() is more dirty, so I intended to propose the hack if OleInitialize() hack did not work.
Will the proposed workarounds work if you do the repro steps more than once? If not then maybe we should GET_MODULE_HANDLE_EX_FLAG_PIN.
FWIW my SO got this crash on Firefox 28.0

bp-4722bf30-94f8-41a3-8f4e-f0d252140402	02/04/2014	12:31 p.m.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #89)
> We should do both. Even if Microsoft fixes the issue it will still affect
> existing users for potentially a long time. Jim can you re-contact MS with
> the details here?

Will do. This request is still open, although there's been little activity as ms has basically said - find a reliable repo. We'll see what they come back with.
Flags: needinfo?(jmathies)
The engineer I'm working with is trying to reproduce - he's having trouble hitting the break point. He had a question about versions - 

> -- 4 --
> 
> Press F5 to run program. Whenever it stops on breakpoint,
> it will print stack, Thread ID, Suspend/Resume commands.
> When it stops, check stack. IT MUST CONTAIN
> 
> It doesn’t hit the breakpoint at all in my case and interestingly
> doesn’t load netprofm either. Can you please check what the OS and
> IE version the contributor had while debugging this sample?
Flags: needinfo?(alexandr.miloslavskiy)
I did that on a fully updated Win7 with IE11. I believe that automatic proxy configuration affects the problem. Maybe the engineer have it disabled.

PS: Guys! Guys!!! Did anyone at all try my repro before cheering up? Come on, its not that hard! I'm a little tired of being the only one who possesses arcane knowledge, while the bug is common as dirt in wild nature.
Flags: needinfo?(alexandr.miloslavskiy)
I'm also vaguely remembering that I tried it on Win8 to confirm. Both machines are connected to a cheap Trendnet home router running default settings on everything except wifi password is that matters. The bug reproduces both on cable and wifi connection.
Just tried a series of repro attempts and it does seem to be a little more elusive than originally thought. Got a cheap laptop with Win7 and wininet version 9.0.8112.16446 and netprofm 6.1.7600.16385.  Even when following your steps exactly, I can never get WININET!AutoProxyWpadAndResultThread to appear on stack.  The only stacks I keep getting are:

ChildEBP RetAddr  
021cf744 6d2833fd netprofm!CPubINetworkListManager::EnsureNLPConnected
021cf758 761e029d netprofm!CPubINetworkListManager::GetNetworks+0x39
021cf7dc 761e0199 WININET!NETWORK_MANAGER::ReadGuidsForConnectedNetworks+0x10f
021cf804 7626c924 WININET!InternalReadGuidsForConnectedNetworks+0x6f
021cf864 7625d7e6 WININET!NETWORK_MANAGER::SetWpadDecisionForCurrentNetwork+0x86
021cf888 762252c3 WININET!AutoProxyResolver::SwpadSetNeedBrowseState+0xc9
021cf8dc 761dcf58 WININET!AutoProxyResolver::DoProxyDetection+0x57a
021cf988 76215341 WININET!AutoProxyResolver::DetectAndDownloadProxyScript+0x165
021cf9bc 76211914 WININET!AutoProxyResolver::ProcessGetProxyForUrl+0xc8
021cf9dc 761dc866 WININET!AutoProxyResolver::OnGetProxyForUrl+0x33
021cfa14 761dc73b WININET!AutoProxyResolver::ProcessMessages+0xbf
021cfbc8 761dc63e WININET!AutoProxyResolver::AutoProxyThread+0x12a
021cfbd4 75ffed6c WININET!AutoProxyResolver::AutoProxyThreadStart+0xd
021cfbe0 777f377b kernel32!BaseThreadInitThunk+0xe
WARNING: Stack unwind information not available. Following frames may be wrong.
021cfc20 777f374e ntdll!RtlInitializeExceptionChain+0xef
021cfc38 00000000 ntdll!RtlInitializeExceptionChain+0xc2
***** Suspend: ~~[c4c] n
***** Resume:  ~~[c4c] m

and 

ChildEBP RetAddr  
021cf89c 6d2833fd netprofm!CPubINetworkListManager::EnsureNLPConnected
021cf8b0 761e029d netprofm!CPubINetworkListManager::GetNetworks+0x39
021cf934 761e0199 WININET!NETWORK_MANAGER::ReadGuidsForConnectedNetworks+0x10f
021cf95c 7626c924 WININET!InternalReadGuidsForConnectedNetworks+0x6f
021cf9bc 7625db09 WININET!NETWORK_MANAGER::SetWpadDecisionForCurrentNetwork+0x86
021cf9dc 76225761 WININET!AutoProxyResolver::OnSetSwpadDecision+0x95
021cfa14 761dc73b WININET!AutoProxyResolver::ProcessMessages+0x83
021cfbc8 761dc63e WININET!AutoProxyResolver::AutoProxyThread+0x12a
021cfbd4 75ffed6c WININET!AutoProxyResolver::AutoProxyThreadStart+0xd
021cfbe0 777f377b kernel32!BaseThreadInitThunk+0xe
WARNING: Stack unwind information not available. Following frames may be wrong.
021cfc20 777f374e ntdll!RtlInitializeExceptionChain+0xef
021cfc38 00000000 ntdll!RtlInitializeExceptionChain+0xc2

although I see it in all our crashes as well, I can never get wininet!AutoProxyResolver::UpdateAutoproxyWithCompletedDetection+1fe to appear...

Can you post the versions of the files your have? I'll try to get closer to your scenario.
(or WININET!AutoProxyWpadAndResultThread for that matter)

Note: Tried both on cable and wi-fi connections. Same behavior.
Finally someone who is trying!

Just reproduced it again on Win7.
wininet: 11.0.9600.17041
netprofm: 6.1.7600.16385

Judging by your wininet version, you're using IE9! That's completely wrong as bug was introduced in IE10. Please install windows updates and let me know the new results.
Excellent, thanks for the info, already was installing it. Will let you know the new results in a bit.
Curiously on my Win8 machine the breakpoint doesn't hit at all, and netprofm doesn't load, exactly like MS engineer reported. Will try to figure why.

Sidenote: despite my deep respect to MS quality of code, this time their way of handling the issue is downright terrible. They have all source code and can easily answer why the component won't load, but somehow I have to do it myself because they're lazy enough to ignore this issue for over a year now.
Good news! Was able to reproduce with 11.0.9600.17041, the AutoProxyWpadAndResultThread now gets called and suspending its thread and waiting for netprofm to be unloaded DOES crash it on resume.  Thanks for your repro, very well done.
I figured why Win8 is different.

There're a few different proxy resolvers in WININET. The list of used resolvers is composed in WININET!WxProxyManager::Initialize(). It will always make a DirectAccessResolver and add it to the list, and call CreateAutoProxyForProcessType() to create another resolver. It will test if it's running on Win8, checking the result of RtlGetVersion() (not affected by compatibility settings!). On Win7, AutoProxyResolver object is created. On Win8, it will be WinHttpClientResolver.
Indeed, for all 3 signatures the only OS is Win7 over the last 28 days.
IE10 is only available for Win7 and further.

Summary:
On Win8 the bugged component exists, but not used.
The bug first occurs in IE10. 
IE10 is not available on pre-Win7.

This is why Win7 is the only affected OS.
We have the same problem in our application, we can provide multiple additional dumps on request. Wanted to ask whether someone already tried the suggested "workarounds" (unbalanced dll load, GET_MODULE_HANDLE_EX_FLAG_PIN, CoInitialize [which I can't see working as this is per thread...])

Thanks!
What is the current status of Microsoft ticket?
(In reply to Alexander from comment #108)
> What is the current status of Microsoft ticket?

They've not been able to reproduce, and are currently waiting on us for more information. The ticket is still open.
Since you told that MS engineer has failed to reproduce, I found the reason why: the problem only occurs on Win7 with IE10 or IE11. Did you send this information?
(In reply to Alexander from comment #110)
> Since you told that MS engineer has failed to reproduce, I found the reason
> why: the problem only occurs on Win7 with IE10 or IE11. Did you send this
> information?

Yes, the os/ie versions were in the original report. I'll confirm with the engineer but I'm confident they've been testing on win7 sp1.
I'm not sure what should I do to have it moving. I have already found a repro that works 100% for me, and another person confirmed he can reproduce as well. To my knowledge, it should be no problem to reproduce it anywhere else on Win7 IE10 / IE11. Still, Microsoft is waiting for more information.

It would be best if you spend a few minutes to confirm repro yourself, then answer anything that is still needed to MS. Otherwise, please state clearly what is required. I really want this issue to be finally fixed.
Another month has passed. Such shame. How do I find someone interested in solving this?
Jim, I'm at a loss as to what the next steps for this bug should be. We've been unable to reproduce this internally which also seems to the case for Microsoft. There has to be some unseen variable on Alexander's system that we aren't taking into account. Can you think of anything?
Flags: needinfo?(jmathies)
Saying that you were unable to reproduce, do you mean the steps with winDBG I described? If yes, what exact configuration (OS, IE version, internet type) have you tried? Which specific step in my repro did not work for you?
(In reply to Alexander from comment #115)
> Saying that you were unable to reproduce, do you mean the steps with winDBG
> I described? If yes, what exact configuration (OS, IE version, internet
> type) have you tried? Which specific step in my repro did not work for you?

Assuming you mean comment 84, step 8 fails for me. I press F5 and it does not crash. I personally tested in Windows 7 SP1 with all the latest updates and IE 10 installed connected to a WPA2 Wireless-N network.
Give me a few minutes to check if I can reproduce it myself. Maybe it got fixed already. Don't get lost!
I have successfully reproduced on IE 11.0.9600.17207. There're updates available for IE, installing now and will check again. Meanwhile, some ideas where it could have gone wrong:
1) Are you sure you suspended thread whose stack contained "WININET!AutoProxyWpadAndResultThread" in step 4?
2) Are you sure you had netprofm module unloaded and not loaded again before following to step 6?
3) Are you sure you resumed the same thread you suspended in step 4, not just the last thread you had command printed for?

If you still can't reproduce, can we arrange TeamViewer session? Can we make a faster communication over skype? Please send me e-mail to arrage that.
I have reproduced on the very latest IE 11.0.9600.17239.

Some notes to the repro scenario:
a) On step 4, you can have all threads wrong. If this is the case, have your networks changed while still on step 4. You could disable and reenable your wifi / network cable / etc to have networks changed. You could also disable and reenable network adapter in Windows (this doesn't require physical manipulations).
b) On step 5, press F5 until everything calms down. The latest thing you see in debugger console should be module unload notifications. They could appear a few seconds after your last action. You want netprofm.dll unloaded.
PS: IE10 should be just fine for the repro as well.
Has anyone ever managed to reproduce this in firefox?
Flags: needinfo?(jmathies)
I tried again and still can't reproduce using Thunderbird as the watched process. It seems like I'm failing in step 4, in that I'm never hitting a subsequent breakpoint. Unfortunately I'm not educated enough to understand how to use your attached repro files. Please provide more clearer, simpler instructions that I can follow.
(In reply to Robert Kaiser (:kairo@mozilla.com, slow reaction due to vacation backlog) from comment #69)
> This is #6 and #8 on Firefox 25 release right now.

Some good news - we're down to #48 in fx31 for [@ CPrivAlloc::operator delete(void*)]

The other two signatures have fallen out of the top 50.
Here is the current topcrash data for the combined signatures in this bug.

=====================================
Firefox 31.0 => #24 w/ 2592 crashes
Firefox 32.0 => #63 w/ 436 crashes
Firefox 33.0 => #10 w/ 145 crashes
Firefox 34.0 => #26 w/ 44 crashes

Thunderbird 31.0 => #37 w/ 70 crashes
Thunderbird 32.0 => #5 w/ 6 crashes
Thunderbird 33.0 => #2 w/ 10 crashes
Thunderbird 34.0 => 0 crashes
=====================================

Based on this data, I suspect this is not a real-world topcrash. Note the ranks on the Thundbird development branches are likely exaggerated due to the low number of ADIs on those branches. 

As such, I'm willing to give this one more attempt given better instruction. However I cannot realistically devote much more time as there are more serious bugs that need my attention.
I tried to reproduce on Firefox. It turned out to be somewhat difficult to make it load this whole AutoProxy thing. One way I found is to download any file, Firefox then calls xul!DownloadPlatform::MapUrlToZone(), which loads a COM object, which seems to cause loading of AutoProxy. However, after such loading, it is never unloaded for some reason. Didn't have enough time to figure.

Anyway, it's a completely probable scenario to have some DLL injected into Firefox which makes required calls. I saw Dropbox's dll loaded into firefox in the first dump for example.
As for Thunderbird, there seems to be the same "problem" here: it doesn't call necessary WinAPI on its own. At least I didn't make it call them clicking here and there [not using Thunderbird on my own]
I tried, but didn't find a good way to have netprofm.dll loaded. Obviously it gets loaded according to crash dumps. Probably some other installed program is needed that will inject DLLs, or some specific feature in Firefox that would load a good COM object. Will try to figure something tomorrow.
For Thunderbird I still consider this to be topcrash, and a serious problem for some users:
- In thunderbird 24.6 the combined crash count of the two signatures still ranks it at #9
- Thunderbird 31 stats are incomplete - 31 is not released (updates are throttled), and it is unclear yet why the crash rate is *currently* lower in version 31. Without knowing a clear reason for the drop my guess is we will see it increase to it's former glory.  I've been in contact with a few TB24 crash users and so far they indicate it still crashes for them in 31.  But it's still early so I don't have a statistically significant sample.

It might be helpful if you can determine a reason for the decrease in fireefox crashes. Perhaps some are leaving or simply avoiding the condition which causes it.
I decided I'm not interested in finding such program that will inject proper DLL or looking through all features. Sorry. It is simply not reasonable for me to do everything to fix a bug in YOUR product, while no programmers at mozilla show reasonable interest.

I've already found a good repro which makes reliable crashes with same symptoms. I believe everyone understands how little is needed to have a rogue InternetOpen() call in your process, with all those injected DLLs, printer drivers, shell extensions, COM calls from within firefox and so on.
 There is something real world going on, as this is #5 overall in volume on Firefox33 betas at 1.53% of all crashes (4143 raw volume over 7 days).
Looks like this was exacerbated by the latest round of IE updates. The beta channel shows 8332 crashes in the two weeks after Patch Tuesday, compared to 1546 crashes in the two weeks before.

Alexander's investigations have been most helpful. With his sample app I was able to reproduce this a year ago (comment 55) and sent a debug trace to MS via Jim. Was that not sufficient? Does MS insist on a repro with a real Firefox? 

Anyone who's been unable to reproduce, please try again with the latest IE11.
I was able to reproduce again with the newer sample in comment 84. It took me several attempts to get a thread with WININET!AutoProxyWpadAndResultThread, but eventually I got one. At step 5 I saw netprofm.dll get unloaded even though the suspended thread was still in the middle of it.

Unloading netprofm is definitely bad, and it's clear why it would crash a thread that was running netprofm code, but it's not clear to me why this explains a crash in ole32. Maybe the unload messed up ole32 data structures? Or maybe both issues are just symptoms of something else entirely.

That ought to be enough for MS to diagnose, but in the meantime we still have a crash, so we should look for workarounds.

In the 33 beta, many stacks look like bp-4381a74b-fb72-47db-8c8e-fc6ff2140915 where the CoCreateInstance comes from Nv3DVUtils::Initialize (Interestingly, I see 98% Intel GPUs) There's no sign of network stuff on the stack, but maybe the badness happened earlier, or on a different thread.

The sample app could avoid the unload by pinning netprofm (and maybe wininet and npmproxy for good measure). But I'm not sure if such a workaround would help us in Firefox with these crashes coming from graphics. Worth a try if we don't have better ideas?
Crash in OLE: it was quite a long time ago now, so excuse me if I don't remember thing perfectly right. Back then, I was reproducing "in real" without WinDBG. I managed to capture an xperf log and investigated it. I saw something like one thread entering CreateInstance() while another thread called CoUninitialize(). First thread hit critical section deep inside OLE, and the other thread held it while cleaning up. As soon as it finished, first thread got the critical section and attempted to proceed creating instance, but a variable that held "server" entity was NULL by that time (I decided it's server entity by checking contents in normal circumstances, that object had vfptr with reasonable name). So I'd say that because module was unloaded prematurely, it NULL'ed internal OLE pointer, and trying to call a virtual function through that pointer, OLE crashed.
Crash spike: without having investigated these new dumps, I'd assume that Nv3DVUtils::Initialize() crash is a new unrelated problem, probably caused by a buggy graphics driver. Is it possible to check how much of a recent crash spike is attributed to stack containing Nv3DVUtils::Initialize ?
I split the Nv3DVUtils-related crash into bug 1071783. This bug should continue to only track the wininet-related issue.
Attached file WinDbg notes
Ok, I think I have something. The detailed stacks are in the attachment. The core problem is that wininet has a thread that touches a COM object outside the scope of a CoInitialize.

Some abbreviations:
AutoProxyThread = the thread with base of WININET!AutoProxyResolver::AutoProxyThreadStart
SwpadWpad = the thread with base of WININET!SwpadWpad
ResultThread = the thread with base of WININET!AutoProxyWpadAndResultThread
EnsureNLP = a COM call to netprofm!CPubINetworkListManager::EnsureNLPConnected
count = the number of outstanding CoInitialize calls, aka ole32!g_cProcessInits

Timeline:
1. AutoProxyThread calls CoInitialize (count=1)
2. SwpadWpad calls CoInitialize (count=2)
3. SwpadWpad calls EnsureNLP
4. SwpadWpad calls CoUninitialize (count=1)
5. ResultThread calls CoInitialize (count=2)
6. ResultThread calls CoUninitialize (count=1)
7. ResultThread calls EnsureNLP
  7b. ResultThread may get suspended here!
8. AutoProxyThread calls CoUninitialize (count=0)
  8b. count=0 causes ole32!ProcessUninitialize
  8c. ProcessUninitialize unloads netprofm.dll, because that DLL had been loaded only for the purpose of hosting a COM object, and we don't need COM anymore

If ResultThread gets suspended at 7b and resumed after 8c, we crash! Depending on exactly where ResultThread was suspended, it may crash in unloaded_netprofm, or inside ole32 CoCreate machinery.

From the stacks it's clear that the SwpadWpad thread has a CoInit scope around its entire ThreadProc, but the ResultThread does not have that. The ResultThread only uses CoInit for a small piece of its work, and then it continues to use COM after the CoUninit.

This issue is intermittent because of thread timing and because the EnsureNLP at step #7 does not always happen. I don't know what conditions cause it. For me it happens in about 20% of test runs.

Looking at 8b, I think we can work around this by taking an unbalanced CoInitialize, on any thread. We just have to keep the process-wide counter above zero.
I would like to remind that it's best to finally have Microsoft fix that instead of making workarounds. I don't understand where the communication barrier is, but the amount of information already gathered about the problem is overwhelming, it looks like MS developers need to spend just minutes to validate and fix it now.
(In reply to Alexander from comment #137)
> I would like to remind that it's best to finally have Microsoft fix that
> instead of making workarounds. I don't understand where the communication
> barrier is, but the amount of information already gathered about the problem
> is overwhelming, it looks like MS developers need to spend just minutes to
> validate and fix it now.

MS recently contacted me to let me know they have another similar crash with better str which they are investigating. We can't count on a fix from them though so a work around is preferred.
Whiteboard: [tbird topcrash] → [tbird topcrash][ms-support][113061010501660]
Whiteboard: [tbird topcrash][ms-support][113061010501660] → [tbird topcrash][ms-support][REG:114063011580728]
It has been a long time since we knew the problem is in WININET!AutoProxyWpadAndResultThread which lacks a reference to netprofm, which has now been refined to that it lacks CoInitialize(). But somehow they still don't know that. Obviously, if they knew, they wouldn't have to "investigate". Can you make sure they know what we know?
From ms today - they've apparently identified this bug in Win7 and they are working on a fix. They will contact me when there's more progress.
Great news! Honestly, I've given up all hope on a fix already.
It appears that the Internet Explorer cumulative security update released yesterday contains the fix for the issue which we were investigating on this case. Could you please let us know if you are still seeing reports of this issue? If you have a repro available with one of your users, you could try again after installing the patch below. 

https://technet.microsoft.com/en-us/library/security/ms14-080.aspx
(In reply to Jim Mathies [:jimm] from comment #142)
> It appears that the Internet Explorer cumulative security update released
> yesterday contains the fix for the issue which we were investigating on this
> case. Could you please let us know if you are still seeing reports of this
> issue?

I don't see any notable change in the volume of those signatures in crash-stats in the last 10 days or so, but at the low volume those are in nowadays (and how generic one of them is), that doesn't necessarily mean anything.
Ditto for Thunderbird. Using both 5 and 10 day length intervals, for comparison of "before" 12/10 and 12/20, I also don't see a statistcally significant difference
Crash Signature: , IActivationPropertiesOut**) ] → , IActivationPropertiesOut**) ] [@ CPrivAlloc::operator delete] [@ @0x0 | CClientContextActivator::CreateInstance ] [@ CClientContextActivator::CreateInstance ]
no longer topcrash
Whiteboard: [tbird topcrash][ms-support][REG:114063011580728] → [tbird crash][ms-support][REG:114063011580728]
Keywords: qawanted

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: