Closed Bug 1704500 Opened 3 years ago Closed 1 year ago

Crash in [@ CLegacyPositionRequest::GetLocationFromSession]

Categories

(Core :: Widget: Win32, defect, P2)

All
Windows 10
defect

Tracking

()

RESOLVED FIXED
111 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox-esr102 --- wontfix
firefox101 --- wontfix
firefox102 --- wontfix
firefox103 --- wontfix
firefox109 --- wontfix
firefox110 --- wontfix
firefox111 --- fixed

People

(Reporter: aryx, Assigned: handyman)

References

(Blocks 1 open bug)

Details

(Keywords: crash, Whiteboard: [win:stability])

Crash Data

Attachments

(10 files, 5 obsolete files)

841.49 KB, text/plain
Details
231.82 KB, text/plain
Details
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Not a new crash, 200-400 affected installations per release cycle, all on Windows 10

Most common affected sites are https://tinder.com/ and https://www.olg.ca/ - at least the latter doesn't prompt for one's location (maybe only if one has an account?)

Crash report: https://crash-stats.mozilla.org/report/index/ff288022-6ab4-4fbd-a491-7436f0210412

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 8 frames of crashing thread:

0 locationapi.dll void __thiscall CLegacyPositionRequest::GetLocationFromSession 
1 locationapi.dll static void __stdcall CLegacyPositionRequest::WaitForLocationInformationThreadProc 
2 ntdll.dll TppExecuteWaitCallback 
3 ntdll.dll TppWaitCompletion 
4 ntdll.dll TppWorkerThread 
5 kernel32.dll BaseThreadInitThunk 
6 ntdll.dll _RtlUserThreadStart 
7 ntdll.dll _RtlUserThreadStart 

Does the crash data contain URLs with which this issue is reproducible?

Since the crash is in Windows' libraries, the issue might need to be forwarded to Microsoft.

Severity: -- → S2
Priority: -- → P2

(In reply to Mirko Brodesser (:mbrodesser) from comment #1)

Does the crash data contain URLs with which this issue is reproducible?

Since the crash is in Windows' libraries, the issue might need to be forwarded to Microsoft.

Andrew, do you know which component is the best place for this issue?

Flags: needinfo?(continuation)

I think this is the right component. I looked at a few crashes, and the main thread had mozilla::dom::WindowsLocationProvider::SetHighAccuracy() in the stack for all of them.

Some of these correlations certainly look suspicious:

(100.0% in signature vs 00.39% overall) Module "LocationFrameworkPS.dll" = true
(100.0% in signature vs 00.76% overall) Module "LocationApi.dll" = true
(100.0% in signature vs 01.18% overall) Module "deviceaccess.dll" = true
(100.0% in signature vs 01.31% overall) Module "SensorsUtilsV2.dll" = true
(100.0% in signature vs 01.31% overall) Module "SensorsNativeApi.V2.dll" = true
(100.0% in signature vs 01.50% overall) Module "wlanapi.dll" = true
(100.0% in signature vs 01.98% overall) Module "PortableDeviceTypes.dll" = true
(100.0% in signature vs 03.34% overall) Module "SensorsApi.dll" = true
(100.0% in signature vs 19.33% overall) Module "msvcp110_win.dll" = true [100.0% vs 36.45% if platform_pretty_version = Windows 10]
(100.0% in signature vs 23.60% overall) Module "OneCoreUAPCommonProxyStub.dll" = true [100.0% vs 48.39% if platform_version = 10.0.19043]
(100.0% in signature vs 22.92% overall) Module "textinputframework.dll" = true [100.0% vs 50.17% if platform_version = 10.0.19042]
(17.07% in signature vs 00.75% overall) CPU Info = family 6 model 158 stepping 13

Toshi, is this something you might be able to look at? The crash volume looks pretty low, but maybe there's something we could do to improve the situation without spending too much time on it. Thanks.

Flags: needinfo?(continuation) → needinfo?(tkikuchi)

The crashing thread is a thread to dispatch a location callback registered here. It seems that the registered callback had been unregistered when the thread accessed the registered object.

LocationApi!CLegacyPositionRequest::GetLocationFromSession+0x63:
00007fff`1b59141b 488b8fa0000000  mov     rcx,qword ptr [rdi+0A0h]
  rdi = LocationApi!CLegacyPositionRequest
  rdi+0A0h is set by RegisterForReport and reset by UnregisterForReport
00007fff`1b591422 488b01          mov     rax,qword ptr [rcx] <<<< crash here because rcx was 0
00007fff`1b591425 488d5538        lea     rdx,[rbp+38h]
00007fff`1b591429 488b4040        mov     rax,qword ptr [rax+40h]
00007fff`1b59142d ff15853a0300    call    qword ptr [LocationApi!_guard_dispatch_icall_fptr (00007fff`1b5c4eb8)]

This may happen if we unregister the callback through the disconnect timer here when the worker thread is running CLegacyPositionRequest::GetLocationFromSession. I don't know whether there is a safe way to unregister the callback outside LocationApi.dll.

Currently we register the callback every time JS calls navigator.geolocation.getCurrentPosition. If we register a single callback that caches the latest location for all listeners, we can reduce the probability of this crash (and also improve geolocation performance)?

Flags: needinfo?(tkikuchi)

Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.

Component: DOM: Geolocation → Widget: Win32
Flags: needinfo?(spohl.mozilla.bugs)
Priority: P2 → --

(In reply to Gian-Carlo Pascutto [:gcp] from comment #5)

Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.

Added to list of triaged bugs. The component could probably have remained DOM:Geolocation since that's the component that's involved in this crash, but we'll track it under Win32 and get it assigned.

Flags: needinfo?(spohl.mozilla.bugs)
Priority: -- → P2
Whiteboard: [win:stability]
Assignee: nobody → mvollmer
Status: NEW → ASSIGNED
See Also: → 1770126

I have been dabbling with different tests, where I open and close multiple tabs and windows using geolocation services, to see if there is any issue with timings and such. So far I have been unsuccessful in reproducing the crash. If someone wants to have a look at logs from a short test session with 3 parallel tabs and then another 3 tabs opened a few seconds apart each, I have attached those here. If someone wants to dive into bigger logs from opening thousands of tabs and running geolocation tests for hours, let me know, and I'll upload the logs somewhere.

fyi: One thing I noticed in past tests was that Geolocation objects didn't get garbage collected until all tabs and windows were closed, even though Geolocation objects are per tab/window. However with the fix in bug 1770126 this does not happen anymore, suggesting that this has been fixed and is not related to this crash.

Did some joint debugging. All the crash reports are off the main thread, while we're calling mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool) on the main thread, which can apparently invoke a response off main thread. So that at least explains why, despite all the WindowsLocationProvider calls are on the main thread/main process, we still can race. No real insight into the conditions to trigger this.

Looking at the logs also revealed that we call SetHighAccuracy before mLocation is even initialized, which makes using high-accuracy location requests non-functional in Firefox, unless they're preceded by a normal getlocation() request. So that's at least one bug to fix here, which may or may not be related to the above.

The multiple threads may be a red herring: it looks like Windows invokes a worker thread to wait on the Windows Location Service to do its thing, and blocks our main thread waiting for a response. So even though this is multi-threaded, this should be transparent to us, which is consistent with this being main process and in COM Single Threaded Appartment mode (the crashes precede win32k lockdown, which is what enabled Multithreaded COM, as all documented in ProcessRuntime.cpp).

Which would mean we're back to the hypothesis that it is a use after free. From investigating the code, the SetHighAccuracy is not all that relevant either, as nsGeolocationRequest::Allow() always calls it with the argument appropriately set.

I tried to intentionally break this by doing what comment 4 says the problem is: forcibly unregistering the event listener immediately after registering it, but still calling the SetDesiredAccuracy() even though there's no listeners. Doesn't crash. If that doesn't crash, I'm starting to be unclear what the UAF or threading problem could be. There's no evidence any of this is called off of main thread/main process (see first paragraph), and the ILocationEvents object is reference counted, so even destroying/unregistering everything shouldn't cause that.

The next step would be perhaps to load the exact same build as crashed into the debugger, and look at exactly what is supposed to sit on the offset that crashes (which is 160 bytes off the start of some pointer that Windows is looking for, and presumably would be an ILocationEvents object - unfortunately the minidumps don't have enough context to see this).

While digging through the internals of IUnknown and all that stuff, I did find that the QueryInterface method on LocationEvent has at least 2 bugs:

https://docs.microsoft.com/en-us/windows/win32/api/unknwn/nf-unknwn-iunknown-queryinterface(refiid_void)
https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/implementing-iunknown-in-c-plus-plus
https://devblogs.microsoft.com/oldnewthing/20040326-00/?p=40033

STDMETHODIMP
LocationEvent::QueryInterface(REFIID iid, void** ppv) {
  if (iid == IID_IUnknown) {
    *ppv = static_cast<IUnknown*>(this);
  } else if (iid == IID_ILocationEvents) {
    *ppv = static_cast<ILocationEvents*>(this);
  } else {
    return E_NOINTERFACE;
  }
  AddRef();
  return S_OK;
}

So this neither returns E_POINTER if ppv is null nor does it null out *ppv if we return E_NOINTERFACE. Interestingly, I could imagine that incomplete error checking in LocationAPI.dll could have an issue with these. That said, this seems pretty far fetched, as about half the QueryInterface implementations in our source tree seem to have one of these mistakes.

Still, seems worth a try. The next step would be to inspect that 0xA0 offset in LocationEvent or generally verify that the offending code in GetLocationFromSession is trying to access parts of LocationEvent objects.

It looks like the thing that sits at offset 0xA0 when LocationEvent is casted to ILocationEvents is __vfptr = 0x00007ffd96b50970 {xul.dll!const mozilla::dom::LocationEvent::`vftable'} {0x00007ffd8a1cb470 {xul.dll!mozilla::dom::LocationEvent::QueryInterface(const _GUID &, void * *)},

This would make some sense wrt the crash? The Windows code is trying to invoke the callback but the virtual method table has already been wiped out.

David has some suspicion regarding the reference counting. Looking at the example code: https://docs.microsoft.com/en-us/windows/win32/api/locationapi/nf-locationapi-ilocation-registerforreport
pLocationEvents there is definitely kept alive until the UnregisterForReport is done. But it seems to me that it would have to be a bug for whoever uses ILocationEvents to drop the reference and then use the object later, so barring concurrency this doesn't seem like it can be the problem (but it could very well work around it, or prevent us from hitting an issue that anyone using the reference code would never hit). An easy fix for our use would then to be to move event to be a member mEvent on mLocation and be automatically Released() when mLocation is Released() (see below), though some fiddling might be needed if we ever use multiple callbacks - but I think we only ever have a single global nsGeolocationService that itself handles multiple callbacks, from looking at the code.

Looking at another crash report: https://crash-stats.mozilla.org/report/index/084fde43-9dc5-4732-83f5-7a3580220601#allthreads

Main Thread
1 	user32.dll 	unsigned long RealMsgWaitForMultipleObjectsEx(unsigned long, void* const*, unsigned long, unsigned long, unsigned long) 	None
2 	combase.dll 	CCliModalLoop::BlockFn(void**, unsigned long, unsigned long*) 	onecore\com\combase\dcomrem\callctrl.cxx:2156 
...
13 	xul.dll 	mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool) 	dom/system/windows/WindowsLocationProvider.cpp:270

Worker Thread
5 	ntdll.dll 	RtlEnterCriticalSection 	None
6 	LocationApi.dll 	ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) 	None
7 	LocationApi.dll 	void CLegacyPositionRequest::GetLocationFromSession(bool) 	None
8 	LocationApi.dll 	static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) 	None
9 	ntdll.dll 	TppExecuteWaitCallback 	None 

Another Worker Thread
5 	ntdll.dll 	RtlEnterCriticalSection 	None
6 	LocationApi.dll 	ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) 	None
7 	LocationApi.dll 	void CLegacyPositionRequest::DispatchStatusChanged(enum LOCATION_REPORT_STATUS) 	None
8 	LocationApi.dll 	long CLegacyPositionRequest::StartLocationSession 	None
9 	LocationApi.dll 	long CLegacyPositionRequest::RestartLocationSession 	None
10 	LocationApi.dll 	virtual long CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) 	None
11 	LocationApi.dll 	virtual long CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) 	None
12 	rpcrt4.dll 	Invoke 	None
Ø 

Crashing Thread
0 	locationapi.dll 	void CLegacyPositionRequest::GetLocationFromSession(bool) 		context
1 	locationapi.dll 	static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) 		cfi
2 	ntdll.dll 	TppExecuteWaitCallback 		cfi

Notice how there's 3 LocationAPI.dll calls running on different threads. They're blocked on a COM Lock (waiting for "LocationSession" whatever that is?), but this at least explains how it would be possible for a worker thread to race with the Unregister call even if the latter blocks the main thread.

Unfortunately, if that's the problem, then the ILocationEvents really needs to be a global object with the application lifetime, and itself have threadsafe support for adding the callbacks (and setting mProvider...because if the above happens it could outlive that as well).

tl;dr

  • SetHighAccuracy should set a member flag but not assume mLocation already exists and that it can call stuff on it
  • QueryInterface on LocationEvent should be bugfixed just because
  • The LocationEvent passed into RegisterForReport needs to be a global that either owns mProvider too (yuck, at least there should be only one) or we need to carefully and thread-safely set/unset it as a member (also yuck). Seems unlikely this would fix anything based on current analysis.

I'm still hoping to figure out why this breaks but I've reproduced the exact crash in a test case here. It fails 100% of the time, in a second or two. This is the key:

  // HRESULT hr = spLocation.CoCreateInstance(CLSID_Location); // works
  HRESULT hr = spLocation.CoCreateInstance(CLSID_Location, nullptr, CLSCTX_INPROC_SERVER); // fails

For an as-yet-unknown reason, the Location API doesn't like the inproc_server context we are feeding it. Running the default CLSCTX_ALL has run for hours without crashing.

CLSCTX_ALL is in line with MSDN sample code.

Current hypothesis from looking at the LocationAPI.dll disassembly is that GetLocationFromSession relies on an atomic/spinlock to make sure that whatever is going to give it ILocationInformation objects still exists, but that StartLocationSession has a single ATL::AtlComPtrAssign that is NOT under that lock, and which could wipe/rewrite that object at an in-opportune time.

It's hard to infer the locking from looking at the disassembly, but this matches what David found with memory breakpoints. In any case it seems calling Start/Stop on one thread while another tread is in GetLocationForSession is bad news. I guess the CLSCTX_ALL avoids that?

For more fun, the suspect part of the code is very different between Windows 11 and Windows 10, due to different inlining of CRITICAL_SECTION, I don't see the bug on my Windows 11's copy of the DLL (we do have reports on that OS), and David's testcase doesn't fail.

Pushed by daparks@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bd1520016295
Windows ILocation should be allowed in all COM execution contexts r=Jamie
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 103 Branch

The patch landed in nightly and beta is affected.
:mvollmer, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(mvollmer)

Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!

Beta/Release Uplift Approval Request

  • User impact if declined: Potential low volume crash when using location.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Low risk: changing a single constant in the code.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9279350 - Flags: approval-mozilla-beta?
Flags: needinfo?(mvollmer)

Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!

Approved for 102 beta 5, thanks.

Attachment #9279350 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

So that was a bust. I can still fairly reliably reproduce this issue in my sample and I think I see the core issue now. It's complicated. I had wondered why there were always three background threads involved in the crash when it seemed like a two-thread mutex bug, but it turns out the third thread is crucial.

Fundamentally, the crash is that one thread is halfway through CLegacyPositionRequest::RestartLocationSession, which calls CLegacyPositionRequest::StopLocationSession, which clears a pointer, then it calls CLegacyPositionRequest::StartLocationSession, which would reset the pointer but we don't get that far because a background thread from the CLegacyPositionRequest's threadpool crashes trying to use the cleared pointer.

There would be deadlock issues with holding the object's mutex across the Stop and Start calls (also mutex re-entrancy issues). RestartLocationSession tries to handle this case by running single-threaded -- one of the first things StopLocationSession does is wait for its worker threadpool to empty its current queue (kind of -- it looks like this is part of the bug). It does this while it holds the mutex -- it also clears the relevant pointer. StartLocationSession would then restart the threadpool callbacks once the values are restored. The first problem is that StopLocationSession would call WaitForThreadpoolWaitCallbacks to wait for them to complete, but that call is conditioned on another flag whose value it doesn't like. This is where the third thread comes in. It is running CLegacyPositionRequest::DispatchPositionChanged in a second thread from the threadpool. DispatchPositionChanged grabs the mutex at first, sets the bad value, then releases the mutex, then unsets the bad value. It's not clear why it maintains this value. The value appears to be an atomic boolean indicating whether there is already a callback request running but it is racy. I can't see why WaitForThreadpoolWaitCallbacks is gated on this but, in particular, the gate seems backwards -- it only waits for callbacks if there are no callbacks to wait for.

The second issue is also with WaitForThreadpoolWaitCallbacks -- it passes FALSE for fCancelPendingCallbacks. It's part of the following call sequence:

    SetThreadpoolWait(x, 0, 0);
    if ( !flag )
      WaitForThreadpoolWaitCallbacks(x, FALSE);
    CloseThreadpoolWait(x);

Again, I don't know what the point of the flag is but it's also the case that fixing only this won't keep it from running other callbacks between the end of the WaitForThreadpoolWaitCallbacks call and the end of the CloseThreadpoolWait call. That's apparently enough time for this race to be hit. If the code followed the pattern suggested by the documentation of these APIs:

    SetThreadpoolWait(x, 0, 0);
    WaitForThreadpoolWaitCallbacks(x, TRUE);
    CloseThreadpoolWait(x);

then the race wouldn't be possible. I'm guessing it wasn't written the normal way because of some other offending code I haven't seen.


So here's my theory for the step-by-step process to failure:

  1. A worker thread runs the timer callback. That thread looks like this:
	[...]
 	combase.dll!ObjectStubless() Line 176	Unknown	Non-user code. Symbols loaded.
	LocationApi.dll!CLegacyPositionRequest::DispatchPositionChanged(struct ILatLongReportInternal *)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::GetLocationFromSession(bool)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE *,void *,struct _TP_WAIT *,long)	Unknown	Non-user code. Symbols loaded.

DispatchPositionChanged has set our special value to 1 and exited the mutex:

00007FFF7E421179  mov         eax,1  
00007FFF7E42117E  xchg        eax,dword ptr [rbx+0C0h]  
00007FFF7E421184  prefetchw   [rbx+0C4h]  
00007FFF7E42118B  mov         eax,dword ptr [rbx+0C4h]  
00007FFF7E421191  mov         ecx,eax  
00007FFF7E421193  lock cmpxchg dword ptr [rbx+0C4h],ecx  
00007FFF7E42119B  jne         CLegacyPositionRequest::DispatchPositionChanged+0C9h (07FFF7E421191h)  
00007FFF7E42119D  mov         edi,eax  
00007FFF7E42119F  cmp         byte ptr [rbp-10h],0  
00007FFF7E4211A3  je          CLegacyPositionRequest::DispatchPositionChanged+0EDh (07FFF7E4211B5h)  
00007FFF7E4211A5  mov         rcx,qword ptr [rbp-18h]  
00007FFF7E4211A9  call        qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]  
  1. A second worker runs SetDesiredAccuracy, which begins our RestartLocationSession.
	[...]
 	combase.dll!ObjectStubless() Line 176	Unknown	Non-user code. Symbols loaded.
>	LocationApi.dll!CLegacyPositionRequest::StopLocationSession(bool)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::RestartLocationSession(void)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY)	Unknown	Non-user code. Symbols loaded.

It claims the mutex in StopLocationSession, which must happen after step 1. It then reads the bad special value 1 from step 1:

00007FFF7E42200B  xor         r8d,r8d  
00007FFF7E42200E  xor         edx,edx  
00007FFF7E422010  mov         rcx,rdi  
00007FFF7E422013  call        qword ptr [__imp_SetThreadpoolWait (07FFF7E454BB8h)]  
00007FFF7E42201A  nop         dword ptr [rax+rax]  
00007FFF7E42201F  test        esi,esi  
00007FFF7E422021  jne         CLegacyPositionRequest::StopLocationSession+0C8h (07FFF7E422034h)  
00007FFF7E422023  xor         edx,edx  
00007FFF7E422025  mov         rcx,rdi  
00007FFF7E422028  call        qword ptr [__imp_WaitForThreadpoolWaitCallbacks (07FFF7E454BC8h)]  
00007FFF7E42202F  nop         dword ptr [rax+rax]  
00007FFF7E422034  mov         rcx,rdi  
00007FFF7E422037  call        qword ptr [__imp_CloseThreadpoolWait (07FFF7E454BD0h)]  

The value in %esi is the special value 1, so we skip __imp_WaitForThreadpoolWaitCallbacks. This means we have no guarantees about future spawning of callbacks, even after __imp_CloseThreadpoolWait.

  1. Stop then continues to grab the mutex, clear the pointer that will cause the crash (using ATL::AtlComPtrAssign), and release the mutex:
00007FFF7E42204C  call        ATL::CComCritSecLock<ATL::CComAutoCriticalSection>::CComCritSecLock<ATL::CComAutoCriticalSection> (07FFF7E41FADCh)  
00007FFF7E422051  lea         rdi,[rbx+0A0h]  
00007FFF7E422058  mov         rdx,qword ptr [rdi]  
00007FFF7E42205B  test        rdx,rdx  
00007FFF7E42205E  je          CLegacyPositionRequest::StopLocationSession+0FEh (07FFF7E42206Ah)  
00007FFF7E422060  lea         rcx,[rsp+70h]  
00007FFF7E422065  call        ATL::AtlComPtrAssign (07FFF7E416294h)  
00007FFF7E42206A  cmp         qword ptr [rdi],0  
00007FFF7E42206E  je          CLegacyPositionRequest::StopLocationSession+10Eh (07FFF7E42207Ah)  
00007FFF7E422070  xor         edx,edx  
00007FFF7E422072  mov         rcx,rdi  
00007FFF7E422075  call        ATL::AtlComPtrAssign (07FFF7E416294h)  
00007FFF7E42207A  test        bpl,bpl  
00007FFF7E42207D  jne         CLegacyPositionRequest::StopLocationSession+11Fh (07FFF7E42208Bh)  
00007FFF7E42207F  lea         rcx,[rbx+8Ch]  
00007FFF7E422086  call        ATL::CComGITPtr<ILocationEvents>::Revoke (07FFF7E42067Ch)  
00007FFF7E42208B  lea         rcx,[rbx+0A8h]  
00007FFF7E422092  call        ATL::CHandle::Close (07FFF7E420D14h)  
00007FFF7E422097  mov         byte ptr [rbx+0B0h],0  
00007FFF7E42209E  cmp         byte ptr [rsp+40h],0  
00007FFF7E4220A3  je          CLegacyPositionRequest::StopLocationSession+14Ah (07FFF7E4220B6h)  
00007FFF7E4220A5  mov         rcx,qword ptr [rsp+38h]  
00007FFF7E4220AA  call        qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]  
  1. At any point between then and where Start restores things, we can launch a new callback thread that will crash on the null pointer.

These values can be seen in the debugger. I have full core dumps.

I haven't been able to conclude anything about how this relates to the public APIs, like a relationship between report requests and wait jobs. There still may be something we can do to finesse this. We might also be able to get a fix from Microsoft. But barring that, this is a fine candidate for stuffing into a utility process, where it can crash and be restarted without us caring much.

Assignee: mvollmer → davidp99
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

We'll back this out (bug 1774073). We're going to try the utility process approach.

See Also: → 1773456

Working prototype for doing geolocation in a utility process.

AsyncBlockers are used in a few places, in particular including child
process shutdowns, and they always cause waiting (10 seconds at
process close), even if they are not in use. This skips the wait in
that case.

Adds a "utilproc" log to trace utility process launch and shutdown steps.

Depends on D155016

Adds a new type of utility process that is set up to handle Windows OS objects. We are adding this process type to run Windows geolocation APIs but more services are expected to be included in it. The ILocation APIs have a race condition that would otherwise crash the main process. The ILocation work is in a later patch in the series.

Depends on D155017

Adds UtilitySandboxProperties, which abstract the more universal sandbox properties into a data object so that the various types of utility process can simply list them. This also adds a somewhat weak sandbox for the new "WindowsUtils" utility process type.

Depends on D155018

Previously, the ILocation COM object ran in the main process. Due to a race condition, it would sometimes crash the process so we are moving it to a utility process, where it can crash safely. The old WindowsLocationProvider class has become a proxy that forwards requests to the child process via the PWindowsLocation parent actor. The PWindowsLocationChild executes the requests and sends the results back. Failures in ILocation (like geolocation being off in Windows settings) are sent back as well -- they will cause the WindowsLocationProvider to start the MLS geolocation fallback, as they did before.

Depends on D155019

Attachment #9290524 - Attachment description: WIP: Bug 1704500: Skip AsyncBlockers delay when there are no blockers → Bug 1704500: Skip AsyncBlockers delay when there are no blockers r=#ipc-reviewers!
Attachment #9290525 - Attachment description: WIP: Bug 1704500: Add logging to UtilityProcessHost/Manager → Bug 1704500: Add logging to UtilityProcessHost/Manager r=gerard-majax!
Attachment #9290526 - Attachment description: WIP: Bug 1704500: Add WindowsUtils kind of utility process on Windows → Bug 1704500: Add WindowsUtils kind of utility process on Windows r=#ipc-reviewers!,gerard-majax!
Attachment #9290527 - Attachment description: WIP: Bug 1704500: Use const structs to normalize utility process sandbox code → Bug 1704500: Use const structs to normalize utility process sandbox code r=gerard-majax!,bobowen!
Attachment #9290528 - Attachment description: WIP: Bug 1704500: Run Windows' ILocation provider in a utility process → Bug 1704500: Run Windows' ILocation provider in a utility process r=emilio!
Attachment #9281329 - Attachment is obsolete: true

Setting n-i to not lose track of the patches in this bug.

Flags: needinfo?(davidp99)
Attachment #9290527 - Attachment description: Bug 1704500: Use const structs to normalize utility process sandbox code r=gerard-majax!,bobowen! → Bug 1704500: Use structs to normalize utility process sandbox code r=gerard-majax!,bobowen!
Attachment #9290528 - Attachment description: Bug 1704500: Run Windows' ILocation provider in a utility process r=emilio! → Bug 1704500: Run Windows' ILocation provider in a utility process r=emilio!,cmartin!

Make sure that the geolocation utility process restarts for georequests that arrive after a crash. Note that this tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).

Also generalizes some utility process testing functions for use with all utility process sandbox types.

Depends on D155020

Attachment #9290524 - Attachment is obsolete: true

This eliminates a redundant enum that we had to keep synchronized with another.

Depends on D155020

This eliminates a redundant enum that we had to keep synchronized with another.

Depends on D155020

This eliminates a redundant enum that we had to keep synchronized with another.

Depends on D155020

Also makes the existing utility process test functions a bit more general.

Depends on D162943

Make sure that the geolocation utility process restarts for georequests that arrive after a crash. This tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).

Depends on D162944

Attachment #9303369 - Attachment is obsolete: true
Attachment #9305047 - Attachment is obsolete: true
Attachment #9305048 - Attachment is obsolete: true
Flags: needinfo?(davidp99)

Marking this as blocking the out-of-process file picker, since (per discussion in meetings) some of the infrastructure in this patchset should be useful for the currently-planned implementation thereof.

Blocks: 1677170
Pushed by daparks@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5d62d1a33cf8
Add logging to UtilityProcessHost/Manager r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/e581bfdf0f8e
Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika
https://hg.mozilla.org/integration/autoland/rev/f85e8eed6fed
Use structs to normalize utility process sandbox code r=gerard-majax,bobowen
https://hg.mozilla.org/integration/autoland/rev/93e59d5488dc
Run Windows' ILocation provider in a utility process r=emilio,cmartin
https://hg.mozilla.org/integration/autoland/rev/1c84f8f0796a
Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/817d104b3a3e
Add utility process test helpers r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/8f1b73a96bb0
Test Windows geolocation utility process restart on crash r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/da69da68588d
1704500, 1704500, 1704500, 1704500, 1704500, 1704500: apply code formatting via Lando

Backed out 8 changesets (Bug 1704500) for causing build bustages on UtilityProcessHost.cpp.
Backout link
Push with failures
Failure Log
Also Bd failure Log
Also Failure Log

Flags: needinfo?(davidp99)
Pushed by daparks@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0d1c0dcd7645
Add logging to UtilityProcessHost/Manager r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/341ec2465c7d
Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika
https://hg.mozilla.org/integration/autoland/rev/241be693628d
Use structs to normalize utility process sandbox code r=gerard-majax,bobowen
https://hg.mozilla.org/integration/autoland/rev/d4508f8ba7df
Run Windows' ILocation provider in a utility process r=emilio,cmartin
https://hg.mozilla.org/integration/autoland/rev/c2474d1c0046
Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/897b985f9298
Add utility process test helpers r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/e48723d8d42e
Test Windows geolocation utility process restart on crash r=gerard-majax

Backed out for causing build bustages

Pushed by daparks@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6bfb4d8de1e1
Add logging to UtilityProcessHost/Manager r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/a1d1a97a2ffa
Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika
https://hg.mozilla.org/integration/autoland/rev/90a8d51fc17c
Use structs to normalize utility process sandbox code r=gerard-majax,bobowen
https://hg.mozilla.org/integration/autoland/rev/30632246ee2b
Run Windows' ILocation provider in a utility process r=emilio,cmartin
https://hg.mozilla.org/integration/autoland/rev/313b8770dc51
Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/747d84ff17c7
Add utility process test helpers r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/c3ae64389c26
Test Windows geolocation utility process restart on crash r=gerard-majax

Backed out for causing multiple failures in test_geolocation_reset_accuracy/browser_utility_geolocation_crashed

Backout link: https://hg.mozilla.org/integration/autoland/rev/f115e287797670e7358559d07f525d5bfd6f5eb3

Push with failures

Failure log xpc

Failure log bc

Flags: needinfo?(davidp99)

Oof. These tests pass on Win10 locally, with or without location turned on, so this is some other test environment discrepancy. This will take some effort to track down.

Flags: needinfo?(davidp99)
Pushed by daparks@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d96e810e20ec
Add logging to UtilityProcessHost/Manager r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/88e8015e1ec4
Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika
https://hg.mozilla.org/integration/autoland/rev/1386732459ce
Use structs to normalize utility process sandbox code r=gerard-majax,bobowen
https://hg.mozilla.org/integration/autoland/rev/4be2d58ddf54
Run Windows' ILocation provider in a utility process r=emilio,cmartin
https://hg.mozilla.org/integration/autoland/rev/c0f0a280aaea
Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/1c27f9c36fda
Add utility process test helpers r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/afcc1e8b5ad7
Test Windows geolocation utility process restart on crash r=gerard-majax
Regressions: 1813345

Backed out for causing bc failures on browser_utility_geolocation_crashed.js.

Please also take a look at Bug 1813345 before landing this. The bug was closed as fixed because of this backout.

[task 2023-01-29T00:31:04.892Z] 00:31:04     INFO - TEST-START | ipc/glue/test/browser/browser_utility_geolocation_crashed.js
[task 2023-01-29T00:31:05.493Z] 00:31:05     INFO - GECKO(7340) | *** WIFI GEO: startup called.
[task 2023-01-29T00:31:05.979Z] 00:31:05     INFO - GECKO(7340) | JavaScript error: resource://gre/modules/XULStore.jsm, line 58: Error: Can't find profile directory.
[task 2023-01-29T00:31:07.490Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: onStatus called.wifi-timeout
[task 2023-01-29T00:31:07.496Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: Use request cache:false reason:No cached data
[task 2023-01-29T00:31:07.497Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: Sending request
[task 2023-01-29T00:31:07.497Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: onStatus called.xhr-start
[task 2023-01-29T00:31:07.508Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: geo provider reported: -122.08769:37.41857
[task 2023-01-29T00:31:07.517Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: shutdown called
[task 2023-01-29T00:31:07.528Z] 00:31:07     INFO - TEST-INFO | started process screenshot
[task 2023-01-29T00:31:07.876Z] 00:31:07     INFO - TEST-INFO | screenshot: exit 0
[task 2023-01-29T00:31:07.876Z] 00:31:07     INFO - Buffered messages logged at 00:31:04
[task 2023-01-29T00:31:07.877Z] 00:31:07     INFO - Entering setup bound 
[task 2023-01-29T00:31:07.878Z] 00:31:07     INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07     INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07     INFO - Leaving setup bound 
[task 2023-01-29T00:31:07.879Z] 00:31:07     INFO - Entering test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.879Z] 00:31:07     INFO - Start the Windows utility process
[task 2023-01-29T00:31:07.880Z] 00:31:07     INFO - Requesting geolocation
[task 2023-01-29T00:31:07.880Z] 00:31:07     INFO - Buffered messages logged at 00:31:06
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - Console message: [JavaScript Error: "Error: Can't find profile directory." {file: "resource://gre/modules/XULStore.jsm" line: 58}]
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - load@resource://gre/modules/XULStore.jsm:58:15
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - XULStore@resource://gre/modules/XULStore.jsm:17:10
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - 
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - Buffered messages logged at 00:31:07
[task 2023-01-29T00:31:07.882Z] 00:31:07     INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | geolocation succeeded - 
[task 2023-01-29T00:31:07.882Z] 00:31:07     INFO - Crash the utility process
[task 2023-01-29T00:31:07.884Z] 00:31:07     INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.885Z] 00:31:07     INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils utility process should be found - 
[task 2023-01-29T00:31:07.885Z] 00:31:07     INFO - prune any previous crashes
[task 2023-01-29T00:31:07.885Z] 00:31:07     INFO - Buffered messages finished
[task 2023-01-29T00:31:07.886Z] 00:31:07     INFO - TEST-UNEXPECTED-FAIL | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | Uncaught exception in test - at chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339 - TypeError: can't access property "pruneOldCrashes", crashMan is undefined
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - Stack trace:
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - crashSomeUtility@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339:3
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - crashSomeUtilityActor@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:415:10
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - Leaving test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.888Z] 00:31:07     INFO - Entering test bound testCleanup
[task 2023-01-29T00:31:07.888Z] 00:31:07     INFO - Clean up to avoid confusing future tests
[task 2023-01-29T00:31:07.889Z] 00:31:07     INFO - Kill Utility Process windowsUtils
[task 2023-01-29T00:31:07.891Z] 00:31:07     INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.891Z] 00:31:07     INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils process exists - 
Flags: needinfo?(davidp99)

Test needs to be gated on crashreporter. Bug 1813345 is a test-verify leak that is fairly reliable under that stress test. I'm looking at it now.

Flags: needinfo?(davidp99)
Status: REOPENED → ASSIGNED
Target Milestone: 103 Branch → ---
Pushed by daparks@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/de943230d39a
Add logging to UtilityProcessHost/Manager r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/6d02ad6e0aee
Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika
https://hg.mozilla.org/integration/autoland/rev/dd7e66e5ab25
Use structs to normalize utility process sandbox code r=gerard-majax,bobowen
https://hg.mozilla.org/integration/autoland/rev/77ef9f7cf3cb
Run Windows' ILocation provider in a utility process r=emilio,cmartin
https://hg.mozilla.org/integration/autoland/rev/037ee6c3373f
Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/2047cf0d7192
Add utility process test helpers r=gerard-majax
https://hg.mozilla.org/integration/autoland/rev/a2bec68b9a7a
Test Windows geolocation utility process restart on crash r=gerard-majax
Regressions: 1818465
Regressions: 1820535
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: