Crash in [@ CLegacyPositionRequest::GetLocationFromSession]
Categories
(Core :: Widget: Win32, defect, P2)
Tracking
()
People
(Reporter: aryx, Assigned: handyman)
References
(Blocks 1 open bug)
Details
(Keywords: crash, Whiteboard: [win:stability])
Crash Data
Attachments
(10 files, 5 obsolete files)
841.49 KB,
text/plain
|
Details | |
231.82 KB,
text/plain
|
Details | |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
Not a new crash, 200-400 affected installations per release cycle, all on Windows 10
Most common affected sites are https://tinder.com/ and https://www.olg.ca/ - at least the latter doesn't prompt for one's location (maybe only if one has an account?)
Crash report: https://crash-stats.mozilla.org/report/index/ff288022-6ab4-4fbd-a491-7436f0210412
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 8 frames of crashing thread:
0 locationapi.dll void __thiscall CLegacyPositionRequest::GetLocationFromSession
1 locationapi.dll static void __stdcall CLegacyPositionRequest::WaitForLocationInformationThreadProc
2 ntdll.dll TppExecuteWaitCallback
3 ntdll.dll TppWaitCompletion
4 ntdll.dll TppWorkerThread
5 kernel32.dll BaseThreadInitThunk
6 ntdll.dll _RtlUserThreadStart
7 ntdll.dll _RtlUserThreadStart
Does the crash data contain URLs with which this issue is reproducible?
Since the crash is in Windows' libraries, the issue might need to be forwarded to Microsoft.
Comment 2•2 years ago
|
||
(In reply to Mirko Brodesser (:mbrodesser) from comment #1)
Does the crash data contain URLs with which this issue is reproducible?
Since the crash is in Windows' libraries, the issue might need to be forwarded to Microsoft.
Andrew, do you know which component is the best place for this issue?
Comment 3•2 years ago
|
||
I think this is the right component. I looked at a few crashes, and the main thread had mozilla::dom::WindowsLocationProvider::SetHighAccuracy() in the stack for all of them.
Some of these correlations certainly look suspicious:
(100.0% in signature vs 00.39% overall) Module "LocationFrameworkPS.dll" = true
(100.0% in signature vs 00.76% overall) Module "LocationApi.dll" = true
(100.0% in signature vs 01.18% overall) Module "deviceaccess.dll" = true
(100.0% in signature vs 01.31% overall) Module "SensorsUtilsV2.dll" = true
(100.0% in signature vs 01.31% overall) Module "SensorsNativeApi.V2.dll" = true
(100.0% in signature vs 01.50% overall) Module "wlanapi.dll" = true
(100.0% in signature vs 01.98% overall) Module "PortableDeviceTypes.dll" = true
(100.0% in signature vs 03.34% overall) Module "SensorsApi.dll" = true
(100.0% in signature vs 19.33% overall) Module "msvcp110_win.dll" = true [100.0% vs 36.45% if platform_pretty_version = Windows 10]
(100.0% in signature vs 23.60% overall) Module "OneCoreUAPCommonProxyStub.dll" = true [100.0% vs 48.39% if platform_version = 10.0.19043]
(100.0% in signature vs 22.92% overall) Module "textinputframework.dll" = true [100.0% vs 50.17% if platform_version = 10.0.19042]
(17.07% in signature vs 00.75% overall) CPU Info = family 6 model 158 stepping 13
Toshi, is this something you might be able to look at? The crash volume looks pretty low, but maybe there's something we could do to improve the situation without spending too much time on it. Thanks.
Comment 4•2 years ago
|
||
The crashing thread is a thread to dispatch a location callback registered here. It seems that the registered callback had been unregistered when the thread accessed the registered object.
LocationApi!CLegacyPositionRequest::GetLocationFromSession+0x63:
00007fff`1b59141b 488b8fa0000000 mov rcx,qword ptr [rdi+0A0h]
rdi = LocationApi!CLegacyPositionRequest
rdi+0A0h is set by RegisterForReport and reset by UnregisterForReport
00007fff`1b591422 488b01 mov rax,qword ptr [rcx] <<<< crash here because rcx was 0
00007fff`1b591425 488d5538 lea rdx,[rbp+38h]
00007fff`1b591429 488b4040 mov rax,qword ptr [rax+40h]
00007fff`1b59142d ff15853a0300 call qword ptr [LocationApi!_guard_dispatch_icall_fptr (00007fff`1b5c4eb8)]
This may happen if we unregister the callback through the disconnect timer here when the worker thread is running CLegacyPositionRequest::GetLocationFromSession
. I don't know whether there is a safe way to unregister the callback outside LocationApi.dll.
Currently we register the callback every time JS calls navigator.geolocation.getCurrentPosition
. If we register a single callback that caches the latest location for all listeners, we can reduce the probability of this crash (and also improve geolocation performance)?
Comment 5•2 years ago
|
||
Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.
Comment 6•2 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #5)
Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.
Added to list of triaged bugs. The component could probably have remained DOM:Geolocation since that's the component that's involved in this crash, but we'll track it under Win32 and get it assigned.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 7•2 years ago
|
||
Max found a crash report from 2022-05-28 (https://crash-stats.mozilla.org/search/?release_channel=nightly&signature=%3DCLegacyPositionRequest%3A%3AGetLocationFromSession&product=Firefox&version=102.0a1&date=%3E%3D2022-05-24T00%3A00%3A00.000Z&date=%3C2022-05-30T12%3A23%3A00.000Z&_facets=install_time&_facets=version&_facets=address&_facets=moz_crash_reason&_facets=reason&_facets=build_id&_facets=platform_pretty_version&_facets=signature&_facets=useragent_locale&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports) which means bug 1770126 may not actually have fixed it.
Comment 8•2 years ago
|
||
Comment 9•2 years ago
|
||
Comment 10•2 years ago
|
||
I have been dabbling with different tests, where I open and close multiple tabs and windows using geolocation services, to see if there is any issue with timings and such. So far I have been unsuccessful in reproducing the crash. If someone wants to have a look at logs from a short test session with 3 parallel tabs and then another 3 tabs opened a few seconds apart each, I have attached those here. If someone wants to dive into bigger logs from opening thousands of tabs and running geolocation tests for hours, let me know, and I'll upload the logs somewhere.
fyi: One thing I noticed in past tests was that Geolocation objects didn't get garbage collected until all tabs and windows were closed, even though Geolocation objects are per tab/window. However with the fix in bug 1770126 this does not happen anymore, suggesting that this has been fixed and is not related to this crash.
Comment 11•2 years ago
|
||
Did some joint debugging. All the crash reports are off the main thread, while we're calling mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool)
on the main thread, which can apparently invoke a response off main thread. So that at least explains why, despite all the WindowsLocationProvider calls are on the main thread/main process, we still can race. No real insight into the conditions to trigger this.
Looking at the logs also revealed that we call SetHighAccuracy before mLocation
is even initialized, which makes using high-accuracy location requests non-functional in Firefox, unless they're preceded by a normal getlocation() request. So that's at least one bug to fix here, which may or may not be related to the above.
Comment 12•2 years ago
•
|
||
The multiple threads may be a red herring: it looks like Windows invokes a worker thread to wait on the Windows Location Service to do its thing, and blocks our main thread waiting for a response. So even though this is multi-threaded, this should be transparent to us, which is consistent with this being main process and in COM Single Threaded Appartment mode (the crashes precede win32k lockdown, which is what enabled Multithreaded COM, as all documented in ProcessRuntime.cpp
).
Which would mean we're back to the hypothesis that it is a use after free. From investigating the code, the SetHighAccuracy
is not all that relevant either, as nsGeolocationRequest::Allow()
always calls it with the argument appropriately set.
I tried to intentionally break this by doing what comment 4 says the problem is: forcibly unregistering the event listener immediately after registering it, but still calling the SetDesiredAccuracy()
even though there's no listeners. Doesn't crash. If that doesn't crash, I'm starting to be unclear what the UAF or threading problem could be. There's no evidence any of this is called off of main thread/main process (see first paragraph), and the ILocationEvents
object is reference counted, so even destroying/unregistering everything shouldn't cause that.
The next step would be perhaps to load the exact same build as crashed into the debugger, and look at exactly what is supposed to sit on the offset that crashes (which is 160 bytes off the start of some pointer that Windows is looking for, and presumably would be an ILocationEvents object - unfortunately the minidumps don't have enough context to see this).
While digging through the internals of IUnknown and all that stuff, I did find that the QueryInterface method on LocationEvent has at least 2 bugs:
https://docs.microsoft.com/en-us/windows/win32/api/unknwn/nf-unknwn-iunknown-queryinterface(refiid_void)
https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/implementing-iunknown-in-c-plus-plus
https://devblogs.microsoft.com/oldnewthing/20040326-00/?p=40033
STDMETHODIMP
LocationEvent::QueryInterface(REFIID iid, void** ppv) {
if (iid == IID_IUnknown) {
*ppv = static_cast<IUnknown*>(this);
} else if (iid == IID_ILocationEvents) {
*ppv = static_cast<ILocationEvents*>(this);
} else {
return E_NOINTERFACE;
}
AddRef();
return S_OK;
}
So this neither returns E_POINTER
if ppv is null nor does it null out *ppv
if we return E_NOINTERFACE
. Interestingly, I could imagine that incomplete error checking in LocationAPI.dll could have an issue with these. That said, this seems pretty far fetched, as about half the QueryInterface implementations in our source tree seem to have one of these mistakes.
Still, seems worth a try. The next step would be to inspect that 0xA0
offset in LocationEvent or generally verify that the offending code in GetLocationFromSession
is trying to access parts of LocationEvent objects.
Comment 13•2 years ago
|
||
It looks like the thing that sits at offset 0xA0
when LocationEvent is casted to ILocationEvents is __vfptr = 0x00007ffd96b50970 {xul.dll!const mozilla::dom::LocationEvent::`vftable'} {0x00007ffd8a1cb470 {xul.dll!mozilla::dom::LocationEvent::QueryInterface(const _GUID &, void * *)},
This would make some sense wrt the crash? The Windows code is trying to invoke the callback but the virtual method table has already been wiped out.
Comment 14•2 years ago
•
|
||
David has some suspicion regarding the reference counting. Looking at the example code: https://docs.microsoft.com/en-us/windows/win32/api/locationapi/nf-locationapi-ilocation-registerforreport
pLocationEvents
there is definitely kept alive until the UnregisterForReport
is done. But it seems to me that it would have to be a bug for whoever uses ILocationEvents
to drop the reference and then use the object later, so barring concurrency this doesn't seem like it can be the problem (but it could very well work around it, or prevent us from hitting an issue that anyone using the reference code would never hit). An easy fix for our use would then to be to move (see below), though some fiddling might be needed if we ever use multiple callbacks - but I think we only ever have a single global event
to be a member mEvent
on mLocation
and be automatically Released()
when mLocation
is Released()
nsGeolocationService
that itself handles multiple callbacks, from looking at the code.
Looking at another crash report: https://crash-stats.mozilla.org/report/index/084fde43-9dc5-4732-83f5-7a3580220601#allthreads
Main Thread
1 user32.dll unsigned long RealMsgWaitForMultipleObjectsEx(unsigned long, void* const*, unsigned long, unsigned long, unsigned long) None
2 combase.dll CCliModalLoop::BlockFn(void**, unsigned long, unsigned long*) onecore\com\combase\dcomrem\callctrl.cxx:2156
...
13 xul.dll mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool) dom/system/windows/WindowsLocationProvider.cpp:270
Worker Thread
5 ntdll.dll RtlEnterCriticalSection None
6 LocationApi.dll ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) None
7 LocationApi.dll void CLegacyPositionRequest::GetLocationFromSession(bool) None
8 LocationApi.dll static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) None
9 ntdll.dll TppExecuteWaitCallback None
Another Worker Thread
5 ntdll.dll RtlEnterCriticalSection None
6 LocationApi.dll ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) None
7 LocationApi.dll void CLegacyPositionRequest::DispatchStatusChanged(enum LOCATION_REPORT_STATUS) None
8 LocationApi.dll long CLegacyPositionRequest::StartLocationSession None
9 LocationApi.dll long CLegacyPositionRequest::RestartLocationSession None
10 LocationApi.dll virtual long CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) None
11 LocationApi.dll virtual long CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) None
12 rpcrt4.dll Invoke None
Ø
Crashing Thread
0 locationapi.dll void CLegacyPositionRequest::GetLocationFromSession(bool) context
1 locationapi.dll static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) cfi
2 ntdll.dll TppExecuteWaitCallback cfi
Notice how there's 3 LocationAPI.dll calls running on different threads. They're blocked on a COM Lock (waiting for "LocationSession" whatever that is?), but this at least explains how it would be possible for a worker thread to race with the Unregister
call even if the latter blocks the main thread.
Unfortunately, if that's the problem, then the ILocationEvents
really needs to be a global object with the application lifetime, and itself have threadsafe support for adding the callbacks (and setting mProvider
...because if the above happens it could outlive that as well).
Comment 15•2 years ago
•
|
||
tl;dr
SetHighAccuracy
should set a member flag but not assumemLocation
already exists and that it can call stuff on itQueryInterface
onLocationEvent
should be bugfixed just becauseTheSeems unlikely this would fix anything based on current analysis.LocationEvent
passed intoRegisterForReport
needs to be a global that either ownsmProvider
too (yuck, at least there should be only one) or we need to carefully and thread-safely set/unset it as a member (also yuck).
Assignee | ||
Comment 16•2 years ago
|
||
I'm still hoping to figure out why this breaks but I've reproduced the exact crash in a test case here. It fails 100% of the time, in a second or two. This is the key:
// HRESULT hr = spLocation.CoCreateInstance(CLSID_Location); // works
HRESULT hr = spLocation.CoCreateInstance(CLSID_Location, nullptr, CLSCTX_INPROC_SERVER); // fails
For an as-yet-unknown reason, the Location API doesn't like the inproc_server context we are feeding it. Running the default CLSCTX_ALL
has run for hours without crashing.
Assignee | ||
Comment 17•2 years ago
|
||
CLSCTX_ALL is in line with MSDN sample code.
Comment 18•2 years ago
•
|
||
Current hypothesis from looking at the LocationAPI.dll disassembly is that GetLocationFromSession
relies on an atomic/spinlock to make sure that whatever is going to give it ILocationInformation
objects still exists, but that StartLocationSession
has a single ATL::AtlComPtrAssign
that is NOT under that lock, and which could wipe/rewrite that object at an in-opportune time.
It's hard to infer the locking from looking at the disassembly, but this matches what David found with memory breakpoints. In any case it seems calling Start/Stop on one thread while another tread is in GetLocationForSession
is bad news. I guess the CLSCTX_ALL avoids that?
Comment 19•2 years ago
•
|
||
For more fun, the suspect part of the code is very different between Windows 11 and Windows 10, due to different inlining of CRITICAL_SECTION
, I don't see the bug on my Windows 11's copy of the DLL (we do have reports on that OS), and David's testcase doesn't fail.
Comment 20•2 years ago
|
||
Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bd1520016295 Windows ILocation should be allowed in all COM execution contexts r=Jamie
Comment 21•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Comment 22•2 years ago
|
||
Let's deal with https://bugzilla.mozilla.org/show_bug.cgi?id=1704500#c15 in a follow-up.
Comment 23•2 years ago
|
||
The patch landed in nightly and beta is affected.
:mvollmer, is this bug important enough to require an uplift?
If not please set status_beta
to wontfix
.
For more information, please visit auto_nag documentation.
Comment 24•2 years ago
|
||
Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!
Beta/Release Uplift Approval Request
- User impact if declined: Potential low volume crash when using location.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Low risk: changing a single constant in the code.
- String changes made/needed:
- Is Android affected?: No
Updated•2 years ago
|
Comment 25•2 years ago
|
||
Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!
Approved for 102 beta 5, thanks.
Comment 26•2 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 27•2 years ago
|
||
So that was a bust. I can still fairly reliably reproduce this issue in my sample and I think I see the core issue now. It's complicated. I had wondered why there were always three background threads involved in the crash when it seemed like a two-thread mutex bug, but it turns out the third thread is crucial.
Fundamentally, the crash is that one thread is halfway through CLegacyPositionRequest::RestartLocationSession
, which calls CLegacyPositionRequest::StopLocationSession
, which clears a pointer, then it calls CLegacyPositionRequest::StartLocationSession
, which would reset the pointer but we don't get that far because a background thread from the CLegacyPositionRequest
's threadpool crashes trying to use the cleared pointer.
There would be deadlock issues with holding the object's mutex across the Stop
and Start
calls (also mutex re-entrancy issues). RestartLocationSession
tries to handle this case by running single-threaded -- one of the first things StopLocationSession
does is wait for its worker threadpool to empty its current queue (kind of -- it looks like this is part of the bug). It does this while it holds the mutex -- it also clears the relevant pointer. StartLocationSession
would then restart the threadpool callbacks once the values are restored. The first problem is that StopLocationSession
would call WaitForThreadpoolWaitCallbacks to wait for them to complete, but that call is conditioned on another flag whose value it doesn't like. This is where the third thread comes in. It is running CLegacyPositionRequest::DispatchPositionChanged
in a second thread from the threadpool. DispatchPositionChanged
grabs the mutex at first, sets the bad value, then releases the mutex, then unsets the bad value. It's not clear why it maintains this value. The value appears to be an atomic boolean indicating whether there is already a callback request running but it is racy. I can't see why WaitForThreadpoolWaitCallbacks
is gated on this but, in particular, the gate seems backwards -- it only waits for callbacks if there are no callbacks to wait for.
The second issue is also with WaitForThreadpoolWaitCallbacks
-- it passes FALSE for fCancelPendingCallbacks
. It's part of the following call sequence:
SetThreadpoolWait(x, 0, 0);
if ( !flag )
WaitForThreadpoolWaitCallbacks(x, FALSE);
CloseThreadpoolWait(x);
Again, I don't know what the point of the flag is but it's also the case that fixing only this won't keep it from running other callbacks between the end of the WaitForThreadpoolWaitCallbacks
call and the end of the CloseThreadpoolWait
call. That's apparently enough time for this race to be hit. If the code followed the pattern suggested by the documentation of these APIs:
SetThreadpoolWait(x, 0, 0);
WaitForThreadpoolWaitCallbacks(x, TRUE);
CloseThreadpoolWait(x);
then the race wouldn't be possible. I'm guessing it wasn't written the normal way because of some other offending code I haven't seen.
So here's my theory for the step-by-step process to failure:
- A worker thread runs the timer callback. That thread looks like this:
[...]
combase.dll!ObjectStubless() Line 176 Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::DispatchPositionChanged(struct ILatLongReportInternal *) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::GetLocationFromSession(bool) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE *,void *,struct _TP_WAIT *,long) Unknown Non-user code. Symbols loaded.
DispatchPositionChanged
has set our special value to 1 and exited the mutex:
00007FFF7E421179 mov eax,1
00007FFF7E42117E xchg eax,dword ptr [rbx+0C0h]
00007FFF7E421184 prefetchw [rbx+0C4h]
00007FFF7E42118B mov eax,dword ptr [rbx+0C4h]
00007FFF7E421191 mov ecx,eax
00007FFF7E421193 lock cmpxchg dword ptr [rbx+0C4h],ecx
00007FFF7E42119B jne CLegacyPositionRequest::DispatchPositionChanged+0C9h (07FFF7E421191h)
00007FFF7E42119D mov edi,eax
00007FFF7E42119F cmp byte ptr [rbp-10h],0
00007FFF7E4211A3 je CLegacyPositionRequest::DispatchPositionChanged+0EDh (07FFF7E4211B5h)
00007FFF7E4211A5 mov rcx,qword ptr [rbp-18h]
00007FFF7E4211A9 call qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]
- A second worker runs
SetDesiredAccuracy
, which begins ourRestartLocationSession
.
[...]
combase.dll!ObjectStubless() Line 176 Unknown Non-user code. Symbols loaded.
> LocationApi.dll!CLegacyPositionRequest::StopLocationSession(bool) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::RestartLocationSession(void) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY) Unknown Non-user code. Symbols loaded.
It claims the mutex in StopLocationSession
, which must happen after step 1. It then reads the bad special value 1 from step 1:
00007FFF7E42200B xor r8d,r8d
00007FFF7E42200E xor edx,edx
00007FFF7E422010 mov rcx,rdi
00007FFF7E422013 call qword ptr [__imp_SetThreadpoolWait (07FFF7E454BB8h)]
00007FFF7E42201A nop dword ptr [rax+rax]
00007FFF7E42201F test esi,esi
00007FFF7E422021 jne CLegacyPositionRequest::StopLocationSession+0C8h (07FFF7E422034h)
00007FFF7E422023 xor edx,edx
00007FFF7E422025 mov rcx,rdi
00007FFF7E422028 call qword ptr [__imp_WaitForThreadpoolWaitCallbacks (07FFF7E454BC8h)]
00007FFF7E42202F nop dword ptr [rax+rax]
00007FFF7E422034 mov rcx,rdi
00007FFF7E422037 call qword ptr [__imp_CloseThreadpoolWait (07FFF7E454BD0h)]
The value in %esi
is the special value 1, so we skip __imp_WaitForThreadpoolWaitCallbacks
. This means we have no guarantees about future spawning of callbacks, even after __imp_CloseThreadpoolWait.
Stop
then continues to grab the mutex, clear the pointer that will cause the crash (usingATL::AtlComPtrAssign
), and release the mutex:
00007FFF7E42204C call ATL::CComCritSecLock<ATL::CComAutoCriticalSection>::CComCritSecLock<ATL::CComAutoCriticalSection> (07FFF7E41FADCh)
00007FFF7E422051 lea rdi,[rbx+0A0h]
00007FFF7E422058 mov rdx,qword ptr [rdi]
00007FFF7E42205B test rdx,rdx
00007FFF7E42205E je CLegacyPositionRequest::StopLocationSession+0FEh (07FFF7E42206Ah)
00007FFF7E422060 lea rcx,[rsp+70h]
00007FFF7E422065 call ATL::AtlComPtrAssign (07FFF7E416294h)
00007FFF7E42206A cmp qword ptr [rdi],0
00007FFF7E42206E je CLegacyPositionRequest::StopLocationSession+10Eh (07FFF7E42207Ah)
00007FFF7E422070 xor edx,edx
00007FFF7E422072 mov rcx,rdi
00007FFF7E422075 call ATL::AtlComPtrAssign (07FFF7E416294h)
00007FFF7E42207A test bpl,bpl
00007FFF7E42207D jne CLegacyPositionRequest::StopLocationSession+11Fh (07FFF7E42208Bh)
00007FFF7E42207F lea rcx,[rbx+8Ch]
00007FFF7E422086 call ATL::CComGITPtr<ILocationEvents>::Revoke (07FFF7E42067Ch)
00007FFF7E42208B lea rcx,[rbx+0A8h]
00007FFF7E422092 call ATL::CHandle::Close (07FFF7E420D14h)
00007FFF7E422097 mov byte ptr [rbx+0B0h],0
00007FFF7E42209E cmp byte ptr [rsp+40h],0
00007FFF7E4220A3 je CLegacyPositionRequest::StopLocationSession+14Ah (07FFF7E4220B6h)
00007FFF7E4220A5 mov rcx,qword ptr [rsp+38h]
00007FFF7E4220AA call qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]
- At any point between then and where
Start
restores things, we can launch a new callback thread that will crash on the null pointer.
These values can be seen in the debugger. I have full core dumps.
I haven't been able to conclude anything about how this relates to the public APIs, like a relationship between report requests and wait jobs. There still may be something we can do to finesse this. We might also be able to get a fix from Microsoft. But barring that, this is a fine candidate for stuffing into a utility process, where it can crash and be restarted without us caring much.
Assignee | ||
Comment 28•2 years ago
|
||
We'll back this out (bug 1774073). We're going to try the utility process approach.
Updated•2 years ago
|
Assignee | ||
Comment 29•2 years ago
|
||
Working prototype for doing geolocation in a utility process.
Assignee | ||
Comment 30•2 years ago
|
||
AsyncBlockers are used in a few places, in particular including child
process shutdowns, and they always cause waiting (10 seconds at
process close), even if they are not in use. This skips the wait in
that case.
Assignee | ||
Comment 31•2 years ago
|
||
Adds a "utilproc" log to trace utility process launch and shutdown steps.
Depends on D155016
Assignee | ||
Comment 32•2 years ago
|
||
Adds a new type of utility process that is set up to handle Windows OS objects. We are adding this process type to run Windows geolocation APIs but more services are expected to be included in it. The ILocation APIs have a race condition that would otherwise crash the main process. The ILocation work is in a later patch in the series.
Depends on D155017
Assignee | ||
Comment 33•2 years ago
|
||
Adds UtilitySandboxProperties, which abstract the more universal sandbox properties into a data object so that the various types of utility process can simply list them. This also adds a somewhat weak sandbox for the new "WindowsUtils" utility process type.
Depends on D155018
Assignee | ||
Comment 34•2 years ago
|
||
Previously, the ILocation COM object ran in the main process. Due to a race condition, it would sometimes crash the process so we are moving it to a utility process, where it can crash safely. The old WindowsLocationProvider class has become a proxy that forwards requests to the child process via the PWindowsLocation parent actor. The PWindowsLocationChild executes the requests and sends the results back. Failures in ILocation (like geolocation being off in Windows settings) are sent back as well -- they will cause the WindowsLocationProvider to start the MLS geolocation fallback, as they did before.
Depends on D155019
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 35•2 years ago
|
||
Setting n-i to not lose track of the patches in this bug.
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 36•1 year ago
|
||
Make sure that the geolocation utility process restarts for georequests that arrive after a crash. Note that this tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).
Also generalizes some utility process testing functions for use with all utility process sandbox types.
Depends on D155020
Updated•1 year ago
|
Assignee | ||
Comment 37•1 year ago
|
||
This eliminates a redundant enum that we had to keep synchronized with another.
Depends on D155020
Assignee | ||
Comment 38•1 year ago
|
||
This eliminates a redundant enum that we had to keep synchronized with another.
Depends on D155020
Assignee | ||
Comment 39•1 year ago
|
||
This eliminates a redundant enum that we had to keep synchronized with another.
Depends on D155020
Assignee | ||
Comment 40•1 year ago
|
||
Also makes the existing utility process test functions a bit more general.
Depends on D162943
Assignee | ||
Comment 41•1 year ago
|
||
Make sure that the geolocation utility process restarts for georequests that arrive after a crash. This tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).
Depends on D162944
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Comment 42•1 year ago
|
||
Marking this as blocking the out-of-process file picker, since (per discussion in meetings) some of the infrastructure in this patchset should be useful for the currently-planned implementation thereof.
Comment 43•1 year ago
|
||
Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5d62d1a33cf8 Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/e581bfdf0f8e Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/f85e8eed6fed Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/93e59d5488dc Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/1c84f8f0796a Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/817d104b3a3e Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/8f1b73a96bb0 Test Windows geolocation utility process restart on crash r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/da69da68588d 1704500, 1704500, 1704500, 1704500, 1704500, 1704500: apply code formatting via Lando
Comment 44•1 year ago
•
|
||
Backed out 8 changesets (Bug 1704500) for causing build bustages on UtilityProcessHost.cpp.
Backout link
Push with failures
Failure Log
Also Bd failure Log
Also Failure Log
Comment 45•1 year ago
|
||
Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0d1c0dcd7645 Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/341ec2465c7d Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/241be693628d Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/d4508f8ba7df Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/c2474d1c0046 Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/897b985f9298 Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/e48723d8d42e Test Windows geolocation utility process restart on crash r=gerard-majax
Comment 46•1 year ago
|
||
Backed out for causing build bustages
- Backout link
- Push with failures
- Failure Log
- Failure line: /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:5990:61: error: no member named 'WINDOWS_UTILS' in 'mozilla::ipc::SandboxingKind'
Assignee | ||
Comment 47•1 year ago
|
||
Comment 48•1 year ago
|
||
Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6bfb4d8de1e1 Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/a1d1a97a2ffa Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/90a8d51fc17c Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/30632246ee2b Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/313b8770dc51 Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/747d84ff17c7 Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/c3ae64389c26 Test Windows geolocation utility process restart on crash r=gerard-majax
Comment 49•1 year ago
|
||
Backed out for causing multiple failures in test_geolocation_reset_accuracy/browser_utility_geolocation_crashed
Backout link: https://hg.mozilla.org/integration/autoland/rev/f115e287797670e7358559d07f525d5bfd6f5eb3
Updated•1 year ago
|
Assignee | ||
Comment 50•1 year ago
|
||
Oof. These tests pass on Win10 locally, with or without location turned on, so this is some other test environment discrepancy. This will take some effort to track down.
Assignee | ||
Comment 51•1 year ago
|
||
Comment 52•1 year ago
|
||
Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d96e810e20ec Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/88e8015e1ec4 Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/1386732459ce Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/4be2d58ddf54 Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/c0f0a280aaea Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/1c27f9c36fda Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/afcc1e8b5ad7 Test Windows geolocation utility process restart on crash r=gerard-majax
Comment 53•1 year ago
•
|
||
Backed out for causing bc failures on browser_utility_geolocation_crashed.js.
Please also take a look at Bug 1813345 before landing this. The bug was closed as fixed because of this backout.
[task 2023-01-29T00:31:04.892Z] 00:31:04 INFO - TEST-START | ipc/glue/test/browser/browser_utility_geolocation_crashed.js
[task 2023-01-29T00:31:05.493Z] 00:31:05 INFO - GECKO(7340) | *** WIFI GEO: startup called.
[task 2023-01-29T00:31:05.979Z] 00:31:05 INFO - GECKO(7340) | JavaScript error: resource://gre/modules/XULStore.jsm, line 58: Error: Can't find profile directory.
[task 2023-01-29T00:31:07.490Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: onStatus called.wifi-timeout
[task 2023-01-29T00:31:07.496Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: Use request cache:false reason:No cached data
[task 2023-01-29T00:31:07.497Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: Sending request
[task 2023-01-29T00:31:07.497Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: onStatus called.xhr-start
[task 2023-01-29T00:31:07.508Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: geo provider reported: -122.08769:37.41857
[task 2023-01-29T00:31:07.517Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: shutdown called
[task 2023-01-29T00:31:07.528Z] 00:31:07 INFO - TEST-INFO | started process screenshot
[task 2023-01-29T00:31:07.876Z] 00:31:07 INFO - TEST-INFO | screenshot: exit 0
[task 2023-01-29T00:31:07.876Z] 00:31:07 INFO - Buffered messages logged at 00:31:04
[task 2023-01-29T00:31:07.877Z] 00:31:07 INFO - Entering setup bound
[task 2023-01-29T00:31:07.878Z] 00:31:07 INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07 INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07 INFO - Leaving setup bound
[task 2023-01-29T00:31:07.879Z] 00:31:07 INFO - Entering test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.879Z] 00:31:07 INFO - Start the Windows utility process
[task 2023-01-29T00:31:07.880Z] 00:31:07 INFO - Requesting geolocation
[task 2023-01-29T00:31:07.880Z] 00:31:07 INFO - Buffered messages logged at 00:31:06
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - Console message: [JavaScript Error: "Error: Can't find profile directory." {file: "resource://gre/modules/XULStore.jsm" line: 58}]
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - load@resource://gre/modules/XULStore.jsm:58:15
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - XULStore@resource://gre/modules/XULStore.jsm:17:10
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO -
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - Buffered messages logged at 00:31:07
[task 2023-01-29T00:31:07.882Z] 00:31:07 INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | geolocation succeeded -
[task 2023-01-29T00:31:07.882Z] 00:31:07 INFO - Crash the utility process
[task 2023-01-29T00:31:07.884Z] 00:31:07 INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.885Z] 00:31:07 INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils utility process should be found -
[task 2023-01-29T00:31:07.885Z] 00:31:07 INFO - prune any previous crashes
[task 2023-01-29T00:31:07.885Z] 00:31:07 INFO - Buffered messages finished
[task 2023-01-29T00:31:07.886Z] 00:31:07 INFO - TEST-UNEXPECTED-FAIL | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | Uncaught exception in test - at chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339 - TypeError: can't access property "pruneOldCrashes", crashMan is undefined
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - Stack trace:
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - crashSomeUtility@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339:3
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - crashSomeUtilityActor@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:415:10
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - Leaving test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.888Z] 00:31:07 INFO - Entering test bound testCleanup
[task 2023-01-29T00:31:07.888Z] 00:31:07 INFO - Clean up to avoid confusing future tests
[task 2023-01-29T00:31:07.889Z] 00:31:07 INFO - Kill Utility Process windowsUtils
[task 2023-01-29T00:31:07.891Z] 00:31:07 INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.891Z] 00:31:07 INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils process exists -
Assignee | ||
Comment 54•1 year ago
|
||
Test needs to be gated on crashreporter
. Bug 1813345 is a test-verify
leak that is fairly reliable under that stress test. I'm looking at it now.
Assignee | ||
Comment 55•1 year ago
|
||
Updated•1 year ago
|
Comment 56•1 year ago
|
||
Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/de943230d39a Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/6d02ad6e0aee Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/dd7e66e5ab25 Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/77ef9f7cf3cb Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/037ee6c3373f Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/2047cf0d7192 Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/a2bec68b9a7a Test Windows geolocation utility process restart on crash r=gerard-majax
Comment 57•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/de943230d39a
https://hg.mozilla.org/mozilla-central/rev/6d02ad6e0aee
https://hg.mozilla.org/mozilla-central/rev/dd7e66e5ab25
https://hg.mozilla.org/mozilla-central/rev/77ef9f7cf3cb
https://hg.mozilla.org/mozilla-central/rev/037ee6c3373f
https://hg.mozilla.org/mozilla-central/rev/2047cf0d7192
https://hg.mozilla.org/mozilla-central/rev/a2bec68b9a7a
Description
•