Crash in [@ CLegacyPositionRequest::GetLocationFromSession]
Categories
(Core :: Widget: Win32, defect, P2)
Tracking
()
People
(Reporter: aryx, Assigned: handyman)
References
(Blocks 1 open bug)
Details
(Keywords: crash, Whiteboard: [win:stability])
Crash Data
Attachments
(10 files, 5 obsolete files)
|
841.49 KB,
text/plain
|
Details | |
|
231.82 KB,
text/plain
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review | |
|
48 bytes,
text/x-phabricator-request
|
Details | Review |
Not a new crash, 200-400 affected installations per release cycle, all on Windows 10
Most common affected sites are https://tinder.com/ and https://www.olg.ca/ - at least the latter doesn't prompt for one's location (maybe only if one has an account?)
Crash report: https://crash-stats.mozilla.org/report/index/ff288022-6ab4-4fbd-a491-7436f0210412
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 8 frames of crashing thread:
0 locationapi.dll void __thiscall CLegacyPositionRequest::GetLocationFromSession
1 locationapi.dll static void __stdcall CLegacyPositionRequest::WaitForLocationInformationThreadProc
2 ntdll.dll TppExecuteWaitCallback
3 ntdll.dll TppWaitCompletion
4 ntdll.dll TppWorkerThread
5 kernel32.dll BaseThreadInitThunk
6 ntdll.dll _RtlUserThreadStart
7 ntdll.dll _RtlUserThreadStart
Does the crash data contain URLs with which this issue is reproducible?
Since the crash is in Windows' libraries, the issue might need to be forwarded to Microsoft.
Comment 2•4 years ago
|
||
(In reply to Mirko Brodesser (:mbrodesser) from comment #1)
Does the crash data contain URLs with which this issue is reproducible?
Since the crash is in Windows' libraries, the issue might need to be forwarded to Microsoft.
Andrew, do you know which component is the best place for this issue?
Comment 3•4 years ago
|
||
I think this is the right component. I looked at a few crashes, and the main thread had mozilla::dom::WindowsLocationProvider::SetHighAccuracy() in the stack for all of them.
Some of these correlations certainly look suspicious:
(100.0% in signature vs 00.39% overall) Module "LocationFrameworkPS.dll" = true
(100.0% in signature vs 00.76% overall) Module "LocationApi.dll" = true
(100.0% in signature vs 01.18% overall) Module "deviceaccess.dll" = true
(100.0% in signature vs 01.31% overall) Module "SensorsUtilsV2.dll" = true
(100.0% in signature vs 01.31% overall) Module "SensorsNativeApi.V2.dll" = true
(100.0% in signature vs 01.50% overall) Module "wlanapi.dll" = true
(100.0% in signature vs 01.98% overall) Module "PortableDeviceTypes.dll" = true
(100.0% in signature vs 03.34% overall) Module "SensorsApi.dll" = true
(100.0% in signature vs 19.33% overall) Module "msvcp110_win.dll" = true [100.0% vs 36.45% if platform_pretty_version = Windows 10]
(100.0% in signature vs 23.60% overall) Module "OneCoreUAPCommonProxyStub.dll" = true [100.0% vs 48.39% if platform_version = 10.0.19043]
(100.0% in signature vs 22.92% overall) Module "textinputframework.dll" = true [100.0% vs 50.17% if platform_version = 10.0.19042]
(17.07% in signature vs 00.75% overall) CPU Info = family 6 model 158 stepping 13
Toshi, is this something you might be able to look at? The crash volume looks pretty low, but maybe there's something we could do to improve the situation without spending too much time on it. Thanks.
Comment 4•4 years ago
|
||
The crashing thread is a thread to dispatch a location callback registered here. It seems that the registered callback had been unregistered when the thread accessed the registered object.
LocationApi!CLegacyPositionRequest::GetLocationFromSession+0x63:
00007fff`1b59141b 488b8fa0000000 mov rcx,qword ptr [rdi+0A0h]
rdi = LocationApi!CLegacyPositionRequest
rdi+0A0h is set by RegisterForReport and reset by UnregisterForReport
00007fff`1b591422 488b01 mov rax,qword ptr [rcx] <<<< crash here because rcx was 0
00007fff`1b591425 488d5538 lea rdx,[rbp+38h]
00007fff`1b591429 488b4040 mov rax,qword ptr [rax+40h]
00007fff`1b59142d ff15853a0300 call qword ptr [LocationApi!_guard_dispatch_icall_fptr (00007fff`1b5c4eb8)]
This may happen if we unregister the callback through the disconnect timer here when the worker thread is running CLegacyPositionRequest::GetLocationFromSession. I don't know whether there is a safe way to unregister the callback outside LocationApi.dll.
Currently we register the callback every time JS calls navigator.geolocation.getCurrentPosition. If we register a single callback that caches the latest location for all listeners, we can reduce the probability of this crash (and also improve geolocation performance)?
Comment 5•4 years ago
|
||
Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.
Comment 6•4 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #5)
Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.
Added to list of triaged bugs. The component could probably have remained DOM:Geolocation since that's the component that's involved in this crash, but we'll track it under Win32 and get it assigned.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 7•4 years ago
|
||
Max found a crash report from 2022-05-28 (https://crash-stats.mozilla.org/search/?release_channel=nightly&signature=%3DCLegacyPositionRequest%3A%3AGetLocationFromSession&product=Firefox&version=102.0a1&date=%3E%3D2022-05-24T00%3A00%3A00.000Z&date=%3C2022-05-30T12%3A23%3A00.000Z&_facets=install_time&_facets=version&_facets=address&_facets=moz_crash_reason&_facets=reason&_facets=build_id&_facets=platform_pretty_version&_facets=signature&_facets=useragent_locale&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports) which means bug 1770126 may not actually have fixed it.
Comment 8•4 years ago
|
||
Comment 9•4 years ago
|
||
Comment 10•4 years ago
|
||
I have been dabbling with different tests, where I open and close multiple tabs and windows using geolocation services, to see if there is any issue with timings and such. So far I have been unsuccessful in reproducing the crash. If someone wants to have a look at logs from a short test session with 3 parallel tabs and then another 3 tabs opened a few seconds apart each, I have attached those here. If someone wants to dive into bigger logs from opening thousands of tabs and running geolocation tests for hours, let me know, and I'll upload the logs somewhere.
fyi: One thing I noticed in past tests was that Geolocation objects didn't get garbage collected until all tabs and windows were closed, even though Geolocation objects are per tab/window. However with the fix in bug 1770126 this does not happen anymore, suggesting that this has been fixed and is not related to this crash.
Comment 11•4 years ago
|
||
Did some joint debugging. All the crash reports are off the main thread, while we're calling mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool) on the main thread, which can apparently invoke a response off main thread. So that at least explains why, despite all the WindowsLocationProvider calls are on the main thread/main process, we still can race. No real insight into the conditions to trigger this.
Looking at the logs also revealed that we call SetHighAccuracy before mLocation is even initialized, which makes using high-accuracy location requests non-functional in Firefox, unless they're preceded by a normal getlocation() request. So that's at least one bug to fix here, which may or may not be related to the above.
Comment 12•4 years ago
•
|
||
The multiple threads may be a red herring: it looks like Windows invokes a worker thread to wait on the Windows Location Service to do its thing, and blocks our main thread waiting for a response. So even though this is multi-threaded, this should be transparent to us, which is consistent with this being main process and in COM Single Threaded Appartment mode (the crashes precede win32k lockdown, which is what enabled Multithreaded COM, as all documented in ProcessRuntime.cpp).
Which would mean we're back to the hypothesis that it is a use after free. From investigating the code, the SetHighAccuracy is not all that relevant either, as nsGeolocationRequest::Allow() always calls it with the argument appropriately set.
I tried to intentionally break this by doing what comment 4 says the problem is: forcibly unregistering the event listener immediately after registering it, but still calling the SetDesiredAccuracy() even though there's no listeners. Doesn't crash. If that doesn't crash, I'm starting to be unclear what the UAF or threading problem could be. There's no evidence any of this is called off of main thread/main process (see first paragraph), and the ILocationEvents object is reference counted, so even destroying/unregistering everything shouldn't cause that.
The next step would be perhaps to load the exact same build as crashed into the debugger, and look at exactly what is supposed to sit on the offset that crashes (which is 160 bytes off the start of some pointer that Windows is looking for, and presumably would be an ILocationEvents object - unfortunately the minidumps don't have enough context to see this).
While digging through the internals of IUnknown and all that stuff, I did find that the QueryInterface method on LocationEvent has at least 2 bugs:
https://docs.microsoft.com/en-us/windows/win32/api/unknwn/nf-unknwn-iunknown-queryinterface(refiid_void)
https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/implementing-iunknown-in-c-plus-plus
https://devblogs.microsoft.com/oldnewthing/20040326-00/?p=40033
STDMETHODIMP
LocationEvent::QueryInterface(REFIID iid, void** ppv) {
if (iid == IID_IUnknown) {
*ppv = static_cast<IUnknown*>(this);
} else if (iid == IID_ILocationEvents) {
*ppv = static_cast<ILocationEvents*>(this);
} else {
return E_NOINTERFACE;
}
AddRef();
return S_OK;
}
So this neither returns E_POINTER if ppv is null nor does it null out *ppv if we return E_NOINTERFACE. Interestingly, I could imagine that incomplete error checking in LocationAPI.dll could have an issue with these. That said, this seems pretty far fetched, as about half the QueryInterface implementations in our source tree seem to have one of these mistakes.
Still, seems worth a try. The next step would be to inspect that 0xA0 offset in LocationEvent or generally verify that the offending code in GetLocationFromSession is trying to access parts of LocationEvent objects.
Comment 13•4 years ago
|
||
It looks like the thing that sits at offset 0xA0 when LocationEvent is casted to ILocationEvents is __vfptr = 0x00007ffd96b50970 {xul.dll!const mozilla::dom::LocationEvent::`vftable'} {0x00007ffd8a1cb470 {xul.dll!mozilla::dom::LocationEvent::QueryInterface(const _GUID &, void * *)},
This would make some sense wrt the crash? The Windows code is trying to invoke the callback but the virtual method table has already been wiped out.
Comment 14•4 years ago
•
|
||
David has some suspicion regarding the reference counting. Looking at the example code: https://docs.microsoft.com/en-us/windows/win32/api/locationapi/nf-locationapi-ilocation-registerforreport
pLocationEvents there is definitely kept alive until the UnregisterForReport is done. But it seems to me that it would have to be a bug for whoever uses ILocationEvents to drop the reference and then use the object later, so barring concurrency this doesn't seem like it can be the problem (but it could very well work around it, or prevent us from hitting an issue that anyone using the reference code would never hit). An easy fix for our use would then to be to move (see below), though some fiddling might be needed if we ever use multiple callbacks - but I think we only ever have a single global event to be a member mEvent on mLocation and be automatically Released() when mLocation is Released()nsGeolocationService that itself handles multiple callbacks, from looking at the code.
Looking at another crash report: https://crash-stats.mozilla.org/report/index/084fde43-9dc5-4732-83f5-7a3580220601#allthreads
Main Thread
1 user32.dll unsigned long RealMsgWaitForMultipleObjectsEx(unsigned long, void* const*, unsigned long, unsigned long, unsigned long) None
2 combase.dll CCliModalLoop::BlockFn(void**, unsigned long, unsigned long*) onecore\com\combase\dcomrem\callctrl.cxx:2156
...
13 xul.dll mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool) dom/system/windows/WindowsLocationProvider.cpp:270
Worker Thread
5 ntdll.dll RtlEnterCriticalSection None
6 LocationApi.dll ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) None
7 LocationApi.dll void CLegacyPositionRequest::GetLocationFromSession(bool) None
8 LocationApi.dll static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) None
9 ntdll.dll TppExecuteWaitCallback None
Another Worker Thread
5 ntdll.dll RtlEnterCriticalSection None
6 LocationApi.dll ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) None
7 LocationApi.dll void CLegacyPositionRequest::DispatchStatusChanged(enum LOCATION_REPORT_STATUS) None
8 LocationApi.dll long CLegacyPositionRequest::StartLocationSession None
9 LocationApi.dll long CLegacyPositionRequest::RestartLocationSession None
10 LocationApi.dll virtual long CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) None
11 LocationApi.dll virtual long CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) None
12 rpcrt4.dll Invoke None
Ø
Crashing Thread
0 locationapi.dll void CLegacyPositionRequest::GetLocationFromSession(bool) context
1 locationapi.dll static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) cfi
2 ntdll.dll TppExecuteWaitCallback cfi
Notice how there's 3 LocationAPI.dll calls running on different threads. They're blocked on a COM Lock (waiting for "LocationSession" whatever that is?), but this at least explains how it would be possible for a worker thread to race with the Unregister call even if the latter blocks the main thread.
Unfortunately, if that's the problem, then the ILocationEvents really needs to be a global object with the application lifetime, and itself have threadsafe support for adding the callbacks (and setting mProvider...because if the above happens it could outlive that as well).
Comment 15•4 years ago
•
|
||
tl;dr
SetHighAccuracyshould set a member flag but not assumemLocationalready exists and that it can call stuff on itQueryInterfaceonLocationEventshould be bugfixed just becauseTheSeems unlikely this would fix anything based on current analysis.LocationEventpassed intoRegisterForReportneeds to be a global that either ownsmProvidertoo (yuck, at least there should be only one) or we need to carefully and thread-safely set/unset it as a member (also yuck).
| Assignee | ||
Comment 16•4 years ago
|
||
I'm still hoping to figure out why this breaks but I've reproduced the exact crash in a test case here. It fails 100% of the time, in a second or two. This is the key:
// HRESULT hr = spLocation.CoCreateInstance(CLSID_Location); // works
HRESULT hr = spLocation.CoCreateInstance(CLSID_Location, nullptr, CLSCTX_INPROC_SERVER); // fails
For an as-yet-unknown reason, the Location API doesn't like the inproc_server context we are feeding it. Running the default CLSCTX_ALL has run for hours without crashing.
| Assignee | ||
Comment 17•4 years ago
|
||
CLSCTX_ALL is in line with MSDN sample code.
Comment 18•4 years ago
•
|
||
Current hypothesis from looking at the LocationAPI.dll disassembly is that GetLocationFromSession relies on an atomic/spinlock to make sure that whatever is going to give it ILocationInformation objects still exists, but that StartLocationSession has a single ATL::AtlComPtrAssign that is NOT under that lock, and which could wipe/rewrite that object at an in-opportune time.
It's hard to infer the locking from looking at the disassembly, but this matches what David found with memory breakpoints. In any case it seems calling Start/Stop on one thread while another tread is in GetLocationForSession is bad news. I guess the CLSCTX_ALL avoids that?
Comment 19•4 years ago
•
|
||
For more fun, the suspect part of the code is very different between Windows 11 and Windows 10, due to different inlining of CRITICAL_SECTION, I don't see the bug on my Windows 11's copy of the DLL (we do have reports on that OS), and David's testcase doesn't fail.
Comment 20•4 years ago
|
||
Comment 21•4 years ago
|
||
| bugherder | ||
Updated•4 years ago
|
Comment 22•4 years ago
|
||
Let's deal with https://bugzilla.mozilla.org/show_bug.cgi?id=1704500#c15 in a follow-up.
Comment 23•4 years ago
|
||
The patch landed in nightly and beta is affected.
:mvollmer, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.
For more information, please visit auto_nag documentation.
Comment 24•4 years ago
|
||
Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!
Beta/Release Uplift Approval Request
- User impact if declined: Potential low volume crash when using location.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Low risk: changing a single constant in the code.
- String changes made/needed:
- Is Android affected?: No
Updated•4 years ago
|
Comment 25•4 years ago
|
||
Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!
Approved for 102 beta 5, thanks.
Comment 26•4 years ago
|
||
| bugherder uplift | ||
| Assignee | ||
Comment 27•4 years ago
|
||
So that was a bust. I can still fairly reliably reproduce this issue in my sample and I think I see the core issue now. It's complicated. I had wondered why there were always three background threads involved in the crash when it seemed like a two-thread mutex bug, but it turns out the third thread is crucial.
Fundamentally, the crash is that one thread is halfway through CLegacyPositionRequest::RestartLocationSession, which calls CLegacyPositionRequest::StopLocationSession, which clears a pointer, then it calls CLegacyPositionRequest::StartLocationSession, which would reset the pointer but we don't get that far because a background thread from the CLegacyPositionRequest's threadpool crashes trying to use the cleared pointer.
There would be deadlock issues with holding the object's mutex across the Stop and Start calls (also mutex re-entrancy issues). RestartLocationSession tries to handle this case by running single-threaded -- one of the first things StopLocationSession does is wait for its worker threadpool to empty its current queue (kind of -- it looks like this is part of the bug). It does this while it holds the mutex -- it also clears the relevant pointer. StartLocationSession would then restart the threadpool callbacks once the values are restored. The first problem is that StopLocationSession would call WaitForThreadpoolWaitCallbacks to wait for them to complete, but that call is conditioned on another flag whose value it doesn't like. This is where the third thread comes in. It is running CLegacyPositionRequest::DispatchPositionChanged in a second thread from the threadpool. DispatchPositionChanged grabs the mutex at first, sets the bad value, then releases the mutex, then unsets the bad value. It's not clear why it maintains this value. The value appears to be an atomic boolean indicating whether there is already a callback request running but it is racy. I can't see why WaitForThreadpoolWaitCallbacks is gated on this but, in particular, the gate seems backwards -- it only waits for callbacks if there are no callbacks to wait for.
The second issue is also with WaitForThreadpoolWaitCallbacks -- it passes FALSE for fCancelPendingCallbacks. It's part of the following call sequence:
SetThreadpoolWait(x, 0, 0);
if ( !flag )
WaitForThreadpoolWaitCallbacks(x, FALSE);
CloseThreadpoolWait(x);
Again, I don't know what the point of the flag is but it's also the case that fixing only this won't keep it from running other callbacks between the end of the WaitForThreadpoolWaitCallbacks call and the end of the CloseThreadpoolWait call. That's apparently enough time for this race to be hit. If the code followed the pattern suggested by the documentation of these APIs:
SetThreadpoolWait(x, 0, 0);
WaitForThreadpoolWaitCallbacks(x, TRUE);
CloseThreadpoolWait(x);
then the race wouldn't be possible. I'm guessing it wasn't written the normal way because of some other offending code I haven't seen.
So here's my theory for the step-by-step process to failure:
- A worker thread runs the timer callback. That thread looks like this:
[...]
combase.dll!ObjectStubless() Line 176 Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::DispatchPositionChanged(struct ILatLongReportInternal *) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::GetLocationFromSession(bool) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE *,void *,struct _TP_WAIT *,long) Unknown Non-user code. Symbols loaded.
DispatchPositionChanged has set our special value to 1 and exited the mutex:
00007FFF7E421179 mov eax,1
00007FFF7E42117E xchg eax,dword ptr [rbx+0C0h]
00007FFF7E421184 prefetchw [rbx+0C4h]
00007FFF7E42118B mov eax,dword ptr [rbx+0C4h]
00007FFF7E421191 mov ecx,eax
00007FFF7E421193 lock cmpxchg dword ptr [rbx+0C4h],ecx
00007FFF7E42119B jne CLegacyPositionRequest::DispatchPositionChanged+0C9h (07FFF7E421191h)
00007FFF7E42119D mov edi,eax
00007FFF7E42119F cmp byte ptr [rbp-10h],0
00007FFF7E4211A3 je CLegacyPositionRequest::DispatchPositionChanged+0EDh (07FFF7E4211B5h)
00007FFF7E4211A5 mov rcx,qword ptr [rbp-18h]
00007FFF7E4211A9 call qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]
- A second worker runs
SetDesiredAccuracy, which begins ourRestartLocationSession.
[...]
combase.dll!ObjectStubless() Line 176 Unknown Non-user code. Symbols loaded.
> LocationApi.dll!CLegacyPositionRequest::StopLocationSession(bool) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::RestartLocationSession(void) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY) Unknown Non-user code. Symbols loaded.
LocationApi.dll!CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY) Unknown Non-user code. Symbols loaded.
It claims the mutex in StopLocationSession, which must happen after step 1. It then reads the bad special value 1 from step 1:
00007FFF7E42200B xor r8d,r8d
00007FFF7E42200E xor edx,edx
00007FFF7E422010 mov rcx,rdi
00007FFF7E422013 call qword ptr [__imp_SetThreadpoolWait (07FFF7E454BB8h)]
00007FFF7E42201A nop dword ptr [rax+rax]
00007FFF7E42201F test esi,esi
00007FFF7E422021 jne CLegacyPositionRequest::StopLocationSession+0C8h (07FFF7E422034h)
00007FFF7E422023 xor edx,edx
00007FFF7E422025 mov rcx,rdi
00007FFF7E422028 call qword ptr [__imp_WaitForThreadpoolWaitCallbacks (07FFF7E454BC8h)]
00007FFF7E42202F nop dword ptr [rax+rax]
00007FFF7E422034 mov rcx,rdi
00007FFF7E422037 call qword ptr [__imp_CloseThreadpoolWait (07FFF7E454BD0h)]
The value in %esi is the special value 1, so we skip __imp_WaitForThreadpoolWaitCallbacks. This means we have no guarantees about future spawning of callbacks, even after __imp_CloseThreadpoolWait.
Stopthen continues to grab the mutex, clear the pointer that will cause the crash (usingATL::AtlComPtrAssign), and release the mutex:
00007FFF7E42204C call ATL::CComCritSecLock<ATL::CComAutoCriticalSection>::CComCritSecLock<ATL::CComAutoCriticalSection> (07FFF7E41FADCh)
00007FFF7E422051 lea rdi,[rbx+0A0h]
00007FFF7E422058 mov rdx,qword ptr [rdi]
00007FFF7E42205B test rdx,rdx
00007FFF7E42205E je CLegacyPositionRequest::StopLocationSession+0FEh (07FFF7E42206Ah)
00007FFF7E422060 lea rcx,[rsp+70h]
00007FFF7E422065 call ATL::AtlComPtrAssign (07FFF7E416294h)
00007FFF7E42206A cmp qword ptr [rdi],0
00007FFF7E42206E je CLegacyPositionRequest::StopLocationSession+10Eh (07FFF7E42207Ah)
00007FFF7E422070 xor edx,edx
00007FFF7E422072 mov rcx,rdi
00007FFF7E422075 call ATL::AtlComPtrAssign (07FFF7E416294h)
00007FFF7E42207A test bpl,bpl
00007FFF7E42207D jne CLegacyPositionRequest::StopLocationSession+11Fh (07FFF7E42208Bh)
00007FFF7E42207F lea rcx,[rbx+8Ch]
00007FFF7E422086 call ATL::CComGITPtr<ILocationEvents>::Revoke (07FFF7E42067Ch)
00007FFF7E42208B lea rcx,[rbx+0A8h]
00007FFF7E422092 call ATL::CHandle::Close (07FFF7E420D14h)
00007FFF7E422097 mov byte ptr [rbx+0B0h],0
00007FFF7E42209E cmp byte ptr [rsp+40h],0
00007FFF7E4220A3 je CLegacyPositionRequest::StopLocationSession+14Ah (07FFF7E4220B6h)
00007FFF7E4220A5 mov rcx,qword ptr [rsp+38h]
00007FFF7E4220AA call qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]
- At any point between then and where
Startrestores things, we can launch a new callback thread that will crash on the null pointer.
These values can be seen in the debugger. I have full core dumps.
I haven't been able to conclude anything about how this relates to the public APIs, like a relationship between report requests and wait jobs. There still may be something we can do to finesse this. We might also be able to get a fix from Microsoft. But barring that, this is a fine candidate for stuffing into a utility process, where it can crash and be restarted without us caring much.
| Assignee | ||
Comment 28•4 years ago
|
||
We'll back this out (bug 1774073). We're going to try the utility process approach.
Updated•4 years ago
|
| Assignee | ||
Comment 29•4 years ago
|
||
Working prototype for doing geolocation in a utility process.
| Assignee | ||
Comment 30•3 years ago
|
||
AsyncBlockers are used in a few places, in particular including child
process shutdowns, and they always cause waiting (10 seconds at
process close), even if they are not in use. This skips the wait in
that case.
| Assignee | ||
Comment 31•3 years ago
|
||
Adds a "utilproc" log to trace utility process launch and shutdown steps.
Depends on D155016
| Assignee | ||
Comment 32•3 years ago
|
||
Adds a new type of utility process that is set up to handle Windows OS objects. We are adding this process type to run Windows geolocation APIs but more services are expected to be included in it. The ILocation APIs have a race condition that would otherwise crash the main process. The ILocation work is in a later patch in the series.
Depends on D155017
| Assignee | ||
Comment 33•3 years ago
|
||
Adds UtilitySandboxProperties, which abstract the more universal sandbox properties into a data object so that the various types of utility process can simply list them. This also adds a somewhat weak sandbox for the new "WindowsUtils" utility process type.
Depends on D155018
| Assignee | ||
Comment 34•3 years ago
|
||
Previously, the ILocation COM object ran in the main process. Due to a race condition, it would sometimes crash the process so we are moving it to a utility process, where it can crash safely. The old WindowsLocationProvider class has become a proxy that forwards requests to the child process via the PWindowsLocation parent actor. The PWindowsLocationChild executes the requests and sends the results back. Failures in ILocation (like geolocation being off in Windows settings) are sent back as well -- they will cause the WindowsLocationProvider to start the MLS geolocation fallback, as they did before.
Depends on D155019
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 35•3 years ago
|
||
Setting n-i to not lose track of the patches in this bug.
Updated•3 years ago
|
Updated•3 years ago
|
| Assignee | ||
Comment 36•3 years ago
|
||
Make sure that the geolocation utility process restarts for georequests that arrive after a crash. Note that this tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).
Also generalizes some utility process testing functions for use with all utility process sandbox types.
Depends on D155020
Updated•3 years ago
|
| Assignee | ||
Comment 37•3 years ago
|
||
This eliminates a redundant enum that we had to keep synchronized with another.
Depends on D155020
| Assignee | ||
Comment 38•3 years ago
|
||
This eliminates a redundant enum that we had to keep synchronized with another.
Depends on D155020
| Assignee | ||
Comment 39•3 years ago
|
||
This eliminates a redundant enum that we had to keep synchronized with another.
Depends on D155020
| Assignee | ||
Comment 40•3 years ago
|
||
Also makes the existing utility process test functions a bit more general.
Depends on D162943
| Assignee | ||
Comment 41•3 years ago
|
||
Make sure that the geolocation utility process restarts for georequests that arrive after a crash. This tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).
Depends on D162944
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
| Assignee | ||
Updated•3 years ago
|
Comment 42•3 years ago
|
||
Marking this as blocking the out-of-process file picker, since (per discussion in meetings) some of the infrastructure in this patchset should be useful for the currently-planned implementation thereof.
Comment 43•3 years ago
|
||
Comment 44•3 years ago
•
|
||
Backed out 8 changesets (Bug 1704500) for causing build bustages on UtilityProcessHost.cpp.
Backout link
Push with failures
Failure Log
Also Bd failure Log
Also Failure Log
Comment 45•3 years ago
|
||
Comment 46•3 years ago
|
||
Backed out for causing build bustages
- Backout link
- Push with failures
- Failure Log
- Failure line: /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:5990:61: error: no member named 'WINDOWS_UTILS' in 'mozilla::ipc::SandboxingKind'
| Assignee | ||
Comment 47•3 years ago
|
||
Comment 48•3 years ago
|
||
Comment 49•3 years ago
|
||
Backed out for causing multiple failures in test_geolocation_reset_accuracy/browser_utility_geolocation_crashed
Backout link: https://hg.mozilla.org/integration/autoland/rev/f115e287797670e7358559d07f525d5bfd6f5eb3
Updated•3 years ago
|
| Assignee | ||
Comment 50•3 years ago
|
||
Oof. These tests pass on Win10 locally, with or without location turned on, so this is some other test environment discrepancy. This will take some effort to track down.
| Assignee | ||
Comment 51•3 years ago
|
||
Comment 52•3 years ago
|
||
Comment 53•3 years ago
•
|
||
Backed out for causing bc failures on browser_utility_geolocation_crashed.js.
Please also take a look at Bug 1813345 before landing this. The bug was closed as fixed because of this backout.
[task 2023-01-29T00:31:04.892Z] 00:31:04 INFO - TEST-START | ipc/glue/test/browser/browser_utility_geolocation_crashed.js
[task 2023-01-29T00:31:05.493Z] 00:31:05 INFO - GECKO(7340) | *** WIFI GEO: startup called.
[task 2023-01-29T00:31:05.979Z] 00:31:05 INFO - GECKO(7340) | JavaScript error: resource://gre/modules/XULStore.jsm, line 58: Error: Can't find profile directory.
[task 2023-01-29T00:31:07.490Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: onStatus called.wifi-timeout
[task 2023-01-29T00:31:07.496Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: Use request cache:false reason:No cached data
[task 2023-01-29T00:31:07.497Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: Sending request
[task 2023-01-29T00:31:07.497Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: onStatus called.xhr-start
[task 2023-01-29T00:31:07.508Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: geo provider reported: -122.08769:37.41857
[task 2023-01-29T00:31:07.517Z] 00:31:07 INFO - GECKO(7340) | *** WIFI GEO: shutdown called
[task 2023-01-29T00:31:07.528Z] 00:31:07 INFO - TEST-INFO | started process screenshot
[task 2023-01-29T00:31:07.876Z] 00:31:07 INFO - TEST-INFO | screenshot: exit 0
[task 2023-01-29T00:31:07.876Z] 00:31:07 INFO - Buffered messages logged at 00:31:04
[task 2023-01-29T00:31:07.877Z] 00:31:07 INFO - Entering setup bound
[task 2023-01-29T00:31:07.878Z] 00:31:07 INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07 INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07 INFO - Leaving setup bound
[task 2023-01-29T00:31:07.879Z] 00:31:07 INFO - Entering test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.879Z] 00:31:07 INFO - Start the Windows utility process
[task 2023-01-29T00:31:07.880Z] 00:31:07 INFO - Requesting geolocation
[task 2023-01-29T00:31:07.880Z] 00:31:07 INFO - Buffered messages logged at 00:31:06
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - Console message: [JavaScript Error: "Error: Can't find profile directory." {file: "resource://gre/modules/XULStore.jsm" line: 58}]
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - load@resource://gre/modules/XULStore.jsm:58:15
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - XULStore@resource://gre/modules/XULStore.jsm:17:10
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO -
[task 2023-01-29T00:31:07.881Z] 00:31:07 INFO - Buffered messages logged at 00:31:07
[task 2023-01-29T00:31:07.882Z] 00:31:07 INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | geolocation succeeded -
[task 2023-01-29T00:31:07.882Z] 00:31:07 INFO - Crash the utility process
[task 2023-01-29T00:31:07.884Z] 00:31:07 INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.885Z] 00:31:07 INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils utility process should be found -
[task 2023-01-29T00:31:07.885Z] 00:31:07 INFO - prune any previous crashes
[task 2023-01-29T00:31:07.885Z] 00:31:07 INFO - Buffered messages finished
[task 2023-01-29T00:31:07.886Z] 00:31:07 INFO - TEST-UNEXPECTED-FAIL | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | Uncaught exception in test - at chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339 - TypeError: can't access property "pruneOldCrashes", crashMan is undefined
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - Stack trace:
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - crashSomeUtility@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339:3
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - crashSomeUtilityActor@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:415:10
[task 2023-01-29T00:31:07.887Z] 00:31:07 INFO - Leaving test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.888Z] 00:31:07 INFO - Entering test bound testCleanup
[task 2023-01-29T00:31:07.888Z] 00:31:07 INFO - Clean up to avoid confusing future tests
[task 2023-01-29T00:31:07.889Z] 00:31:07 INFO - Kill Utility Process windowsUtils
[task 2023-01-29T00:31:07.891Z] 00:31:07 INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.891Z] 00:31:07 INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils process exists -
| Assignee | ||
Comment 54•3 years ago
|
||
Test needs to be gated on crashreporter. Bug 1813345 is a test-verify leak that is fairly reliable under that stress test. I'm looking at it now.
| Assignee | ||
Comment 55•3 years ago
|
||
Updated•3 years ago
|
Comment 56•3 years ago
|
||
Comment 57•3 years ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/de943230d39a
https://hg.mozilla.org/mozilla-central/rev/6d02ad6e0aee
https://hg.mozilla.org/mozilla-central/rev/dd7e66e5ab25
https://hg.mozilla.org/mozilla-central/rev/77ef9f7cf3cb
https://hg.mozilla.org/mozilla-central/rev/037ee6c3373f
https://hg.mozilla.org/mozilla-central/rev/2047cf0d7192
https://hg.mozilla.org/mozilla-central/rev/a2bec68b9a7a
Description
•