<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 5

•

4 years ago

Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.

Component: DOM: Geolocation → Widget: Win32

Flags: needinfo?(spohl.mozilla.bugs)

Priority: P2 → --

Stephen A Pohl [:spohl]

Comment 6

•

4 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #5)

Stephen, please take a look. I'm moving the component here because the crashes are win32k specific.

Added to list of triaged bugs. The component could probably have remained DOM:Geolocation since that's the component that's involved in this crash, but we'll track it under Win32 and get it assigned.

Flags: needinfo?(spohl.mozilla.bugs)

Priority: -- → P2

Stephen A Pohl [:spohl]

Updated

•

4 years ago

Whiteboard: [win:stability]

Updated

•

4 years ago

Assignee: nobody → mvollmer

Updated

•

4 years ago

Status: NEW → ASSIGNED

Updated

•

4 years ago

Comment 7

•

4 years ago

Max found a crash report from 2022-05-28 (https://crash-stats.mozilla.org/search/?release_channel=nightly&signature=%3DCLegacyPositionRequest%3A%3AGetLocationFromSession&product=Firefox&version=102.0a1&date=%3E%3D2022-05-24T00%3A00%3A00.000Z&date=%3C2022-05-30T12%3A23%3A00.000Z&_facets=install_time&_facets=version&_facets=address&_facets=moz_crash_reason&_facets=reason&_facets=build_id&_facets=platform_pretty_version&_facets=signature&_facets=useragent_locale&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports) which means bug 1770126 may not actually have fixed it.

Comment 8

•

4 years ago

Attached file locationlog.txt.moz_log — Details

Comment 9

•

4 years ago

Attached file locationlog.txt.child-6.moz_log — Details

Comment 10

•

4 years ago

I have been dabbling with different tests, where I open and close multiple tabs and windows using geolocation services, to see if there is any issue with timings and such. So far I have been unsuccessful in reproducing the crash. If someone wants to have a look at logs from a short test session with 3 parallel tabs and then another 3 tabs opened a few seconds apart each, I have attached those here. If someone wants to dive into bigger logs from opening thousands of tabs and running geolocation tests for hours, let me know, and I'll upload the logs somewhere.

fyi: One thing I noticed in past tests was that Geolocation objects didn't get garbage collected until all tabs and windows were closed, even though Geolocation objects are per tab/window. However with the fix in bug 1770126 this does not happen anymore, suggesting that this has been fixed and is not related to this crash.

Comment 11

•

4 years ago

Did some joint debugging. All the crash reports are off the main thread, while we're calling mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool) on the main thread, which can apparently invoke a response off main thread. So that at least explains why, despite all the WindowsLocationProvider calls are on the main thread/main process, we still can race. No real insight into the conditions to trigger this.

Looking at the logs also revealed that we call SetHighAccuracy before mLocation is even initialized, which makes using high-accuracy location requests non-functional in Firefox, unless they're preceded by a normal getlocation() request. So that's at least one bug to fix here, which may or may not be related to the above.

https://docs.microsoft.com/en-us/windows/win32/api/unknwn/nf-unknwn-iunknown-queryinterface(refiid_void)
https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/implementing-iunknown-in-c-plus-plus
https://devblogs.microsoft.com/oldnewthing/20040326-00/?p=40033

Comment 12

•

4 years ago

•

Edited

The multiple threads may be a red herring: it looks like Windows invokes a worker thread to wait on the Windows Location Service to do its thing, and blocks our main thread waiting for a response. So even though this is multi-threaded, this should be transparent to us, which is consistent with this being main process and in COM Single Threaded Appartment mode (the crashes precede win32k lockdown, which is what enabled Multithreaded COM, as all documented in ProcessRuntime.cpp).

Which would mean we're back to the hypothesis that it is a use after free. From investigating the code, the SetHighAccuracy is not all that relevant either, as nsGeolocationRequest::Allow() always calls it with the argument appropriately set.

I tried to intentionally break this by doing what comment 4 says the problem is: forcibly unregistering the event listener immediately after registering it, but still calling the SetDesiredAccuracy() even though there's no listeners. Doesn't crash. If that doesn't crash, I'm starting to be unclear what the UAF or threading problem could be. There's no evidence any of this is called off of main thread/main process (see first paragraph), and the ILocationEvents object is reference counted, so even destroying/unregistering everything shouldn't cause that.

The next step would be perhaps to load the exact same build as crashed into the debugger, and look at exactly what is supposed to sit on the offset that crashes (which is 160 bytes off the start of some pointer that Windows is looking for, and presumably would be an ILocationEvents object - unfortunately the minidumps don't have enough context to see this).

While digging through the internals of IUnknown and all that stuff, I did find that the QueryInterface method on LocationEvent has at least 2 bugs:

STDMETHODIMP
LocationEvent::QueryInterface(REFIID iid, void** ppv) {
  if (iid == IID_IUnknown) {
    *ppv = static_cast<IUnknown*>(this);
  } else if (iid == IID_ILocationEvents) {
    *ppv = static_cast<ILocationEvents*>(this);
  } else {
    return E_NOINTERFACE;
  }
  AddRef();
  return S_OK;
}

So this neither returns E_POINTER if ppv is null nor does it null out *ppv if we return E_NOINTERFACE. Interestingly, I could imagine that incomplete error checking in LocationAPI.dll could have an issue with these. That said, this seems pretty far fetched, as about half the QueryInterface implementations in our source tree seem to have one of these mistakes.

Still, seems worth a try. The next step would be to inspect that 0xA0 offset in LocationEvent or generally verify that the offending code in GetLocationFromSession is trying to access parts of LocationEvent objects.

Comment 13

•

4 years ago

It looks like the thing that sits at offset 0xA0 when LocationEvent is casted to ILocationEvents is __vfptr = 0x00007ffd96b50970 {xul.dll!const mozilla::dom::LocationEvent::`vftable'} {0x00007ffd8a1cb470 {xul.dll!mozilla::dom::LocationEvent::QueryInterface(const _GUID &, void * *)},

This would make some sense wrt the crash? The Windows code is trying to invoke the callback but the virtual method table has already been wiped out.

Comment 14

•

4 years ago

•

Edited

David has some suspicion regarding the reference counting. Looking at the example code: https://docs.microsoft.com/en-us/windows/win32/api/locationapi/nf-locationapi-ilocation-registerforreport
pLocationEvents there is definitely kept alive until the UnregisterForReport is done. But it seems to me that it would have to be a bug for whoever uses ILocationEvents to drop the reference and then use the object later, so barring concurrency this doesn't seem like it can be the problem (but it could very well work around it, or prevent us from hitting an issue that anyone using the reference code would never hit). ~~An easy fix for our use would then to be to move event to be a member mEvent on mLocation and be automatically Released() when mLocation is Released()~~ (see below), though some fiddling might be needed if we ever use multiple callbacks - but I think we only ever have a single global nsGeolocationService that itself handles multiple callbacks, from looking at the code.

Looking at another crash report: https://crash-stats.mozilla.org/report/index/084fde43-9dc5-4732-83f5-7a3580220601#allthreads

Main Thread
1 	user32.dll 	unsigned long RealMsgWaitForMultipleObjectsEx(unsigned long, void* const*, unsigned long, unsigned long, unsigned long) 	None
2 	combase.dll 	CCliModalLoop::BlockFn(void**, unsigned long, unsigned long*) 	onecore\com\combase\dcomrem\callctrl.cxx:2156 
...
13 	xul.dll 	mozilla::dom::WindowsLocationProvider::SetHighAccuracy(bool) 	dom/system/windows/WindowsLocationProvider.cpp:270

Worker Thread
5 	ntdll.dll 	RtlEnterCriticalSection 	None
6 	LocationApi.dll 	ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) 	None
7 	LocationApi.dll 	void CLegacyPositionRequest::GetLocationFromSession(bool) 	None
8 	LocationApi.dll 	static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) 	None
9 	ntdll.dll 	TppExecuteWaitCallback 	None 

Another Worker Thread
5 	ntdll.dll 	RtlEnterCriticalSection 	None
6 	LocationApi.dll 	ATL::CComCritSecLock<class ATL::CComAutoCriticalSection>::CComCritSecLock<class ATL::CComAutoCriticalSection>(class ATL::CComAutoCriticalSection&, bool) 	None
7 	LocationApi.dll 	void CLegacyPositionRequest::DispatchStatusChanged(enum LOCATION_REPORT_STATUS) 	None
8 	LocationApi.dll 	long CLegacyPositionRequest::StartLocationSession 	None
9 	LocationApi.dll 	long CLegacyPositionRequest::RestartLocationSession 	None
10 	LocationApi.dll 	virtual long CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) 	None
11 	LocationApi.dll 	virtual long CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const&, enum LOCATION_DESIRED_ACCURACY) 	None
12 	rpcrt4.dll 	Invoke 	None
Ø 

Crashing Thread
0 	locationapi.dll 	void CLegacyPositionRequest::GetLocationFromSession(bool) 		context
1 	locationapi.dll 	static void CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE*, void*, struct _TP_WAIT*, long) 		cfi
2 	ntdll.dll 	TppExecuteWaitCallback 		cfi

Notice how there's 3 LocationAPI.dll calls running on different threads. They're blocked on a COM Lock (waiting for "LocationSession" whatever that is?), but this at least explains how it would be possible for a worker thread to race with the Unregister call even if the latter blocks the main thread.

Unfortunately, if that's the problem, then the ILocationEvents really needs to be a global object with the application lifetime, and itself have threadsafe support for adding the callbacks (and setting mProvider...because if the above happens it could outlive that as well).

Comment 15

•

4 years ago

•

Edited

tl;dr

SetHighAccuracy should set a member flag but not assume mLocation already exists and that it can call stuff on it
QueryInterface on LocationEvent should be bugfixed just because
The LocationEvent passed into RegisterForReport needs to be a global that either owns mProvider too (yuck, at least there should be only one) or we need to carefully and thread-safely set/unset it as a member (also yuck). Seems unlikely this would fix anything based on current analysis.

Assignee

Comment 16

•

4 years ago

I'm still hoping to figure out why this breaks but I've reproduced the exact crash in a test case here. It fails 100% of the time, in a second or two. This is the key:

  // HRESULT hr = spLocation.CoCreateInstance(CLSID_Location); // works
  HRESULT hr = spLocation.CoCreateInstance(CLSID_Location, nullptr, CLSCTX_INPROC_SERVER); // fails

For an as-yet-unknown reason, the Location API doesn't like the inproc_server context we are feeding it. Running the default CLSCTX_ALL has run for hours without crashing.

Assignee

Comment 17

•

4 years ago

Attached file Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio! — Details

CLSCTX_ALL is in line with MSDN sample code.

Comment 18

•

4 years ago

•

Edited

Current hypothesis from looking at the LocationAPI.dll disassembly is that GetLocationFromSession relies on an atomic/spinlock to make sure that whatever is going to give it ILocationInformation objects still exists, but that StartLocationSession has a single ATL::AtlComPtrAssign that is NOT under that lock, and which could wipe/rewrite that object at an in-opportune time.

It's hard to infer the locking from looking at the disassembly, but this matches what David found with memory breakpoints. In any case it seems calling Start/Stop on one thread while another tread is in GetLocationForSession is bad news. I guess the CLSCTX_ALL avoids that?

Comment 19

•

4 years ago

•

Edited

For more fun, the suspect part of the code is very different between Windows 11 and Windows 10, due to different inlining of CRITICAL_SECTION, I don't see the bug on my Windows 11's copy of the DLL (we do have reports on that OS), and David's testcase doesn't fail.

https://hg.mozilla.org/mozilla-central/rev/bd1520016295

Comment 20

•

4 years ago

Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bd1520016295 Windows ILocation should be allowed in all COM execution contexts r=Jamie

Norisz Fay [:noriszfay]

Comment 21

•

4 years ago

bugherder

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox103: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 103 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox101: --- → wontfix

status-firefox102: --- → affected

status-firefox-esr91: --- → wontfix

BugBot [:suhaib / :marco/ :calixte]

Comment 22

•

4 years ago

Let's deal with https://bugzilla.mozilla.org/show_bug.cgi?id=1704500#c15 in a follow-up.

Comment 23

•

4 years ago

The patch landed in nightly and beta is affected.
:mvollmer, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(mvollmer)

Comment 24

•

4 years ago

Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!

Beta/Release Uplift Approval Request

User impact if declined: Potential low volume crash when using location.
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Low risk: changing a single constant in the code.
String changes made/needed:
Is Android affected?: No

Attachment #9279350 - Flags: approval-mozilla-beta?

Pascal Chevrel (relman team) -> :pascalc

Updated

•

4 years ago

Flags: needinfo?(mvollmer)

Comment 25

•

4 years ago

Comment on attachment 9279350 [details]
Bug 1704500: Windows ILocation should be allowed in all COM execution contexts r=emilio!

Approved for 102 beta 5, thanks.

Attachment #9279350 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pascal Chevrel (relman team) -> :pascalc

Comment 26

•

4 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/a39b419d4e09

status-firefox102: affected → fixed

Assignee

Comment 27

•

4 years ago

So that was a bust. I can still fairly reliably reproduce this issue in my sample and I think I see the core issue now. It's complicated. I had wondered why there were always three background threads involved in the crash when it seemed like a two-thread mutex bug, but it turns out the third thread is crucial.

Fundamentally, the crash is that one thread is halfway through CLegacyPositionRequest::RestartLocationSession, which calls CLegacyPositionRequest::StopLocationSession, which clears a pointer, then it calls CLegacyPositionRequest::StartLocationSession, which would reset the pointer but we don't get that far because a background thread from the CLegacyPositionRequest's threadpool crashes trying to use the cleared pointer.

There would be deadlock issues with holding the object's mutex across the Stop and Start calls (also mutex re-entrancy issues). RestartLocationSession tries to handle this case by running single-threaded -- one of the first things StopLocationSession does is wait for its worker threadpool to empty its current queue (kind of -- it looks like this is part of the bug). It does this while it holds the mutex -- it also clears the relevant pointer. StartLocationSession would then restart the threadpool callbacks once the values are restored. The first problem is that StopLocationSession would call WaitForThreadpoolWaitCallbacks to wait for them to complete, but that call is conditioned on another flag whose value it doesn't like. This is where the third thread comes in. It is running CLegacyPositionRequest::DispatchPositionChanged in a second thread from the threadpool. DispatchPositionChanged grabs the mutex at first, sets the bad value, then releases the mutex, then unsets the bad value. It's not clear why it maintains this value. The value appears to be an atomic boolean indicating whether there is already a callback request running but it is racy. I can't see why WaitForThreadpoolWaitCallbacks is gated on this but, in particular, the gate seems backwards -- it only waits for callbacks if there are no callbacks to wait for.

The second issue is also with WaitForThreadpoolWaitCallbacks -- it passes FALSE for fCancelPendingCallbacks. It's part of the following call sequence:

    SetThreadpoolWait(x, 0, 0);
    if ( !flag )
      WaitForThreadpoolWaitCallbacks(x, FALSE);
    CloseThreadpoolWait(x);

Again, I don't know what the point of the flag is but it's also the case that fixing only this won't keep it from running other callbacks between the end of the WaitForThreadpoolWaitCallbacks call and the end of the CloseThreadpoolWait call. That's apparently enough time for this race to be hit. If the code followed the pattern suggested by the documentation of these APIs:

    SetThreadpoolWait(x, 0, 0);
    WaitForThreadpoolWaitCallbacks(x, TRUE);
    CloseThreadpoolWait(x);

then the race wouldn't be possible. I'm guessing it wasn't written the normal way because of some other offending code I haven't seen.

So here's my theory for the step-by-step process to failure:

A worker thread runs the timer callback. That thread looks like this:

	[...]
 	combase.dll!ObjectStubless() Line 176	Unknown	Non-user code. Symbols loaded.
	LocationApi.dll!CLegacyPositionRequest::DispatchPositionChanged(struct ILatLongReportInternal *)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::GetLocationFromSession(bool)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::WaitForLocationInformationThreadProc(struct _TP_CALLBACK_INSTANCE *,void *,struct _TP_WAIT *,long)	Unknown	Non-user code. Symbols loaded.

DispatchPositionChanged has set our special value to 1 and exited the mutex:

00007FFF7E421179  mov         eax,1  
00007FFF7E42117E  xchg        eax,dword ptr [rbx+0C0h]  
00007FFF7E421184  prefetchw   [rbx+0C4h]  
00007FFF7E42118B  mov         eax,dword ptr [rbx+0C4h]  
00007FFF7E421191  mov         ecx,eax  
00007FFF7E421193  lock cmpxchg dword ptr [rbx+0C4h],ecx  
00007FFF7E42119B  jne         CLegacyPositionRequest::DispatchPositionChanged+0C9h (07FFF7E421191h)  
00007FFF7E42119D  mov         edi,eax  
00007FFF7E42119F  cmp         byte ptr [rbp-10h],0  
00007FFF7E4211A3  je          CLegacyPositionRequest::DispatchPositionChanged+0EDh (07FFF7E4211B5h)  
00007FFF7E4211A5  mov         rcx,qword ptr [rbp-18h]  
00007FFF7E4211A9  call        qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]

A second worker runs SetDesiredAccuracy, which begins our RestartLocationSession.

	[...]
 	combase.dll!ObjectStubless() Line 176	Unknown	Non-user code. Symbols loaded.
>	LocationApi.dll!CLegacyPositionRequest::StopLocationSession(bool)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::RestartLocationSession(void)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyPositionRequest::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY)	Unknown	Non-user code. Symbols loaded.
 	LocationApi.dll!CLegacyLocationApiImpl::SetDesiredAccuracy(struct _GUID const &,enum LOCATION_DESIRED_ACCURACY)	Unknown	Non-user code. Symbols loaded.

It claims the mutex in StopLocationSession, which must happen after step 1. It then reads the bad special value 1 from step 1:

00007FFF7E42200B  xor         r8d,r8d  
00007FFF7E42200E  xor         edx,edx  
00007FFF7E422010  mov         rcx,rdi  
00007FFF7E422013  call        qword ptr [__imp_SetThreadpoolWait (07FFF7E454BB8h)]  
00007FFF7E42201A  nop         dword ptr [rax+rax]  
00007FFF7E42201F  test        esi,esi  
00007FFF7E422021  jne         CLegacyPositionRequest::StopLocationSession+0C8h (07FFF7E422034h)  
00007FFF7E422023  xor         edx,edx  
00007FFF7E422025  mov         rcx,rdi  
00007FFF7E422028  call        qword ptr [__imp_WaitForThreadpoolWaitCallbacks (07FFF7E454BC8h)]  
00007FFF7E42202F  nop         dword ptr [rax+rax]  
00007FFF7E422034  mov         rcx,rdi  
00007FFF7E422037  call        qword ptr [__imp_CloseThreadpoolWait (07FFF7E454BD0h)]

The value in %esi is the special value 1, so we skip __imp_WaitForThreadpoolWaitCallbacks. This means we have no guarantees about future spawning of callbacks, even after __imp_CloseThreadpoolWait.

Stop then continues to grab the mutex, clear the pointer that will cause the crash (using ATL::AtlComPtrAssign), and release the mutex:

00007FFF7E42204C  call        ATL::CComCritSecLock<ATL::CComAutoCriticalSection>::CComCritSecLock<ATL::CComAutoCriticalSection> (07FFF7E41FADCh)  
00007FFF7E422051  lea         rdi,[rbx+0A0h]  
00007FFF7E422058  mov         rdx,qword ptr [rdi]  
00007FFF7E42205B  test        rdx,rdx  
00007FFF7E42205E  je          CLegacyPositionRequest::StopLocationSession+0FEh (07FFF7E42206Ah)  
00007FFF7E422060  lea         rcx,[rsp+70h]  
00007FFF7E422065  call        ATL::AtlComPtrAssign (07FFF7E416294h)  
00007FFF7E42206A  cmp         qword ptr [rdi],0  
00007FFF7E42206E  je          CLegacyPositionRequest::StopLocationSession+10Eh (07FFF7E42207Ah)  
00007FFF7E422070  xor         edx,edx  
00007FFF7E422072  mov         rcx,rdi  
00007FFF7E422075  call        ATL::AtlComPtrAssign (07FFF7E416294h)  
00007FFF7E42207A  test        bpl,bpl  
00007FFF7E42207D  jne         CLegacyPositionRequest::StopLocationSession+11Fh (07FFF7E42208Bh)  
00007FFF7E42207F  lea         rcx,[rbx+8Ch]  
00007FFF7E422086  call        ATL::CComGITPtr<ILocationEvents>::Revoke (07FFF7E42067Ch)  
00007FFF7E42208B  lea         rcx,[rbx+0A8h]  
00007FFF7E422092  call        ATL::CHandle::Close (07FFF7E420D14h)  
00007FFF7E422097  mov         byte ptr [rbx+0B0h],0  
00007FFF7E42209E  cmp         byte ptr [rsp+40h],0  
00007FFF7E4220A3  je          CLegacyPositionRequest::StopLocationSession+14Ah (07FFF7E4220B6h)  
00007FFF7E4220A5  mov         rcx,qword ptr [rsp+38h]  
00007FFF7E4220AA  call        qword ptr [__imp_LeaveCriticalSection (07FFF7E454AD0h)]

At any point between then and where Start restores things, we can launch a new callback thread that will crash on the null pointer.

These values can be seen in the debugger. I have full core dumps.

I haven't been able to conclude anything about how this relates to the public APIs, like a relationship between report requests and wait jobs. There still may be something we can do to finesse this. We might also be able to get a fix from Microsoft. But barring that, this is a fine candidate for stuffing into a utility process, where it can crash and be restarted without us caring much.

Assignee: mvollmer → davidp99

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Pascal Chevrel (relman team) -> :pascalc

Assignee

Comment 28

•

4 years ago

We'll back this out (bug 1774073). We're going to try the utility process approach.

Updated

•

4 years ago

status-firefox102: fixed → affected

status-firefox103: fixed → affected

Updated

•

4 years ago

Comment 29

•

4 years ago

Attached file WIP: Bug 1704500: Use Utility Process For Geolocation on Windows (obsolete) — Details

Working prototype for doing geolocation in a utility process.

Assignee

Comment 30

•

3 years ago

Attached file Bug 1704500: Skip AsyncBlockers delay when there are no blockers r=#ipc-reviewers! (obsolete) — Details

AsyncBlockers are used in a few places, in particular including child
process shutdowns, and they always cause waiting (10 seconds at
process close), even if they are not in use. This skips the wait in
that case.

Assignee

Comment 31

•

3 years ago

Attached file Bug 1704500: Add logging to UtilityProcessHost/Manager r=gerard-majax! — Details

Adds a "utilproc" log to trace utility process launch and shutdown steps.

Depends on D155016

Assignee

Comment 32

•

3 years ago

Attached file Bug 1704500: Add WindowsUtils kind of utility process on Windows r=#ipc-reviewers!,gerard-majax! — Details

Adds a new type of utility process that is set up to handle Windows OS objects. We are adding this process type to run Windows geolocation APIs but more services are expected to be included in it. The ILocation APIs have a race condition that would otherwise crash the main process. The ILocation work is in a later patch in the series.

Depends on D155017

Assignee

Comment 33

•

3 years ago

Attached file Bug 1704500: Use structs to normalize utility process sandbox code r=gerard-majax!,bobowen! — Details

Adds UtilitySandboxProperties, which abstract the more universal sandbox properties into a data object so that the various types of utility process can simply list them. This also adds a somewhat weak sandbox for the new "WindowsUtils" utility process type.

Depends on D155018

Assignee

Comment 34

•

3 years ago

Attached file Bug 1704500: Run Windows' ILocation provider in a utility process r=emilio!,cmartin! — Details

Previously, the ILocation COM object ran in the main process. Due to a race condition, it would sometimes crash the process so we are moving it to a utility process, where it can crash safely. The old WindowsLocationProvider class has become a proxy that forwards requests to the child process via the PWindowsLocation parent actor. The PWindowsLocationChild executes the requests and sends the results back. Failures in ILocation (like geolocation being off in Windows settings) are sent back as well -- they will cause the WindowsLocationProvider to start the MLS geolocation fallback, as they did before.

Depends on D155019

Updated

•

3 years ago

Attachment #9290524 - Attachment description: WIP: Bug 1704500: Skip AsyncBlockers delay when there are no blockers → Bug 1704500: Skip AsyncBlockers delay when there are no blockers r=#ipc-reviewers!

Updated

•

3 years ago

Attachment #9290525 - Attachment description: WIP: Bug 1704500: Add logging to UtilityProcessHost/Manager → Bug 1704500: Add logging to UtilityProcessHost/Manager r=gerard-majax!

Updated

•

3 years ago

Attachment #9290526 - Attachment description: WIP: Bug 1704500: Add WindowsUtils kind of utility process on Windows → Bug 1704500: Add WindowsUtils kind of utility process on Windows r=#ipc-reviewers!,gerard-majax!

Updated

•

3 years ago

Attachment #9290527 - Attachment description: WIP: Bug 1704500: Use const structs to normalize utility process sandbox code → Bug 1704500: Use const structs to normalize utility process sandbox code r=gerard-majax!,bobowen!

Updated

•

3 years ago

Attachment #9290528 - Attachment description: WIP: Bug 1704500: Run Windows' ILocation provider in a utility process → Bug 1704500: Run Windows' ILocation provider in a utility process r=emilio!

Updated

•

3 years ago

Attachment #9281329 - Attachment is obsolete: true

Stephen A Pohl [:spohl]

Comment 35

•

3 years ago

Setting n-i to not lose track of the patches in this bug.

Flags: needinfo?(davidp99)

Updated

•

3 years ago

Attachment #9290527 - Attachment description: Bug 1704500: Use const structs to normalize utility process sandbox code r=gerard-majax!,bobowen! → Bug 1704500: Use structs to normalize utility process sandbox code r=gerard-majax!,bobowen!

Updated

•

3 years ago

Attachment #9290528 - Attachment description: Bug 1704500: Run Windows' ILocation provider in a utility process r=emilio! → Bug 1704500: Run Windows' ILocation provider in a utility process r=emilio!,cmartin!

Assignee

Comment 36

•

3 years ago

Attached file Bug 1704500: Test Windows geolocation utility process restart on crash r=gerard-majax! (obsolete) — Details

Make sure that the geolocation utility process restarts for georequests that arrive after a crash. Note that this tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).

Also generalizes some utility process testing functions for use with all utility process sandbox types.

Depends on D155020

Updated

•

3 years ago

Attachment #9290524 - Attachment is obsolete: true

Assignee

Comment 37

•

3 years ago

Attached file Bug 1704500: Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax! (obsolete) — Details

This eliminates a redundant enum that we had to keep synchronized with another.

Depends on D155020

Assignee

Comment 38

•

3 years ago

Attached file Bug 1704500: Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax! (obsolete) — Details

This eliminates a redundant enum that we had to keep synchronized with another.

Depends on D155020

Assignee

Comment 39

•

3 years ago

Attached file Bug 1704500: Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax! — Details

This eliminates a redundant enum that we had to keep synchronized with another.

Depends on D155020

Assignee

Comment 40

•

3 years ago

Attached file Bug 1704500: Add utility process test helpers r=gerard-majax! — Details

Also makes the existing utility process test functions a bit more general.

Depends on D162943

Assignee

Comment 41

•

3 years ago

Attached file Bug 1704500: Test Windows geolocation utility process restart on crash r=gerard-majax! — Details

Make sure that the geolocation utility process restarts for georequests that arrive after a crash. This tests process behavior regardless of whether or not the OS is set to allow geolocation (and in automation, it is not).

Depends on D162944

Updated

•

3 years ago

Attachment #9303369 - Attachment is obsolete: true

Updated

•

3 years ago

Attachment #9305047 - Attachment is obsolete: true

Updated

•

3 years ago

Attachment #9305048 - Attachment is obsolete: true

Assignee

Updated

•

3 years ago

Flags: needinfo?(davidp99)

Ray Kraesig [:rkraesig]

Comment 42

•

3 years ago

Marking this as blocking the out-of-process file picker, since (per discussion in meetings) some of the infrastructure in this patchset should be useful for the currently-planned implementation thereof.

Blocks: 1677170

Comment 43

•

3 years ago

Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5d62d1a33cf8 Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/e581bfdf0f8e Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/f85e8eed6fed Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/93e59d5488dc Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/1c84f8f0796a Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/817d104b3a3e Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/8f1b73a96bb0 Test Windows geolocation utility process restart on crash r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/da69da68588d 1704500, 1704500, 1704500, 1704500, 1704500, 1704500: apply code formatting via Lando

Marian-Vasile Laza

Comment 44

•

3 years ago

•

Edited

Backed out 8 changesets (Bug 1704500) for causing build bustages on UtilityProcessHost.cpp.
Backout link
Push with failures
Failure Log
Also Bd failure Log
Also Failure Log

Flags: needinfo?(davidp99)

Comment 45

•

3 years ago

Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0d1c0dcd7645 Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/341ec2465c7d Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/241be693628d Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/d4508f8ba7df Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/c2474d1c0046 Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/897b985f9298 Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/e48723d8d42e Test Windows geolocation utility process restart on crash r=gerard-majax

Cristian Tuns

Comment 46

•

3 years ago

Backed out for causing build bustages

Backout link
Push with failures
Failure Log
Failure line: /builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp:5990:61: error: no member named 'WINDOWS_UTILS' in 'mozilla::ipc::SandboxingKind'

https://treeherder.mozilla.org/jobs?repo=try&revision=1644fe483d0163e8df112d3f7a5e40f66582abcb

Assignee

Comment 47

•

3 years ago

Flags: needinfo?(davidp99)

Comment 48

•

3 years ago

Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6bfb4d8de1e1 Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/a1d1a97a2ffa Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/90a8d51fc17c Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/30632246ee2b Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/313b8770dc51 Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/747d84ff17c7 Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/c3ae64389c26 Test Windows geolocation utility process restart on crash r=gerard-majax

Sandor Molnar[:smolnar]

Comment 49

•

3 years ago

Backed out for causing multiple failures in test_geolocation_reset_accuracy/browser_utility_geolocation_crashed

Backout link: https://hg.mozilla.org/integration/autoland/rev/f115e287797670e7358559d07f525d5bfd6f5eb3

Push with failures

Failure log xpc

Failure log bc

Flags: needinfo?(davidp99)

Donal Meehan [:dmeehan]

Updated

•

3 years ago

status-firefox102: affected → wontfix

status-firefox103: affected → wontfix

status-firefox109: --- → wontfix

status-firefox110: --- → wontfix

status-firefox111: --- → affected

status-firefox-esr102: --- → wontfix

Assignee

Comment 50

•

3 years ago

Oof. These tests pass on Win10 locally, with or without location turned on, so this is some other test environment discrepancy. This will take some effort to track down.

Flags: needinfo?(davidp99)

https://treeherder.mozilla.org/jobs?repo=try&group_state=expanded&selectedTaskRun=JDU5ik7GTG-knIYUAdG1Sw.0&revision=f529af38cd42c404d4caa3f4e28e345ad59e4d27

Assignee

Comment 51

•

3 years ago

Comment 52

•

3 years ago

Pushed by daparks@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d96e810e20ec Add logging to UtilityProcessHost/Manager r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/88e8015e1ec4 Add WindowsUtils kind of utility process on Windows r=gerard-majax,ipc-reviewers,nika https://hg.mozilla.org/integration/autoland/rev/1386732459ce Use structs to normalize utility process sandbox code r=gerard-majax,bobowen https://hg.mozilla.org/integration/autoland/rev/4be2d58ddf54 Run Windows' ILocation provider in a utility process r=emilio,cmartin https://hg.mozilla.org/integration/autoland/rev/c0f0a280aaea Make UtilityActorName an alias for WebIDLUtilityActorName r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/1c27f9c36fda Add utility process test helpers r=gerard-majax https://hg.mozilla.org/integration/autoland/rev/afcc1e8b5ad7 Test Windows geolocation utility process restart on crash r=gerard-majax

Treeherder Bug Filer

Updated

•

3 years ago

Regressions: 1813345

Iulian Moraru

Comment 53

•

3 years ago

•

Edited

Backed out for causing bc failures on browser_utility_geolocation_crashed.js.

Please also take a look at Bug 1813345 before landing this. The bug was closed as fixed because of this backout.

[task 2023-01-29T00:31:04.892Z] 00:31:04     INFO - TEST-START | ipc/glue/test/browser/browser_utility_geolocation_crashed.js
[task 2023-01-29T00:31:05.493Z] 00:31:05     INFO - GECKO(7340) | *** WIFI GEO: startup called.
[task 2023-01-29T00:31:05.979Z] 00:31:05     INFO - GECKO(7340) | JavaScript error: resource://gre/modules/XULStore.jsm, line 58: Error: Can't find profile directory.
[task 2023-01-29T00:31:07.490Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: onStatus called.wifi-timeout
[task 2023-01-29T00:31:07.496Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: Use request cache:false reason:No cached data
[task 2023-01-29T00:31:07.497Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: Sending request
[task 2023-01-29T00:31:07.497Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: onStatus called.xhr-start
[task 2023-01-29T00:31:07.508Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: geo provider reported: -122.08769:37.41857
[task 2023-01-29T00:31:07.517Z] 00:31:07     INFO - GECKO(7340) | *** WIFI GEO: shutdown called
[task 2023-01-29T00:31:07.528Z] 00:31:07     INFO - TEST-INFO | started process screenshot
[task 2023-01-29T00:31:07.876Z] 00:31:07     INFO - TEST-INFO | screenshot: exit 0
[task 2023-01-29T00:31:07.876Z] 00:31:07     INFO - Buffered messages logged at 00:31:04
[task 2023-01-29T00:31:07.877Z] 00:31:07     INFO - Entering setup bound 
[task 2023-01-29T00:31:07.878Z] 00:31:07     INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07     INFO - Console message: [JavaScript Warning: "Unknown category for SetEventRecordingEnabled: page_load"]
[task 2023-01-29T00:31:07.878Z] 00:31:07     INFO - Leaving setup bound 
[task 2023-01-29T00:31:07.879Z] 00:31:07     INFO - Entering test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.879Z] 00:31:07     INFO - Start the Windows utility process
[task 2023-01-29T00:31:07.880Z] 00:31:07     INFO - Requesting geolocation
[task 2023-01-29T00:31:07.880Z] 00:31:07     INFO - Buffered messages logged at 00:31:06
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - Console message: [JavaScript Error: "Error: Can't find profile directory." {file: "resource://gre/modules/XULStore.jsm" line: 58}]
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - load@resource://gre/modules/XULStore.jsm:58:15
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - XULStore@resource://gre/modules/XULStore.jsm:17:10
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - 
[task 2023-01-29T00:31:07.881Z] 00:31:07     INFO - Buffered messages logged at 00:31:07
[task 2023-01-29T00:31:07.882Z] 00:31:07     INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | geolocation succeeded - 
[task 2023-01-29T00:31:07.882Z] 00:31:07     INFO - Crash the utility process
[task 2023-01-29T00:31:07.884Z] 00:31:07     INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.885Z] 00:31:07     INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils utility process should be found - 
[task 2023-01-29T00:31:07.885Z] 00:31:07     INFO - prune any previous crashes
[task 2023-01-29T00:31:07.885Z] 00:31:07     INFO - Buffered messages finished
[task 2023-01-29T00:31:07.886Z] 00:31:07     INFO - TEST-UNEXPECTED-FAIL | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | Uncaught exception in test - at chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339 - TypeError: can't access property "pruneOldCrashes", crashMan is undefined
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - Stack trace:
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - crashSomeUtility@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:339:3
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - crashSomeUtilityActor@chrome://mochitests/content/browser/ipc/glue/test/browser/head.js:415:10
[task 2023-01-29T00:31:07.887Z] 00:31:07     INFO - Leaving test bound testGeolocationProcessCrash
[task 2023-01-29T00:31:07.888Z] 00:31:07     INFO - Entering test bound testCleanup
[task 2023-01-29T00:31:07.888Z] 00:31:07     INFO - Clean up to avoid confusing future tests
[task 2023-01-29T00:31:07.889Z] 00:31:07     INFO - Kill Utility Process windowsUtils
[task 2023-01-29T00:31:07.891Z] 00:31:07     INFO - Utility process infos = [{"childID":0,"cpuCycleCount":932110890,"cpuTime":343750000,"memory":120610816,"origin":"","pid":4244,"threads":[{"cpuCycleCount":798054878,"cpuTime":296875000,"name":"MainThread","tid":4904},{"cpuCycleCount":18965478,"cpuTime":0,"name":"IPC I/O Child","tid":7360},{"cpuCycleCount":9100750,"cpuTime":0,"name":"COM MTA","tid":4416},{"cpuCycleCount":30853086,"cpuTime":15625000,"name":"BackgroundThreadPool #1","tid":4320},{"cpuCycleCount":9419998,"cpuTime":0,"name":"ProfilerChild","tid":3076},{"cpuCycleCount":6455848,"cpuTime":0,"name":"","tid":6464},{"cpuCycleCount":1586894,"cpuTime":0,"name":"","tid":7992},{"cpuCycleCount":611598,"cpuTime":0,"name":"","tid":5540},{"cpuCycleCount":55476366,"cpuTime":31250000,"name":"","tid":3292}],"type":"utility","utilityActors":[{"actorName":"windowsUtils"}],"windows":[]}]
[task 2023-01-29T00:31:07.891Z] 00:31:07     INFO - TEST-PASS | ipc/glue/test/browser/browser_utility_geolocation_crashed.js | exactly one windowsUtils process exists -

Flags: needinfo?(davidp99)

Assignee

Comment 54

•

3 years ago

Test needs to be gated on crashreporter. Bug 1813345 is a test-verify leak that is fairly reliable under that stress test. I'm looking at it now.

Flags: needinfo?(davidp99)

https://treeherder.mozilla.org/jobs?repo=try&revision=127f0a12cc0c07ac7858ea0dad6c6935564f3fea

Assignee

Comment 55

•

3 years ago

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

Status: REOPENED → ASSIGNED

Target Milestone: 103 Branch → ---