Open Bug 1964731 Opened 1 year ago Updated 9 months ago

Retrieving geolocation coordinates for WebDriver tests in CI takes more than 500ms on ASAN builds

Categories

(Core :: DOM: Geolocation, defect)

defect

Tracking

()

People

(Reporter: Sasha, Unassigned, NeedInfo)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file)

After bug 1960651 was fixed, the retrieving geolocation coordinates for WebDriver tests in CI became fast, but we started to get failures on ASAN builds because it would still take more than 500 ms to get them (this is a timeout for this API call which we have), see more details in bug 1964346. We started to use a timeout multiplier to increase the timeout for slow builds. But maybe something can be improved here.

Here an excerpt from a log of an ASAN jobs:
https://treeherder.mozilla.org/logviewer?job_id=506831595&repo=mozilla-central&lineNumber=45515-45524

[task 2025-05-06T15:22:48.215Z] 15:22:48     INFO - PID 17377 | 1746544968212	RemoteAgent	DEBUG	WebDriverBiDiConnection 4a593ff8-e804-46b4-8547-62ae645355b7 -> {"id":10,"method":"script.callFunction","params":{"functionDeclaration":"(multiplier) =>\n                new Promise(\n                    resolve => window.navigator.geolocation.getCurrentPosition(\n ... de: error.code, message: error.message}),\n                        {timeout: 500 * multiplier}\n                ))\n            ","target":{"context":"970a14c5-5737-406b-9e38-3c33c311ade1"},"awaitPromise":true,"arguments":[{"type":"number","value":4}]}}
[task 2025-05-06T15:22:48.222Z] 15:22:48     INFO - PID 17377 | 1746544968214	RemoteAgent	TRACE	Received command script.callFunction for destination ROOT
[task 2025-05-06T15:22:48.269Z] 15:22:48     INFO - PID 17377 | 1746544968258	RemoteAgent	TRACE	Created MessageHandler WINDOW_GLOBAL for session 86a902c1-061c-4eac-8b27-7ff9aa4492f1
[task 2025-05-06T15:22:48.271Z] 15:22:48     INFO - PID 17377 | 1746544968258	RemoteAgent	TRACE	Received command script.callFunctionDeclaration for destination WINDOW_GLOBAL
[task 2025-05-06T15:22:48.273Z] 15:22:48     INFO - PID 17377 | 1746544968272	RemoteAgent	TRACE	WebDriverProcessData actor created for PID 17634
[task 2025-05-06T15:22:48.317Z] 15:22:48     INFO - PID 17377 | *** WIFI GEO: startup called.
[task 2025-05-06T15:22:48.324Z] 15:22:48     INFO - PID 17377 | *** WIFI GEO: watch called
[task 2025-05-06T15:22:48.325Z] 15:22:48     INFO - PID 17377 | *** WIFI GEO: onStatus called.wifi-timeout
[task 2025-05-06T15:22:48.325Z] 15:22:48     INFO - PID 17377 | *** WIFI GEO: Use request cache:true reason:New req. is GeoIP.
[task 2025-05-06T15:22:48.328Z] 15:22:48     INFO - PID 17377 | 1746544968325	RemoteAgent	DEBUG	WebDriverBiDiConnection 4a593ff8-e804-46b4-8547-62ae645355b7 <- {"type":"success","id":10,"result":{"realm":"8352c3fd-d3c5-4c95-a13f-3c3ddb49967b","type":"success","result":{"type":"object","value":[["latitude",{"type":"number","value":37.41857}],["longitude",{"type":"number","value":-122.08769}],["altitude",{"type":"null"}],["accuracy",{"type":"number","value":42}],["altitudeAccuracy",{"type":"null"}],["heading",{"type":"null"}],["speed",{"type":"null"}]]}}}

The delay of about 400ms happens before the WIFI GEO: startup called log output.

Kagami, maybe you have an idea what this could be? If not would a log for Geolocation:5 help?

Blocks: 1886512
Flags: needinfo?(krosylight)

The delay of about 400ms happens before the WIFI GEO: startup called log output.

No idea yet, but that's 40ms, not 400ms.

I agree that there may still be room for enhancement, but why can we make it so strict in WPT? I don't think such strict timeout should apply to all implementations. It'll also fail for any local run if one depends on a remote server and internet connection is slow.

Sorry I had a wrong reference. Here is a failing test for ASAN:
https://treeherder.mozilla.org/logviewer?job_id=506785799&repo=autoland&lineNumber=45541-45551

[task 2025-05-06T09:02:29.772Z] 09:02:29     INFO - PID 17426 | 1746522149766	RemoteAgent	TRACE	Received command script.callFunctionDeclaration for destination WINDOW_GLOBAL
[task 2025-05-06T09:02:30.436Z] 09:02:30     INFO - PID 17426 | *** WIFI GEO: startup called.
[task 2025-05-06T09:02:30.439Z] 09:02:30     INFO - PID 17426 | *** WIFI GEO: watch called
[task 2025-05-06T09:02:30.439Z] 09:02:30     INFO - PID 17426 | *** WIFI GEO: onStatus called.wifi-timeout
[task 2025-05-06T09:02:30.441Z] 09:02:30     INFO - PID 17426 | *** WIFI GEO: Use request cache:false reason:No cached data
[task 2025-05-06T09:02:30.444Z] 09:02:30     INFO - PID 17426 | *** WIFI GEO: Sending request
[task 2025-05-06T09:02:30.444Z] 09:02:30     INFO - PID 17426 | *** WIFI GEO: onStatus called.xhr-start
ModemManager[17802]: <info>  ModemManager (version 1.10.0) starting in system bus...
[task 2025-05-06T09:02:30.461Z] 09:02:30     INFO - PID 17426 | 1746522150460	RemoteAgent	DEBUG	WebDriverBiDiConnection 8ee18d15-89f8-4e8c-b973-a28ee622c7d9 <- {"type":"success","id":8,"result":{"realm":"fd89d63b-dc8f-42dd-868e-f3fef11441cf","type":"success","result":{"type":"object","value":[["code",{"type":"number","value":3}],["message",{"type":"string","value":"Position acquisition timed out"}]]}}}

The delay - in the above case around 650ms - is happening before startup is called, so it cannot be related to network latency, right?

Network latency can't be a problem for our CI run, as it should always use the mock server. 🤔🤔

I pushed a try build with the Geolocation:5 logs enabled. Lets see if that might give an indication.

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=69eb0a2213bd3abcd209b6e3fb3a5e41beac789b

It doesn't look like it's helpful that much:
https://treeherder.mozilla.org/logviewer?job_id=506971728&repo=try&lineNumber=45438-45464

In some cases it works fine but in others (the 2nd case for the above link) it's taking longer. I wonder if this might be related to switching between tabs and activation?

Those lines also happened within 10ms, so not a good reference...

https://treeherder.mozilla.org/logviewer?job_id=506971728&repo=try&lineNumber=45438-45443

Here I see the fallback from geoclue took 140ms, we could force it to try network provider and skip geoclue by geo.provider.use_geoclue=false and geo.provider.use_mls=true.

Flags: needinfo?(krosylight)

Sorry, looks like I pasted again something invalid. Not sure why the clipboard didn't had the updated URL. Here is the one that I wanted to refer:

https://treeherder.mozilla.org/logviewer?job_id=506971728&repo=try&lineNumber=46993-47000

There are 320ms and maybe this is related to called.wifi-timeout? But there are other cases when this comes in pretty quickly. So not sure if that is related.

(In reply to Kagami Rosylight [:saschanaz] (they/them) from comment #8)

Here I see the fallback from geoclue took 140ms, we could force it to try network provider and skip geoclue by geo.provider.use_geoclue=false and geo.provider.use_mls=true.

We don't disable these in automation yet. Which side-effect would it have? I assume we should not disable for WebDriver users?

Those lines have no geoclue logs so seems not related... And there's no other log during the delay 🤔🤔🤔

The side effect would be that the geolocation API literally won't use geoclue. But that would deviate from the usual behavior, yes.

I assume that Geolocation::GetCurrentPosition is the entry point? Maybe we could add a log entry there to see when exactly the geolocation provider is called? Maybe there is some kind of delay before? Could it maybe related to permission checks etc?

Severity: -- → S3
No longer blocks: 1964346
See Also: → 1964346

I finally got around and pushed a try build with the extra logging enabled and the multiplier removed so it should be easier to see failing tests. I already tried without removing the multiplier but didn't notice any longer delays anymore. Lets see if that helps now.

https://treeherder.mozilla.org/jobs?repo=try&revision=d8c7b4b3e8c6726c7b178508742759e31b2bc4c6

I'm no longer able to see this delay. I assume that the move to the Linux docker worker made this more stable. As such I would suggest that we are removing the multiplier when retrieving the geolocation in wdspec tests, so that we have 500ms for any build, which should be totally enough.

Component: DOM: Geolocation → WebDriver BiDi
Product: Core → Remote Protocol
Summary: Retrieving geolocation coordinates for WebDriver tests in CI takes more than 500ms on ASAN builds → [wdspec] Remove usage of multiplier when retrieving the geolocation position
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Points: --- → 1
Priority: -- → P3
Whiteboard: [webdriver:m16]

This is actually still a problem as the retriggered jobs from the try build show. Here an excerpt and maybe it helps to identify where we are hanging:

https://treeherder.mozilla.org/logviewer?job_id=512776610&repo=try&lineNumber=44799-44803

[task 2025-06-12T11:42:34.619+00:00] 11:42:34     INFO - PID 17464 | 1749728554614	RemoteAgent	TRACE	Received command script.callFunctionDeclaration for destination WINDOW_GLOBAL
[task 2025-06-12T11:42:34.622+00:00] 11:42:34     INFO - PID 17464 | [Child 17620: Main Thread]: I/Geolocation Geolocation::GetCurrentPosition
[task 2025-06-12T11:42:34.729+00:00] 11:42:34     INFO - PID 17464 | [Parent 17481: Main Thread]: D/Geolocation Checking GeoclueLocationProvider
[task 2025-06-12T11:42:35.182+00:00] 11:42:35     INFO - PID 17464 | [Child 17620: Main Thread]: D/Geolocation nsGeolocationRequest::Notify
[task 2025-06-12T11:42:35.184+00:00] 11:42:35     INFO - PID 17464 | [Child 17620: Main Thread]: D/Geolocation nsGeolocationRequest::NotifyErrorAndShutdown with error code: 3
[task 2025-06-12T11:42:35.186+00:00] 11:42:35     INFO - PID 17464 | [Child 17620: Main Thread]: D/Geolocation nsGeolocationRequest::NotifyError with error code: 3
[task 2025-06-12T11:42:35.187+00:00] 11:42:35     INFO - PID 17464 | [Child 17620: Main Thread]: D/Geolocation GeolocationPositionError::NotifyCallback
[task 2025-06-12T11:42:35.320+00:00] 11:42:35     INFO - PID 17464 | [Parent 17481: Main Thread]: D/Geolocation Selected GeoclueLocationProvider
ModemManager[17841]: <info>  ModemManager (version 1.10.0) starting in system bus...
[task 2025-06-12T11:42:35.323+00:00] 11:42:35     INFO - PID 17464 | *** WIFI GEO: startup called.

This can be seen as well for opt builds which fail with code: 3 (bug 1966071).

Blocks: 1966071
Points: 1 → ---
Component: WebDriver BiDi → DOM: Geolocation
Flags: needinfo?(krosylight)
Priority: P3 → --
Product: Remote Protocol → Core
Summary: [wdspec] Remove usage of multiplier when retrieving the geolocation position → Retrieving geolocation coordinates for WebDriver tests in CI takes more than 500ms on ASAN builds
Whiteboard: [webdriver:m16]

I triggered pernosco, let's see what it finds out.

Flags: needinfo?(krosylight)

Hi Kagami, I wanted to check back if you were able to figure out something from the Pernosco recording.

Flags: needinfo?(krosylight)

I did not get any response from Pernosco. Pinged Pernosco channel for support.

Flags: needinfo?(krosylight)

I wanted to check back if you got some support. If not shall we push a new try build?

No response. Perhaps you can try triggering pernosco and see if it responds to you, I feel like it just hates me 🤔

I never got Pernosco working for me due to authentication issues that I was not able to resolve. :(

Assignee: hskupin → nobody
Status: ASSIGNED → NEW

Update: Pernosco is broken and the right people are on it

We don't do anything special when starting Firefox. Lets continue the discussion on Matrix to figure out a bit more. I actually wonder if other test suites experience a similar issue or not.

I've asked on the pernosco channel but it makes it hard to track. Lets see if Kyle can help us further. TIA

Flags: needinfo?(khuey)

There is an r+ patch which didn't land and no activity in this bug for 2 weeks.
:whimboo, could you have a look please?
If you still have some work to do, you can add an action "Plan Changes" in Phabricator.
For more information, please visit BugBot documentation.

Flags: needinfo?(jdescottes)
Flags: needinfo?(hskupin)

I think this is still pending investigations.

Flags: needinfo?(jdescottes)

I was on vacation earlier this month and this is low priority because of how niche it is but I expect to investigate this this weekend.

Kyle, in case you need newer builds I'm happy to push a new try build. Just let me know. Thanks.

Flags: needinfo?(hskupin)

Yeah, that would be great, thanks.

Here a new try job with a failing wdspec test:

https://treeherder.mozilla.org/jobs?repo=try&revision=8d0e5105e709f966b17d5ab639ff51900eb9022c&selectedTaskRun=aRepHMeGTdKFWUr6NQ1TvA.0

This are the related lines in the log where the delay can be seen:
https://treeherder.mozilla.org/logviewer?job_id=522035289&repo=try&lineNumber=46630-46643

[task 2025-08-12T09:03:47.105+00:00] 09:03:47     INFO - PID 18994 | 1754989427102	RemoteAgent	TRACE	Received command script.callFunctionDeclaration for destination WINDOW_GLOBAL
[task 2025-08-12T09:03:47.186+00:00] 09:03:47     INFO - PID 18994 | [Parent 19024: Main Thread]: D/Geolocation Checking GeoclueLocationProvider
[task 2025-08-12T09:03:47.654+00:00] 09:03:47     INFO - PID 18994 | [Child 19220: Main Thread]: D/Geolocation nsGeolocationRequest::Notify
[task 2025-08-12T09:03:47.656+00:00] 09:03:47     INFO - PID 18994 | [Child 19220: Main Thread]: D/Geolocation nsGeolocationRequest::NotifyErrorAndShutdown with error code: 3
[task 2025-08-12T09:03:47.657+00:00] 09:03:47     INFO - PID 18994 | [Child 19220: Main Thread]: D/Geolocation nsGeolocationRequest::NotifyError with error code: 3
[task 2025-08-12T09:03:47.658+00:00] 09:03:47     INFO - PID 18994 | [Child 19220: Main Thread]: D/Geolocation GeolocationPositionError::NotifyCallback
[task 2025-08-12T09:03:47.737+00:00] 09:03:47     INFO - PID 18994 | [Parent 19024: Main Thread]: D/Geolocation Selected GeoclueLocationProvider
[task 2025-08-12T09:03:47.742+00:00] 09:03:47     INFO - PID 18994 | *** WIFI GEO: startup called.
[task 2025-08-12T09:03:47.744+00:00] 09:03:47     INFO - PID 18994 | *** WIFI GEO: watch called
[task 2025-08-12T09:03:47.746+00:00] 09:03:47     INFO - PID 18994 | *** WIFI GEO: onStatus called.wifi-timeout
[task 2025-08-12T09:03:47.750+00:00] 09:03:47     INFO - PID 18994 | *** WIFI GEO: Use request cache:false reason:No cached data
[task 2025-08-12T09:03:47.753+00:00] 09:03:47     INFO - PID 18994 | *** WIFI GEO: Sending request
[task 2025-08-12T09:03:47.753+00:00] 09:03:47     INFO - PID 18994 | *** WIFI GEO: onStatus called.xhr-start
[task 2025-08-12T09:03:47.754+00:00] 09:03:47     INFO - PID 18994 | 1754989427748	RemoteAgent	DEBUG	WebDriverBiDiConnection 3d8b3857-3114-4950-8202-d934ff5edf4c <- {"type":"success","id":8,"result":{"realm":"f2aa0f81-81b0-4e12-8c91-ff37942e406b","type":"success","result":{"type":"object","value":[["code",{"type":"number","value":3}]]}}}

Interesting is that we do not have such delays for the Mochitests which test the Geolocation feature.

Hi Kyle, did the above help or won't Pernosco work with webdriver tests? I'm not sure. Thanks.

I've taken a look at recent jobs and as it looks like a longer delay seem to only happen for the very first time the Geolocation API is used after starting Firefox. For following calls the timing looks fine.

https://treeherder.mozilla.org/logviewer?job_id=524193153&repo=mozilla-central&task=E0FE6ZRoRo6D9ahoJrubOA.0&lineNumber=46715-46716

The failures that we get classified once in a while for bug 1966071 are all for esr140. So maybe something was fixed post the branching?

Kagami, are you aware of some landing

Flags: needinfo?(krosylight)

No, I think it's still the geoclue responding late, given the 500ms delay happens synchronously: https://searchfox.org/firefox-main/rev/856a307913c2b73765b4e88d32cf15ed05549cae/dom/geolocation/Geolocation.cpp#746-758

We are calling dbus synchronously which is not great: https://searchfox.org/firefox-main/rev/856a307913c2b73765b4e88d32cf15ed05549cae/dom/system/linux/GeoclueLocationProvider.cpp#1004-1006

But other than that I think my opinion in comment #3 still holds: this is WPT not mochitest, we can't apply such strict limitation to all implementations.

Flags: needinfo?(krosylight)

:whimboo this is still on my todo list I just keep getting sidetracked by other things :/

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: