Closed Bug 1627553 Opened 4 years ago Closed 4 years ago

Investigate Geocoding

Categories

(Firefox :: Search, task, P3)

task
Points:
5

Tracking

()

RESOLVED FIXED
Iteration:
77.2 - Apr 20 - May 3

People

(Reporter: daleharvey, Assigned: daleharvey)

References

(Blocks 1 open bug)

Details

Attachments

(1 obsolete file)

No description provided.
Priority: -- → P3

Hey Mike

As part of some upcoming region detection improvements we were looking to do some geocoding within Firefox, the source file is ~1.1MB (https://gist.github.com/jwhitlock/7a38b0d745729aa3519ae6ae176c089f#file-regions_buffer-geojson) and it could be a potentially memory / cpu expensive operation.

While I was investigating the implementation, I wanted to check in on what I should be looking out for to make sure this doesnt cause any performance issues, its an async operation that can be done when idle / in the background, its not going to be within any UI hot paths

Cheers

Flags: needinfo?(mconley)
Blocks: 1627560
Assignee: nobody → dharvey
Summary: Implement Geocoding → Investigate Geocoding

Hey Dale!

Thanks for tagging me in. I think the only thing I'd worry about here is jank caused by attempting to move this quantity of data between processes or threads. If you, for example, fetch it asynchronously, then it's read off of the main thread, but then if you attempt to access it on the main thread, then it needs to be copied / structure-cloned into the main threads memory, and that can cause jank.

I wonder if this would make sense running inside of a DOM Worker? The read of the file can occur off of the main thread then, and any processing with that data could also be done off of the main thread. So then you send a message to the Worker asking it to compute the region given a point, and the Worker's job would be to return the matching regions.

That's what I'd suggest as a model, anyhow.

I'd also suggest that this work occur outside of the startup path, certainly, where the disk (even off-main-thread) accesses should be considered expensive. When would you need to retrieve this information?

Flags: needinfo?(mconley)

Its something we are going to start updating once every few weeks or so and have no reason to have it be part of the startup path, it sounds like DOM worker is a good idea cheers, will keep you in the loop as we develop this, cheers loads

Next up, the actual geocoding

Hi John, so I have a very basic proof of concept of local geocoding up @ https://phabricator.services.mozilla.com/D70112, it uses a basic point in polygon calculation which completes in ~30ms (which is faster than I was expecting). I have been looking over https://github.com/mozilla/ichnaea/blob/874e8284f0dfa1868e79aae64e14707eed660efe/ichnaea/geocode.py#L114 and looking to port the code that picks between multiple buffered regions over

I was wondering if you could point out some of the gotchas / why the buffered regions are needed rather than a straight "in polygon" calculation, also trying to form a plan for how we are going to test that the local geocoding is accurate

Cheers
Dale

Flags: needinfo?(jwhitlock)

I'm not sure why the buffered regions were used. The closest tracking bug for the work is https://github.com/mozilla/ichnaea/issues/355. Here's what I can determine from the code history:

We started caring about region in November 2014, with the Yahoo! search deal and a different search provider based on the user's country. Several things were tried in December, including OS sources like the city used in the timezone picker. By March 2015, a GeoIP-based country service was provided by MLS and used in production Firefox. 2015 was also when we learned that "region" is a politically safer term than "country".

In 2015, Hanno was working to keep up with the volume of API calls and stumbler submissions, and experimenting with different ways to determine the region from radio code. For example, cell IDs imply a region of installation. In September 2015, he worked to bring in shapefiles and GeoJSON. The first shipping version used the unbuffered shapefiles.

These were also used to validate submitted data. If the GPS coordinates are in the middle of the ocean, then the data was rejected. However, due to GPS and shape file precision, some coordinates were just outside of the precise regions. It appears the first reason for the buffered regions was to avoid rejecting submitted data just over region borders (January 2016). It might also be a little faster, since the buffered regions absorb some detailed borders, for instance around archipelagos.

The logic didn't change much after this point. This was around the shift from FirefoxOS to Connected Devices, and some of the new focus was deriving useful data from location queries.

See geocode.py for details, but here's the quick version of the logic to determine a region from a position:

  • Look up the regions from the buffered region data.
    • If no regions found, return None
    • If exactly one region found, return it
    • Otherwise, we have two or more, need to figure out which one. Repeat the lookup in the precise region data.
      • If exactly one region is found, return it
      • If no regions are found, return the closest region from the precise region data
      • If multiple regions are found, return the precise region it is most inside of (furthest from the border)

You may be able to just use the precise region data, with the risk of users near borders or the ocean falling outside of any precise region. You could then do math to find the closest region. The buffered region helps precalculate the "closest region" with expanded borders.

Flags: needinfo?(jwhitlock)

Also, the ichnaea shapefiles (version 2.0.0) have not been updated since January 2016, since Hanno Schlichting, the primary MLS developer, left in September 2017. They are derived from the natural earth project, which was last updated in May 2018 (version 4.0.0, see release notes).

I would expect long times between shape updates, and an insignificant impact from using old releases of shapes.

Edit 1: Added version numbers

Points: --- → 5
Iteration: --- → 77.2 - Apr 20 - May 3

Ok so I run both ichnaea geocoding and the local geocoding (https://phabricator.services.mozilla.com/) against a dataset of 17k lat lon pairs downloaded from https://location.services.mozilla.com/downloads, they matched results in 16533 cases and different in 597 so around 96% right. Also worth noting I could run 17k local geocodings in max 2 seconds (which included a whole bunch of io random stuff), this is without taking into account the buffered regions etc

I think for our implementation, to avoid having multiple region maps we will use the buffered region map and in the case of multiple results go straight to the "furthest from the border" but thats mostly a guess and will investigate the differences more while doing that, but this bug was mostly so I can validate that we could do local geocoding and that it wouldnt be too slow so I am happy with that for now, will move to working on the actual implementation in https://bugzilla.mozilla.org/show_bug.cgi?id=1627560

Also here are the test scripts so I can work on them again later - https://gist.github.com/daleharvey/01d1780f7343251f6d9c5bd69212e8a7

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Attachment #9138985 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: