Closed Bug 1448467 Opened 7 years ago Closed 7 years ago

Data review for Add state/province to parsed pings output

Categories

(Data Platform and Tools :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kparlante, Unassigned)

References

Details

Attachments

(1 file)

Request for data collection review > What questions will you answer with this data? From bug 1447648: Marketing would like to look at the performance of a local campaign, but needs to be able to distinguish between data from various cities in the US with the same name (specifically, Portland OR and Madison WI). > Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Mozilla needs to answer these questions to evaluate the effectiveness of marketing spend. More generally, any question that requires the "City" field might run into the same problem. > What alternative methods did you consider to answer these questions? Why were they not sufficient? We tried to use existing Country and City information, but on their own they are not sufficient because there are multiple cities in the US with the same name, and we can't tell them apart without further info to disambiguate. > Can current instrumentation answer these questions? No, we do not currently store any geolocation information that would help distinguish cities with the same name inside a given country. > List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki. We propose to add a unique "city" identifier to disambiguate cities of the same name. This is a category 1 measure, as far as I can tell. > How long will this data be collected? I want to permanently monitor this data. > What populations will you measure? Firefox users with Telemetry enabled > Which release channels? All > Which countries? All > Which locales? All > Any other filters? Please describe in detail below. No > If this data collection is default on, what is the opt-out mechanism for users? Opt-out is done via the usual Telemetry mechanism in Firefox preferences > Please provide a general description of how you will analyze this data. This data will be used to filter Telemetry data so that it can be limited to a specific geographic city. > Where do you intend to share the results of your analysis? Within Mozilla
Flags: needinfo?(francois)
(In reply to Katie Parlante from comment #0) > > List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki. > We propose to add a unique "city" identifier to disambiguate cities of the > same name. This is a category 1 measure, as far as I can tell. What exactly are we adding here? A unique numerical ID for each city in the US so that we can use that instead of "Portland" or "Maddison"? Is there any bucketing of the data so that we don't keep track of very small towns? For very small towns (e.g. < 1000 people), the city field would become very similar to a unique identifier.
Flags: needinfo?(francois)
We are adding a geoname_id for city. It's a unique numerical ID for every city in the geoname database. The geonames exports include subsets for only cities with greater than 1000, 5000, or 15000 people (city1000.zip, city5000.zip, and city15000.zip respectively). It may be possible to read those exports and drop the geoname_id if it isn't in the city1000 set (and would probably also need to drop geoCity field too). city1000.zip is 7.2MB, so if we just extracted all of the geoname_id's, it would be small enough that this might be doable in hindsight at the same time that we look up geoname_id.
Alternately we could add most specific subdivision iso code (and not geoname_id), which in the US maps to State.
(In reply to Daniel Thorn [:relud] from comment #2) > It may be possible to read > those exports and drop the geoname_id if it isn't in the city1000 set (and > would probably also need to drop geoCity field too). city1000.zip is 7.2MB, > so if we just extracted all of the geoname_id's, it would be small enough > that this might be doable in hindsight at the same time that we look up > geoname_id. That sounds like a good way to ensure that we stay away from Category 3, both for the geoname_id and the geocity fields. Marshall, in your recent discussions on similar topics, did you come up with a standard population size (1000? 5000?) below which we roll users up into a larger grouping?
Flags: needinfo?(merwin)
We may instead add subdivision1 and subdivision2, which are described at https://dev.maxmind.com/geoip/geoip2/whats-new-in-geoip2/ > We also provide multiple levels of country subdivision data. The subdivisions these provide correspond to the subdivisions which have been given ISO 3166-2 codes. For example, in the United States, we only provide a single level of subdivision data, corresponding to US states. But for the United Kingdom, we may provide two levels. The first level is the the country (England, Scotland, Wales) or province (Northern Ireland). The second level may be a county, a London borough, a unitary authority, council area, etc. We could still drop some geo data if a city weren't in a particular cityN000.zip, but that work will take longer, and may risk the 4/1 deadline for needing this data.
> that work will take longer, and may risk the 4/1 deadline for needing this data. Nevermind, I've implemented it: https://github.com/mozilla-services/lua_sandbox_extensions/pull/286/commits/82baac0ca8ac00cab534dc75590f13f9681e84f1
No, we did not come up with a specific population size. The question is what we really need the data for? That should inform the population size decision. We probably don't need data for cities smaller than 15000, so I'd strongly suggest going with that.
Flags: needinfo?(merwin)
The code to add this information is here: https://github.com/mozilla-services/lua_sandbox_extensions/pull/286 Daniel has added a minimum city size cutoff, so we can omit city info for cities with populations under 15000. Francois, can you please data-review this code?
Flags: needinfo?(francois)
Thanks Daniel and Mike. Can you put fill out a new data review form and attach it to the bug as a text file. If you r? me on it, I will be happy to review it before I leave today. Essentially, it will consist of what's in comment 0 but with more details in question 5: - what the new data looks like (first two sentences of comment 2) - the fact that it's category 1 because we are filtering out small cities
Flags: needinfo?(mreid)
Flags: needinfo?(francois)
Flags: needinfo?(dthorn)
Attached file data_review.txt
Flags: needinfo?(mreid)
Flags: needinfo?(dthorn)
Attachment #8963764 - Flags: review?(francois)
Comment on attachment 8963764 [details] data_review.txt 1) Is there or will there be **documentation** that describes the schema for the ultimate data set available publicly, complete and accurate? It will be added to https://docs.telemetry.mozilla.org/datasets/batch_view/main_summary/reference.html#schema and https://docs.telemetry.mozilla.org/datasets/mozetl/clients_daily/reference.html#schema. 2) Is there a control mechanism that allows the user to turn the data collection on and off? Yes, telemetry setting. 3) If the request is for permanent data collection, is there someone who will monitor the data over time?** Permanent: kparlante. 4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under? ** Category 1. 5) Is the data collection request for default-on or default-off? Default on, all channels. 6) Does the instrumentation include the addition of **any *new* identifiers** (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)? No, since a minimum city size is enforced. 7) Is the data collection covered by the existing Firefox privacy notice? Yes. 8) Does there need to be a check-in in the future to determine whether to renew the data? No, permanent.
Attachment #8963764 - Flags: review?(francois) → review+
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: