Add direct-to-parquet for new "update" ping

RESOLVED FIXED

Status

enhancement
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Dexter, Assigned: Dexter)

Tracking

Details

(Whiteboard: [measurement:client])

Attachments

(1 attachment)

Once bug 1120372 lands, we need to make the "update" ping data available for analysis to re:dash and other data tools.
Blocks: 1351397
Whiteboard: [measurement:client]
Before unblock this item, we need to understand what's the minimum set of fields that we'll need to save in parquet from the environment section of the ping. Please note that although it's possible to change the schema and backfill the data in the dataset (see 1360256 comment 20), we won't be able to perform schema evolution on this (not until Bug 1352521 is fixed).

Here's the minimum set I would go for:

build.applicationName
build.architecture
build.version
build.buildId
build.vendor
build.hotfixVersion

settings.isDefaultBrowser
settings.defaultSearchEngine
settings.defaultSearchEngineData.*
settings.telemetryEnabled
settings.locale
settings.attribution.*
settings.update.*

profile.creationDate

partner.*

os.name
os.version
os.locale

@Ben & @Benjamin, can you think of any other valuable field in the environment section [1] that should make it to the parquet table (in addition to the other ping data outside of the environment)?

[1] - https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html
Flags: needinfo?(bhearsum)
Flags: needinfo?(benjamin)
Product, channel, and the version of the update that's just been applied should be all I need. It looks like the channel is in settings.update.channel. If build.* is information about the new build (not the currently running one), I think I'm good.
Flags: needinfo?(bhearsum)
build.* is about the existing build, not the new build. The data bout the new build is in the payload section described at 

https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/update-ping.html

  payload: {
    reason: <string>, // "ready"
    targetChannel: <string>, // "nightly"
    targetVersion: <string>, // "56.01a"
    targetBuildId: <string>, // "20080811053724"
  }

I have no particular opinion about what needs to be in this dataset, given that I won't be a primary user.
Flags: needinfo?(benjamin)
Assignee: nobody → alessio.placitelli
Status: NEW → ASSIGNED
With both the parquet and the update ping schema landed [1], what else do I need to enable direct to parquet here? Any other PR I should file against some repo?

[1] - https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/dev/templates/telemetry/update/update.4.parquetmr.txt
Flags: needinfo?(whd)
No other PRs are currently needed. I'm going to add a monitor with a 1% ingestion error threshold that emails minimally myself and :Dexter. At some point including a monitor config will become a requirement when adding a new ping type to the schemas repo.

This will be packaged and deployed some time next week (Monday is a US holiday). As there are substantial unrelated changes in this sprint it may take longer than usual before this is deployed.
Flags: needinfo?(whd)
(In reply to Wesley Dawson [:whd] from comment #6)
> No other PRs are currently needed. I'm going to add a monitor with a 1%
> ingestion error threshold that emails minimally myself and :Dexter. At some
> point including a monitor config will become a requirement when adding a new
> ping type to the schemas repo.
> 
> This will be packaged and deployed some time next week (Monday is a US
> holiday). As there are substantial unrelated changes in this sprint it may
> take longer than usual before this is deployed.

Awesome, thanks!
This data should now be available in the presto data source as "telemetry_update_parquet". The only schema errors I'm seeing in production are that profile.creationDate is sometimes null, not required by the json schema, and required in parquet.
(In reply to Wesley Dawson [:whd] from comment #8)
> This data should now be available in the presto data source as
> "telemetry_update_parquet". The only schema errors I'm seeing in production
> are that profile.creationDate is sometimes null, not required by the json
> schema, and required in parquet.

Thanks Wesley! Do I need to tweak the parquet schema to make that field optional and make the error go away?
Blocks: 1397132
(In reply to Alessio Placitelli [:Dexter] from comment #9)
> 
> Thanks Wesley! Do I need to tweak the parquet schema to make that field
> optional and make the error go away?

If you want the data where creationDate is null to be available in parquet, then yes, you should make it optional. Otherwise you'll only be able to access it via ATMO and the dataset APIs.
(In reply to Wesley Dawson [:whd] from comment #10)
> (In reply to Alessio Placitelli [:Dexter] from comment #9)
> > 
> > Thanks Wesley! Do I need to tweak the parquet schema to make that field
> > optional and make the error go away?
> 
> If you want the data where creationDate is null to be available in parquet,
> then yes, you should make it optional. Otherwise you'll only be able to
> access it via ATMO and the dataset APIs.

Thanks. Created this PR to address the problem: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/82
Closing this as the parquet was created.
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Depends on: 1400921
I've done an out-of-band deploy for https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/82 in case that improves the situation for bug #1400921.
You need to log in before you can comment on or make changes to this bug.