FxA Metrics ETL: Bulk export of send tab metrics to Amplitude
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Unassigned)
References
Details
Attachments
(3 files)
See description in https://docs.google.com/document/d/1UNulczGgBSepy0V1YISAfEuczbtvXaR59HB7xwTCFCk/edit#heading=h.yku9rl5thbdk
We need to export "tab sent" and "tab received" events to Amplitude based on sync telemetry.
The existing schema for the Amplitude events is:
User fxa_uid string
Event device_id string
User ua_browser string
User ua_version string
Event flow_id string
In BigQuery, we have the telemetry.sync
ping view which includes fields payload.uid
(which should be the hashed fxa_uid
), payload.device_id
, and user agent info parsed under metadata.user_agent
, though we may want to check on format of those fields and whether they match previous expectations. See https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/data/sync-ping.html for more details on the ping structure.
The flow_id
will be extracted from payload.events
where it appears in the object map as key "flowID".
This may be a fairly straightforward task of creating a query to write out these events 1:1 to a derived table with schema matching the above, and then tying that table into the same nightly bulk export machinery we use for other Amplitude sends.
Send tab metrics are currently broken, so it's not clear at this point how we can validate the output.
Reporter | ||
Comment 1•5 years ago
|
||
The following query prototype parses out most of what we need:
WITH
events AS (
SELECT
*,
udf.deanonymize_event(event).*
FROM
`moz-fx-data-shared-prod.telemetry_stable.sync_v4`
CROSS JOIN
UNNEST(payload.events) AS event
WHERE
DATE(submission_timestamp) = "2020-01-09" )
SELECT
payload.uid,
--payload.device_id,
--insert_id,
CASE event_object
WHEN 'processcommand' THEN 'tab_received'
WHEN 'sendcommand' THEN 'tab_sent'
END AS event_type,
metadata.user_agent.browser AS ua_browser,
metadata.user_agent.version AS ua_version,
event_object,
udf.get_key(event_map_values, 'flowID') AS flow_id,
submission_timestamp,
TIMESTAMP_SECONDS(SAFE_CAST(SUBSTR(udf.get_key(event_map_values, 'serverTime'), 1, 10) AS INT64)) AS server_time,
FROM
events
WHERE
event_method = 'displayURI'
LIMIT
1000
To finish this off, we need to include some more metadata fields and we'll need to put the user_properties and event_properties into JSON-ified form. These changes should be very similar to the fennec ios events view.
Reporter | ||
Comment 2•5 years ago
•
|
||
Reporter | ||
Comment 3•5 years ago
|
||
I'm checking with Rob about Amplitude event budget.
Comment 4•5 years ago
|
||
Comment 5•5 years ago
|
||
Reporter | ||
Comment 6•5 years ago
|
||
The view that defines the schema here is now committed. It's doing sampling at 10% for now in order to save on Amplitude budget.
We have an Airflow PR in review to schedule daily dumps of send tab data to AWS S3. At this point, we need to figure out where in Amplitude this data should go. If we have an appropriate dev project, we should try that first so we can verify the schema, etc. before pushing to the prod project.
Reporter | ||
Comment 7•5 years ago
|
||
Events are now flowing to the dev project. I'm awaiting validation from FxA folks to see if we need to make any tweaks to the format before we go ahead and start populating prod.
Questions to evaluate:
- Are fields being populated as expected?
- Is volume approximately as expected (understanding that we're only sending a 10% sample)?
- Do we need to include messages lacking the
payload.uid
andpayload.deviceID
fields? - Do the hashed fxa UIDs match up with the hashed UIDs coming from other sources?
Reporter | ||
Comment 8•4 years ago
|
||
Validation with DS happened in https://jira.mozilla.com/browse/DS-642 and we were able to resolve outstanding questions.
These events are now flowing to the prod FxA project in Amplitude.
Description
•