FxA Metrics ETL: Activity and flow metrics
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
References
Details
Attachments
(2 files)
We need to create queries in bigquery-etl that read from the imported FxA stackdriver log tables and write out derived tables that match existing ETL in FxA's Redshift cluster.
See specification in https://docs.google.com/document/d/1UNulczGgBSepy0V1YISAfEuczbtvXaR59HB7xwTCFCk/edit#heading=h.u1lhte908iqc
Assignee | ||
Comment 1•5 years ago
|
||
Per discussion with :jhirsch yesterday, I'm moving this from P2 to P3. There looks to be a significant amount of work here to understand the full existing pipeline, so we want to do a round of checking in with consumers of this data to understand ROI for this work. It may be that the raw Stackdriver logs as presented in BQ are now sufficient for some of the deep-dive bespoke analysis that previously required the processed activity and flow metrics in Redshift.
Assignee | ||
Comment 2•5 years ago
|
||
See example query in https://bugzilla.mozilla.org/show_bug.cgi?id=1573359#c4
SELECT
jsonPayload.fields.uid,
jsonPayload.fields.service,
MIN(timestamp) AS ts
FROM
`moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20190820`
WHERE
jsonPayload.type = 'activityEvent'
AND jsonPayload.fields.event = 'account.signed'
GROUP BY 1, 2
ORDER BY 1, 2;
If we need to go forward with this work, it's looking likely that the relevant source data is all (or mostly?) in the auth
table as captured above. Next steps would be to look at a recent day in this table and list out the available values of jsonPayload.type and jsonPayload.fields.event and cross reference that with values in the activity and flow metrics tables in Redshift (through redash). If we see those generally matching up, then we can look at counts per event type on a target day. If those counts match up between BQ and redash, then we may have a clear path to recreating these tables in BQ.
It's still not clear to me how to handle the smaller, more aggregated tables that exist in Redshift, since the existing logic seems spread out through the FxA codebase.
Assignee | ||
Comment 3•5 years ago
|
||
Met with :loines and Kimmy today about this. Kimmy is going to work on developing queries for the relevant events we need and that work/conversation will happen in https://jira.mozilla.com/projects/DS/issues/DS-643
Assignee | ||
Comment 4•5 years ago
|
||
?ni :jbuck We've identified that we want to get the 10% sample data from Redshift copied into BQ, since that's the only longer-term event data available. :loines indicated that Redshift fetches that data from CSVs somewhere in S3. Do you know the location and would it be possible for you to provision me access to those files or copy them into a GCS location so that I can load these to BQ or at least assess the size of the data?
Comment 5•4 years ago
|
||
I am currently unloading the 10% sample data as parquet data files to s3://fxa-admin-activity-metrics-export/ in us-west-2 prod. Do you have access to AWS prod?
Assignee | ||
Comment 6•4 years ago
|
||
Do you have access to AWS prod?
I have AWS creds, but I think they're for dev rather than prod. I get ListObjectsV2 operation: Access Denied
when trying to do aws s3 ls s3://fxa-admin-activity-metrics-export/
.
Assignee | ||
Comment 7•4 years ago
|
||
:jbuck I'm open to suggestions as to whether it's easier for you to figure out permissions in AWS or if you want to copy the files into GCS (perhaps under gs://moz-fx-data-prod-analysis/klukas/fxa-admin-activity-metrics-export/
).
Comment 8•4 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #7)
:jbuck I'm open to suggestions as to whether it's easier for you to figure out permissions in AWS or if you want to copy the files into GCS (perhaps under
gs://moz-fx-data-prod-analysis/klukas/fxa-admin-activity-metrics-export/
).
I think I figured out permissions for your dev account - give it a try now?
Assignee | ||
Comment 9•4 years ago
|
||
I think I figured out permissions for your dev account - give it a try now?
Looks great. I was able to list objects in the bucket and also tested that I could view content of one of the objects. Thanks so much.
Assignee | ||
Comment 10•4 years ago
|
||
The following 10% sample tables now exist in BQ:
'moz-fx-data-shared-prod:firefox_accounts_derived.flow_events_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.activity_events_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.daily_activity_per_device_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.daily_multi_device_users_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.email_events_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.flow_metadata_sampled_10'
:loines - Can you verify these look reasonable? We can then clean up the exported data in S3 and GCS.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 12•4 years ago
|
||
I have deleted the staged data in GCS.
jbuck - Feel free to tear down the bucket in S3. Thanks so much for your help here.
Comment 13•4 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #12)
I have deleted the staged data in GCS.
jbuck - Feel free to tear down the bucket in S3. Thanks so much for your help here.
Done
Assignee | ||
Comment 14•4 years ago
|
||
Status update:
Further work here is currently waiting on https://jira.mozilla.com/projects/DS/issues/DS-643
Once those draft queries are created, I will productionize and we can call this bug fixed.
Assignee | ||
Updated•4 years ago
|
Comment 15•4 years ago
|
||
Comment 16•4 years ago
|
||
Assignee | ||
Comment 17•4 years ago
|
||
These tables are created now and the PRs are in review.
Assignee | ||
Comment 18•4 years ago
|
||
The tables are now being populated nightly.
Description
•