Closed Bug 1628708 Opened 5 years ago Closed 4 years ago

FxA Metrics ETL: Activity and flow metrics

Categories

(Data Platform and Tools :: General, task, P1)

task
Points:
5

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

Attachments

(2 files)

We need to create queries in bigquery-etl that read from the imported FxA stackdriver log tables and write out derived tables that match existing ETL in FxA's Redshift cluster.

See specification in https://docs.google.com/document/d/1UNulczGgBSepy0V1YISAfEuczbtvXaR59HB7xwTCFCk/edit#heading=h.u1lhte908iqc

Per discussion with :jhirsch yesterday, I'm moving this from P2 to P3. There looks to be a significant amount of work here to understand the full existing pipeline, so we want to do a round of checking in with consumers of this data to understand ROI for this work. It may be that the raw Stackdriver logs as presented in BQ are now sufficient for some of the deep-dive bespoke analysis that previously required the processed activity and flow metrics in Redshift.

Points: 3 → 5
Priority: P2 → P3

See example query in https://bugzilla.mozilla.org/show_bug.cgi?id=1573359#c4

SELECT
  jsonPayload.fields.uid,
  jsonPayload.fields.service,
  MIN(timestamp) AS ts
FROM
  `moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20190820`
WHERE
  jsonPayload.type = 'activityEvent'
  AND jsonPayload.fields.event = 'account.signed'
GROUP BY 1, 2
ORDER BY 1, 2;

If we need to go forward with this work, it's looking likely that the relevant source data is all (or mostly?) in the auth table as captured above. Next steps would be to look at a recent day in this table and list out the available values of jsonPayload.type and jsonPayload.fields.event and cross reference that with values in the activity and flow metrics tables in Redshift (through redash). If we see those generally matching up, then we can look at counts per event type on a target day. If those counts match up between BQ and redash, then we may have a clear path to recreating these tables in BQ.

It's still not clear to me how to handle the smaller, more aggregated tables that exist in Redshift, since the existing logic seems spread out through the FxA codebase.

Met with :loines and Kimmy today about this. Kimmy is going to work on developing queries for the relevant events we need and that work/conversation will happen in https://jira.mozilla.com/projects/DS/issues/DS-643

?ni :jbuck We've identified that we want to get the 10% sample data from Redshift copied into BQ, since that's the only longer-term event data available. :loines indicated that Redshift fetches that data from CSVs somewhere in S3. Do you know the location and would it be possible for you to provision me access to those files or copy them into a GCS location so that I can load these to BQ or at least assess the size of the data?

Flags: needinfo?(jbuckley)

I am currently unloading the 10% sample data as parquet data files to s3://fxa-admin-activity-metrics-export/ in us-west-2 prod. Do you have access to AWS prod?

Flags: needinfo?(jbuckley) → needinfo?(jklukas)

Do you have access to AWS prod?

I have AWS creds, but I think they're for dev rather than prod. I get ListObjectsV2 operation: Access Denied when trying to do aws s3 ls s3://fxa-admin-activity-metrics-export/.

Flags: needinfo?(jklukas)

:jbuck I'm open to suggestions as to whether it's easier for you to figure out permissions in AWS or if you want to copy the files into GCS (perhaps under gs://moz-fx-data-prod-analysis/klukas/fxa-admin-activity-metrics-export/).

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #7)

:jbuck I'm open to suggestions as to whether it's easier for you to figure out permissions in AWS or if you want to copy the files into GCS (perhaps under gs://moz-fx-data-prod-analysis/klukas/fxa-admin-activity-metrics-export/).

I think I figured out permissions for your dev account - give it a try now?

I think I figured out permissions for your dev account - give it a try now?

Looks great. I was able to list objects in the bucket and also tested that I could view content of one of the objects. Thanks so much.

The following 10% sample tables now exist in BQ:

'moz-fx-data-shared-prod:firefox_accounts_derived.flow_events_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.activity_events_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.daily_activity_per_device_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.daily_multi_device_users_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.email_events_sampled_10'
'moz-fx-data-shared-prod:firefox_accounts_derived.flow_metadata_sampled_10'

:loines - Can you verify these look reasonable? We can then clean up the exported data in S3 and GCS.

Flags: needinfo?(loines)

LGTM thanks Jeff

Flags: needinfo?(loines)

I have deleted the staged data in GCS.

jbuck - Feel free to tear down the bucket in S3. Thanks so much for your help here.

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #12)

I have deleted the staged data in GCS.

jbuck - Feel free to tear down the bucket in S3. Thanks so much for your help here.

Done

Status update:

Further work here is currently waiting on https://jira.mozilla.com/projects/DS/issues/DS-643

Once those draft queries are created, I will productionize and we can call this bug fixed.

Assignee: nobody → jklukas
Priority: P3 → P1

These tables are created now and the PRs are in review.

The tables are now being populated nightly.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: