Import Pocket OKR data to shared-prod
Categories
(Data Platform and Tools Graveyard :: Operations, task)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
Details
Attachments
(1 file)
Kenny at Pocket has raised a need for getting Pocket OKR data (rolling MAU) into shared-prod so that the Revenue Data working group can set up reporting on it.
The data lives in Snowflake, and Kenny has indicated it would be easy for them to unload to S3 after the nightly modeling job completes.
I'm looking for SRE advice about what the most maintainable option would be. Have we used the BigQuery S3 transfer service? Or would we want to set up an Airflow DAG to handle the relevant copies?
Comment 1•5 years ago
•
|
||
Snowflake can unload data to GCS, is it possible we can use that instead? [1]
We would need to provision a new GCS bucket and service account in shared-prod, and share the service account credentials with Kenny.
[1] https://docs.snowflake.com/en/user-guide/data-unload-gcs.html#
Comment 2•5 years ago
|
||
| Assignee | ||
Comment 3•5 years ago
|
||
I chatted with Kenny; he says this sounds good, so I think we can proceed with the path of Snowflake unloading to GCS.
Steps would be:
- Data SRE provisions GCS bucket and service account, and provides those details to Pocket
- Data SRE provisions a new
pocket_externaldataset in BigQuery - Pocket does a one-time UNLOAD of historical data and sets up the recurring nightly UNLOAD
- Klukas sends a PR for a new Airflow DAG to populate tables in BQ from the GCS bucket
Does that all sound correct? Would a new external dataset match how we've handled other recent data imports?
Comment 4•5 years ago
|
||
The above sounds correct, but we should clarify the dataset names and access groups associated with the data before proceeding with the SRE pieces. Current spec is:
pocket_external: workgroup:mozilla-confidential
(no user-facing dataset)
pocket_external would usually imply a user-facing dataset pocket, but that may not be appropriate here. If this is entirely revenue-specific there's also revenue_external. I think we should probably avoid direct analysis access to shared-prod _external datasets, but perhaps it's warranted in this case. The tmp dataset also exists and is restricted to engineering workloads such as imports, so it may be possible to leverage it in some cases (though probably not here).
For the SA and GCS bucket, we're considering provisioning them in the pocket bq teams project instead of shared-prod, with appropriate access granted to ["workgroup:dataops-managed/airflow", "workgroup:platform-infra"].
| Assignee | ||
Comment 5•5 years ago
|
||
I think I'd be fine with using moz-fx-data-shared-prod:tmp for the import, and then have the Airflow DAG also include a query step that populates a table under pocket_derived (which already exists), potentially doing some cleanup in the query. That does seem simpler, and we can fall back to the data in GCS for backfill if we accidentally delete data from the table in pocket_derived.
Comment 6•5 years ago
|
||
I will configure a GCS bucket in the shared projects, with the service account provisioned in the pocket teams project.
I'm assuming we want to provision Data Loading Only permissions for snowflake access but that can be upgraded if desired.
Comment 7•5 years ago
|
||
https://github.com/mozilla-services/cloudops-infra/pull/2925 contains the infrastructure provisioning logic, which has been applied.
Bucket: gs://moz-fx-data-prod-external-pocket-data
Service Account: external-snowflake@moz-fx-data-bq-pocket.iam.gserviceaccount.com
We'll need to convey a SA key for that account to Pocket.
| Assignee | ||
Comment 8•5 years ago
|
||
We'll need to convey a SA key for that account to Pocket.
?ni :kenny Will it be you setting up the Snowflake export or someone else? Who should we coordinate with to deliver the service account key?
:whd - Will I have access to read from this bucket so that I can verify formats while setting up an import job?
Comment 9•5 years ago
|
||
:whd - Will I have access to read from this bucket so that I can verify formats while setting up an import job?
Data Platform (and airflow) have been granted storage.objectAdmin, but I can downgrade that access to read only.
Comment 10•5 years ago
•
|
||
Hi Jeff, Wesley,
For the Snowflake setup we are using the Snowflake Integration for GCS to setup access for the COPY/UNLOAD (https://docs.snowflake.com/en/user-guide/data-load-gcs-config.html#step-3-grant-the-service-account-permissions-to-access-bucket-objects).
Snowflake has allocated a gcp service account for using with the gcs integration. This gcp service account is unique for our Snowflake account.
The GCS Service Account (provided by Snowflake): yeevlwcxfc@va2-10fd.iam.gserviceaccount.com
The permissions required for UNLOAD are: storage.buckets.get, storage.objects.create, storage.objects.delete, storage.objects.get, storage.objects.list
(In reply to Wesley Dawson [:whd] from comment #7)
https://github.com/mozilla-services/cloudops-infra/pull/2925 contains the infrastructure provisioning logic, which has been applied.
Bucket:
gs://moz-fx-data-prod-external-pocket-data
Service Account:external-snowflake@moz-fx-data-bq-pocket.iam.gserviceaccount.comWe'll need to convey a SA key for that account to Pocket.
I have created an Snowflake integration for accessing the bucket: gcs://moz-fx-data-prod-external-pocket-data/
Note: with this gcs integration setup, we do not need the additional Service Account: external-snowflake@moz-fx-data-bq-pocket.iam.gserviceaccount.com
Comment 11•5 years ago
•
|
||
(In reply to Wesley Dawson [:whd] from comment #6)
I will configure a GCS bucket in the shared projects, with the service account provisioned in the pocket teams project.
I'm assuming we want to provision Data Loading Only permissions for snowflake access but that can be upgraded if desired.
I believe, Snowflake needs the permissions to UNLOAD (Data loading and unloading) data to the gcs bucket.
Comment 12•5 years ago
|
||
We provisioned access based on the assumption that we would be managing the service account.
I've filed https://github.com/mozilla-services/cloudops-infra/pull/2933/files for instead using the service account mentioned in comment #10, which likely be deployed tomorrow. That PR elevates permissions granted to the SA to include UNLOAD and also removes our managed service account.
Comment 13•5 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #12)
We provisioned access based on the assumption that we would be managing the service account.
I've filed https://github.com/mozilla-services/cloudops-infra/pull/2933/files for instead using the service account mentioned in comment #10, which likely be deployed tomorrow. That PR elevates permissions granted to the SA to include
UNLOADand also removes our managed service account.
Thank you!
Comment 14•5 years ago
|
||
The change has been deployed and yeevlwcxfc@va2-10fd.iam.gserviceaccount.com should consequently be able to write to gs://moz-fx-data-prod-external-pocket-data.
Comment 15•5 years ago
|
||
(In reply to Wesley Dawson [:whd] from comment #14)
The change has been deployed and
yeevlwcxfc@va2-10fd.iam.gserviceaccount.comshould consequently be able to write togs://moz-fx-data-prod-external-pocket-data.
I have tested the LOAD and UNLOAD from Snowflake and works perfect! Thank you!
| Assignee | ||
Comment 16•5 years ago
|
||
This is great progress here! I'm able to read the contents of the bucket and see the test files.
:gaurang - Let me know when you have UNLOADs of real data happening there. I will be able to start working on the telemetry-side Airflow DAG once I can see the structure of the unloaded data.
Comment 17•5 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #16)
This is great progress here! I'm able to read the contents of the bucket and see the test files.
:gaurang - Let me know when you have UNLOADs of real data happening there. I will be able to start working on the telemetry-side Airflow DAG once I can see the structure of the unloaded data.
Kirill (kirill@getpocket.com) will be doing the data UNLOADs, i believe, next week.
Comment 18•5 years ago
|
||
That's correct. I will try to get some testing done today and provide further details on file structure/frequency.
Comment 19•5 years ago
|
||
Jeff, do you have preferences as far as file formats and directory structure?
I can make reasonable assumptions, but if you have specific wishes as far as any of these items, it would be useful for me to know them from the start:
-file format (CSV, TSV, parquet, etc)
-file headers
-overwriting the same file with each subsequent UNLOAD vs. creating dated files for each UNLOAD
| Assignee | ||
Comment 20•5 years ago
|
||
do you have preferences as far as file formats and directory structure
For file formats, the more structured the better. Either Parquet or newline-delimited JSON files would be easy to import.
For directory structure, I would love to have files organized into directories by date. I'm imagining paths something like: gcs://moz-fx-data-prod-external-pocket-data/pocket_mau/2021-01-01/
Comment 21•5 years ago
|
||
There is an initial set of sample Parquet data in the following directory:
gcs://moz-fx-data-prod-external-pocket-data/rolling_monthly_active_user_counts/2021-03-08/
Let me know if that works for you!
I am trying to set up an approach that we can use more generally, if we want to share further data with you in the future, so the name of the data set is a bit more generic (rolling_monthly_active_user_counts vs. pocket_mau). I figured the GCS bucket itself provides the necessary context to tell that this data comes from Pocket.
| Assignee | ||
Comment 22•5 years ago
|
||
I was able to successfully run a one-off import of these parquet files:
bq load --source_format=PARQUET mozdata:analysis.klukas_pocket1 gs://moz-fx-data-prod-external-pocket-data/rolling_monthly_active_user_counts/2021-03-08'*.parquet'
The one change I'd like to see is for the date part of the path to be a directory. It looks like it ended up as part of the file name:
gs://moz-fx-data-prod-external-pocket-data/rolling_monthly_active_user_counts/2021-03-08_0_0_0.snappy.parquet
| Assignee | ||
Comment 23•5 years ago
|
||
Also, it looks like each row appears twice. There are 812 rows representing 406 days (2020-01-01 onward).
Comment 24•5 years ago
|
||
Whoops, good catch on the date being part of the file name instead of the directory. I fixed that and unloaded more records to:
gcs://moz-fx-data-prod-external-pocket-data/rolling_monthly_active_user_counts/2021-03-09/data_*
Jeff, can you check if there is a single copy of each record? I must have accidentally copied data twice while testing yesterday.
| Assignee | ||
Comment 25•5 years ago
|
||
I don't see any change in the contents of the bucket. In particular, I get:
Not found: Uris gs://moz-fx-data-prod-external-pocket-
data/rolling_monthly_active_user_counts/2021-03-09/*
Comment 26•5 years ago
|
||
Sorry, can you check again? The fix I made to make sure the dev environment did not unload records and they only came from prod was too effective haha.
| Assignee | ||
Comment 27•5 years ago
|
||
Looks good this time. No duplicates and the paths match my expectations. If you can set up a nightly export to continue publishing to dates paths like this (and let me know when to export the exports to complete), then I have everything I need to be able to import this to BQ and make it available for reporting. I'll plan to have the main derived table for reporting reflect just the most recent day of exported data, but it will be good to have the history of previous exports available to help with debugging if we see any issues.
Comment 28•5 years ago
|
||
Woohoo!
How tight of an estimate do you need on when fresh data will be available in GCS each night? This UNLOAD will be run as part of our nightly refresh of all data models, so it may vary a bit day-to-day depending on how long everything upstream took to run.
| Assignee | ||
Comment 29•5 years ago
|
||
How tight of an estimate do you need on when fresh data will be available in GCS each night?
If you can give us the time by which you have ~90% confidence it'll be done, that should be fine. I will plan to set up the job so that it will fail if no data exists for the new day, and then will retry after 30 minutes, so we'll have some resilience to lateness.
Comment 30•5 years ago
|
||
Sounds good. Let's use 12:00:00 UTC each day.
Comment 31•5 years ago
|
||
| Assignee | ||
Comment 32•5 years ago
|
||
PR for scheduling is reviewed and ready to merge, but I've realized one final detail.
Per Airflow convention and existing practice in our telemetry datasets, we are usually operating based on the most recent complete day, so we are executing jobs for 2021-03-08 data on 2021-03-09, etc. The Snowflake export, however, is exporting data to a path that represents the date on which the export is run.
:kirill - Can you modify the export so that the date in the path reflects the most recent measured_at date rather than the date on which the export is initiated? If that's difficult on your end, now worries; we can adjust the DAG logic to account for it.
| Assignee | ||
Updated•5 years ago
|
Comment 33•5 years ago
|
||
Good point!
We are using dbt to update this aggregate table and copy the data into GCS every night. Unlike Airflow, dbt convention is not to process particular dated "chunks" of events in relation to its run/execution date. dbt convention is usually to process any new records that are not already part of a model (without referencing the execution date).
Thus, when I update the MAU counts aggregate table on 2021-03-11, it will already have a partial count from 2021-03-10 in it. dbt will then finalize the count for 2021-03-10 and add a partial 2021-03-11. This all will get dumped into a directory dated 2021-03-11 (the execution date).
It is not straightforward for me to change the directory to the most recent complete measured_at date, but I can hard-code it to be one day behind the date the export is initiated. Most of the time, these two should be the same. Let me know if that works for you! The situation in which it would fail is if we're experiencing some sort of pipeline issue and there was in fact no new data processed on day X, but the date in the directory would still advance by 1 day.
Separately, for your processing needs, does it matter if there is an incomplete, leading day of data in each export? I could also hard-code the nightly update for this table to assume that last day is incomplete and remove it from the export.
| Assignee | ||
Comment 34•5 years ago
|
||
For your processing needs, does it matter if there is an incomplete, leading day of data in each export?
I definitely want to make sure we aren't exposing unfinished days in what we publish to BQ, but we can probably handle that on either side. I need to absorb this a bit more. Thanks so much for all the context on what's happening Pocket-side.
| Assignee | ||
Comment 35•5 years ago
|
||
I've updated the logic in https://github.com/mozilla/bigquery-etl/pull/1883 to account for the 1 day offset and to drop any partial data for the current day, along with copious comments to explain the situation. So we should be good from the Pocket side. I'll merge this today and check on status tomorrow to confirm that it was able to run correctly.
Comment 36•5 years ago
|
||
Perfect! Thank you, Jeff!
| Assignee | ||
Comment 37•5 years ago
|
||
The Airflow job was able to run successfully. We'll call this fixed and I'll follow up with data folks to let them know that it's available.
Updated•3 years ago
|
Description
•