Changes to Pocket tile ID types import job
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: kdemtchouk, Assigned: klukas)
References
Details
Attachments
(1 file)
For a long time, we’ve been copying some data to S3 that is then picked up by the Mozilla data pipeline from this directory:
s3://pocket-mozilla-shared-spoc-analytics/pocket-tile-id-types-csv/tile_id_types_
Specifically, what this data does is distinguish sponsored Pocket tiles in new tab from organic recommendation tiles. It is used in the creation of the activity_stream.impression_stats_flat BQ table.
The export procedure to S3 is still being triggered from our legacy data warehouse and I would like to move it to our new Snowflake warehouse ahead of our legacy warehouse being decommissioned next year.
For reasons we can get into if you’re interested, this would ideally require two changes on the Mozilla side:
Logic change to assume that the data we share with you contains only sponsored tile IDs and that anything else is an organic rec. (We do not yet have organic tile IDs available in Snowflake).
That we shift the directory from S3 to an existing GCS directory we’ve been using to share data with Mozilla this year in Parquet format:
gcs://moz-fx-data-prod-external-pocket-data/
Do you know how much effort would be involved in making these changes? And as always, please let me know if I should direct my questions elsewhere.
Assignee | ||
Comment 1•3 years ago
|
||
The spot where we read the tile ID types is here: https://github.com/mozilla/bigquery-etl/blob/9bca48821a8a0d40b1700cc14ecd8068d132ed06/sql/moz-fx-data-shared-prod/activity_stream/tile_id_types/view.sql
That references a table pocket-tiles.pocket_tiles_data.tile_id_types
that is owned by Pocket. So I don't think we have anything Mozilla data-side that is reading directly from the S3 source. What is the status of that table?
Reporter | ||
Comment 2•3 years ago
|
||
Hmmm, I'm not sure. I know that at some point we changed how Pocket-related resources are administered in Google Cloud to give the Pocket team control over said resources. This may be something that was initially owned by Mozilla, but then got moved over to Pocket.
Let me inquire with some internal data eng/ops folks.
Reporter | ||
Comment 3•3 years ago
|
||
Unfortunately, no one at Pocket seems to know how pocket-tiles.pocket_tiles_data.tile_id_types
is populated :grimace: It is not a table that our Airflow repo works with at all.
Are you sure there are no references to this table in the Mozilla data codebase(s) that might provide clues as to how it is populated?
I could have sworn we set up moz-fx-data-shared-prod.activity_stream.tile_id_types
before Pocket was doing anything in GCP, so it's strange to me that this view is somehow referencing a Pocket-owned BQ table. I also dimly remember working with Nan, as well as you, on it.
Comment 4•3 years ago
|
||
From https://github.com/mozilla/bigquery-etl/blame/9bca48821a8a0d40b1700cc14ecd8068d132ed06/sql/moz-fx-data-shared-prod/activity_stream/tile_id_types/view.sql, I think bug #1604643 is the bug you're looking for with context on pocket-tiles
. Per https://bugzilla.mozilla.org/show_bug.cgi?id=1604643#c2, maybe Hristo knows?
Comment 5•3 years ago
|
||
Data population for pocket-tiles.pocket_tiles_data.tile_id_types
is performed by a service on our end that runs under an ec2 instance. It copies data from s3://pocket-mozilla-shared-spoc-analytics/pocket-tile-id-types-csv/tile_id_types_000 to GoogleBigQuery table pocket-tiles.pocket_tiles_data.tile_id_types
every 2 minutes.
@kirill this service needs to be moved away from the ec2. Ping me so we can resolve this.
Reporter | ||
Comment 7•3 years ago
|
||
Hristo is working on updating the CRON job to populate a new BigQuery table. The dataset has changed slightly (it is now a list of only sponsored tile IDs, anything not in this last can be assumed to be an organic recommendation).
Once that is done, I think we'll need to change the join logic against this new table in the activity stream tables where the old table (pocket-tiles.pocket_tiles_data.tile_id_types) was used.
Assignee | ||
Comment 8•3 years ago
|
||
Kirill has set up an export to the moz-fx-data-prod-external-pocket-data
bucket and it should start getting populated tonight under a spoc_tile_ids
path.
We should be able to copy the two definitions under https://github.com/mozilla/bigquery-etl/tree/main/sql/moz-fx-data-shared-prod/pocket_derived and modify to fit this new use case.
Once the import to BQ is established and working, we'll coordinate with Kirill to get all references pointing to the new table.
Assignee | ||
Updated•3 years ago
|
Comment 9•3 years ago
|
||
Assignee | ||
Comment 10•3 years ago
|
||
PR up for review: https://github.com/mozilla/bigquery-etl/pull/2661
I've run these locally as well to get the tables set up and initial pull in place. So you should be able to validate the content now:
select * from mozdata.pocket.spoc_tile_ids
Kirill - Can you give the PR and the data a look over to verify if it looks like you'd expect?
Reporter | ||
Comment 11•3 years ago
|
||
Amazing, thank you, Jeff! I approved the PR and confirmed all of the records on the Pocket side are making it through to the mozdata.pocket.spoc_tile_ids table.
Assignee | ||
Comment 12•3 years ago
|
||
The PR is merged now, so this table should get updated daily from the new content in GCS now.
Kirill - I'll leave this open to capture a subsequent PR where we update uses of the legacy tile_id_types table to use the new one. It sounds like you were willing to take a crack at that.
Reporter | ||
Comment 13•3 years ago
|
||
Yes, I can do that! To do so, would I just look for references to the old table in that some repo?
Assignee | ||
Comment 14•3 years ago
|
||
(In reply to Kirill Demtchouk from comment #13)
Yes, I can do that! To do so, would I just look for references to the old table in that some repo?
Exactly. This is what I see:
› rg tile_id_types sql/
sql/moz-fx-data-shared-prod/activity_stream/impression_stats_by_experiment/view.sql
6: tile_id_types.type AS tile_type,
11: `moz-fx-data-shared-prod.activity_stream.tile_id_types` AS tile_id_types
sql/moz-fx-data-shared-prod/activity_stream/tile_id_types/view.sql
5: `moz-fx-data-shared-prod.activity_stream.tile_id_types`
10: `pocket-tiles.pocket_tiles_data.tile_id_types`
sql/moz-fx-data-shared-prod/activity_stream/impression_stats_flat/view.sql
6: tile_id_types.type AS tile_type,
11: `moz-fx-data-shared-prod.activity_stream.tile_id_types` AS tile_id_types
Reporter | ||
Comment 15•3 years ago
|
||
How does this look?
https://github.com/mozilla/bigquery-etl/pull/2683
Reporter | ||
Comment 16•3 years ago
|
||
Also, is it easy for you to check whether the new spoc_tile_ids
table and the old tile_id_types
tables have the same level of access?
Specifically, Pocket uses Airflow to run a few queries against Mozilla BigQuery and I want to be sure they don't break when I switch them to the new table.
Airflow is blanking out the connection details when I try to view them, so I'm not 100% certain of the name of the service account that Pocket's Airflow instance is using.
Assignee | ||
Comment 17•3 years ago
|
||
The old tile_id_types
is an authorized view that points into a pocket-owned project. If your existing ETL is referencing the pocket-side pocket-tiles.pocket_tiles_data.tile_id_types
then it might be a problem transitioning to the new view.
We do have a special grant on the activity_stream_bi dataset for the workgroup pocket/external
which may be relevant here.
Specifically, Pocket uses Airflow to run a few queries against Mozilla BigQuery
What tables/views are those existing queries hitting? That could help me determine what needs to be in place here.
Reporter | ||
Comment 18•3 years ago
|
||
Our ETL is referencing the the Mozilla-side moz-fx-data-shared-prod.activity_stream.tile_id_types
.
The other table being referenced is mostly moz-fx-data-shared-prod.activity_stream_live.impression_stats_v1
.
Assignee | ||
Comment 19•3 years ago
|
||
Kirill wrote up https://github.com/mozilla/bigquery-etl/pull/2683 to change the references, and that is now merged.
If the current state looks good to you and we aren't having any access issues from the Pocket side, then we can call this bug done.
Reporter | ||
Comment 20•3 years ago
|
||
Awesome, thank you! I was playing with the tables in Redash this morning and things look good. I also deployed the changes to our ETL jobs, but will need to give them 24 hours to see if there are any errors. I'll report back tomorrow morning!
Reporter | ||
Comment 21•3 years ago
|
||
Actually, I was able to trigger the ETL jobs early and verify everything works! Please close out this bug.
Assignee | ||
Comment 22•3 years ago
|
||
Excellent!
Description
•