Closed Bug 1743493 Opened 3 years ago Closed 3 years ago

Changes to Pocket tile ID types import job

Categories

(Data Platform and Tools :: General, task, P1)

task
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kdemtchouk, Assigned: klukas)

References

Details

Attachments

(1 file)

For a long time, we’ve been copying some data to S3 that is then picked up by the Mozilla data pipeline from this directory:
s3://pocket-mozilla-shared-spoc-analytics/pocket-tile-id-types-csv/tile_id_types_

Specifically, what this data does is distinguish sponsored Pocket tiles in new tab from organic recommendation tiles. It is used in the creation of the activity_stream.impression_stats_flat BQ table.

The export procedure to S3 is still being triggered from our legacy data warehouse and I would like to move it to our new Snowflake warehouse ahead of our legacy warehouse being decommissioned next year.

For reasons we can get into if you’re interested, this would ideally require two changes on the Mozilla side:

Logic change to assume that the data we share with you contains only sponsored tile IDs and that anything else is an organic rec. (We do not yet have organic tile IDs available in Snowflake).

That we shift the directory from S3 to an existing GCS directory we’ve been using to share data with Mozilla this year in Parquet format:
gcs://moz-fx-data-prod-external-pocket-data/

Do you know how much effort would be involved in making these changes? And as always, please let me know if I should direct my questions elsewhere.

Flags: needinfo?(jklukas)

The spot where we read the tile ID types is here: https://github.com/mozilla/bigquery-etl/blob/9bca48821a8a0d40b1700cc14ecd8068d132ed06/sql/moz-fx-data-shared-prod/activity_stream/tile_id_types/view.sql

That references a table pocket-tiles.pocket_tiles_data.tile_id_types that is owned by Pocket. So I don't think we have anything Mozilla data-side that is reading directly from the S3 source. What is the status of that table?

Flags: needinfo?(jklukas)

Hmmm, I'm not sure. I know that at some point we changed how Pocket-related resources are administered in Google Cloud to give the Pocket team control over said resources. This may be something that was initially owned by Mozilla, but then got moved over to Pocket.

Let me inquire with some internal data eng/ops folks.

Unfortunately, no one at Pocket seems to know how pocket-tiles.pocket_tiles_data.tile_id_types is populated :grimace: It is not a table that our Airflow repo works with at all.

Are you sure there are no references to this table in the Mozilla data codebase(s) that might provide clues as to how it is populated?

I could have sworn we set up moz-fx-data-shared-prod.activity_stream.tile_id_types before Pocket was doing anything in GCP, so it's strange to me that this view is somehow referencing a Pocket-owned BQ table. I also dimly remember working with Nan, as well as you, on it.

Flags: needinfo?(jklukas)
Flags: needinfo?(jklukas) → needinfo?(hganchev)

Data population for pocket-tiles.pocket_tiles_data.tile_id_types is performed by a service on our end that runs under an ec2 instance. It copies data from s3://pocket-mozilla-shared-spoc-analytics/pocket-tile-id-types-csv/tile_id_types_000 to GoogleBigQuery table pocket-tiles.pocket_tiles_data.tile_id_types every 2 minutes.

@kirill this service needs to be moved away from the ec2. Ping me so we can resolve this.

Flags: needinfo?(hganchev)

:kirill - Has this been resolved?

Flags: needinfo?(kirill)

Hristo is working on updating the CRON job to populate a new BigQuery table. The dataset has changed slightly (it is now a list of only sponsored tile IDs, anything not in this last can be assumed to be an organic recommendation).

Once that is done, I think we'll need to change the join logic against this new table in the activity stream tables where the old table (pocket-tiles.pocket_tiles_data.tile_id_types) was used.

Flags: needinfo?(kirill)

Kirill has set up an export to the moz-fx-data-prod-external-pocket-data bucket and it should start getting populated tonight under a spoc_tile_ids path.

We should be able to copy the two definitions under https://github.com/mozilla/bigquery-etl/tree/main/sql/moz-fx-data-shared-prod/pocket_derived and modify to fit this new use case.

Once the import to BQ is established and working, we'll coordinate with Kirill to get all references pointing to the new table.

Assignee: nobody → jklukas
Points: --- → 3
Priority: -- → P1

PR up for review: https://github.com/mozilla/bigquery-etl/pull/2661

I've run these locally as well to get the tables set up and initial pull in place. So you should be able to validate the content now:

select * from mozdata.pocket.spoc_tile_ids

Kirill - Can you give the PR and the data a look over to verify if it looks like you'd expect?

Amazing, thank you, Jeff! I approved the PR and confirmed all of the records on the Pocket side are making it through to the mozdata.pocket.spoc_tile_ids table.

The PR is merged now, so this table should get updated daily from the new content in GCS now.

Kirill - I'll leave this open to capture a subsequent PR where we update uses of the legacy tile_id_types table to use the new one. It sounds like you were willing to take a crack at that.

Yes, I can do that! To do so, would I just look for references to the old table in that some repo?

(In reply to Kirill Demtchouk from comment #13)

Yes, I can do that! To do so, would I just look for references to the old table in that some repo?

Exactly. This is what I see:

› rg tile_id_types sql/
sql/moz-fx-data-shared-prod/activity_stream/impression_stats_by_experiment/view.sql
6:  tile_id_types.type AS tile_type,
11:  `moz-fx-data-shared-prod.activity_stream.tile_id_types` AS tile_id_types

sql/moz-fx-data-shared-prod/activity_stream/tile_id_types/view.sql
5:  `moz-fx-data-shared-prod.activity_stream.tile_id_types`
10:  `pocket-tiles.pocket_tiles_data.tile_id_types`

sql/moz-fx-data-shared-prod/activity_stream/impression_stats_flat/view.sql
6:  tile_id_types.type AS tile_type,
11:  `moz-fx-data-shared-prod.activity_stream.tile_id_types` AS tile_id_types
Flags: needinfo?(jklukas)

Also, is it easy for you to check whether the new spoc_tile_ids table and the old tile_id_types tables have the same level of access?

Specifically, Pocket uses Airflow to run a few queries against Mozilla BigQuery and I want to be sure they don't break when I switch them to the new table.

Airflow is blanking out the connection details when I try to view them, so I'm not 100% certain of the name of the service account that Pocket's Airflow instance is using.

The old tile_id_types is an authorized view that points into a pocket-owned project. If your existing ETL is referencing the pocket-side pocket-tiles.pocket_tiles_data.tile_id_types then it might be a problem transitioning to the new view.

We do have a special grant on the activity_stream_bi dataset for the workgroup pocket/external which may be relevant here.

Specifically, Pocket uses Airflow to run a few queries against Mozilla BigQuery

What tables/views are those existing queries hitting? That could help me determine what needs to be in place here.

Flags: needinfo?(jklukas)

Our ETL is referencing the the Mozilla-side moz-fx-data-shared-prod.activity_stream.tile_id_types.

The other table being referenced is mostly moz-fx-data-shared-prod.activity_stream_live.impression_stats_v1.

Kirill wrote up https://github.com/mozilla/bigquery-etl/pull/2683 to change the references, and that is now merged.

If the current state looks good to you and we aren't having any access issues from the Pocket side, then we can call this bug done.

Awesome, thank you! I was playing with the tables in Redash this morning and things look good. I also deployed the changes to our ETL jobs, but will need to give them 24 hours to see if there are any errors. I'll report back tomorrow morning!

Actually, I was able to trigger the ETL jobs early and verify everything works! Please close out this bug.

Excellent!

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
See Also: → 1761790
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: