Closed Bug 1751979 Opened 4 years ago Closed 3 years ago

Sanitize search engine values in historical search data

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

Attachments

(5 files, 1 obsolete file)

Link to GitHub pull-request: https://github.com/mozilla/telemetry-airflow/pull/1464 4 years ago GitHub Bugzilla PR Linker 54 bytes, text/x-github-pull-request		Details \| Review
Link to GitHub pull-request: https://github.com/mozilla/bigquery-etl/pull/2718 4 years ago GitHub Bugzilla PR Linker 49 bytes, text/x-github-pull-request		Details \| Review
delete_state.sql 3 years ago Daniel Thorn [:relud] 779 bytes, text/plain		Details
delete_state.sql 3 years ago Daniel Thorn [:relud] 1.22 KB, text/plain		Details
fix_column 3 years ago Daniel Thorn [:relud] 143 bytes, text/plain		Details
Link to GitHub pull-request: https://github.com/mozilla/bigquery-etl/pull/2783 3 years ago GitHub Bugzilla PR Linker 49 bytes, text/x-github-pull-request		Details \| Review

Jeff Klukas [:klukas] (UTC-4)

Assignee

Description

•

4 years ago

•

Edited

This is a follow-up to https://bugzilla.mozilla.org/show_bug.cgi?id=1751753 where we implemented sanitization on ingestion. We will need to apply similar logic to historical data.

The unexpected values propagate all the way to search_aggregates, so we need to consider a whole chain of derived tables:

main_v4
clients_daily_v6
clients_daily_joined_v1
clients_last_seen_v1
clients_last_seen_joined_v1
search_clients_daily
search_clients_last_seen
search_aggregates

Maximum value would come earliest by first rewriting the most aggregated tables, which are the place where most real analysis of search activity is done. I need to think some more about implications there, because we may end up violating some data contracts; for example, if search_clients_daily has a contract of one row per (client_id, engine, source), sanitization may leave us with several rows that have the same sanitized value for engine, violating the contract. Likewise, the aggregation in clients_daily would produce different results if the input were already sanitized.

So the most stable approach may be to sanitize the rawest data first, and then do a full reprocessing of downstream data in order to have the sanitized values propagate.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

4 years ago

Assignee: nobody → jklukas

Priority: -- → P1

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

4 years ago

Depends on: 1751753

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 1

•

4 years ago

?ni :relud

Am I correct in remembering that the shredder process for main_v4 uses DELETEs and also uses on-demand query slots? Do you have benchmarks for performance of overwriting partitions using SELECT vs. DELETE? Do you have a good estimate of the rate at which we'd be able to rewrite partitions of main_v4 using a SELECT so that we could sanitize some fields?

Flags: needinfo?(dthorn)

Daniel Thorn [:relud]

Comment 2

•

4 years ago

I don't have benchmarks, but last time i ran tests SELECT took about twice as much slot time as DELETE on main_v4, but DML can only modify 2 partitions at a time, while SELECT has no limit. Shredder takes over 28 days to process main_v4 with DELETE statements, and I put up https://github.com/mozilla/bigquery-etl/pull/2686 to allow shredder to use SELECT statements instead so that we can increase parallelism. we can deploy that change and compare query performance for a more up to date benchmark.

Flags: needinfo?(dthorn)

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 3

•

4 years ago

Another option here would be to restrict access to the underlying tables and turn telemetry.main into an authorized view that does this cleaning. I don't think that's worth pursuing, though, due to performance impact. We already have a problem with queries on telemetry.main sometimes being rejected due to the query complexity limit, and the modifications needed for this bug's cleaning will significantly increase the base complexity of the query.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

4 years ago

Group: mozilla-employee-confidential

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 4

•

4 years ago

I've started a draft bigquery-backfill PR to capture evolving design for running this and to start drafting a desktop query to run the sanitization:

https://github.com/mozilla/bigquery-backfill/pull/17

Daniel Thorn [:relud]

Comment 5

•

4 years ago

shredder ran overnight for main_v4 with SELECT statements, and 100% of them timed out at 6 hours. equivalent DELETE statements take about 2 hours each.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 6

•

4 years ago

•

Edited

I've opened support case 29411159 with GCP to help with determining an efficient path forward for this effort and am coordinating with JJ our TAM as well.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 7

•

4 years ago

It appears that we have found a solution using SELECT to replace partitions (rather than using DML like MERGE or UPDATE) and that we will be able to integrate this into the existing shredder machinery. I am aiming to get that integrated and running on main_v4 by EOW (Feb 11). I will need to do some more estimation to figure out when we can have this finished, but it will be several weeks.

GitHub Bugzilla PR Linker

Comment 8

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/telemetry-airflow/pull/1464 — Details

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 9

•

4 years ago

•

Edited

main_v4

We are updating Shredder config to include sanitization for main_v4 and we are going to pursue running 10 queries concurrently to do the next run in potentially 1 week rather than 28 days. Main priority right now is getting all those pieces coordinated to get main_v4 processing running.

main_summary

Next step will be to decide on a plan for main_summary. That table includes a search_counts field derived from the keyed histogram of that name, but doesn't include any of the search scalars. main_summary is deprecated, so we should have no dependencies on it other than existing user-created redash queries, etc. It would be fine to completely delete this field.

Note that we can't just rely on rerunning the main_summary query since the table contains more history than what we have in main_v4. We'll have to modify the contents of the table directly. I see problematic values starting in June of 2017, so we'll need to process from at least 2017-06-01 which is nearly the full history of the table.

I'm going to plan to hook this into Shredder like we did main_v4.

clients_daily and downstream

clients_daily has search data populated back to 2018-10-30 which is the first day of main_v4. So we don't need to worry about sanitizing dates before that. We could rely on running the full query from main_v4 per normal backfill procedures.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

4 years ago

Depends on: 1737674

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 10

•

4 years ago

It looks like we have an existing need to do a backfill of clients_daily and downstream which we can roll this in with: https://bugzilla.mozilla.org/show_bug.cgi?id=1737674

GitHub Bugzilla PR Linker

Comment 11

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/bigquery-etl/pull/2718 — Details

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 12

•

3 years ago

I kicked off a job this morning to recalculate clients_daily_v6 from 2019-11-22 to present. This should take a few days using on-demand slots. I'll then plan to start reprocessing downstream tables using reserved slots as much as possible.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 13

•

3 years ago

•

Edited

Plan for desktop derived tables

We have turned off the all task in the Shredder DAG so that the full slot reservation will be used for processing main_summary_v4 for now. We actually won't need to apply shredder to clients_daily and downstream tables for 2019-11-22 and onward since we're recreating those partitions via query jobs from source data that has already been shredded, so allowing shredder to run on them would be wasted resources.

Once clients_daily and downstream queries are complete for 2019-11-22 onward, we will still need to sanitize earlier dates. We'll add logic to shredder to sanitize this set of tables as well. Once main_summary_v4 is complete, we'll turn the all shredder task back on to process those earlier dates. We'll need to add a partition filter into the shredder logic to skip the dates beyond 2019-11-22 (although we also need to consider the 28 day gap for last_seen-type tables, so we'll need to make sure to address that detail as well).

New logic needed for shredder

We have a consistent search_counts REPEATED engine/source/count field and consistent search_content_*_sum key/value fields in the clients_daily and clients_last_seen-type tables, so can have one logic branch for those.

The search_clients_daily_v8 and search_aggregates_v8 tables have top-level engine and source fields. This can be another branch of logic.

The search_clients_last_seen_v1 table has normalize_search_engine logic applied, so does not appear to contain any problematic values and can be out of scope here.

Daniel Thorn [:relud]

Comment 14

•

3 years ago

•

Edited

Attached file delete_state.sql (obsolete) — Details

:whd, or any data sre, would you kindly run the attached query to remove shredder state for non-sanitizing main summary tasks in the current run. this will allow shredder run again on those dates, this time with sanitizing logic in place.

non sanitizing tasks are detected as those where statement_type for the associated job_id was DELETE.

Flags: needinfo?(whd)

Daniel Thorn [:relud]

Comment 15

•

3 years ago

Attached file delete_state.sql — Details

updated query to also remove state for clients_daily_v6, clients_daily_joined_v1, clients_last_seen_v6, clients_last_seen_joined_v1, and search_clients_daily_v8

Attachment #9265035 - Attachment is obsolete: true

Wesley Dawson [:whd]

Comment 16

•

3 years ago

I ran the SQL from comment #15, which succeeded.

Flags: needinfo?(whd)

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 17

•

3 years ago

https://github.com/mozilla/bigquery-etl/pull/2731 is now merged, so we have all necessary logic hooked up now for the shredder_all task to do the scrubbing we need when we turn it back on.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 18

•

3 years ago

Status as of 2022-02-23

I continue to update https://github.com/mozilla/bigquery-backfill/pull/17 with status as this progresses, but wanted to document in this bug. I'll be on PTO rest of this week, so will next update on the 28th.

We have fully sanitized all source tables, for both desktop and mobile.

We have sanitized derived tables for historical desktop data from 2020-01-01 onward. The one exception to this is main_summary which is still running, but should complete in the next few days.

We have sanitized mobile derived tables for all history with the one exception of mobile_search_clients_last_seen, which is currently processing.

Daniel Thorn [:relud]

Comment 19

•

3 years ago

Attached file fix_column — Details

delete_state.sql query swapped task_id and job_id in the table

This is because SELECT * FROM shredder_state LEFT JOIN (SELECT job_id FROM job_ids) USING (job_id) moves the join key to the top of the column list, and CREATE TABLE ... LIKE statements only use column order and ignore name.
The column order for the table is task_id, job_id, ..., but for the query was job_id, task_id, ..., and the types still matched.

:whd would you kindly run the attached update statement to restore the values to their correct columns.

Flags: needinfo?(whd)

Harold Woo

Comment 20

•

3 years ago

•

Edited

Ran the query.
Query complete (15.7 sec elapsed, 120.7 MB processed)
This statement modified 799,352 rows in moz-fx-data-shredder:shredder_state.shredder_state.

Flags: needinfo?(whd)

Daniel Thorn [:relud]

Comment 21

•

3 years ago

main_summary is now sanitized back to 2017-04-29, and shredder_all has been started.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 22

•

3 years ago

(In reply to Daniel Thorn [:relud] from comment #21)

main_summary is now sanitized back to 2017-04-29, and shredder_all has been started.

Thank you so much!

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #18)

We have sanitized mobile derived tables for all history with the one exception of mobile_search_clients_last_seen, which is currently processing.

This is now complete and verified and I've just copied it into place.

This means that all data for 2020-01-01 and on is now fully sanitized including downstream datasets. The shredder_all task will handle the remaining sanitization for earlier dates, and should be complete in approximately one week.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

3 years ago

Blocks: 1752317

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 23

•

3 years ago

Once this is complete, note that we need to make sure to clean up the backfill project, revert temporary access changes, revert temporary shredder code, and commit the bigquery-backfill PR.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 24

•

3 years ago

This is complete. Working on cleanup now.

GitHub Bugzilla PR Linker

Comment 25

•

3 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/bigquery-etl/pull/2783 — Details

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 26

•

3 years ago

Staging BQ tables have been cleaned up and the final backfill documentation is merged at https://github.com/mozilla/bigquery-backfill/tree/main/backfill/2022-01-27-sanitize-search-engines and removal of shredder code is proposed in https://github.com/mozilla/bigquery-etl/pull/2783

Will follow up with :whd about backfill project permissions.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

3 years ago

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.