Closed Bug 1751979 Opened 3 years ago Closed 3 years ago

Sanitize search engine values in historical search data

Categories

(Data Platform and Tools :: General, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

Attachments

(5 files, 1 obsolete file)

This is a follow-up to https://bugzilla.mozilla.org/show_bug.cgi?id=1751753 where we implemented sanitization on ingestion. We will need to apply similar logic to historical data.

The unexpected values propagate all the way to search_aggregates, so we need to consider a whole chain of derived tables:

  • main_v4
  • clients_daily_v6
  • clients_daily_joined_v1
  • clients_last_seen_v1
  • clients_last_seen_joined_v1
  • search_clients_daily
  • search_clients_last_seen
  • search_aggregates

Maximum value would come earliest by first rewriting the most aggregated tables, which are the place where most real analysis of search activity is done. I need to think some more about implications there, because we may end up violating some data contracts; for example, if search_clients_daily has a contract of one row per (client_id, engine, source), sanitization may leave us with several rows that have the same sanitized value for engine, violating the contract. Likewise, the aggregation in clients_daily would produce different results if the input were already sanitized.

So the most stable approach may be to sanitize the rawest data first, and then do a full reprocessing of downstream data in order to have the sanitized values propagate.

Assignee: nobody → jklukas
Priority: -- → P1
Depends on: 1751753

?ni :relud

Am I correct in remembering that the shredder process for main_v4 uses DELETEs and also uses on-demand query slots? Do you have benchmarks for performance of overwriting partitions using SELECT vs. DELETE? Do you have a good estimate of the rate at which we'd be able to rewrite partitions of main_v4 using a SELECT so that we could sanitize some fields?

Flags: needinfo?(dthorn)

I don't have benchmarks, but last time i ran tests SELECT took about twice as much slot time as DELETE on main_v4, but DML can only modify 2 partitions at a time, while SELECT has no limit. Shredder takes over 28 days to process main_v4 with DELETE statements, and I put up https://github.com/mozilla/bigquery-etl/pull/2686 to allow shredder to use SELECT statements instead so that we can increase parallelism. we can deploy that change and compare query performance for a more up to date benchmark.

Flags: needinfo?(dthorn)

Another option here would be to restrict access to the underlying tables and turn telemetry.main into an authorized view that does this cleaning. I don't think that's worth pursuing, though, due to performance impact. We already have a problem with queries on telemetry.main sometimes being rejected due to the query complexity limit, and the modifications needed for this bug's cleaning will significantly increase the base complexity of the query.

Group: mozilla-employee-confidential

I've started a draft bigquery-backfill PR to capture evolving design for running this and to start drafting a desktop query to run the sanitization:

https://github.com/mozilla/bigquery-backfill/pull/17

shredder ran overnight for main_v4 with SELECT statements, and 100% of them timed out at 6 hours. equivalent DELETE statements take about 2 hours each.

I've opened support case 29411159 with GCP to help with determining an efficient path forward for this effort and am coordinating with JJ our TAM as well.

It appears that we have found a solution using SELECT to replace partitions (rather than using DML like MERGE or UPDATE) and that we will be able to integrate this into the existing shredder machinery. I am aiming to get that integrated and running on main_v4 by EOW (Feb 11). I will need to do some more estimation to figure out when we can have this finished, but it will be several weeks.

main_v4

We are updating Shredder config to include sanitization for main_v4 and we are going to pursue running 10 queries concurrently to do the next run in potentially 1 week rather than 28 days. Main priority right now is getting all those pieces coordinated to get main_v4 processing running.

main_summary

Next step will be to decide on a plan for main_summary. That table includes a search_counts field derived from the keyed histogram of that name, but doesn't include any of the search scalars. main_summary is deprecated, so we should have no dependencies on it other than existing user-created redash queries, etc. It would be fine to completely delete this field.

Note that we can't just rely on rerunning the main_summary query since the table contains more history than what we have in main_v4. We'll have to modify the contents of the table directly. I see problematic values starting in June of 2017, so we'll need to process from at least 2017-06-01 which is nearly the full history of the table.

I'm going to plan to hook this into Shredder like we did main_v4.

clients_daily and downstream

clients_daily has search data populated back to 2018-10-30 which is the first day of main_v4. So we don't need to worry about sanitizing dates before that. We could rely on running the full query from main_v4 per normal backfill procedures.

Depends on: 1737674

It looks like we have an existing need to do a backfill of clients_daily and downstream which we can roll this in with: https://bugzilla.mozilla.org/show_bug.cgi?id=1737674

I kicked off a job this morning to recalculate clients_daily_v6 from 2019-11-22 to present. This should take a few days using on-demand slots. I'll then plan to start reprocessing downstream tables using reserved slots as much as possible.

Plan for desktop derived tables

We have turned off the all task in the Shredder DAG so that the full slot reservation will be used for processing main_summary_v4 for now. We actually won't need to apply shredder to clients_daily and downstream tables for 2019-11-22 and onward since we're recreating those partitions via query jobs from source data that has already been shredded, so allowing shredder to run on them would be wasted resources.

Once clients_daily and downstream queries are complete for 2019-11-22 onward, we will still need to sanitize earlier dates. We'll add logic to shredder to sanitize this set of tables as well. Once main_summary_v4 is complete, we'll turn the all shredder task back on to process those earlier dates. We'll need to add a partition filter into the shredder logic to skip the dates beyond 2019-11-22 (although we also need to consider the 28 day gap for last_seen-type tables, so we'll need to make sure to address that detail as well).

New logic needed for shredder

We have a consistent search_counts REPEATED engine/source/count field and consistent search_content_*_sum key/value fields in the clients_daily and clients_last_seen-type tables, so can have one logic branch for those.

The search_clients_daily_v8 and search_aggregates_v8 tables have top-level engine and source fields. This can be another branch of logic.

The search_clients_last_seen_v1 table has normalize_search_engine logic applied, so does not appear to contain any problematic values and can be out of scope here.

Attached file delete_state.sql (obsolete) —

:whd, or any data sre, would you kindly run the attached query to remove shredder state for non-sanitizing main summary tasks in the current run. this will allow shredder run again on those dates, this time with sanitizing logic in place.

non sanitizing tasks are detected as those where statement_type for the associated job_id was DELETE.

Flags: needinfo?(whd)
Attached file delete_state.sql

updated query to also remove state for clients_daily_v6, clients_daily_joined_v1, clients_last_seen_v6, clients_last_seen_joined_v1, and search_clients_daily_v8

Attachment #9265035 - Attachment is obsolete: true

I ran the SQL from comment #15, which succeeded.

Flags: needinfo?(whd)

https://github.com/mozilla/bigquery-etl/pull/2731 is now merged, so we have all necessary logic hooked up now for the shredder_all task to do the scrubbing we need when we turn it back on.

Status as of 2022-02-23

I continue to update https://github.com/mozilla/bigquery-backfill/pull/17 with status as this progresses, but wanted to document in this bug. I'll be on PTO rest of this week, so will next update on the 28th.

We have fully sanitized all source tables, for both desktop and mobile.

We have sanitized derived tables for historical desktop data from 2020-01-01 onward. The one exception to this is main_summary which is still running, but should complete in the next few days.

We have sanitized mobile derived tables for all history with the one exception of mobile_search_clients_last_seen, which is currently processing.

Attached file fix_column

delete_state.sql query swapped task_id and job_id in the table

This is because SELECT * FROM shredder_state LEFT JOIN (SELECT job_id FROM job_ids) USING (job_id) moves the join key to the top of the column list, and CREATE TABLE ... LIKE statements only use column order and ignore name.
The column order for the table is task_id, job_id, ..., but for the query was job_id, task_id, ..., and the types still matched.

:whd would you kindly run the attached update statement to restore the values to their correct columns.

Flags: needinfo?(whd)

Ran the query.
Query complete (15.7 sec elapsed, 120.7 MB processed)
This statement modified 799,352 rows in moz-fx-data-shredder:shredder_state.shredder_state.

Flags: needinfo?(whd)

main_summary is now sanitized back to 2017-04-29, and shredder_all has been started.

(In reply to Daniel Thorn [:relud] from comment #21)

main_summary is now sanitized back to 2017-04-29, and shredder_all has been started.

Thank you so much!

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #18)

We have sanitized mobile derived tables for all history with the one exception of mobile_search_clients_last_seen, which is currently processing.

This is now complete and verified and I've just copied it into place.

This means that all data for 2020-01-01 and on is now fully sanitized including downstream datasets. The shredder_all task will handle the remaining sanitization for earlier dates, and should be complete in approximately one week.

Blocks: 1752317

Once this is complete, note that we need to make sure to clean up the backfill project, revert temporary access changes, revert temporary shredder code, and commit the bigquery-backfill PR.

This is complete. Working on cleanup now.

Staging BQ tables have been cleaned up and the final backfill documentation is merged at https://github.com/mozilla/bigquery-backfill/tree/main/backfill/2022-01-27-sanitize-search-engines and removal of shredder code is proposed in https://github.com/mozilla/bigquery-etl/pull/2783

Will follow up with :whd about backfill project permissions.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: