Sanitize search engine values in historical search data
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
References
Details
Attachments
(5 files, 1 obsolete file)
This is a follow-up to https://bugzilla.mozilla.org/show_bug.cgi?id=1751753 where we implemented sanitization on ingestion. We will need to apply similar logic to historical data.
The unexpected values propagate all the way to search_aggregates
, so we need to consider a whole chain of derived tables:
main_v4
clients_daily_v6
clients_daily_joined_v1
clients_last_seen_v1
clients_last_seen_joined_v1
search_clients_daily
search_clients_last_seen
search_aggregates
Maximum value would come earliest by first rewriting the most aggregated tables, which are the place where most real analysis of search activity is done. I need to think some more about implications there, because we may end up violating some data contracts; for example, if search_clients_daily
has a contract of one row per (client_id, engine, source)
, sanitization may leave us with several rows that have the same sanitized value for engine
, violating the contract. Likewise, the aggregation in clients_daily
would produce different results if the input were already sanitized.
So the most stable approach may be to sanitize the rawest data first, and then do a full reprocessing of downstream data in order to have the sanitized values propagate.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
?ni :relud
Am I correct in remembering that the shredder process for main_v4
uses DELETEs and also uses on-demand query slots? Do you have benchmarks for performance of overwriting partitions using SELECT
vs. DELETE
? Do you have a good estimate of the rate at which we'd be able to rewrite partitions of main_v4
using a SELECT
so that we could sanitize some fields?
Comment 2•3 years ago
|
||
I don't have benchmarks, but last time i ran tests SELECT
took about twice as much slot time as DELETE
on main_v4
, but DML can only modify 2 partitions at a time, while SELECT
has no limit. Shredder takes over 28 days to process main_v4
with DELETE
statements, and I put up https://github.com/mozilla/bigquery-etl/pull/2686 to allow shredder to use SELECT
statements instead so that we can increase parallelism. we can deploy that change and compare query performance for a more up to date benchmark.
Assignee | ||
Comment 3•3 years ago
|
||
Another option here would be to restrict access to the underlying tables and turn telemetry.main
into an authorized view that does this cleaning. I don't think that's worth pursuing, though, due to performance impact. We already have a problem with queries on telemetry.main
sometimes being rejected due to the query complexity limit, and the modifications needed for this bug's cleaning will significantly increase the base complexity of the query.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 4•3 years ago
|
||
I've started a draft bigquery-backfill PR to capture evolving design for running this and to start drafting a desktop query to run the sanitization:
Comment 5•3 years ago
|
||
shredder ran overnight for main_v4 with SELECT statements, and 100% of them timed out at 6 hours. equivalent DELETE statements take about 2 hours each.
Assignee | ||
Comment 6•3 years ago
•
|
||
I've opened support case 29411159 with GCP to help with determining an efficient path forward for this effort and am coordinating with JJ our TAM as well.
Assignee | ||
Comment 7•3 years ago
|
||
It appears that we have found a solution using SELECT
to replace partitions (rather than using DML like MERGE or UPDATE) and that we will be able to integrate this into the existing shredder machinery. I am aiming to get that integrated and running on main_v4
by EOW (Feb 11). I will need to do some more estimation to figure out when we can have this finished, but it will be several weeks.
Comment 8•3 years ago
|
||
Assignee | ||
Comment 9•3 years ago
•
|
||
main_v4
We are updating Shredder config to include sanitization for main_v4 and we are going to pursue running 10 queries concurrently to do the next run in potentially 1 week rather than 28 days. Main priority right now is getting all those pieces coordinated to get main_v4
processing running.
main_summary
Next step will be to decide on a plan for main_summary
. That table includes a search_counts
field derived from the keyed histogram of that name, but doesn't include any of the search scalars. main_summary
is deprecated, so we should have no dependencies on it other than existing user-created redash queries, etc. It would be fine to completely delete this field.
Note that we can't just rely on rerunning the main_summary
query since the table contains more history than what we have in main_v4
. We'll have to modify the contents of the table directly. I see problematic values starting in June of 2017, so we'll need to process from at least 2017-06-01 which is nearly the full history of the table.
I'm going to plan to hook this into Shredder like we did main_v4.
clients_daily and downstream
clients_daily
has search data populated back to 2018-10-30 which is the first day of main_v4
. So we don't need to worry about sanitizing dates before that. We could rely on running the full query from main_v4
per normal backfill procedures.
Assignee | ||
Comment 10•3 years ago
|
||
It looks like we have an existing need to do a backfill of clients_daily and downstream which we can roll this in with: https://bugzilla.mozilla.org/show_bug.cgi?id=1737674
Comment 11•3 years ago
|
||
Assignee | ||
Comment 12•3 years ago
|
||
I kicked off a job this morning to recalculate clients_daily_v6
from 2019-11-22 to present. This should take a few days using on-demand slots. I'll then plan to start reprocessing downstream tables using reserved slots as much as possible.
Assignee | ||
Comment 13•3 years ago
•
|
||
Plan for desktop derived tables
We have turned off the all
task in the Shredder DAG so that the full slot reservation will be used for processing main_summary_v4
for now. We actually won't need to apply shredder to clients_daily and downstream tables for 2019-11-22 and onward since we're recreating those partitions via query jobs from source data that has already been shredded, so allowing shredder to run on them would be wasted resources.
Once clients_daily
and downstream queries are complete for 2019-11-22 onward, we will still need to sanitize earlier dates. We'll add logic to shredder to sanitize this set of tables as well. Once main_summary_v4
is complete, we'll turn the all
shredder task back on to process those earlier dates. We'll need to add a partition filter into the shredder logic to skip the dates beyond 2019-11-22 (although we also need to consider the 28 day gap for last_seen-type tables, so we'll need to make sure to address that detail as well).
New logic needed for shredder
We have a consistent search_counts
REPEATED engine/source/count
field and consistent search_content_*_sum
key/value
fields in the clients_daily
and clients_last_seen
-type tables, so can have one logic branch for those.
The search_clients_daily_v8
and search_aggregates_v8
tables have top-level engine
and source
fields. This can be another branch of logic.
The search_clients_last_seen_v1
table has normalize_search_engine
logic applied, so does not appear to contain any problematic values and can be out of scope here.
Comment 14•3 years ago
•
|
||
:whd, or any data sre, would you kindly run the attached query to remove shredder state for non-sanitizing main summary tasks in the current run. this will allow shredder run again on those dates, this time with sanitizing logic in place.
non sanitizing tasks are detected as those where statement_type
for the associated job_id
was DELETE
.
Comment 15•3 years ago
|
||
updated query to also remove state for clients_daily_v6
, clients_daily_joined_v1
, clients_last_seen_v6
, clients_last_seen_joined_v1
, and search_clients_daily_v8
Assignee | ||
Comment 17•3 years ago
|
||
https://github.com/mozilla/bigquery-etl/pull/2731 is now merged, so we have all necessary logic hooked up now for the shredder_all
task to do the scrubbing we need when we turn it back on.
Assignee | ||
Comment 18•3 years ago
|
||
Status as of 2022-02-23
I continue to update https://github.com/mozilla/bigquery-backfill/pull/17 with status as this progresses, but wanted to document in this bug. I'll be on PTO rest of this week, so will next update on the 28th.
We have fully sanitized all source tables, for both desktop and mobile.
We have sanitized derived tables for historical desktop data from 2020-01-01 onward. The one exception to this is main_summary
which is still running, but should complete in the next few days.
We have sanitized mobile derived tables for all history with the one exception of mobile_search_clients_last_seen
, which is currently processing.
Comment 19•3 years ago
|
||
delete_state.sql
query swapped task_id
and job_id
in the table
This is because SELECT * FROM shredder_state LEFT JOIN (SELECT job_id FROM job_ids) USING (job_id)
moves the join key to the top of the column list, and CREATE TABLE ... LIKE
statements only use column order and ignore name.
The column order for the table is task_id, job_id, ...
, but for the query was job_id, task_id, ...
, and the types still matched.
:whd would you kindly run the attached update statement to restore the values to their correct columns.
Comment 20•3 years ago
•
|
||
Ran the query.
Query complete (15.7 sec elapsed, 120.7 MB processed)
This statement modified 799,352 rows in moz-fx-data-shredder:shredder_state.shredder_state.
Comment 21•3 years ago
|
||
main_summary
is now sanitized back to 2017-04-29
, and shredder_all
has been started.
Assignee | ||
Comment 22•3 years ago
|
||
(In reply to Daniel Thorn [:relud] from comment #21)
main_summary
is now sanitized back to2017-04-29
, andshredder_all
has been started.
Thank you so much!
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #18)
We have sanitized mobile derived tables for all history with the one exception of
mobile_search_clients_last_seen
, which is currently processing.
This is now complete and verified and I've just copied it into place.
This means that all data for 2020-01-01 and on is now fully sanitized including downstream datasets. The shredder_all
task will handle the remaining sanitization for earlier dates, and should be complete in approximately one week.
Assignee | ||
Comment 23•3 years ago
|
||
Once this is complete, note that we need to make sure to clean up the backfill project, revert temporary access changes, revert temporary shredder code, and commit the bigquery-backfill PR.
Assignee | ||
Comment 24•3 years ago
|
||
This is complete. Working on cleanup now.
Comment 25•3 years ago
|
||
Assignee | ||
Comment 26•3 years ago
|
||
Staging BQ tables have been cleaned up and the final backfill documentation is merged at https://github.com/mozilla/bigquery-backfill/tree/main/backfill/2022-01-27-sanitize-search-engines and removal of shredder code is proposed in https://github.com/mozilla/bigquery-etl/pull/2783
Will follow up with :whd about backfill project permissions.
Assignee | ||
Updated•3 years ago
|
Description
•