I've taken a closer look at this, now 4 months after the initial filing of the bug. There are in total 4.1 million documents to be backfilled, however these documents no longer exist in the live table which has a 30 day retention policy.
Reading through the 2020-01-23 sync ping backfill, it looks there need to be a few modification for this to go through successfully.
From the sync backfill:
- Process from errors into a live table in a backfill project.
- Append live tables in the backfill project to the live tables in the shared prod project.
- Run copy_deduplicate from the live tables into the stable tables.
This assumes that the live table exists in full. Reading through copy_deduplicate, it looks like it does a
WRITE_TRUNCATE. It seems like this could lead to data loss if the live table only held documents from the backfilled errors.
I can imagine a process that appends directly the stable table with the following psuedo-SQL as the source data:
declare date DATE;
set date = "2020-08-20"
-- documents should be deduplicated in the live table too...
where date(submission_timestamp) = date
and document_id not in (
select distinct document_id
where date(submission_date) = date
On note is that there may be client ids that have already gone through the shredder process that will be introduced into these tables. Presumably, these dates will be reprocessed using the entire deletion-request table such that this is not an issue.
Is the modification above of skipping the copy deduplicate stage reasonable?