Closed Bug 1661565 Opened 1 year ago Closed 11 months ago

Backfill pings for `#/environment/system/gfx/adapters/N/GPUActive` from 2020-07-04 to 2020-08-20


(Data Platform and Tools :: General, task, P1)



(Not tracked)



(Reporter: amiyaguchi, Assigned: amiyaguchi)



(Whiteboard: [data-quality])


(2 files)

Bug 1657142 fixed schema errors related to GPUActive fields. The extent of the errors goes from 2020-07-04 until it was fixed on 2020-08-20. These dates can be backfilled from the errors table.

Let me know if you want to talk through setup for the backfill. It looks like it's actually been quite a while since we've done a backfill from the errors table:

There's a more recent backfill done here, but it'd certainly be a good idea to talk through the setup when the time comes around:

Anthony, are you going to take this?

Flags: needinfo?(amiyaguchi)

Yes, I'll take this, but I'll leave it as a P3.

Assignee: nobody → amiyaguchi
Flags: needinfo?(amiyaguchi)
Priority: P3 → P1

I've taken a closer look at this, now 4 months after the initial filing of the bug. There are in total 4.1 million documents to be backfilled, however these documents no longer exist in the live table which has a 30 day retention policy.

Reading through the 2020-01-23 sync ping backfill, it looks there need to be a few modification for this to go through successfully.

From the sync backfill:

  • Process from errors into a live table in a backfill project.
  • Append live tables in the backfill project to the live tables in the shared prod project.
  • Run copy_deduplicate from the live tables into the stable tables.

This assumes that the live table exists in full. Reading through copy_deduplicate, it looks like it does a WRITE_TRUNCATE. It seems like this could lead to data loss if the live table only held documents from the backfilled errors.

I can imagine a process that appends directly the stable table with the following psuedo-SQL as the source data:

declare date DATE;
set date = "2020-08-20"

-- documents should be deduplicated in the live table too...
select *
from backfill.telemetry_live.bhr_v4
where date(submission_timestamp) = date
and document_id not in (
    select distinct document_id
    from shared_prod.telemetry_stable.bhr_v4
    where date(submission_date) = date

On note is that there may be client ids that have already gone through the shredder process that will be introduced into these tables. Presumably, these dates will be reprocessed using the entire deletion-request table such that this is not an issue.

Is the modification above of skipping the copy deduplicate stage reasonable?

Flags: needinfo?(jklukas)

I had a quick conversion with :klukas to go through the plan. We can append directly from a stable table in the backfill project into the shared prod project without having to go through copy-deduplicate.

The process will look something like this:

  • Filter out the set of documents directly from payload bytes format. Apply de-duplication and remove relevant client ids from deletion requests as necessary.
  • Run the beam job to populate stable tables in the backfill project
  • Run a bq cp from backfill to shared prod.
Flags: needinfo?(jklukas)

I've created the set of stable tables that can be copied into the prod project now. The procedure is something like this:

  • Mirror tables from the production project for the live and stable tables.
  • Copy the subset of payload bytes error into the backfill project, run the beam decoder job on it to populate the live table in the backfill project.
  • Optionally prune the set of empty tables from the live dataset, then run copy_deduplicate over the live dataset.
  • Prune the set of empty tables from the stable dataset, then run shredder_delete on the stable dataset.
  • Append backfill stable tables from backfill to shared prod.
  • Delete old error set from shared prod and append errors from backfill into shared prod.

It's pretty reasonable running copy deduplicate and shredder delete inside of the backfill project. The options do need to be inspected closely and there are some caveats to usage, but it's mostly straightforward.

The backfill is complete, with the stable tables and errors being appended to prod. There was a slight issue with the table clustering being incorrect on the stable tables one backfill project, so I had to mirror them in a separate project to cluster the data before the append could be done in prod (notes in the PR).

Closed: 11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.