Delete all Parquet files from S3
Categories
(Data Platform and Tools Graveyard :: Operations, task, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: robotblake)
References
Details
This is part of the overall H2 effort towards retiring the AWS telemetry ingestion pipeline and data lake.
We will not be able to delete the data lake in AWS until the AWS pipeline is shut down and all relevant data is imported to GCP and available in BigQuery. At time of creating this bug, our target is late November.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
Note that we are nearly done with the backfill of heka data to BQ, so we will be deleting heka files from AWS in the near future. After that's done, we'll turn our attention more seriously to deleting Parquet data. We may start altering BQ views next week to avoid using imported Parquet data, so I believe we're still on track to turn off Parquet datasets by end of November.
Comment 2•5 years ago
|
||
Are we planning on moving parquet files to a GCP filestore or are parquet files that are not migrated to BQ going to be lost? For clarity, both of these options seem reasonable to me!
Thanks!
Reporter | ||
Comment 3•5 years ago
|
||
Generally, the Parquet datasets are now available as BQ tables. We are considering the BQ tables the migration path and we have no plan to store the Parquet files in GCS.
Do let us know if there is any specific dataset you're worried about.
Comment 4•5 years ago
|
||
I'm currently working on an analysis that depends on telemetry_ip_privacy_parquet
. I was hoping I could dispatch the analysis this week, but it looks like it's going to continue for a while. I just found this table so I haven't talked to anyone about preserving it. Is it something we can make available in GCP?
Reporter | ||
Comment 5•5 years ago
|
||
If we haven't already copied that over and made it available in BQ, it's on the list to be made available. More info will be coming in this bug.
Reporter | ||
Comment 6•5 years ago
•
|
||
Looks like it already exists as moz-fx-data-derived-datasets.telemetry_derived.telemetry_ip_privacy_parquet_v1
and should be accessible as telemetry.telemetry_ip_privacy_parquet_v1
in Redash, etc.
Comment 7•5 years ago
|
||
Awesome! Thanks!
Reporter | ||
Comment 8•5 years ago
|
||
There appears to be some significant confusion about timeline for retiring Parquet data. I am going to send out a message to fx-data-dev and ask for feedback in this bug about particular use cases that would be severely impacted by that timeline.
Reporter | ||
Comment 9•5 years ago
|
||
There was no new feedback about critical Parquet use cases, though folks did appreciate having more clarity on the timeline. I checked in with :jason and :robotblake and sent out additional communication today to indicate we'll remove access to all Parquet data on 12/11 and then hard delete on 12/18.
Reporter | ||
Comment 10•5 years ago
|
||
?ni :robotblake Are you the one to shut off access permissions to telemetry-parquet? Can you clear the needsinfo when that's done?
Assignee | ||
Comment 11•5 years ago
|
||
Yep, I can take care of that.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 12•5 years ago
|
||
So we're keeping taar/*
and telemetry-ml/*
prefixes temporarily because TAAR is still using them. Also socorro_crash/v2/*
is still being written to by Airflow and is potentially still being used by Athena.
Any others?
Assignee | ||
Comment 13•5 years ago
|
||
Turns out socorro_crash/v2/*
isn't actually needed.
Access to all data has been removed with the exception of TAAR which still has access to the previously mentioned prefixes.
Assignee | ||
Comment 14•5 years ago
|
||
All the data besides the previously mentioned prefixes have been removed!
Updated•2 years ago
|
Description
•