Closed Bug 1579441 Opened 5 years ago Closed 5 years ago

Delete all Parquet files from S3

Categories

(Data Platform and Tools Graveyard :: Operations, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: robotblake)

References

Details

This is part of the overall H2 effort towards retiring the AWS telemetry ingestion pipeline and data lake.

We will not be able to delete the data lake in AWS until the AWS pipeline is shut down and all relevant data is imported to GCP and available in BigQuery. At time of creating this bug, our target is late November.

Note that we are nearly done with the backfill of heka data to BQ, so we will be deleting heka files from AWS in the near future. After that's done, we'll turn our attention more seriously to deleting Parquet data. We may start altering BQ views next week to avoid using imported Parquet data, so I believe we're still on track to turn off Parquet datasets by end of November.

Are we planning on moving parquet files to a GCP filestore or are parquet files that are not migrated to BQ going to be lost? For clarity, both of these options seem reasonable to me!

Thanks!

Generally, the Parquet datasets are now available as BQ tables. We are considering the BQ tables the migration path and we have no plan to store the Parquet files in GCS.

Do let us know if there is any specific dataset you're worried about.

Depends on: 1594823

I'm currently working on an analysis that depends on telemetry_ip_privacy_parquet. I was hoping I could dispatch the analysis this week, but it looks like it's going to continue for a while. I just found this table so I haven't talked to anyone about preserving it. Is it something we can make available in GCP?

Flags: needinfo?(jklukas)

If we haven't already copied that over and made it available in BQ, it's on the list to be made available. More info will be coming in this bug.

Flags: needinfo?(jklukas)

Looks like it already exists as moz-fx-data-derived-datasets.telemetry_derived.telemetry_ip_privacy_parquet_v1 and should be accessible as telemetry.telemetry_ip_privacy_parquet_v1 in Redash, etc.

Awesome! Thanks!

Depends on: 1596552

There appears to be some significant confusion about timeline for retiring Parquet data. I am going to send out a message to fx-data-dev and ask for feedback in this bug about particular use cases that would be severely impacted by that timeline.

There was no new feedback about critical Parquet use cases, though folks did appreciate having more clarity on the timeline. I checked in with :jason and :robotblake and sent out additional communication today to indicate we'll remove access to all Parquet data on 12/11 and then hard delete on 12/18.

?ni :robotblake Are you the one to shut off access permissions to telemetry-parquet? Can you clear the needsinfo when that's done?

Flags: needinfo?(bimsland)

Yep, I can take care of that.

Flags: needinfo?(bimsland)
Priority: -- → P1

So we're keeping taar/* and telemetry-ml/* prefixes temporarily because TAAR is still using them. Also socorro_crash/v2/* is still being written to by Airflow and is potentially still being used by Athena.

Any others?

Assignee: nobody → bimsland

Turns out socorro_crash/v2/* isn't actually needed.

Access to all data has been removed with the exception of TAAR which still has access to the previously mentioned prefixes.

All the data besides the previously mentioned prefixes have been removed!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.