Closed Bug 1357250 Opened 8 years ago Closed 6 years ago

Evaluate zstandard performance with telemetry data

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: whd, Unassigned)

References

Details

(Whiteboard: [DataOps])

For the ingestion pipeline, we know gzip on upload is the current bottleneck. We should verify that moving to zstandard improves the size and performance of our s3 uploads. If we can run fewer machines on ingestion we will have fewer total objects, which should increase the downstream Dataset API performance. For said downstream analysis tools, when we moved to the new ingestion infra (and gzip), we measured a 10-15% performance decrease from our previous per-record snappy compression format (the cost of having smaller object sizes). With zstandard, we should expect to see a significant performance increase (> 15%) in our analysis. The first step is to generate a day's worth of data from landfill into some test data sets in the canonical bucket (-zstd and -gzip), and compare both the amount of compute required to do so and the resultant object sizes. :mreid did work similar to this a long time ago when we were choosing compression formats. The second step is to run some spark analysis (probably counts) on the data via the scala and python bindings, and compare the performance of the -zstd and -gzip data sets.
Blocks: 1357253
Blocks: 1357254
Blocks: 1357255
Priority: -- → P3
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
Whiteboard: [SvcOps] → [DataOps]
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.