Closed
Bug 1357250
Opened 8 years ago
Closed 6 years ago
Evaluate zstandard performance with telemetry data
Categories
(Data Platform and Tools :: General, enhancement, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: whd, Unassigned)
References
Details
(Whiteboard: [DataOps])
For the ingestion pipeline, we know gzip on upload is the current bottleneck. We should verify that moving to zstandard improves the size and performance of our s3 uploads. If we can run fewer machines on ingestion we will have fewer total objects, which should increase the downstream Dataset API performance.
For said downstream analysis tools, when we moved to the new ingestion infra (and gzip), we measured a 10-15% performance decrease from our previous per-record snappy compression format (the cost of having smaller object sizes). With zstandard, we should expect to see a significant performance increase (> 15%) in our analysis.
The first step is to generate a day's worth of data from landfill into some test data sets in the canonical bucket (-zstd and -gzip), and compare both the amount of compute required to do so and the resultant object sizes. :mreid did work similar to this a long time ago when we were choosing compression formats.
The second step is to run some spark analysis (probably counts) on the data via the scala and python bindings, and compare the performance of the -zstd and -gzip data sets.
Updated•8 years ago
|
Priority: -- → P3
Updated•8 years ago
|
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
Updated•7 years ago
|
Whiteboard: [SvcOps] → [DataOps]
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•2 years ago
|
Component: Pipeline Ingestion → General
You need to log in
before you can comment on or make changes to this bug.
Description
•