Closed Bug 1401630 Opened 7 years ago Closed 5 years ago

Build a tool to estimate the on-disk storage size for a given Spark DataFrame/RDD/Dataset

Categories

(Data Platform and Tools :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mreid, Unassigned)

Details

This would help us automatically decide how many partitions to use when storing data, particularly in Parquet format, rather than setting specific numbers of partitions by manual inspection / trial and error.
See an example investigation of data size at: https://gist.github.com/sunahsuh/ed0c7148b80963abe8f0030e74578d35
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Telemetry APIs for Analysis → General
You need to log in before you can comment on or make changes to this bug.