Closed Bug 1450079 Opened 7 years ago Closed 7 years ago

Add option to align number of input partitions to default parallelism

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

Currently the default behavior is to read partitions in block sizes of 256 MB. This does not work well when processing a small amount of data on a large number of workers. For example, processing a dataset that is under several GB will leave most of the available workers with no work. To get around this, we should add an `--alignment` option to MainSummary, which will set the number of input partitions to a multiple of `sc.defaultParallelism`. The performance of aligning the number of partitions to default parallelism was explored on Jan 16 in the `Message.toJValue` PR in moztelemetry.[1] Over the course of a single day with the same number of workers, the aligned jar took 3 hours as opposed to 3.4 hours. This change has a minimal gain for large datasets, but this is more efficient for smaller datasets. Using this option is strictly better the naive partitioning method if tuned correctly due to higher worker utilization. [1] https://github.com/mozilla/moztelemetry/pull/15
Correction: the default is 268 MB, which is 2^31 bits.
See Also: → 1451103
This ended up being implemented as two options. > --read-mode (fixed | aligned) > --input-partition-multiplier <int> `fixed` is the current behavior, reading in files until partitions are filled. `aligned` will use evenly-spaced partitions aligned to the default parallelism.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Datasets: Main Summary → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: