Closed
Bug 1304693
Opened 8 years ago
Closed 8 years ago
Dataset should handle small files
Categories
(Cloud Services :: Metrics: Dashboard, defect)
Cloud Services
Metrics: Dashboard
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Unassigned)
References
Details
User Story
The current implementation of Dataset for Python doesn't parallelize the load well when there is a large number of small files, since it partitions the RDD based on the cumulative size of the files. We should make sure that the number of partitions of the RDD returned by Dataset.records() is at least N times `sc.defaultParallelism()`. For example, Dataset.from_source('telemetry') \ .where(docType = 'heartbeat') \ .where(submissionDate = lambda x: x >= "20160801" and x <= "20160920") \ .where(appName = 'Firefox') \ .where(appUpdateChannel = "beta") \ .records(sc) \ .getNumPartitions() returns 1
Attachments
(1 file)
No description provided.
Comment 1•8 years ago
|
||
Reporter | ||
Updated•8 years ago
|
User Story: (updated)
Comment 2•8 years ago
|
||
This is now fixed. The execution time of the query in the description went down from 4h to 4 minutes.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•