Closed Bug 1304693 Opened 8 years ago Closed 8 years ago

Dataset should handle small files

Categories

(Cloud Services :: Metrics: Dashboard, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

The current implementation of Dataset for Python doesn't parallelize the load well when there is a large number of small files, since it partitions the RDD based on the cumulative size of the files. We should make sure that the number of partitions of the RDD returned by Dataset.records() is at least N times `sc.defaultParallelism()`.

For example,

Dataset.from_source('telemetry') \
       .where(docType = 'heartbeat') \
       .where(submissionDate = lambda x: x >= "20160801" and x <= "20160920") \
       .where(appName = 'Firefox') \
       .where(appUpdateChannel = "beta") \
       .records(sc) \
       .getNumPartitions()

returns 1

Attachments

(1 file)

      No description provided.
Blocks: 1255748
Blocks: 1295359
User Story: (updated)
This is now fixed. The execution time of the query in the description went down from 4h to 4 minutes.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
No longer blocks: 1295359
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: