Closed Bug 1304693 Opened 8 years ago Closed 8 years ago

Dataset should handle small files

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Unassigned)

References

Details

User Story

The current implementation of Dataset for Python doesn't parallelize the load well when there is a large number of small files, since it partitions the RDD based on the cumulative size of the files. We should make sure that the number of partitions of the RDD returned by Dataset.records() is at least N times `sc.defaultParallelism()`.

For example,

Dataset.from_source('telemetry') \
       .where(docType = 'heartbeat') \
       .where(submissionDate = lambda x: x >= "20160801" and x <= "20160920") \
       .where(appName = 'Firefox') \
       .where(appUpdateChannel = "beta") \
       .records(sc) \
       .getNumPartitions()

returns 1

Attachments

(1 file)

[python_moztelemetry] mozilla:bug-1304693-small-files > mozilla:master 8 years ago GitHub Autolander Bot 54 bytes, text/x-github-pull-request		Details \| Review

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

8 years ago

      No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Blocks: 1255748

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Blocks: 1295359

GitHub Autolander Bot

Comment 1

•

8 years ago

Attached file [python_moztelemetry] mozilla:bug-1304693-small-files > mozilla:master — Details

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

User Story: (updated)

Mauro Doglio [:mdoglio]

Comment 2

•

8 years ago

This is now fixed. The execution time of the query in the description went down from 4h to 4 minutes.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Mauro Doglio [:mdoglio]

Updated

•

8 years ago

No longer blocks: 1295359

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Dataset should handle small files

Categories

(Cloud Services :: Metrics: Dashboard, defect)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Comment 1

Updated

Comment 2

Updated

Attachment

General

Description

File Name

Content Type