The current implementation of Dataset for Python doesn't parallelize the load well when there is a large number of small files, since it partitions the RDD based on the cumulative size of the files. We should make sure that the number of partitions of the RDD returned by Dataset.records() is at least N times `sc.defaultParallelism()`. For example, Dataset.from_source('telemetry') \ .where(docType = 'heartbeat') \ .where(submissionDate = lambda x: x >= "20160801" and x <= "20160920") \ .where(appName = 'Firefox') \ .where(appUpdateChannel = "beta") \ .records(sc) \ .getNumPartitions() returns 1
54 bytes, text/x-github-pull-request
|Details | Review | Splinter Review|
Created attachment 8793738 [details] [review] [python_moztelemetry] mozilla:bug-1304693-small-files > mozilla:master
This is now fixed. The execution time of the query in the description went down from 4h to 4 minutes.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.