Closed Bug 1349608 Opened 8 years ago Closed 5 years ago

Improve Dataset performance when downloading many small files

Categories

(Data Platform and Tools :: General, defect, P2)

defect
Points:
2

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mdoglio, Assigned: mdoglio)

Details

As reported by :ekr certain queries like this > ds = Dataset.from_source('telemetry').where(docType='OTHER') > rec = ds.records(sc) > tls.count() takes ages to run, even though the total amount of data downloaded is not huge. :rvitillo confirmed the issue: > I have run a similar query on a 10 node cluster and connected to a worker. > I noticed that the worker was downloading data only at about 15 MB/s. > Furthermore, there were like half of cores of the worker just sitting idle when I got to 562/640. > It looks to me that for situations like this (tons of small files, see also Bug 1304693), > having a higher degree of parallelism would be extremely beneficial.
This was possibly exacerbated by the move to the new infra because the average "small file" object size is smaller than it used to be (see https://bugzilla.mozilla.org/show_bug.cgi?id=1302264#c7). Additionally there have been at least two occasions where the OTHER ping files became very large due to the introduction of a new ping type to all release users via system addons (I think one of them is https://bugzilla.mozilla.org/show_bug.cgi?id=1307568 and the other was disableSHA1rollout) that were bucketed into OTHER for a few days. If there are ping types that are in OTHER but are often used in analysis, we can also backfill them so that they are available by their docType.
I improved the situation in bug 1318681. The spark partitions are now more balanced, resulting in a 30% speedup. This alone doesn't make things much better, because you still have to wait hours to get 100% of the data. Let's keep this open to track any further work.
Sorry, the previous link was wrong. The patch for this can be found at https://github.com/mozilla/python_moztelemetry/pull/127
Component: Metrics: Pipeline → Telemetry APIs for Analysis
Product: Cloud Services → Data Platform and Tools
Priority: P1 → P2
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Telemetry APIs for Analysis → General
You need to log in before you can comment on or make changes to this bug.