Closed
Bug 1349608
Opened 8 years ago
Closed 5 years ago
Improve Dataset performance when downloading many small files
Categories
(Data Platform and Tools :: General, defect, P2)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: mdoglio, Assigned: mdoglio)
Details
As reported by :ekr certain queries like this
> ds = Dataset.from_source('telemetry').where(docType='OTHER')
> rec = ds.records(sc)
> tls.count()
takes ages to run, even though the total amount of data downloaded is not huge.
:rvitillo confirmed the issue:
> I have run a similar query on a 10 node cluster and connected to a worker.
> I noticed that the worker was downloading data only at about 15 MB/s.
> Furthermore, there were like half of cores of the worker just sitting idle when I got to 562/640.
> It looks to me that for situations like this (tons of small files, see also Bug 1304693),
> having a higher degree of parallelism would be extremely beneficial.
Comment 1•8 years ago
|
||
This was possibly exacerbated by the move to the new infra because the average "small file" object size is smaller than it used to be (see https://bugzilla.mozilla.org/show_bug.cgi?id=1302264#c7). Additionally there have been at least two occasions where the OTHER ping files became very large due to the introduction of a new ping type to all release users via system addons (I think one of them is https://bugzilla.mozilla.org/show_bug.cgi?id=1307568 and the other was disableSHA1rollout) that were bucketed into OTHER for a few days.
If there are ping types that are in OTHER but are often used in analysis, we can also backfill them so that they are available by their docType.
Assignee | ||
Comment 2•8 years ago
|
||
I improved the situation in bug 1318681. The spark partitions are now more balanced, resulting in a 30% speedup. This alone doesn't make things much better, because you still have to wait hours to get 100% of the data. Let's keep this open to track any further work.
Assignee | ||
Comment 3•8 years ago
|
||
Sorry, the previous link was wrong. The patch for this can be found at https://github.com/mozilla/python_moztelemetry/pull/127
Updated•8 years ago
|
Component: Metrics: Pipeline → Telemetry APIs for Analysis
Product: Cloud Services → Data Platform and Tools
Assignee | ||
Updated•8 years ago
|
Priority: P1 → P2
Updated•5 years ago
|
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Updated•3 years ago
|
Component: Telemetry APIs for Analysis → General
You need to log in
before you can comment on or make changes to this bug.
Description
•