Closed
Bug 1401626
Opened 8 years ago
Closed 4 years ago
Verify Optimal Parquet File Size
Categories
(Data Platform and Tools :: General, enhancement, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: frank, Unassigned)
Details
Currently we assume 200-300MB is the "correct" size. Databricks has recommended 1GB files [0]. We should run some tests on S3 to verify empirically that the size is best around N MB, since our current understanding is just from cargo culting.
[0] https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html
Comment 1•8 years ago
|
||
Some things to test:
- Querying the resulting data from Presto
- Querying the resulting data from Athena
- Querying the resulting data from Spark clusters with varying number of nodes
- Reading all columns vs. reading few columns
- GROUP BY queries
There are some relatively recent performance recommendations for Athena at [1], which recommends using files at least 128MB in size, and "not too big".
[1] https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Comment 2•8 years ago
|
||
Mark, any idea who should take this? Seems related to the architecture review.
Points: --- → 1
Flags: needinfo?(mreid)
Priority: -- → P1
Updated•8 years ago
|
Assignee: nobody → mreid
Comment 3•8 years ago
|
||
This is a "nice to have", I don't think it's urgent that it be worked on this quarter.
Assignee: mreid → nobody
Points: 1 → 2
Flags: needinfo?(mreid)
Priority: P1 → P3
| Reporter | ||
Comment 4•8 years ago
|
||
One thought we had was, instead of using mozilla data and queries for this, is to use standardized benchmarking tools. TPC-H[0] could be a good candidate, as well as TPC-DS[1].
Basically, given everything else the same, how does parquet file size affect performance?
This will require:
- Setting up data in parquet format
- Running TPC queries in Presto/Spark on that data
- Benchmarking and comparing different types
Following this we can write up a blog post with our finding.
[0] http://www.tpc.org/tpch/default.asp
[1] http://www.tpc.org/tpcds/default.asp
| Reporter | ||
Comment 5•4 years ago
|
||
We don't use parquet anymore.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
| Assignee | ||
Updated•3 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•