Closed Bug 1401626 Opened 8 years ago Closed 4 years ago

Verify Optimal Parquet File Size

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: frank, Unassigned)

Details

Currently we assume 200-300MB is the "correct" size. Databricks has recommended 1GB files [0]. We should run some tests on S3 to verify empirically that the size is best around N MB, since our current understanding is just from cargo culting. [0] https://forums.databricks.com/questions/101/what-is-an-optimal-size-for-file-partitions-using.html
Some things to test: - Querying the resulting data from Presto - Querying the resulting data from Athena - Querying the resulting data from Spark clusters with varying number of nodes - Reading all columns vs. reading few columns - GROUP BY queries There are some relatively recent performance recommendations for Athena at [1], which recommends using files at least 128MB in size, and "not too big". [1] https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Mark, any idea who should take this? Seems related to the architecture review.
Points: --- → 1
Flags: needinfo?(mreid)
Priority: -- → P1
Assignee: nobody → mreid
This is a "nice to have", I don't think it's urgent that it be worked on this quarter.
Assignee: mreid → nobody
Points: 1 → 2
Flags: needinfo?(mreid)
Priority: P1 → P3
One thought we had was, instead of using mozilla data and queries for this, is to use standardized benchmarking tools. TPC-H[0] could be a good candidate, as well as TPC-DS[1]. Basically, given everything else the same, how does parquet file size affect performance? This will require: - Setting up data in parquet format - Running TPC queries in Presto/Spark on that data - Benchmarking and comparing different types Following this we can write up a blog post with our finding. [0] http://www.tpc.org/tpch/default.asp [1] http://www.tpc.org/tpcds/default.asp

We don't use parquet anymore.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.