Closed
Bug 1389247
Opened 8 years ago
Closed 5 years ago
Make parquet merging job
Categories
(Data Platform and Tools :: General, enhancement, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: frank, Unassigned)
References
Details
Between spark-streaming and direct-to-parquet, some of our parquet datasets are becoming very sharded, making reading them a bottleneck. Nightly, we should merge those datasets into good ~300MB parquet files to facilitate reading them.
| Reporter | ||
Comment 1•8 years ago
|
||
Whd, I feel like mreid mentioned that you had some code written with this capability already. Any thoughts?
Flags: needinfo?(whd)
Comment 2•8 years ago
|
||
(In reply to Frank Bertsch [:frank] from comment #1)
> Between spark-streaming and direct-to-parquet, some of our parquet datasets are becoming very sharded, making reading them a bottleneck.
Which datasets, and how much of a bottleneck? We've discussed doing this before (mostly for the main store where sharding is even worse) but never committed to measuring it.
> Whd, I feel like mreid mentioned that you had some code written with this
> capability already. Any thoughts?
I remember testing the parquet-tools merge utility for this purpose, but I never hooked it up in any production capacity. If we were to set this up it would likely be a wrapper around the official tool.
Flags: needinfo?(whd)
From Bug 133047: The new direct-to-parquet writer will likely produce files smaller than optimal for parquet in many cases. We'll need a way to consolidate them into larger files after they're written, probably as a nightly ETL.
I found one standalone project that does this already (Herringbone, from Stripe: https://github.com/stripe/herringbone) but it has a bunch of dependencies that aren't managed in the project itself (e.g. Thrift), and it wouldn't be *that* difficult to write a batch view that does this generically, so I'm inclined to go with a batch view so we don't have to introduce new processes/dependencies in the pipeline.
Comment 5•8 years ago
|
||
To reiterate: I would like the bottleneck these small files are generating quantified so that this work can be prioritized appropriately. Hitherto as bug #1330047 makes apparent we have been able to live with the current less-than-optimal state.
Updated•8 years ago
|
Priority: -- → P3
Updated•5 years ago
|
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
| Assignee | ||
Updated•3 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•