Closed Bug 1389247 Opened 8 years ago Closed 5 years ago

Make parquet merging job

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: frank, Unassigned)

References

Details

Between spark-streaming and direct-to-parquet, some of our parquet datasets are becoming very sharded, making reading them a bottleneck. Nightly, we should merge those datasets into good ~300MB parquet files to facilitate reading them.
Whd, I feel like mreid mentioned that you had some code written with this capability already. Any thoughts?
Flags: needinfo?(whd)
(In reply to Frank Bertsch [:frank] from comment #1) > Between spark-streaming and direct-to-parquet, some of our parquet datasets are becoming very sharded, making reading them a bottleneck. Which datasets, and how much of a bottleneck? We've discussed doing this before (mostly for the main store where sharding is even worse) but never committed to measuring it. > Whd, I feel like mreid mentioned that you had some code written with this > capability already. Any thoughts? I remember testing the parquet-tools merge utility for this purpose, but I never hooked it up in any production capacity. If we were to set this up it would likely be a wrapper around the official tool.
Flags: needinfo?(whd)
From Bug 133047: The new direct-to-parquet writer will likely produce files smaller than optimal for parquet in many cases. We'll need a way to consolidate them into larger files after they're written, probably as a nightly ETL. I found one standalone project that does this already (Herringbone, from Stripe: https://github.com/stripe/herringbone) but it has a bunch of dependencies that aren't managed in the project itself (e.g. Thrift), and it wouldn't be *that* difficult to write a batch view that does this generically, so I'm inclined to go with a batch view so we don't have to introduce new processes/dependencies in the pipeline.
To reiterate: I would like the bottleneck these small files are generating quantified so that this work can be prioritized appropriately. Hitherto as bug #1330047 makes apparent we have been able to live with the current less-than-optimal state.
Depends on: 1389290
Priority: -- → P3
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.