Closed
Bug 1380673
Opened 8 years ago
Closed 8 years ago
Investigate performance/viability of large-scale joins with parquet data
Categories
(Data Platform and Tools :: General, enhancement)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mreid, Unassigned)
Details
There are several cases where it would be convenient to be able to join datasets by client_id + day.
We should see how such joins perform on real data (using Spark, Presto, and Athena).
This will inform some upcoming design decisions about derived datasets.
| Reporter | ||
Comment 1•8 years ago
|
||
I've tested joining about 2.5 months of main_summary data against a set of about 400k client ids and it worked fine on a cluster of 12 machines. I think we can safely say that this approach will work going forwards.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•