Closed Bug 1380673 Opened 8 years ago Closed 8 years ago

Investigate performance/viability of large-scale joins with parquet data

Categories

(Data Platform and Tools :: General, enhancement)

x86
macOS
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Unassigned)

Details

There are several cases where it would be convenient to be able to join datasets by client_id + day. We should see how such joins perform on real data (using Spark, Presto, and Athena). This will inform some upcoming design decisions about derived datasets.
I've tested joining about 2.5 months of main_summary data against a set of about 400k client ids and it worked fine on a cluster of 12 machines. I think we can safely say that this approach will work going forwards.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.