Comments from bug triage: - Hopefully by the end of the week - First will use dummy tables (headers/json blobs) - Next step use FHR schema Risks: - synchronous, could jam up heka
Redshift output was tested/running on Feb 6 against a basic message/table schema. The speed was reasonable from my home machine when bulk loading (10K inserts/sec) and should be much better from machines within AWS. Synchronous individual inserts were painfully slow, about 10 per second. Katie: ETA on the real schema?
The FHR data is ingested by Bagheera and stored in HDFS initially. It is then processed by bcolloran's de-orphaning script. Saptarshi's code creates a set of samples from the de-orphaned data, which are loaded into vertica. Here's the full vertica schema (includes ADI and other tables): https://mana.mozilla.org/wiki/download/attachments/43724740/vertica_tables.txt And more info about the rollup & vertica import scripts: https://mana.mozilla.org/wiki/display/BIDW/FHR+rollups
So what do we actually need here? - Something to read the de-orphaned results out of HDFS and put them in Redshift instead? - Perform the de-orphaning in the pipeline data stream and populate Redshift avoiding HDFS? - ?? The generic Redshift output is done so I am closing this. Please open a bug(s) for the implementation of the specific FHR use cases.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.