In order to export the historic data from mango to AWS, we need to implement a map/reduce version of the code that processes incoming submissions in the AWS-based Telemetry data pipeline. To summarize: - Read json payload - Convert to validated format - Write to partitioned directory structure - lzma/xz compress - Export to S3
hmm... Is this something we want to do? So tmary and I arrived at two possible solutions: A) Aggregate historic telemetry data on mango using a map/reduce job. This produces output to be displayed on the telemetry-dashboard. B) Export raw telemtry pings to S3, process them server side with telemetry incoming server from telemtry-server and publish them to a bucket once converted and validated. Then we can do whatever analysis we might want to do... Strategy (A) is implemented, as the job finished last night, I'll be verifying the data today and tomorrow. Hopefully, merging it into data file usable by the dashboard. Strategy (B) is fairly easy to implement on the mango-side, and can be fired up in an hour or so. Once everything is in S3, we can launch incoming-process servers on spot nodes as if the exported data was collected from HTTP nodes. A few hacks to the incoming-process servers would be required, these are already implemented (I have them laying around somewhere). If (A) works, I'm not sure we want to do (B), that's what I'm asking? Note, this bug suggests (C) conversion and validation of telemetry pings on mango, in terms of work required I think it'll be faster to do (B). ----- Anyways, do we want to export, if (A) works?
Just, for reference, the script doing (A) on mango, is `process_seq_file-fast.py`, available here: https://github.com/jonasfj/telemetry-dashboard/tree/hadoop-extraction-script
Pretty sure we did this... Then sat on the data so long that we care to import it anyways :)