Write Hadoop job to convert and export historic Telemetry data to AWS format

RESOLVED FIXED in Unreviewed

Status

Mozilla Metrics
Hadoop/HBase Operations
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: mreid, Assigned: jonasfj)

Tracking

unspecified
Unreviewed
x86_64
Linux

Details

(Reporter)

Description

4 years ago
In order to export the historic data from mango to AWS, we need to implement a map/reduce version of the code that processes incoming submissions in the AWS-based Telemetry data pipeline.

To summarize:
- Read json payload
- Convert to validated format
- Write to partitioned directory structure
- lzma/xz compress
- Export to S3
(Reporter)

Updated

4 years ago
Assignee: tmeyarivan → mreid
(Reporter)

Updated

4 years ago
Group: metrics-private
(Reporter)

Updated

4 years ago
Blocks: 921116
(Reporter)

Updated

4 years ago
Assignee: mreid → jopsen
(Assignee)

Comment 1

4 years ago
hmm... Is this something we want to do?

So tmary and I arrived at two possible solutions:

A)
Aggregate historic telemetry data on mango using a map/reduce job.
This produces output to be displayed on the telemetry-dashboard.

B)
Export raw telemtry pings to S3, process them server side with telemetry
incoming server from telemtry-server and publish them to a bucket once converted and validated.
Then we can do whatever analysis we might want to do...

Strategy (A) is implemented, as the job finished last night, I'll be verifying the data today and tomorrow.
Hopefully, merging it into data file usable by the dashboard.

Strategy (B) is fairly easy to implement on the mango-side, and can be fired up in an hour or so.
Once everything is in S3, we can launch incoming-process servers on spot nodes as if the exported data was collected from HTTP nodes. A few hacks to the incoming-process servers would be required, these are already implemented (I have them laying around somewhere).


If (A) works, I'm not sure we want to do (B), that's what I'm asking?
Note, this bug suggests (C) conversion and validation of telemetry pings on mango, in terms of work required I think it'll be faster to do (B).

-----
Anyways, do we want to export, if (A) works?
(Assignee)

Comment 2

4 years ago
Just, for reference, the script doing (A) on mango, is `process_seq_file-fast.py`, available here:
https://github.com/jonasfj/telemetry-dashboard/tree/hadoop-extraction-script
(Assignee)

Comment 3

3 years ago
Pretty sure we did this... Then sat on the data so long that we care to import it anyways :)
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.