Closed
Bug 1180051
Opened 9 years ago
Closed 6 years ago
Create Redshift table for raw FxOS FTU data
Categories
(Cloud Services :: Metrics: Product Metrics, defect, P5)
Cloud Services
Metrics: Product Metrics
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: dzeber, Unassigned)
References
Details
We'd like to make the entire Firefox OS FTU dataset available in a SQL query-able table. Currently the raw data is stored with telemetry in S3, and we run daily scripts to sanitize and flatten it into rows, which are outputted as CSV snapshots of the dataset. We want to replace the CSV output with a Redshift table that will more convenient for us to maintain and for the consumers of the data to access.
Comment 1•9 years ago
|
||
(In reply to Dave Zeber [:dzeber] from comment #0) > We'd like to make the entire Firefox OS FTU dataset available in a SQL > query-able table. > > Currently the raw data is stored with telemetry in S3, and we run daily > scripts to sanitize and flatten it into rows, which are outputted as CSV > snapshots of the dataset. We want to replace the CSV output with a Redshift > table that will more convenient for us to maintain and for the consumers of > the data to access. Can you attach or link to the sanitize/flatten scripts?
Flags: needinfo?(dzeber)
Reporter | ||
Comment 2•9 years ago
|
||
The code for sanitizing the raw FTU payloads is here: https://github.com/dzeber/fxos-metrics/blob/master/awsjobs/dump_format_ftu.py. It is currently run daily as a mapreduce job with the only filter conditions {reason == "ftu", appName == "FirefoxOS", and submission_date in a relevant date range}. The filter rules called by this script are located in https://github.com/dzeber/fxos-metrics/tree/master/utils. The list of fields that are outputted by this script (to be considered as columns) is at https://github.com/dzeber/fxos-metrics/blob/master/utils/dump_schema.py#L11. These field names are either those used in the raw payload or those added or modified in the sanitization script. If any of these fields is absent from the payload, it is recorded as missing. I can send a sample output CSV over email.
Flags: needinfo?(dzeber)
Comment 3•9 years ago
|
||
Yes, please do send a sample.
Reporter | ||
Comment 4•9 years ago
|
||
Additional details: the output of the mapreduce job is (list of field values, # of payloads with the same set of values). The output table is basically occurrence counts of the segments defined by the column values.
Comment 6•8 years ago
|
||
Let me reply to this NI after meeting with Katie and Georg next week, please. Thanks.
Comment 7•8 years ago
|
||
Hi Thomas, how much of a pain would it be to implement these? Reason why I am asking is that the current answer would be 'no', but I can see us coming to a stage where different Connected Devices trains will start submitting activation data and then, these tables would actually make it very comfortable for me to quickly do analysis by simply being able to connect R with the Redshift tables. Dominik
Flags: needinfo?(dstrohmeier)
Comment 8•6 years ago
|
||
I don't think we are going to work on this anymore. Closing.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•