Closed Bug 1180051 Opened 9 years ago Closed 6 years ago

Create Redshift table for raw FxOS FTU data

Categories

(Cloud Services :: Metrics: Product Metrics, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dzeber, Unassigned)

References

Details

We'd like to make the entire Firefox OS FTU dataset available in a SQL query-able table. 

Currently the raw data is stored with telemetry in S3, and we run daily scripts to sanitize and flatten it into rows, which are outputted as CSV snapshots of the dataset. We want to replace the CSV output with a Redshift table that will more convenient for us to maintain and for the consumers of the data to access.
See Also: → 1180058
(In reply to Dave Zeber [:dzeber] from comment #0)
> We'd like to make the entire Firefox OS FTU dataset available in a SQL
> query-able table. 
> 
> Currently the raw data is stored with telemetry in S3, and we run daily
> scripts to sanitize and flatten it into rows, which are outputted as CSV
> snapshots of the dataset. We want to replace the CSV output with a Redshift
> table that will more convenient for us to maintain and for the consumers of
> the data to access.

Can you attach or link to the sanitize/flatten scripts?
Flags: needinfo?(dzeber)
Depends on: 1204869
The code for sanitizing the raw FTU payloads is here: https://github.com/dzeber/fxos-metrics/blob/master/awsjobs/dump_format_ftu.py. It is currently run daily as a mapreduce job with the only filter conditions {reason == "ftu", appName == "FirefoxOS", and submission_date in a relevant date range}. 

The filter rules called by this script are located in https://github.com/dzeber/fxos-metrics/tree/master/utils. 

The list of fields that are outputted by this script (to be considered as columns) is at https://github.com/dzeber/fxos-metrics/blob/master/utils/dump_schema.py#L11. These field names are either those used in the raw payload or those added or modified in the sanitization script. If any of these fields is absent from the payload, it is recorded as missing.

I can send a sample output CSV over email.
Flags: needinfo?(dzeber)
Yes, please do send a sample.
Additional details: the output of the mapreduce job is (list of field values, # of payloads with the same set of values). The output table is basically occurrence counts of the segments defined by the column values.
Hello, is this still needed?
Flags: needinfo?(dstrohmeier)
Priority: -- → P5
Let me reply to this NI after meeting with Katie and Georg next week, please.
Thanks.
Hi Thomas,
how much of a pain would it be to implement these?

Reason why I am asking is that the current answer would be 'no', but I can see us coming to a stage where different Connected Devices trains will start submitting activation data and then, these tables would actually make it very comfortable for me to quickly do analysis by simply being able to connect R with the Redshift tables.

Dominik
Flags: needinfo?(dstrohmeier)
I don't think we are going to work on this anymore. Closing.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.