Closed Bug 1250941 Opened 9 years ago Closed 9 years ago

Create a derived dataset for the unified-urlbar experiment

Categories

(Cloud Services :: Metrics: Product Metrics, defect, P4)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mak, Assigned: mak)

References

Details

the unified-urlbar experiment just finished on the Beta channel, and we'd like to be able to analyze the collected data. Could you please make us a dataset limited to only users who ran the experiment? We would like to query the following telemetry values: - environment data - UITelemetry (search, search-oneoff, click-builtin-item, environment, toolbars). This is part of the Simple Measurements. - FX_URLBAR_SELECTED_RESULT_TYPE and SEARCH_COUNTS histograms - wherever the experiment branch is stored (no idea)
if the data is too large, we'd be fine with retaining 15k random users per branch (the experiment had 3 branches).
Roberto, are you assuming that this needs to be a longitudinal dataset? I don't see that in the requirements here. We're clearly not going to get a strict schema out of UITelemetry in the short term. Can you just teach mak how to create a subset of get_pings data for this experiment and save it to S3?
Flags: needinfo?(rvitillo)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #2) > Roberto, are you assuming that this needs to be a longitudinal dataset? I > don't see that in the requirements here. We're clearly not going to get a > strict schema out of UITelemetry in the short term. It depends what kind of questions Marco intends to answer with the experimental data. Creating a longitudinal Parquet dataset per experiment is something I would like to deal with at some point instead of writing individual ETL jobs for each experiment. > Can you just teach mak how to create a subset of get_pings data for this > experiment and save it to S3? Certainly, this is what I proposed Marco in our e-mail thread before updating the Bug.
Flags: needinfo?(rvitillo)
Component: Metrics: Pipeline → Metrics: Product Metrics
Priority: -- → P4
Since I plan to do the analysis on spark by myself, for now I'm assigning the bug to myself
Assignee: nobody → mak77
first version is at https://github.com/mak77/telemetry_analysis/blob/master/unified-urlbar.ipynb This is pretty much restricted, so I could run it on a single node. Roberto, could you please take a look at it and tell me if I'm doing something very dumb? For the broader version, I will split it into extraction and analysis and store on S3, as you suggested. I'm not sure which values I should aim at for the extraction though, I was thinking to fetch from 20160112 to 20160212, and sample 10% of the data... is that too much? how much may it take using more clusters?
Flags: needinfo?(rvitillo)
(In reply to Marco Bonardo [::mak] from comment #5) > first version is at > https://github.com/mak77/telemetry_analysis/blob/master/unified-urlbar.ipynb > > This is pretty much restricted, so I could run it on a single node. > Roberto, could you please take a look at it and tell me if I'm doing > something very dumb? Your analysis is well written. > For the broader version, I will split it into extraction and analysis and > store on S3, as you suggested. I'm not sure which values I should aim at for > the extraction though, I was thinking to fetch from 20160112 to 20160212, > and sample 10% of the data... is that too much? how much may it take using > more clusters? 10 % might be OK. You could either compute the confidence intervals for your results and/or keep increasing the percentage of pings considered until the results stabilize. Feel free to spawn a larger cluster once you have the final version of your analysis.
Flags: needinfo?(rvitillo)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.