Closed
Bug 1286573
Opened 9 years ago
Closed 8 years ago
Create cross-sectional parquet dataset
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: harter)
References
Details
User Story
The distribution viewer will be based on a cross-sectional dataset derived from our longitudinal dataset. Brendan created a prototype of the dataset [1] [2] some time ago. Ideally the dataset should be generated from a Scala job and be scheduled with Airflow to run after the longitudinal dataset has been updated. Note that not all columns will be used by the viewer. [1] https://github.com/mozilla/bcolloran_spark/blob/master/CrossSectionalExploration/CrossSectionalExploration%20-%20single%20variable%20distributions%20-%20raw%20diagnostic%20data.ipynb [2] s3n://net-mozaws-prod-us-west-2-pipeline-analysis/bcolloran/crossSectionalFrame_20160317.parquet
No description provided.
Reporter | ||
Updated•9 years ago
|
User Story: (updated)
Updated•9 years ago
|
Points: --- → 3
Priority: -- → P2
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → rharter
Roberto, in the meeting the other day you asked if there were any array-valued columns; at the time I said no, but I've just spoken with Martin Lopatka (Rebecca's contractor, who I think you've met), and to support the work he's doing I'm going to add a column containing a list of all the addon names that have been observed for a profile, so there will be one column containing a list of strings. Hopefully this does not make things vastly more complicated.
Reporter | ||
Comment 2•9 years ago
|
||
(In reply to brendan c from comment #1)
> Roberto, in the meeting the other day you asked if there were any
> array-valued columns; at the time I said no, but I've just spoken with
> Martin Lopatka (Rebecca's contractor, who I think you've met), and to
> support the work he's doing I'm going to add a column containing a list of
> all the addon names that have been observed for a profile, so there will be
> one column containing a list of strings. Hopefully this does not make things
> vastly more complicated.
That's not an issue. That said, I have spoken to him yesterday and we tentatively agreed to use the longitudinal dataset for the time being for his work, which contains data about add-ons.
Comment 3•9 years ago
|
||
How often will this dataset be updated? I'm asking so I know how often to plan for the distribution viewer job, which will use this data as its source, to run.
Assignee | ||
Comment 4•9 years ago
|
||
I can't speak authoritatively since I'm just getting ramped up on this project, but I'd expect the cross-sectional dataset to be updated weekly with the longitudinal dataset.
Poking Roberto to confirm.
Flags: needinfo?(rvitillo)
Assignee | ||
Updated•9 years ago
|
Priority: P2 → P1
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to Ryan Harter [:harter] from comment #4)
> I can't speak authoritatively since I'm just getting ramped up on this
> project, but I'd expect the cross-sectional dataset to be updated weekly
> with the longitudinal dataset.
>
> Poking Roberto to confirm.
That is correct.
Flags: needinfo?(rvitillo)
Assignee | ||
Updated•9 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Updated•9 years ago
|
Points: 3 → ---
Assignee | ||
Comment 6•8 years ago
|
||
I have an example dataset based off of one shard of longitudinal data (~.002%) stored here: s3://telemetry-test-bucket/cross_sectional/single_shard. The set is still in alpha and does not contain most of the fields. For reference, I’m periodically checking the code into the batch view repository here: https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala
Brendan, I want your thoughts on the following:
**Null Values**
In the example notebook, if a ping has at least one missing value the entire ping is discarded. This is more difficult to do in the new dataset since it depends on the Longitudinal dataset where rows correspond to client_ids, not pings. I propose we do one of the following:
* Plan A: Exclude client_ids only where one or more field is excluded from all pings for that client_id. The examples below really help explain this better. This means that a longitudinal row is excluded iff there are NULL values for one or more direct members; the row is not excluded if an array contains one or more NULL values (indirect members).
* Plan B: Exclude no pings
Here are some examples to illustrate: https://docs.google.com/spreadsheets/d/1s4b01nrHNqwktu_yup-lkaFd07o4O2CsBpW8_xyrG-I/edit#gid=0
**Column name changes**
I propose the following column name changes:
* Change column names to snake_case
* For example: clientId -> client_id, addonCountForeign_Avg -> addon_count_foreign_avg
* I expect we'll want to read this table into Presto so it can be accessible in re:dash, and Presto is case insensitive.
* Remove day of week index from variable names
* For example: activeHours_0Mon -> active_hours_mon
* In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday is 0. If we choose either convention, someone will be confused later. The three letter description is clear on its own, though this breaks lexicographic sorting.
Flags: needinfo?(bcolloran)
> **Null Values**
>
> In the example notebook, if a ping has at least one missing value the entire
> ping is discarded.
I don't think that's right. I do a lot to try to clean pings rather than discarding them (see the function cleanSubsessPingRows in https://github.com/mozilla/bcolloran_spark/blob/master/perClientDataPrep/cross-sectional%20data%20table%20%282016-07-14%29%20-%20deduped%20clientIds.ipynb ).
Can you indicate where you think I'm dropping pings? I've reviewed my notebook and I can't find the spot you're indicating.
I believe my approach was in fact "Plan B: Exclude no pings", but I'm concerned that you seemed to think otherwise. It's quite important that your results match mine, so if your implementation diverges anywhere we need to figure that out...
> **Column name changes**
> I propose the following column name changes:
> * Change column names to snake_case
> * For example: clientId -> client_id, addonCountForeign_Avg ->
> addon_count_foreign_avg
> * I expect we'll want to read this table into Presto so it can be
> accessible in re:dash, and Presto is case insensitive.
Yep this works for me.
> * Remove day of week index from variable names
> * For example: activeHours_0Mon -> active_hours_mon
> * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday
> is 0. If we choose either convention, someone will be confused later. The
> three letter description is clear on its own, though this breaks
> lexicographic sorting.
The lexicographic sorting turns out to be quite handy. Let's leave that one as-is.
Flags: needinfo?(bcolloran)
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to brendan c from comment #7)
> > **Null Values**
> >
> > In the example notebook, if a ping has at least one missing value the entire
> > ping is discarded.
>
> I don't think that's right. I do a lot to try to clean pings rather than
> discarding them (see the function cleanSubsessPingRows in
> https://github.com/mozilla/bcolloran_spark/blob/master/perClientDataPrep/
> cross-sectional%20data%20table%20%282016-07-14%29%20-%20deduped%20clientIds.
> ipynb ).
>
> Can you indicate where you think I'm dropping pings? I've reviewed my
> notebook and I can't find the spot you're indicating.
>
> I believe my approach was in fact "Plan B: Exclude no pings", but I'm
> concerned that you seemed to think otherwise. It's quite important that your
> results match mine, so if your implementation diverges anywhere we need to
> figure that out...
Doh, I misread `getDictPathsAndRename` and thought this function was discarding '__missing' results. You're correct, thanks for the clarification.
> > **Column name changes**
> > I propose the following column name changes:
> > * Change column names to snake_case
> > * For example: clientId -> client_id, addonCountForeign_Avg ->
> > addon_count_foreign_avg
> > * I expect we'll want to read this table into Presto so it can be
> > accessible in re:dash, and Presto is case insensitive.
>
> Yep this works for me.
>
> > * Remove day of week index from variable names
> > * For example: activeHours_0Mon -> active_hours_mon
> > * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday
> > is 0. If we choose either convention, someone will be confused later. The
> > three letter description is clear on its own, though this breaks
> > lexicographic sorting.
>
> The lexicographic sorting turns out to be quite handy. Let's leave that one
> as-is.
SG, I'll update the dataset.
Assignee | ||
Comment 9•8 years ago
|
||
I just uploaded a copy of the cross sectional dataset [0] run over the complete longitudinal dataset. Soon, this dataset will be created weekly via airflow [1]. I've addressed the confusion around handling null values, the results should now be consistent. The code is still hosted here [2].
For the rest of the week I'll be filling out the column coverage. Next week I'll be on PTO.
[0] s3://telemetry-test-bucket/cross_sectional/alpha_v20160920/
[1] https://github.com/mozilla/telemetry-airflow/pull/44
[2] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala
Comment 10•8 years ago
|
||
Thanks Ryan, looking good. I'm curious where you think we're at regarding the ETA until we have a dataset with coverage matching the prototype? Just trying to figure out how this fits into other work that is going on -- distribution viewer, other exploratory analysis, etc.
Assignee | ||
Comment 11•8 years ago
|
||
As discussed in today's stand up, my best point-estimate for parity with the prototype is 2016-10-14. I'll keep this thread updated.
Assignee | ||
Comment 12•8 years ago
|
||
I've uploaded an example dataset with column parity to Brendan's prototype at s3://telemetry-test-bucket/cross_sectional/alpha/20161012/. The PR's are still under review, but should be closed up this week. This should unblock the rest of the team.
I have a couple of bugfixes to clean up, so these numbers may show some small shifts over the next couple days but the structure should be consistent.
A couple of notes / differences:
* The cross sectional dataset is based on the longitudinal dataset, which only summarizes main pings. Since we do not have deletion or crash pings, the cross sectional dataset does not include: crashes_numPings,deletionPings, and mainPingsNumTotal,
* the longitudinal dataset guarantees the start date string format, so I did not include badStartDateStrings_numPings
Comment 13•8 years ago
|
||
Glad to see the progress here! Just to have it on your radar, I think next steps will be:
1) add a number of very simple columns to the table that are the result of simple mathematical operation on the existing columns. For example, we'll want things like "active_hours_per_active_day", which is just active_hours_total/days_active.
There are a bunch of things like this, and even though they are super simple operations, we need a number of these rate metrics for the distribution viewer, and we don't want to push that logic onto that team (and more generally, it's preferable to have these measures pre-populated in one place for all consumers of the cross-sectional table rather than making people roll their own).
2) create a second set of table that looks at the crash and deletion pings, created the relevant columns, and joins them into this table by clientId. (crashes are much more important than deletions)
3) add a couple rate columns for crashes, as in (1)
4) add some cross sectional metrics related to the new scalar engagement measures.
The fourth of those will be a bit down the road b/c i haven't even looked at the new engagement measures yet, but (2) can be started any time and i can give you the formulas for (1) and (3) as soon as you are ready.
Thanks Ryan!
Comment 14•8 years ago
|
||
> 1) add a number of very simple columns to the table that are the result of
> simple mathematical operation on the existing columns.
oh, i just noticed that we'll want to add a few that aren't pure math -- for each of these columns,
|-- start_date_oldest: string (nullable = true)
|-- start_date_newest: string (nullable = true)
|-- prof_creation_date: string (nullable = true)
we'll want a numeric measure of how many days it has been since that date (as of the time the cross sectional aggregation is run). So that would be:
|-- start_date_oldest_days_since: long (nullable = true)
|-- start_date_newest_days_since: long (nullable = true)
|-- prof_creation_date_days_since: long (nullable = true)
Thanks!
Assignee | ||
Comment 15•8 years ago
|
||
Sounds good, Brendan. Can you file bugs for each of these points?
Comment 16•8 years ago
|
||
Ok Ryan, four bugs filed (dependencies directly above). Thanks!
Assignee | ||
Comment 17•8 years ago
|
||
Thanks, Brendan. I've moved these bugs as blockers to Bug 1312134, a new meta bug for the Cross Sectional Table. I'd like to close this bug once fixes for the current blockers are submitted.
Let me know if you'd like to discuss!
Comment 18•8 years ago
|
||
(as mentioned on IRC)
I just noticed an error that I made in naming some of the columns -- all of the "active_hours_*" columns should be "session_hours_*" -- there is a distinction between "activeTicks" and "sessionLength" in UT, and though we don't have a summary of activeTicks in the x-sectional data yet, we will likely wish to add it at some point, so we should change this nomenclature now.
My bad, so sorry about this!
Assignee | ||
Comment 19•8 years ago
|
||
PR 143 fixes the column naming. Closing this bug and moving on to the bugs blocking Bug 1312134.
https://github.com/mozilla/telemetry-batch-view/pull/143
Assignee | ||
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•