Closed Bug 1286573 Opened 8 years ago Closed 8 years ago

Create cross-sectional parquet dataset

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: harter)

References

Details

User Story

The distribution viewer will be based on a cross-sectional dataset derived from our longitudinal dataset. Brendan created a prototype of the dataset [1] [2] some time ago. 

Ideally the dataset should be generated from a Scala job and be scheduled with Airflow to run after the longitudinal dataset has been updated.

Note that not all columns will be used by the viewer.

[1] https://github.com/mozilla/bcolloran_spark/blob/master/CrossSectionalExploration/CrossSectionalExploration%20-%20single%20variable%20distributions%20-%20raw%20diagnostic%20data.ipynb
[2] s3n://net-mozaws-prod-us-west-2-pipeline-analysis/bcolloran/crossSectionalFrame_20160317.parquet
      No description provided.
User Story: (updated)
Points: --- → 3
Priority: -- → P2
Assignee: nobody → rharter
Roberto, in the meeting the other day you asked if there were any array-valued columns; at the time I said no, but I've just spoken with Martin Lopatka (Rebecca's contractor, who I think you've met), and to support the work he's doing I'm going to add a column containing a list of all the addon names that have been observed for a profile, so there will be one column containing a list of strings. Hopefully this does not make things vastly more complicated.
(In reply to brendan c from comment #1)
> Roberto, in the meeting the other day you asked if there were any
> array-valued columns; at the time I said no, but I've just spoken with
> Martin Lopatka (Rebecca's contractor, who I think you've met), and to
> support the work he's doing I'm going to add a column containing a list of
> all the addon names that have been observed for a profile, so there will be
> one column containing a list of strings. Hopefully this does not make things
> vastly more complicated.

That's not an issue. That said, I have spoken to him yesterday and we tentatively agreed to use the longitudinal dataset for the time being for his work, which contains data about add-ons.
How often will this dataset be updated? I'm asking so I know how often to plan for the distribution viewer job, which will use this data as its source, to run.
I can't speak authoritatively since I'm just getting ramped up on this project, but I'd expect the cross-sectional dataset to be updated weekly with the longitudinal dataset. 

Poking Roberto to confirm.
Flags: needinfo?(rvitillo)
Priority: P2 → P1
(In reply to Ryan Harter [:harter] from comment #4)
> I can't speak authoritatively since I'm just getting ramped up on this
> project, but I'd expect the cross-sectional dataset to be updated weekly
> with the longitudinal dataset. 
> 
> Poking Roberto to confirm.

That is correct.
Flags: needinfo?(rvitillo)
Status: NEW → ASSIGNED
Depends on: 1298118
Depends on: 1298119
Depends on: 1298120
Depends on: 1298123
Depends on: 1298126
Depends on: 1298127
Points: 3 → ---
Depends on: 1301037
I have an example dataset based off of one shard of longitudinal data (~.002%) stored here: s3://telemetry-test-bucket/cross_sectional/single_shard. The set is still in alpha and does not contain most of the fields. For reference, I’m periodically checking the code into the batch view repository here: https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala

Brendan, I want your thoughts on the following:

**Null Values**

In the example notebook, if a ping has at least one missing value the entire ping is discarded. This is more difficult to do in the new dataset since it depends on the Longitudinal dataset where rows correspond to client_ids, not pings. I propose we do one of the following:

* Plan A: Exclude client_ids only where one or more field is excluded from all pings for that client_id. The examples below really help explain this better. This means that a longitudinal row is excluded iff there are NULL values for one or more direct members; the row is not excluded if an array contains one or more NULL values (indirect members). 
* Plan B: Exclude no pings

Here are some examples to illustrate: https://docs.google.com/spreadsheets/d/1s4b01nrHNqwktu_yup-lkaFd07o4O2CsBpW8_xyrG-I/edit#gid=0

**Column name changes**
I propose the following column name changes:
* Change column names to snake_case
  * For example: clientId -> client_id, addonCountForeign_Avg -> addon_count_foreign_avg
  * I expect we'll want to read this table into Presto so it can be accessible in re:dash, and Presto is case insensitive.
* Remove day of week index from variable names
  * For example: activeHours_0Mon -> active_hours_mon
  * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday is 0. If we choose either convention, someone will be confused later. The three letter description is clear on its own, though this breaks lexicographic sorting.
Flags: needinfo?(bcolloran)
> **Null Values**
> 
> In the example notebook, if a ping has at least one missing value the entire
> ping is discarded.

I don't think that's right. I do a lot to try to clean pings rather than discarding them (see the function cleanSubsessPingRows in https://github.com/mozilla/bcolloran_spark/blob/master/perClientDataPrep/cross-sectional%20data%20table%20%282016-07-14%29%20-%20deduped%20clientIds.ipynb ).

Can you indicate where you think I'm dropping pings? I've reviewed my notebook and I can't find the spot you're indicating.

I believe my approach was in fact "Plan B: Exclude no pings", but I'm concerned that you seemed to think otherwise. It's quite important that your results match mine, so if your implementation diverges anywhere we need to figure that out...

> **Column name changes**
> I propose the following column name changes:
> * Change column names to snake_case
>   * For example: clientId -> client_id, addonCountForeign_Avg ->
> addon_count_foreign_avg
>   * I expect we'll want to read this table into Presto so it can be
> accessible in re:dash, and Presto is case insensitive.

Yep this works for me.

> * Remove day of week index from variable names
>   * For example: activeHours_0Mon -> active_hours_mon
>   * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday
> is 0. If we choose either convention, someone will be confused later. The
> three letter description is clear on its own, though this breaks
> lexicographic sorting.

The lexicographic sorting turns out to be quite handy. Let's leave that one as-is.
Flags: needinfo?(bcolloran)
(In reply to brendan c from comment #7)
> > **Null Values**
> > 
> > In the example notebook, if a ping has at least one missing value the entire
> > ping is discarded.
> 
> I don't think that's right. I do a lot to try to clean pings rather than
> discarding them (see the function cleanSubsessPingRows in
> https://github.com/mozilla/bcolloran_spark/blob/master/perClientDataPrep/
> cross-sectional%20data%20table%20%282016-07-14%29%20-%20deduped%20clientIds.
> ipynb ).
> 
> Can you indicate where you think I'm dropping pings? I've reviewed my
> notebook and I can't find the spot you're indicating.
> 
> I believe my approach was in fact "Plan B: Exclude no pings", but I'm
> concerned that you seemed to think otherwise. It's quite important that your
> results match mine, so if your implementation diverges anywhere we need to
> figure that out...

Doh, I misread `getDictPathsAndRename` and thought this function was discarding '__missing' results. You're correct, thanks for the clarification.

> > **Column name changes**
> > I propose the following column name changes:
> > * Change column names to snake_case
> >   * For example: clientId -> client_id, addonCountForeign_Avg ->
> > addon_count_foreign_avg
> >   * I expect we'll want to read this table into Presto so it can be
> > accessible in re:dash, and Presto is case insensitive.
> 
> Yep this works for me.
> 
> > * Remove day of week index from variable names
> >   * For example: activeHours_0Mon -> active_hours_mon
> >   * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday
> > is 0. If we choose either convention, someone will be confused later. The
> > three letter description is clear on its own, though this breaks
> > lexicographic sorting.
> 
> The lexicographic sorting turns out to be quite handy. Let's leave that one
> as-is.

SG, I'll update the dataset.
I just uploaded a copy of the cross sectional dataset [0] run over the complete longitudinal dataset. Soon, this dataset will be created weekly via airflow [1]. I've addressed the confusion around handling null values, the results should now be consistent. The code is still hosted here [2].

For the rest of the week I'll be filling out the column coverage. Next week I'll be on PTO.

[0] s3://telemetry-test-bucket/cross_sectional/alpha_v20160920/
[1] https://github.com/mozilla/telemetry-airflow/pull/44
[2] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala
Thanks Ryan, looking good. I'm curious where you think we're at regarding the ETA until we have a dataset with coverage matching the prototype? Just trying to figure out how this fits into other work that is going on -- distribution viewer, other exploratory analysis, etc.
As discussed in today's stand up, my best point-estimate for parity with the prototype is 2016-10-14. I'll keep this thread updated.
I've uploaded an example dataset with column parity to Brendan's prototype at s3://telemetry-test-bucket/cross_sectional/alpha/20161012/. The PR's are still under review, but should be closed up this week. This should unblock the rest of the team.

I have a couple of bugfixes to clean up, so these numbers may show some small shifts over the next couple days but the structure should be consistent.

A couple of notes / differences:
* The cross sectional dataset is based on the longitudinal dataset, which only summarizes main pings. Since we do not have deletion or crash pings, the cross sectional dataset does not include: crashes_numPings,deletionPings, and mainPingsNumTotal,
* the longitudinal dataset guarantees the start date string format, so I did not include badStartDateStrings_numPings
Depends on: 1310325
Depends on: 1310326
Depends on: 1310327
Glad to see the progress here! Just to have it on your radar, I think next steps will be:

1) add a number of very simple columns to the table that are the result of simple mathematical operation on the existing columns. For example, we'll want things like "active_hours_per_active_day", which is just active_hours_total/days_active.
There are a bunch of things like this, and even though they are super simple operations, we need a number of these rate metrics for the distribution viewer, and we don't want to push that logic onto that team (and more generally, it's preferable to have these measures pre-populated in one place for all consumers of the cross-sectional table rather than making people roll their own).

2) create a second set of table that looks at the crash and deletion pings, created the relevant columns, and joins them into this table by clientId. (crashes are much more important than deletions)

3) add a couple rate columns for crashes, as in (1)

4) add some cross sectional metrics related to the new scalar engagement measures.

The fourth of those will be a bit down the road b/c i haven't even looked at the new engagement measures yet, but (2) can be started any time and i can give you the formulas for (1) and (3) as soon as you are ready.

Thanks Ryan!
> 1) add a number of very simple columns to the table that are the result of
> simple mathematical operation on the existing columns.

oh, i just noticed that we'll want to add a few that aren't pure math -- for each of these columns, 
|-- start_date_oldest: string (nullable = true)
|-- start_date_newest: string (nullable = true)
|-- prof_creation_date: string (nullable = true)
we'll want a numeric measure of how many days it has been since that date (as of the time the cross sectional aggregation is run). So that would be:

|-- start_date_oldest_days_since: long (nullable = true)
|-- start_date_newest_days_since: long (nullable = true)
|-- prof_creation_date_days_since: long (nullable = true)

Thanks!
Sounds good, Brendan. Can you file bugs for each of these points?
Depends on: 1311484
Depends on: 1311483
Depends on: 1311485
Depends on: 1311486
Ok Ryan, four bugs filed (dependencies directly above). Thanks!
No longer depends on: 1311483
No longer depends on: 1311484
No longer depends on: 1311485
No longer depends on: 1311486
No longer depends on: 1310327
Thanks, Brendan. I've moved these bugs as blockers to Bug 1312134, a new meta bug for the Cross Sectional Table. I'd like to close this bug once fixes for the current blockers are submitted. 

Let me know if you'd like to discuss!
(as mentioned on IRC)
I just noticed an error that I made in naming some of the columns -- all of the "active_hours_*" columns should be "session_hours_*" -- there is a distinction between "activeTicks" and "sessionLength" in UT, and though we don't have a summary of activeTicks in the x-sectional data yet, we will likely wish to add it at some point, so we should change this nomenclature now.

My bad, so sorry about this!
PR 143 fixes the column naming. Closing this bug and moving on to the bugs blocking Bug 1312134.

https://github.com/mozilla/telemetry-batch-view/pull/143
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.