Closed Bug 1286573 Opened 9 years ago Closed 9 years ago

Create cross-sectional parquet dataset

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: harter)

References

Details

User Story

The distribution viewer will be based on a cross-sectional dataset derived from our longitudinal dataset. Brendan created a prototype of the dataset [1] [2] some time ago. 

Ideally the dataset should be generated from a Scala job and be scheduled with Airflow to run after the longitudinal dataset has been updated.

Note that not all columns will be used by the viewer.

[1] https://github.com/mozilla/bcolloran_spark/blob/master/CrossSectionalExploration/CrossSectionalExploration%20-%20single%20variable%20distributions%20-%20raw%20diagnostic%20data.ipynb
[2] s3n://net-mozaws-prod-us-west-2-pipeline-analysis/bcolloran/crossSectionalFrame_20160317.parquet

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

9 years ago

No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

User Story: (updated)

Thomas Huelbert

Updated

•

9 years ago

Points: --- → 3

Priority: -- → P2

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Assignee: nobody → rharter

brendan c

Comment 1

•

9 years ago

Roberto, in the meeting the other day you asked if there were any array-valued columns; at the time I said no, but I've just spoken with Martin Lopatka (Rebecca's contractor, who I think you've met), and to support the work he's doing I'm going to add a column containing a list of all the addon names that have been observed for a profile, so there will be one column containing a list of strings. Hopefully this does not make things vastly more complicated.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 2

•

9 years ago

(In reply to brendan c from comment #1) > Roberto, in the meeting the other day you asked if there were any > array-valued columns; at the time I said no, but I've just spoken with > Martin Lopatka (Rebecca's contractor, who I think you've met), and to > support the work he's doing I'm going to add a column containing a list of > all the addon names that have been observed for a profile, so there will be > one column containing a list of strings. Hopefully this does not make things > vastly more complicated. That's not an issue. That said, I have spoken to him yesterday and we tentatively agreed to use the longitudinal dataset for the time being for his work, which contains data about add-ons.

Rob Hudson [:robhudson]

Comment 3

•

9 years ago

How often will this dataset be updated? I'm asking so I know how often to plan for the distribution viewer job, which will use this data as its source, to run.

Ryan Harter [:harter]

Assignee

Comment 4

•

9 years ago

I can't speak authoritatively since I'm just getting ramped up on this project, but I'd expect the cross-sectional dataset to be updated weekly with the longitudinal dataset. Poking Roberto to confirm.

Flags: needinfo?(rvitillo)

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Priority: P2 → P1

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 5

•

9 years ago

(In reply to Ryan Harter [:harter] from comment #4) > I can't speak authoritatively since I'm just getting ramped up on this > project, but I'd expect the cross-sectional dataset to be updated weekly > with the longitudinal dataset. > > Poking Roberto to confirm. That is correct.

Flags: needinfo?(rvitillo)

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Status: NEW → ASSIGNED

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1298118

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1298119

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1298120

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1298123

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1298126

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1298127

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Points: 3 → ---

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

9 years ago

Depends on: 1301037

Ryan Harter [:harter]

Assignee

Comment 6

•

9 years ago

I have an example dataset based off of one shard of longitudinal data (~.002%) stored here: s3://telemetry-test-bucket/cross_sectional/single_shard. The set is still in alpha and does not contain most of the fields. For reference, I’m periodically checking the code into the batch view repository here: https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala Brendan, I want your thoughts on the following: **Null Values** In the example notebook, if a ping has at least one missing value the entire ping is discarded. This is more difficult to do in the new dataset since it depends on the Longitudinal dataset where rows correspond to client_ids, not pings. I propose we do one of the following: * Plan A: Exclude client_ids only where one or more field is excluded from all pings for that client_id. The examples below really help explain this better. This means that a longitudinal row is excluded iff there are NULL values for one or more direct members; the row is not excluded if an array contains one or more NULL values (indirect members). * Plan B: Exclude no pings Here are some examples to illustrate: https://docs.google.com/spreadsheets/d/1s4b01nrHNqwktu_yup-lkaFd07o4O2CsBpW8_xyrG-I/edit#gid=0 **Column name changes** I propose the following column name changes: * Change column names to snake_case * For example: clientId -> client_id, addonCountForeign_Avg -> addon_count_foreign_avg * I expect we'll want to read this table into Presto so it can be accessible in re:dash, and Presto is case insensitive. * Remove day of week index from variable names * For example: activeHours_0Mon -> active_hours_mon * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday is 0. If we choose either convention, someone will be confused later. The three letter description is clear on its own, though this breaks lexicographic sorting.

Flags: needinfo?(bcolloran)

brendan c

Comment 7

•

9 years ago

> **Null Values** > > In the example notebook, if a ping has at least one missing value the entire > ping is discarded. I don't think that's right. I do a lot to try to clean pings rather than discarding them (see the function cleanSubsessPingRows in https://github.com/mozilla/bcolloran_spark/blob/master/perClientDataPrep/cross-sectional%20data%20table%20%282016-07-14%29%20-%20deduped%20clientIds.ipynb ). Can you indicate where you think I'm dropping pings? I've reviewed my notebook and I can't find the spot you're indicating. I believe my approach was in fact "Plan B: Exclude no pings", but I'm concerned that you seemed to think otherwise. It's quite important that your results match mine, so if your implementation diverges anywhere we need to figure that out... > **Column name changes** > I propose the following column name changes: > * Change column names to snake_case > * For example: clientId -> client_id, addonCountForeign_Avg -> > addon_count_foreign_avg > * I expect we'll want to read this table into Presto so it can be > accessible in re:dash, and Presto is case insensitive. Yep this works for me. > * Remove day of week index from variable names > * For example: activeHours_0Mon -> active_hours_mon > * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday > is 0. If we choose either convention, someone will be confused later. The > three letter description is clear on its own, though this breaks > lexicographic sorting. The lexicographic sorting turns out to be quite handy. Let's leave that one as-is.

Flags: needinfo?(bcolloran)

Ryan Harter [:harter]

Assignee

Comment 8

•

9 years ago

(In reply to brendan c from comment #7) > > **Null Values** > > > > In the example notebook, if a ping has at least one missing value the entire > > ping is discarded. > > I don't think that's right. I do a lot to try to clean pings rather than > discarding them (see the function cleanSubsessPingRows in > https://github.com/mozilla/bcolloran_spark/blob/master/perClientDataPrep/ > cross-sectional%20data%20table%20%282016-07-14%29%20-%20deduped%20clientIds. > ipynb ). > > Can you indicate where you think I'm dropping pings? I've reviewed my > notebook and I can't find the spot you're indicating. > > I believe my approach was in fact "Plan B: Exclude no pings", but I'm > concerned that you seemed to think otherwise. It's quite important that your > results match mine, so if your implementation diverges anywhere we need to > figure that out... Doh, I misread `getDictPathsAndRename` and thought this function was discarding '__missing' results. You're correct, thanks for the clarification. > > **Column name changes** > > I propose the following column name changes: > > * Change column names to snake_case > > * For example: clientId -> client_id, addonCountForeign_Avg -> > > addon_count_foreign_avg > > * I expect we'll want to read this table into Presto so it can be > > accessible in re:dash, and Presto is case insensitive. > > Yep this works for me. > > > * Remove day of week index from variable names > > * For example: activeHours_0Mon -> active_hours_mon > > * In Java/Scala, Sunday is the 0th day and Monday is 1. In python, Monday > > is 0. If we choose either convention, someone will be confused later. The > > three letter description is clear on its own, though this breaks > > lexicographic sorting. > > The lexicographic sorting turns out to be quite handy. Let's leave that one > as-is. SG, I'll update the dataset.

Ryan Harter [:harter]

Assignee

Comment 9

•

9 years ago

I just uploaded a copy of the cross sectional dataset [0] run over the complete longitudinal dataset. Soon, this dataset will be created weekly via airflow [1]. I've addressed the confusion around handling null values, the results should now be consistent. The code is still hosted here [2]. For the rest of the week I'll be filling out the column coverage. Next week I'll be on PTO. [0] s3://telemetry-test-bucket/cross_sectional/alpha_v20160920/ [1] https://github.com/mozilla/telemetry-airflow/pull/44 [2] https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala

brendan c

Comment 10

•

9 years ago

Thanks Ryan, looking good. I'm curious where you think we're at regarding the ETA until we have a dataset with coverage matching the prototype? Just trying to figure out how this fits into other work that is going on -- distribution viewer, other exploratory analysis, etc.

Ryan Harter [:harter]

Assignee

Comment 11

•

9 years ago

As discussed in today's stand up, my best point-estimate for parity with the prototype is 2016-10-14. I'll keep this thread updated.

Ryan Harter [:harter]

Assignee

Comment 12

•

9 years ago

I've uploaded an example dataset with column parity to Brendan's prototype at s3://telemetry-test-bucket/cross_sectional/alpha/20161012/. The PR's are still under review, but should be closed up this week. This should unblock the rest of the team. I have a couple of bugfixes to clean up, so these numbers may show some small shifts over the next couple days but the structure should be consistent. A couple of notes / differences: * The cross sectional dataset is based on the longitudinal dataset, which only summarizes main pings. Since we do not have deletion or crash pings, the cross sectional dataset does not include: crashes_numPings,deletionPings, and mainPingsNumTotal, * the longitudinal dataset guarantees the start date string format, so I did not include badStartDateStrings_numPings

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1310325

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1310326

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Depends on: 1310327

brendan c

Comment 13

•

9 years ago

Glad to see the progress here! Just to have it on your radar, I think next steps will be: 1) add a number of very simple columns to the table that are the result of simple mathematical operation on the existing columns. For example, we'll want things like "active_hours_per_active_day", which is just active_hours_total/days_active. There are a bunch of things like this, and even though they are super simple operations, we need a number of these rate metrics for the distribution viewer, and we don't want to push that logic onto that team (and more generally, it's preferable to have these measures pre-populated in one place for all consumers of the cross-sectional table rather than making people roll their own). 2) create a second set of table that looks at the crash and deletion pings, created the relevant columns, and joins them into this table by clientId. (crashes are much more important than deletions) 3) add a couple rate columns for crashes, as in (1) 4) add some cross sectional metrics related to the new scalar engagement measures. The fourth of those will be a bit down the road b/c i haven't even looked at the new engagement measures yet, but (2) can be started any time and i can give you the formulas for (1) and (3) as soon as you are ready. Thanks Ryan!

brendan c

Comment 14

•

9 years ago

> 1) add a number of very simple columns to the table that are the result of > simple mathematical operation on the existing columns. oh, i just noticed that we'll want to add a few that aren't pure math -- for each of these columns, |-- start_date_oldest: string (nullable = true) |-- start_date_newest: string (nullable = true) |-- prof_creation_date: string (nullable = true) we'll want a numeric measure of how many days it has been since that date (as of the time the cross sectional aggregation is run). So that would be: |-- start_date_oldest_days_since: long (nullable = true) |-- start_date_newest_days_since: long (nullable = true) |-- prof_creation_date_days_since: long (nullable = true) Thanks!

Ryan Harter [:harter]

Assignee

Comment 15

•

9 years ago

Sounds good, Brendan. Can you file bugs for each of these points?

brendan c

Updated

•

9 years ago

Depends on: 1311484

brendan c

Updated

•

9 years ago

Depends on: 1311483

brendan c

Updated

•

9 years ago

Depends on: 1311485

brendan c

Updated

•

9 years ago

Depends on: 1311486

brendan c

Comment 16

•

9 years ago

Ok Ryan, four bugs filed (dependencies directly above). Thanks!

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

No longer depends on: 1311483

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

No longer depends on: 1311484

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

No longer depends on: 1311485

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

No longer depends on: 1311486

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

No longer depends on: 1310327

Ryan Harter [:harter]

Assignee

Comment 17

•

9 years ago

Thanks, Brendan. I've moved these bugs as blockers to Bug 1312134, a new meta bug for the Cross Sectional Table. I'd like to close this bug once fixes for the current blockers are submitted. Let me know if you'd like to discuss!

brendan c

Comment 18

•

9 years ago

(as mentioned on IRC) I just noticed an error that I made in naming some of the columns -- all of the "active_hours_*" columns should be "session_hours_*" -- there is a distinction between "activeTicks" and "sessionLength" in UT, and though we don't have a summary of activeTicks in the x-sectional data yet, we will likely wish to add it at some point, so we should change this nomenclature now. My bad, so sorry about this!

Ryan Harter [:harter]

Assignee

Comment 19

•

9 years ago

PR 143 fixes the column naming. Closing this bug and moving on to the bugs blocking Bug 1312134. https://github.com/mozilla/telemetry-batch-view/pull/143

Ryan Harter [:harter]

Assignee

Updated

•

9 years ago

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.