Move app_update_out_of_date job out of Longitudinal
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: akomar, Assigned: akomar)
References
Details
Attachments
(3 files)
We can generate a dataset matching the subset of longitudinal used in this job via SQL and Spark transformations.
I have a PoC here: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/163676/command/168444 - it needs some refinements around histograms and simple measures, but overall it looks straightforward.
Assignee | ||
Comment 1•6 years ago
|
||
:rstrong, :frank - let me know if you see any disadvantages of this solution.
![]() |
||
Comment 2•6 years ago
|
||
I doubt I have the expertise to be able to properly weigh in on this but I think this would be great.
Sometimes we query old longitudinal datasets when investigating. Would this still be possible with this approach? If not, it wouldn't be a showstopper.
Assignee | ||
Comment 3•6 years ago
|
||
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #2)
I doubt I have the expertise to be able to properly weigh in on this but I think this would be great.
Sometimes we query old longitudinal datasets when investigating. Would this still be possible with this approach? If not, it wouldn't be a showstopper.
No worries, I can handle this but probably will need your help with validation at some later stage.
You'll be able to query intermediate table. It will be similar, but not identical to longitudinal. That should be perfectly fine for debugging.
Comment 4•6 years ago
|
||
:akomar, that plan sounds great to me, and seems to be the most straightforward. Thanks for putting this together.
:rstrong, except for (as Arkadiusz mentioned) debugging, we won't recommend you use that interim-longitudinal for analysis. Can you give a bit more info about your use-cases and we can direct you to an alternative? It may be a case of helping you all get familiar with a different dataset that has the info you need.
![]() |
||
Comment 5•6 years ago
|
||
Hi Frank, I've done this in the past for one-off queries when I need to dive deeper into the data from the dashboard. I've used other datasources besides longitudinal for this at times in the past so it isn't a showstopper. I was just curious whether I would always need to use a different datasource. Thanks!
Comment 6•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 7•6 years ago
|
||
I have converted original notebook to a Python script and added a shim code to read data from main ping BigQuery table [1].
We have to wait until BigQuery is backfilled before proceeding further, but first run's result [2] seems to be fine (judging just by proportions between various measures as the sample is much smaller than the real longitudinal yet).
[1] https://github.com/mozilla/telemetry-airflow/pull/613
[2] https://moz-update-orphaning-test-output.storage.googleapis.com/20190915.json
Updated•6 years ago
|
Assignee | ||
Comment 8•6 years ago
|
||
I ran the script on the 1% backfill table, results are matching to what we get on AWS.
AWS run: s3://telemetry-public-analysis-2/app-update/data/out-of-date/20190901.json
GCP run: gs://moz-update-orphaning-test-output/20190901.json
Assignee | ||
Comment 9•6 years ago
|
||
:rstrong - I can't tag you on Github, but FYI we're close to landing the job linked in the PR. I want to do a test run this week and hopefully by the next one we'll be able to switch it to write to prod s3 location.
Comment 11•6 years ago
|
||
Comment 12•6 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Updated•3 years ago
|
Description
•