Closed Bug 1577460 Opened 6 years ago Closed 5 years ago

Move app_update_out_of_date job out of Longitudinal

Categories

(Data Platform and Tools :: General, task, P1)

task
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: akomar, Assigned: akomar)

References

Details

Attachments

(3 files)

We can generate a dataset matching the subset of longitudinal used in this job via SQL and Spark transformations.
I have a PoC here: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/163676/command/168444 - it needs some refinements around histograms and simple measures, but overall it looks straightforward.

:rstrong, :frank - let me know if you see any disadvantages of this solution.

Flags: needinfo?(robert.strong.bugs)
Flags: needinfo?(fbertsch)
Blocks: 1572033
Points: --- → 2

I doubt I have the expertise to be able to properly weigh in on this but I think this would be great.

Sometimes we query old longitudinal datasets when investigating. Would this still be possible with this approach? If not, it wouldn't be a showstopper.

Flags: needinfo?(robert.strong.bugs)

(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #2)

I doubt I have the expertise to be able to properly weigh in on this but I think this would be great.

Sometimes we query old longitudinal datasets when investigating. Would this still be possible with this approach? If not, it wouldn't be a showstopper.

No worries, I can handle this but probably will need your help with validation at some later stage.

You'll be able to query intermediate table. It will be similar, but not identical to longitudinal. That should be perfectly fine for debugging.

:akomar, that plan sounds great to me, and seems to be the most straightforward. Thanks for putting this together.

:rstrong, except for (as Arkadiusz mentioned) debugging, we won't recommend you use that interim-longitudinal for analysis. Can you give a bit more info about your use-cases and we can direct you to an alternative? It may be a case of helping you all get familiar with a different dataset that has the info you need.

Flags: needinfo?(fbertsch) → needinfo?(robert.strong.bugs)

Hi Frank, I've done this in the past for one-off queries when I need to dive deeper into the data from the dashboard. I've used other datasources besides longitudinal for this at times in the past so it isn't a showstopper. I was just curious whether I would always need to use a different datasource. Thanks!

Flags: needinfo?(robert.strong.bugs)
Status: NEW → ASSIGNED

I have converted original notebook to a Python script and added a shim code to read data from main ping BigQuery table [1].
We have to wait until BigQuery is backfilled before proceeding further, but first run's result [2] seems to be fine (judging just by proportions between various measures as the sample is much smaller than the real longitudinal yet).

[1] https://github.com/mozilla/telemetry-airflow/pull/613
[2] https://moz-update-orphaning-test-output.storage.googleapis.com/20190915.json

Depends on: 1577833
Priority: -- → P1

I ran the script on the 1% backfill table, results are matching to what we get on AWS.
AWS run: s3://telemetry-public-analysis-2/app-update/data/out-of-date/20190901.json
GCP run: gs://moz-update-orphaning-test-output/20190901.json

:rstrong - I can't tag you on Github, but FYI we're close to landing the job linked in the PR. I want to do a test run this week and hopefully by the next one we'll be able to switch it to write to prod s3 location.

Flags: needinfo?(robert.strong.bugs)

Thanks for the update

Flags: needinfo?(robert.strong.bugs)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: