Closed Bug 1592447 Opened 5 years ago Closed 5 years ago

Add `addons_daily` to GCP / BigQuery

Categories

(Data Platform and Tools :: General, task, P2)

task
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bmiroglio, Assigned: bmiroglio)

References

Details

The addons_daily dataset is, right now, only accessible via Athena. I'd like this to be migrated to GCP and accessible via BigQuery.

See Also: → 1596190
See Also: 1596190

Hey Ben: ETL for this dataset is now running on AWS, we'll need to move it to GCP. Have you thought about this by any chance?

I took a look at the code and it looks like it uses main_summary, search_daily, events datasets and Dataset API (which is going away). I think this shouldn't be too hard to express as SQL query.

Flags: needinfo?(bmiroglio)

I was hoping to keep as much of this as-is as possible, with the caveat that we need a SQL query to replace the raw ping queries that utilize the Dataset API. This uses the Pyspark API for most other things, is that something we can keep, or do you recommend migrating all these into pure SQL?

Flags: needinfo?(bmiroglio) → needinfo?(akomarzewski)

Yes, we can keep it if that's easier. We've already migrated few jobs this way.

I'd pull out only the relevant histograms aggregated by addon from BigQuery (so we don't need to do that in Spark).

Flags: needinfo?(akomarzewski)

per a slack conversation with :bmiroglio if this isn't migrated before the Dataset API is shut down, then we will disabled it until it is migrated, because "the table is most useful for the search related fields but it isn't fueling a specific highly visible dashboard afaik".

I've prototyped the raw ping portion in Big Query [0]. Assuming we keep the rest of ETL in Spark, this would need to joined separately once that job (in Airflow?) runs. I'm unsure how to proceed so I thought I'd start here.

[0] https://console.cloud.google.com/bigquery?sq=630180991450:f34d026c1642458eac8e2f603127b196

ni-ing relud for visibility. What do you propose are the next steps here to get this migrated?

Flags: needinfo?(dthorn)

I would suggest modifying load_keyed_hist to read your query next. something like:

from google.cloud import bigquery

...

def load_keyed_hist(spark):
    bq = bigquery.Client()
    query_job = bq.query("""
    ...
    """)
    query_job.result()
    return spark.read.format("bigquery").option("table", bq.get_table(query_job.destination).full_table_id)
Flags: needinfo?(dthorn)
Assignee: nobody → bmiroglio
Points: --- → 2
Priority: -- → P2
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.