Add `addons_daily` to GCP / BigQuery
Categories
(Data Platform and Tools :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: bmiroglio, Assigned: bmiroglio)
References
Details
The addons_daily
dataset is, right now, only accessible via Athena. I'd like this to be migrated to GCP and accessible via BigQuery.
Comment 2•5 years ago
|
||
Hey Ben: ETL for this dataset is now running on AWS, we'll need to move it to GCP. Have you thought about this by any chance?
I took a look at the code and it looks like it uses main_summary
, search_daily
, events
datasets and Dataset API (which is going away). I think this shouldn't be too hard to express as SQL query.
Assignee | ||
Comment 3•5 years ago
•
|
||
I was hoping to keep as much of this as-is as possible, with the caveat that we need a SQL query to replace the raw ping queries that utilize the Dataset API. This uses the Pyspark API for most other things, is that something we can keep, or do you recommend migrating all these into pure SQL?
Comment 4•5 years ago
|
||
Yes, we can keep it if that's easier. We've already migrated few jobs this way.
I'd pull out only the relevant histograms aggregated by addon from BigQuery (so we don't need to do that in Spark).
Comment 5•5 years ago
|
||
per a slack conversation with :bmiroglio if this isn't migrated before the Dataset API is shut down, then we will disabled it until it is migrated, because "the table is most useful for the search related fields but it isn't fueling a specific highly visible dashboard afaik".
Assignee | ||
Comment 6•5 years ago
|
||
I've prototyped the raw ping portion in Big Query [0]. Assuming we keep the rest of ETL in Spark, this would need to joined separately once that job (in Airflow?) runs. I'm unsure how to proceed so I thought I'd start here.
[0] https://console.cloud.google.com/bigquery?sq=630180991450:f34d026c1642458eac8e2f603127b196
Assignee | ||
Comment 7•5 years ago
|
||
ni-ing relud for visibility. What do you propose are the next steps here to get this migrated?
Comment 8•5 years ago
|
||
I would suggest modifying load_keyed_hist to read your query next. something like:
from google.cloud import bigquery
...
def load_keyed_hist(spark):
bq = bigquery.Client()
query_job = bq.query("""
...
""")
query_job.result()
return spark.read.format("bigquery").option("table", bq.get_table(query_job.destination).full_table_id)
Updated•5 years ago
|
Comment 9•5 years ago
|
||
This was resolved in https://github.com/mozilla/telemetry-airflow/commit/857f0da825a7704ea7819fa6823f38fc48aa2113#diff-fba63aad158073b9c8bcb5c37ea33c11
Updated•3 years ago
|
Description
•