Closed Bug 1892284 Opened 1 year ago Closed 1 year ago

firefox_desktop_derived.events_stream_v1 query is very expensive

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benwu, Assigned: benwu)

References

Details

(Whiteboard: [dataplatform])

Attachments

(1 file)

The new events_stream_v1 query for firefox desktop (airflow task) is a very heavy query and has caused a couple of etl delays is the past couple weeks when it ran concurrently with copy_deduplicate due to backfill/running a past execution date.

The slot usage is particularly high with the query I looked at (42 days slot time with 900 GB scanned). This may be due to all the json and string manipulation and there may be some optimizations to be made. I think it's worth putting some time into seeing if this can be improved to make the ETL more scalable. It's also possible nothing can be done and that would be a valid outcome.

In the data platform meeting, :whd suggested explicitly running this query with on-demand pricing in another project which reduce cost and slot contention

I've via https://github.com/mozilla-services/cloudops-infra/pull/5568/files#diff-9dc73d59e695135b3280d689a48895a532ff9fcdf6614985c8a954ffc72f8a84R141 set moz-fx-data-backfill-2 to explicitly use the on-demand billing model. workgroup:dataops-managed/airflow should already have jobUser on that project so you should be able to run that query there and observe the cost/performance characteristics.

I ran the query for yesterday in backfill-2 and it took 1h07m with 44 days of slot time (avg ~950 slots) compared to the etl run which took 2h58m with 39 days 13 hr slot time (~316 avg slots). Data scanned is 887 GB. So this works as expected, should be inexpensive, and would free up slots in the main reservation for three hours per night.

We need to add support for setting billing project for queries through bqetl in order to get the etl to use on-demand slots. From a cursory look, this isn't trivial for any queries that query more than one dataset. I'll see if I can start on this this week.
Related open tickets: https://mozilla-hub.atlassian.net/browse/DENG-845 https://github.com/mozilla/bigquery-etl/issues/3515

Assignee: nobody → bewu

Additional related note: there are other daily queries that have high slot usage relative to bytes scanned but nothing close to firefox_desktop_derived.events_stream_v1. e.g. monitoring_derived.event_monitoring_aggregates_v1 (9 days 10 hr slot time, 600 GB), telemetry_derived.clients_daily_v6, telemetry_derived.firefox_desktop_usage_v1.

I would prefer to keep the number of special cases to a minimum but if we continue seeing contention issues, we can consider running other queries with the on-demand model. If I'm understanding the monitoring graph correctly, the events stream query doesn't saturate 2000 on-demand slots so there's some room there

I would prefer to keep the number of special cases to a minimum but if we continue seeing contention issues, we can consider running other queries with the on-demand model.

Something we could consider doing is automating which queries to run using on-demand based on empirical usage e.g. using the comparison in https://mozilla-hub.atlassian.net/browse/DENG-492?focusedCommentId=864234 and some threshold like 2x, 5x etc.. This is actually an argument in favor of keeping all the backfill projects we have provisioned: we could use however many of those we wanted as 2000 slot on-demand projects.

The numbers we're dealing with here appear to be about a 10x cost reduction ($50/day vs. $5/day) but the more useful side effect is definitely the reduction in slots contention.

The events_stream queries (for all glean apps) have been running correctly in moz-fx-data-backfill-2 with on-demand for the past few days, scanning 1.8 TB and using 2207 slot hours for 2024-05-10 (Friday), so this is now resolved.

query:

SELECT
  creation_time,
  end_time,
  destination_table,
  total_bytes_billed / 1024 / 1024 / 1024 / 1024 AS tb_billed,
  total_slot_ms / 1000 / 60 / 60 AS slot_hours,
  SUM(total_bytes_billed) OVER () / 1024 / 1024 / 1024 / 1024 AS total_tb,
  SUM(total_slot_ms) OVER () / 1000 / 60 / 60 AS total_slot_hours,
FROM
  `moz-fx-data-shared-prod.monitoring_derived.jobs_by_organization_v1`
WHERE
  DATE(creation_time) = "2024-05-11"
  AND destination_table.table_id LIKE 'events_stream_v1$20240510'
  AND project_id = 'moz-fx-data-backfill-2'
ORDER BY total_bytes_processed DESC

One thing to note is that the firefox desktop query has been getting heavier over the last month. It was about 1000 slot hours when I created this bug and the past week it ranged from around 1500 to 2000. It was 2044 hours for 2024-05-10 and managed to complete in 2 hours (and 30s) so there's a lot of headroom before it hits the 6 hour time limit, but still somewhat concerning that it increased by that much.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

The bytes scanned also increased by about 15-30% over the last month

SELECT
  creation_time,
  end_time,
  destination_table,
  job_id,
  reservation_id,
  total_bytes_billed / 1024 / 1024 / 1024 / 1024 AS tb_billed,
  total_slot_ms / 1000 / 60 / 60 AS slot_hours,
FROM
  `moz-fx-data-shared-prod.monitoring_derived.jobs_by_organization_v1`
WHERE
  DATE(creation_time) BETWEEN "2024-04-15" AND "2024-05-13"
  AND destination_table.project_id = 'moz-fx-data-shared-prod'
  AND destination_table.dataset_id = 'firefox_desktop_derived'
  AND destination_table.table_id LIKE 'events_stream_v1%'
  AND error_result.reason IS NULL
  AND end_time IS NOT NULL
  AND job_id LIKE 'bqjob_%'
ORDER BY creation_time DESC
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: