Open Bug 1887755 Opened 1 year ago Updated 10 months ago

`firefox_desktop_stable.metrics_v1` table increasing in size

Categories

(Data Platform and Tools :: General, task)

task

Tracking

(Not tracked)

People

(Reporter: ascholtz, Unassigned)

References

Details

(Whiteboard: [dataquality])

firefox_desktop_stable.metrics_v1 has almost been doubling in storage size in the last few days. The increase seems to be coming from metrics from metrics.timing_distribution, but some more investigation is needed.

This is in fact due to timing distributions which landed in the latest firefox release on 2024-03-20. The table size is continuing to increase and this may be causing issues such a copy_deduplicate taking 3x as long to complete, slack thread https://mozilla.slack.com/archives/C01E8GDG80N/p1712320531088849. Not fully confirmed if this is the cause though.

This may continue to cause other problems so we should check on it again next week to see if the size levels off.

This is potentially caused by a large number of distributions with a lot of buckets, e.g.

SELECT
  DATE(submission_timestamp) AS submmission_date,
  normalized_channel,
  ARRAY_LENGTH(metrics.timing_distribution.network_dns_start.values) AS bucket_count,
  COUNT(*) AS ping_count,
FROM
  `moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`
WHERE
  DATE(submission_timestamp) = '2024-04-01'
  AND metrics.timing_distribution.network_dns_start IS NOT NULL
  AND sample_id = 1
GROUP BY
  submmission_date,
  normalized_channel,
  bucket_count
ORDER BY
  bucket_count DESC

release-drivers indicates 100% rollout of Fx Desktop 124.0.2 on Apr. 3 so hopefully that corresponds to a leveling off in size. Given the ETL issues we're seeing and associated cost increase we may need to make infrastructure changes to deal with this.

firefox_desktop_derived__events_stream__v1 inside bqetl_glean_usage failed (exec_date: 2024-04-05) with:

[2024-04-06, 15:26:48 UTC] {pod_manager.py:466} INFO - [base] BigQuery error in query operation: Error processing job 'moz-fx-data-shared-
[2024-04-06, 15:26:48 UTC] {pod_manager.py:466} INFO - [base] prod:bqjob_r5acdbb4d638ede16_0000018eb2bb33b4_1': Operation timed out after 6.0
[2024-04-06, 15:26:48 UTC] {pod_manager.py:466} INFO - [base] hours. Consider reducing the amount of work performed by your operation so that

Link to the task: https://workflow.telemetry.mozilla.org/dags/bqetl_glean_usage/grid?execution_date=2024-04-05T04%3A15%3A00%2B00%3A00&dag_run_id=scheduled__2024-04-04T02%3A00%3A00%2B00%3A00&task_id=firefox_desktop.firefox_desktop_derived__events_stream__v1&tab=logs

firefox_desktop_derived__events_stream__v1 failure may have been due to running concurrently with a publish_new_tables run that ended up taking 20 hours because it was doing backfills according to Anna. copy_deduplicate_all for the next day which ran at the same time also ran over 6 hours but succeeded on retry. So this may have just been a one-time hiccup due to resource contention unrelated specifically to firefox_desktop_stable.metrics_v1 but definitely worth looking again tomorrow since it will process a non-weekend day of data.

This does raise a question of how we should be doing the init backfills but that's out of scope for this bug.

firefox_desktop_stable.metrics_v1 partition size seems to have stopped increasing for now at ~5.3 TB per weekday based on this dashboard https://mozilla.cloud.looker.com/dashboards/387?Submission+Date=90+day+ago+for+90+day

copy_deduplicate runtimes also looks to have stopped increasing and was between 3 and 4 hours last week

See Also: → 1898336
You need to log in before you can comment on or make changes to this bug.