`firefox_desktop_stable.metrics_v1` table increasing in size
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: ascholtz, Unassigned)
References
Details
(Whiteboard: [dataquality])
firefox_desktop_stable.metrics_v1
has almost been doubling in storage size in the last few days. The increase seems to be coming from metrics from metrics.timing_distribution
, but some more investigation is needed.
Updated•1 year ago
|
Comment 1•1 year ago
|
||
This is in fact due to timing distributions which landed in the latest firefox release on 2024-03-20. The table size is continuing to increase and this may be causing issues such a copy_deduplicate taking 3x as long to complete, slack thread https://mozilla.slack.com/archives/C01E8GDG80N/p1712320531088849. Not fully confirmed if this is the cause though.
This may continue to cause other problems so we should check on it again next week to see if the size levels off.
This is potentially caused by a large number of distributions with a lot of buckets, e.g.
SELECT
DATE(submission_timestamp) AS submmission_date,
normalized_channel,
ARRAY_LENGTH(metrics.timing_distribution.network_dns_start.values) AS bucket_count,
COUNT(*) AS ping_count,
FROM
`moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`
WHERE
DATE(submission_timestamp) = '2024-04-01'
AND metrics.timing_distribution.network_dns_start IS NOT NULL
AND sample_id = 1
GROUP BY
submmission_date,
normalized_channel,
bucket_count
ORDER BY
bucket_count DESC
Comment 2•1 year ago
|
||
release-drivers indicates 100% rollout of Fx Desktop 124.0.2 on Apr. 3 so hopefully that corresponds to a leveling off in size. Given the ETL issues we're seeing and associated cost increase we may need to make infrastructure changes to deal with this.
Comment 3•1 year ago
|
||
firefox_desktop_derived__events_stream__v1
inside bqetl_glean_usage
failed (exec_date: 2024-04-05
) with:
[2024-04-06, 15:26:48 UTC] {pod_manager.py:466} INFO - [base] BigQuery error in query operation: Error processing job 'moz-fx-data-shared-
[2024-04-06, 15:26:48 UTC] {pod_manager.py:466} INFO - [base] prod:bqjob_r5acdbb4d638ede16_0000018eb2bb33b4_1': Operation timed out after 6.0
[2024-04-06, 15:26:48 UTC] {pod_manager.py:466} INFO - [base] hours. Consider reducing the amount of work performed by your operation so that
Comment 4•1 year ago
|
||
firefox_desktop_derived__events_stream__v1
failure may have been due to running concurrently with a publish_new_tables
run that ended up taking 20 hours because it was doing backfills according to Anna. copy_deduplicate_all for the next day which ran at the same time also ran over 6 hours but succeeded on retry. So this may have just been a one-time hiccup due to resource contention unrelated specifically to firefox_desktop_stable.metrics_v1
but definitely worth looking again tomorrow since it will process a non-weekend day of data.
This does raise a question of how we should be doing the init backfills but that's out of scope for this bug.
Comment 5•11 months ago
|
||
firefox_desktop_stable.metrics_v1
partition size seems to have stopped increasing for now at ~5.3 TB per weekday based on this dashboard https://mozilla.cloud.looker.com/dashboards/387?Submission+Date=90+day+ago+for+90+day
copy_deduplicate runtimes also looks to have stopped increasing and was between 3 and 4 hours last week
Description
•