The Airflow task prerelease_telemetry_aggregates.prerelease_telemetry_aggregate_view_dataproc failed on 2022-06-27, 02:00:00
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: kik, Assigned: linh)
Details
(Whiteboard: [airflow-triage])
Attachments
(1 file)
|
73.06 KB,
image/png
|
Details |
The Airflow task prerelease_telemetry_aggregates.prerelease_telemetry_aggregate_view_dataproc failed on 2022-06-27, 02:00:00
Investigating the dataproc job's logs it appears this error is related to no disk space left.
psycopg2.OperationalError: could not write to file "base/pgsql_tmp/pgsql_tmp32306.194": No space left on device
CONTEXT: SQL statement "with merge as (update build_id_beta_89_20210422 as dest
set histogram = aggregate_histogram_arrays(dest.histogram, src.histogram)
from staging_build_id_beta_89_20210422 as src
where dest.dimensions = src.dimensions
returning dest.*)
delete from staging_build_id_beta_89_20210422 as stage
using merge
where stage.dimensions = merge.dimensions"
PL/pgSQL function merge_table(text,text,text,text,regclass) line 18 at EXECUTE
SSL SYSCALL error: EOF detected
Link to the dataproc job:
https://console.cloud.google.com/dataproc/jobs/prerelease_aggregates_089bd3b0/monitoring?region=us-west1&project=airflow-dataproc
| Reporter | ||
Comment 1•3 years ago
|
||
Comment 2•3 years ago
|
||
Found the db with storage complaints - telemetry-aggregates-lean - and just to address the current failed job concerns, I doubled the storage allocated to the replica & main db (this is processing right now and will ping here when ready for the job to retry).
Beyond a short term fix, I see where the storage looked to have been used up on June 27th (this past Monday), and it hasn't been able to recover since. Sharing the Free Storage Space (mb/sec) metric from RDS monitoring here in case other thoughts on this being regular and/or understandable operational pattern for this particular db / system. My no context proposal is this looks normal for a slightly increasing storage usage over time that simply hit rock bottom on Monday and couldn't recover, and I'll open SRE tickets to perhaps 1. monitor this impending issue better; 2. turn on autoscaling for this particular database.
Super open to other interpretations, context or ideas here though.
Comment 3•3 years ago
|
||
Comment 4•3 years ago
|
||
Storage resizing of the database & its replica has been completed, and storage optimization stages appear done. feel free to retry and lemme know what happens.
will get together SRE tickets for this as well.
Updated•3 years ago
|
Updated•3 years ago
|
Description
•