Closed Bug 1777244 Opened 3 years ago Closed 3 years ago

The Airflow task prerelease_telemetry_aggregates.prerelease_telemetry_aggregate_view_dataproc failed on 2022-06-27, 02:00:00

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kik, Assigned: linh)

Details

(Whiteboard: [airflow-triage])

Attachments

(1 file)

The Airflow task prerelease_telemetry_aggregates.prerelease_telemetry_aggregate_view_dataproc failed on 2022-06-27, 02:00:00

Investigating the dataproc job's logs it appears this error is related to no disk space left.

psycopg2.OperationalError: could not write to file "base/pgsql_tmp/pgsql_tmp32306.194": No space left on device
CONTEXT:  SQL statement "with merge as (update build_id_beta_89_20210422 as dest
                            set histogram = aggregate_histogram_arrays(dest.histogram, src.histogram)
                            from staging_build_id_beta_89_20210422 as src
                            where dest.dimensions = src.dimensions
                            returning dest.*)
                  delete from staging_build_id_beta_89_20210422 as stage
                  using merge
                  where stage.dimensions = merge.dimensions"
PL/pgSQL function merge_table(text,text,text,text,regclass) line 18 at EXECUTE
SSL SYSCALL error: EOF detected

Link to the dataproc job:
https://console.cloud.google.com/dataproc/jobs/prerelease_aggregates_089bd3b0/monitoring?region=us-west1&project=airflow-dataproc

Found the db with storage complaints - telemetry-aggregates-lean - and just to address the current failed job concerns, I doubled the storage allocated to the replica & main db (this is processing right now and will ping here when ready for the job to retry).

Beyond a short term fix, I see where the storage looked to have been used up on June 27th (this past Monday), and it hasn't been able to recover since. Sharing the Free Storage Space (mb/sec) metric from RDS monitoring here in case other thoughts on this being regular and/or understandable operational pattern for this particular db / system. My no context proposal is this looks normal for a slightly increasing storage usage over time that simply hit rock bottom on Monday and couldn't recover, and I'll open SRE tickets to perhaps 1. monitor this impending issue better; 2. turn on autoscaling for this particular database.

Super open to other interpretations, context or ideas here though.

Attached image freestorage2wks.png

Storage resizing of the database & its replica has been completed, and storage optimization stages appear done. feel free to retry and lemme know what happens.

will get together SRE tickets for this as well.

Component: Datasets: General → General
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: