Closed Bug 1865187 Opened 2 years ago Closed 2 years ago

Airflow DAG `glam` is running much longer starting with exec_date 2023-11-14

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: srose, Assigned: efilho)

Details

(Whiteboard: [airflow-triage])

The 2023-11-14 DAG run has been running for over 41 hours, and its client_scalar_probe_counts task failed due to the query hitting BigQuery's 6-hour limit.

The 2023-11-15 DAG run has been running for over 17 hours and looks like it will do the same thing.

Task link:
https://workflow.telemetry.mozilla.org/dags/glam/grid?dag_run_id=scheduled__2023-11-14T02%3A00%3A00%2B00%3A00&task_id=client_scalar_probe_counts&tab=logs

Log extract:

google.api_core.exceptions.InternalServerError: 500 Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.
Assignee: nobody → efilho
Status: NEW → ASSIGNED

did fail for exec_date 2023-11-18 again for task client_scalar_probe_counts

The slower-than-usual issue was apparently caused by different dag dates running at the same time.
Fortunately glam etl has a checkpoint (client_histogram_aggregates, client_scalar_aggregates) steps after which a subsequent execution will backfill the previous days of data - as long as the gap isn't too big. I took advantage of such mechanism to skip a day - since all previous stuck executions had reached the checkpoint - and the next execution ran within expected time, because it was no longer competing for resources.

Regarding clients_scalar_probe_counts, that was also after the checkpoint and the next execution picked up where it left off.
The last execution of this dag was successful end to end so I'm closing this.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.