Closed Bug 1865187 Opened 2 years ago Closed 2 years ago

Airflow DAG `glam` is running much longer starting with exec_date 2023-11-14

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: srose, Assigned: efilho)

Details

(Whiteboard: [airflow-triage])

Sean Rose [:srose]

Reporter

Description

•

2 years ago

The 2023-11-14 DAG run has been running for over 41 hours, and its client_scalar_probe_counts task failed due to the query hitting BigQuery's 6-hour limit.

The 2023-11-15 DAG run has been running for over 17 hours and looks like it will do the same thing.

Task link:
https://workflow.telemetry.mozilla.org/dags/glam/grid?dag_run_id=scheduled__2023-11-14T02%3A00%3A00%2B00%3A00&task_id=client_scalar_probe_counts&tab=logs

Log extract:

google.api_core.exceptions.InternalServerError: 500 Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.

Sean Rose [:srose]

Reporter

Updated

•

2 years ago

Assignee: nobody → efilho

Status: NEW → ASSIGNED

Leli [:Leli]

Comment 1

•

2 years ago

did fail for exec_date 2023-11-18 again for task client_scalar_probe_counts

Eduardo Filho [:efilho]

Assignee

Comment 2

•

2 years ago

The slower-than-usual issue was apparently caused by different dag dates running at the same time.
Fortunately glam etl has a checkpoint (client_histogram_aggregates, client_scalar_aggregates) steps after which a subsequent execution will backfill the previous days of data - as long as the gap isn't too big. I took advantage of such mechanism to skip a day - since all previous stuck executions had reached the checkpoint - and the next execution ran within expected time, because it was no longer competing for resources.

Regarding clients_scalar_probe_counts, that was also after the checkpoint and the next execution picked up where it left off.
The last execution of this dag was successful end to end so I'm closing this.

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Airflow DAG `glam` is running much longer starting with exec_date 2023-11-14

Categories

(Data Platform and Tools :: General, defect)

Tracking

(Not tracked)

People

(Reporter: srose, Assigned: efilho)

References

Details

(Whiteboard: [airflow-triage])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2