Airflow DAG `glam` is running much longer starting with exec_date 2023-11-14
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: srose, Assigned: efilho)
Details
(Whiteboard: [airflow-triage])
The 2023-11-14 DAG run has been running for over 41 hours, and its client_scalar_probe_counts task failed due to the query hitting BigQuery's 6-hour limit.
The 2023-11-15 DAG run has been running for over 17 hours and looks like it will do the same thing.
Log extract:
google.api_core.exceptions.InternalServerError: 500 Operation timed out after 6.0 hours. Consider reducing the amount of work performed by your operation so that it can complete within this limit.
| Reporter | ||
Updated•2 years ago
|
Comment 1•2 years ago
|
||
did fail for exec_date 2023-11-18 again for task client_scalar_probe_counts
| Assignee | ||
Comment 2•2 years ago
|
||
The slower-than-usual issue was apparently caused by different dag dates running at the same time.
Fortunately glam etl has a checkpoint (client_histogram_aggregates, client_scalar_aggregates) steps after which a subsequent execution will backfill the previous days of data - as long as the gap isn't too big. I took advantage of such mechanism to skip a day - since all previous stuck executions had reached the checkpoint - and the next execution ran within expected time, because it was no longer competing for resources.
Regarding clients_scalar_probe_counts, that was also after the checkpoint and the next execution picked up where it left off.
The last execution of this dag was successful end to end so I'm closing this.
Description
•