Airflow task taar_weekly.dataflow_import_avro_to_bigtable run for 2023-02-05 appears stuck
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: kik, Assigned: lucia-vargas-a)
Details
(Whiteboard: [airflow-triage])
Airflow task taar_weekly.dataflow_import_avro_to_bigtable run for 2023-02-05 appears stuck
Airflow logs show that the pod has not been started:
[2023-02-12, 00:01:58 UTC] {{kubernetes_pod.py:564}} INFO - Creating pod dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326 with labels: {'dag_id': 'taar_weekly', 'task_id': 'dataflow_import_avro_to_bigtable', 'run_id': 'scheduled__2023-02-05T0000000000-64592e701', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2023-02-12, 00:02:00 UTC] {{pod_manager.py:178}} WARNING - Pod not yet started: dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326
[2023-02-12, 00:02:01 UTC] {{pod_manager.py:178}} WARNING - Pod not yet started: dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326
[2023-02-12, 00:02:02 UTC] {{pod_manager.py:178}} WARNING - Pod not yet started: dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326
When checking for the pod under "workloads" inside GCP we see the container exists, however, it appears the application is stuck and has not produced any logs for more than 24 hours now. Last logs:
2023-02-12 01:02:12.935 CET
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
2023-02-12 01:02:14.947 CET
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--iso-date=20230205', '--gcp-project=moz-fx-data-taar-pr-prod-e0f7', '--avro-gcs-bucket=moz-fx-data-taar-pr-prod-e0f7-prod-etl', '--bigtable-instance-id=taar-prod-202006', '--sample-rate=1.0', '--dataflow-service-account=taar-prod-dataflow@moz-fx-data-taar-pr-prod-e0f7.iam.gserviceaccount.com', '--gcs-to-bigtable']
2023-02-12 01:02:14.951 CET
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--iso-date=20230205', '--gcp-project=moz-fx-data-taar-pr-prod-e0f7', '--avro-gcs-bucket=moz-fx-data-taar-pr-prod-e0f7-prod-etl', '--bigtable-instance-id=taar-prod-202006', '--sample-rate=1.0', '--dataflow-service-account=taar-prod-dataflow@moz-fx-data-taar-pr-prod-e0f7.iam.gserviceaccount.com', '--gcs-to-bigtable']
| Assignee | ||
Comment 1•3 years ago
|
||
Since there is no message in the logs and the application is not failing, we'll try first re-starting the DAG. For this we require SRE to can delete the existing pod and cancel the dataflow job before re triggering the DAG.
| Assignee | ||
Comment 2•3 years ago
|
||
But the taar_weekly it's been running for 16h (usually takes aprox. 2 hours), so the re-run did not solve the issue.
We'll continue the investigation.
| Reporter | ||
Comment 4•3 years ago
|
||
Yes, however, it is still unclear why when the task is run via automated schedule gets stuck...
Here's a dashboard used for taar that Evgeny linked me:
https://sql.telemetry.mozilla.org/dashboard/taar-production?p_end_date=2023-02-27&p_start_date=2023-02-01&p_w64127_end_date=2023-02-16&p_w64127_start_date=2022-02-01
Data appears to be there...
| Reporter | ||
Updated•3 years ago
|
| Reporter | ||
Comment 5•3 years ago
|
||
It appears the most recent scheduled run succeeded. Marking this as resolved.
Updated•2 years ago
|
Description
•