Closed Bug 1816446 Opened 3 years ago Closed 3 years ago

Airflow task taar_weekly.dataflow_import_avro_to_bigtable run for 2023-02-05 appears stuck

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kik, Assigned: lucia-vargas-a)

Details

(Whiteboard: [airflow-triage])

Airflow task taar_weekly.dataflow_import_avro_to_bigtable run for 2023-02-05 appears stuck

Airflow logs show that the pod has not been started:

[2023-02-12, 00:01:58 UTC] {{kubernetes_pod.py:564}} INFO - Creating pod dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326 with labels: {'dag_id': 'taar_weekly', 'task_id': 'dataflow_import_avro_to_bigtable', 'run_id': 'scheduled__2023-02-05T0000000000-64592e701', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2023-02-12, 00:02:00 UTC] {{pod_manager.py:178}} WARNING - Pod not yet started: dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326
[2023-02-12, 00:02:01 UTC] {{pod_manager.py:178}} WARNING - Pod not yet started: dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326
[2023-02-12, 00:02:02 UTC] {{pod_manager.py:178}} WARNING - Pod not yet started: dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326

When checking for the pod under "workloads" inside GCP we see the container exists, however, it appears the application is stuck and has not produced any logs for more than 24 hours now. Last logs:

2023-02-12 01:02:12.935 CET
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
2023-02-12 01:02:14.947 CET
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--iso-date=20230205', '--gcp-project=moz-fx-data-taar-pr-prod-e0f7', '--avro-gcs-bucket=moz-fx-data-taar-pr-prod-e0f7-prod-etl', '--bigtable-instance-id=taar-prod-202006', '--sample-rate=1.0', '--dataflow-service-account=taar-prod-dataflow@moz-fx-data-taar-pr-prod-e0f7.iam.gserviceaccount.com', '--gcs-to-bigtable']
2023-02-12 01:02:14.951 CET
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--iso-date=20230205', '--gcp-project=moz-fx-data-taar-pr-prod-e0f7', '--avro-gcs-bucket=moz-fx-data-taar-pr-prod-e0f7-prod-etl', '--bigtable-instance-id=taar-prod-202006', '--sample-rate=1.0', '--dataflow-service-account=taar-prod-dataflow@moz-fx-data-taar-pr-prod-e0f7.iam.gserviceaccount.com', '--gcs-to-bigtable']

https://console.cloud.google.com/kubernetes/pod/us-west1/workloads-prod-v1/default/dataflow-import-avro-to-bigtab-df1e603f15af4a748a2e1f8205a1c326/details?project=moz-fx-data-airflow-gke-prod&pageState=(%22savedViews%22:(%22i%22:%2288c77fdb0e024497ad0c5ff75f746848%22,%22c%22:%5B%22gke%2Fus-west1%2Fworkloads-prod-v1%22%5D,%22n%22:%5B%5D))

Since there is no message in the logs and the application is not failing, we'll try first re-starting the DAG. For this we require SRE to can delete the existing pod and cancel the dataflow job before re triggering the DAG.

But the taar_weekly it's been running for 16h (usually takes aprox. 2 hours), so the re-run did not solve the issue.
We'll continue the investigation.

This seems to have been fixed?

Flags: needinfo?(alvargasa)

Yes, however, it is still unclear why when the task is run via automated schedule gets stuck...

Here's a dashboard used for taar that Evgeny linked me:
https://sql.telemetry.mozilla.org/dashboard/taar-production?p_end_date=2023-02-27&p_start_date=2023-02-01&p_w64127_end_date=2023-02-16&p_w64127_start_date=2022-02-01

Data appears to be there...

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

It appears the most recent scheduled run succeeded. Marking this as resolved.

Status: REOPENED → RESOLVED
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Flags: needinfo?(alvargasa)
You need to log in before you can comment on or make changes to this bug.