Open Bug 1852266 Opened 2 years ago Updated 1 years ago

GCP Airflow tasks are failing with common error

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: frank, Assigned: frank)

Details

This looks to be the result of a recent Airflow upgrade. We expect that the google-provided packages had a bug.

Copy dedupe logs: https://workflow.telemetry.mozilla.org/dags/copy_deduplicate/grid?search=copy_deduplicate&dag_run_id=scheduled__2023-09-07T01%3A00%3A00%2B00%3A00&task_id=copy_deduplicate_all&tab=logs

Error:

[2023-09-08, 12:11:47 UTC] {pod.py:907} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 550, in execute_sync
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 492, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels dag_id=copy_deduplicate,kubernetes_pod_operator=True,run_id=scheduled__2023-09-07T0100000000-51fa1e10e,task_id=copy_deduplicate_all,already_checked!=True,!airflow-worker
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 751, in patch_already_checked
    name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'
Assignee: nobody → mducharme

The error on the first log is a 401 unauthorized on retrieving the pod logs:

[2023-09-08, 02:02:40 UTC] {pod.py:907} ERROR - (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'd32bdcdf-6f7d-4c4b-b6b3-740e1becd5e5', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 08 Sep 2023 02:02:40 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 368, in consume_logs
    logs = self.read_pod_logs(
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 325, in iter
    raise retry_exc.reraise()
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 158, in reraise
    raise self.last_attempt.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 494, in read_pod_logs
    logs = self._client.read_namespaced_pod_log(
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23747, in read_namespaced_pod_log
    return self.read_namespaced_pod_log_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 23866, in read_namespaced_pod_log_with_http_info
    return self.api_client.call_api(
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 240, in GET
    return self.request("GET", url,
  File "/home/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)

Because the 401 happens in the Airflow server, the task fails, but the pod continues running. Then when it spins up the next pod we get the duplicate labels error (which is actually what's happening).

We're rolling back to Airflow 2.5.3.

Airflow downgrade is complete. Restarting jobs now.

Taking over this ticket to backfill affected DAGs.

Assignee: mducharme → fbertsch
You need to log in before you can comment on or make changes to this bug.