Closed Bug 1648327 Opened 5 years ago Closed 5 years ago

Airflow times out while monitoring Cloud Dataflow jobs

Categories

(Data Platform and Tools Graveyard :: Operations, defect)

defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: vng, Unassigned)

Details

Attachments

(1 file)

Attached image dataflow_job.png

I'm getting a timeout in airflow/contrib/kubernetes/pod_launcher.py when monitoring a Dataflow job in Airflow:


[2020-06-25 01:51:37,212] {logging_mixin.py:112} INFO - [2020-06-25 01:51:37,212] {pod_launcher.py:125} INFO - WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--iso-date=20200614', '--gcp-project=moz-fx-data-taar-nonprod-48b6', '--avro-gcs-bucket=moz-fx-data-taar-nonprod-48b6-stage-etl', '--bigtable-instance-id=taar-stage-202006', '--gcs-to-bigtable']

[2020-06-25 01:56:44,686] {taskinstance.py:1088} ERROR - ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/airflow/models/taskinstance.py", line 955, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/app/dags/operators/gcp_container_operator.py", line 96, in execute
    result = super(UpstreamGKEPodOperator, self).execute(context) # Moz specific
  File "/app/dags/operators/backport/kubernetes_pod_operator_1_10_7.py", line 251, in execute
    get_logs=self.get_logs)
  File "/usr/local/lib/python2.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 117, in run_pod
    return self._monitor_pod(pod, get_logs)
  File "/usr/local/lib/python2.7/site-packages/airflow/contrib/kubernetes/pod_launcher.py", line 124, in _monitor_pod
    for line in logs:
  File "/usr/local/lib/python2.7/site-packages/urllib3/response.py", line 808, in __iter__
    for chunk in self.stream(decode_content=True):
  File "/usr/local/lib/python2.7/site-packages/urllib3/response.py", line 572, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python2.7/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
[2020-06-25 01:56:44,697] {taskinstance.py:1119} INFO - Marking task as FAILED.
[2020-06-25 01:56:44,712] {logging_mixin.py:112} INFO - [2020-06-25 01:56:44,712] {log_email_backend.py:54} INFO - 
Content-Type: multipart/mixed; boundary="===============5139698362747261446=="
MIME-Version: 1.0
Subject: Airflow alert: <TaskInstance:
 taar_weekly.dataflow_import_avro_to_bigtable 2020-06-14T00:00:00+00:00
 [failed]>

At the time the job failed - I have checked the the Cloud Dataflow job is still running taar-profile-load-20200614 with job id: 2020-06-24_18_51_37-17541505251849162564

Retrying the job isn't appropriate as the job has not actually failed - it is still executing.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID

yup, unfortunately this is the way right now. To be clear, this is a GKEPodOperator issue, not a dataflow one.

Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: