Closed Bug 1495453 Opened 2 years ago Closed 2 years ago

Airflow task fails with no EMR runs and no emails

Categories

(Data Platform and Tools :: Datasets: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: frank, Assigned: frank)

Details

(Whiteboard: [DataOps])

This has happened twice to telemetry aggregates. The operator is null, the task is set as failed, but there are no associated runs in EMR.
The only log entry for that task on that day is the following. All previous task runs seemed to be just fine. The release telemetry aggregates for that day also ran correctly.

56627 	09-22T00:00:20.878267 	telemetry_aggregates 	prerelease_telemetry_aggregate_view 	failed 	09-21T00:00:00 	frank@mozilla.com

:whd, are you aware of any other logs or info we can pull out of Airflow to investigate this? It seems like something internally went wrong.
Flags: needinfo?(whd)
Whiteboard: [DataPlatform]
(In reply to Frank Bertsch [:frank] from comment #1)

> :whd, are you aware of any other logs or info we can pull out of Airflow to
> investigate this? It seems like something internally went wrong.

Aside from perhaps the application logs (which I believe are not routed anywhere presently), I am not. If this happens again, we can try to investigate those logs.
Flags: needinfo?(whd)
Whiteboard: [DataPlatform] → [DataOps]
Hey Wesley, looks like this happened again :/ can we capture those logs?
Flags: needinfo?(whd)
Harold, will the Airflow instance on GCP be able to run this job? What's the timeline of when that might be available? I'm thinking it might be better to just make that switch and see if it still occurs.
Flags: needinfo?(hwoo)
Composer on GCP is ready, if we go with the approach of using aws keys as env variables.  We would have to make changes to all the dag start dates so that there is a clean cutoff between wtmo and composer.  Composer will still try to execute the job on EMR, but it may be easier to see logs in stackdriver if there are failures.

If the telemetry aggregates job is failing anyway, we can give it a shot. We can sync on vidyo for details/next steps.
Flags: needinfo?(hwoo)
(In reply to Frank Bertsch [:frank] from comment #3)
> Hey Wesley, looks like this happened again :/ can we capture those logs?

I've copied the available worker and scheduler logs to s3://telemetry-airflow/logs/bug_1495453/, but haven't looked at them.
Flags: needinfo?(whd)
For now I've added this DB to re:dash and setup an alert there to let us know if Airflow fails again. That should keep this dataset afloat until we come up with a long-term solution.
Plan is the following:

1. Monitor existing Airflow job for failures using re:dash alert (setup already, and working)
2. Move this job to cloudcomposer on GCP
3. Deprecate this Airflow instance

I'll close this as wontfix, and if we run into the same issue on Cloud Composer we'll reopen.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.