Closed Bug 1751206 Opened 3 years ago Closed 3 years ago

Airflow task prerelease_telemetry_aggregates mozaggregator2bq_extract failing on 2022-01-20

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Alekhya, Assigned: linh)

Details

(Whiteboard: [airflow-triage])

The Airflow task prerelease_telemetry_aggregates failed on Jan 20 2022 for the task mozaggregator2bq_extract

Link to the error: https://workflow.telemetry.mozilla.org/log?dag_id=prerelease_telemetry_aggregates&task_id=mozaggregator2bq_extract&execution_date=2022-01-19T00%3A00%3A00%2B00%3A00

Assignee: nobody → linh

hmm, so I was trying to check if this small change would fix the problem:
https://github.com/mozilla/telemetry-airflow/pull/1461

This attempt was inspired by this post:
https://stackoverflow.com/questions/49143271/invalid-spark-url-in-local-spark-session

However, I re-run the task before it was available in Airflow and still succeeded. It seems like this change did not break anything so I'd leave it in the code. And the tasks which have not executed yet because of this are currently in the process of catching up, no failures yet.

:linh would you be able to validate that the data this DAG is producing looks fine?


Brain dump:

I'm still really puzzled as in to why it worked this time when I've re-run it, but previously it was consistently failing. I don't see any obvious changes to the environment or configuration, and the task (Spark job) that was failing appears to be running on the container locally.

For completion sake, I compared the YAML definition of the pod that completed successfully vs attempt 4 (failure) with the same execution date and did not find any obvious differences (which confirms the environments should be the same).

The only main difference I was able to notice so far looking at the Airflow logs is that the sparkDriver service seems to come up with a different port every time. I wonder if maybe that's why we see the issue sometimes? At the same time I am not a Spark expert so this is just a wild guess:
https://stackoverflow.com/questions/58216831/sparksubmit-can-run-locally
https://stackoverflow.com/questions/32356143/what-does-setmaster-local-mean-in-spark
https://spark.apache.org/docs/1.6.1/submitting-applications.html#master-urls

The command we use to kick off spark locally:
spark-submit --master 'local[*]' --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=16 bin/pg_dump_to_parquet.py --input-dir data/submission_date/20220122 --output-dir data/parquet/submission_date/20220122
^^ I do not see any potential issues here either.

Example:
attempt 4: INFO Utils: Successfully started service 'sparkDriver' on port 46557
attempt 5: INFO Utils: Successfully started service 'sparkDriver' on port 44195

Shouldn't this always be using port 7077?
https://spark.apache.org/docs/latest/security.html#standalone-mode

Flags: needinfo?(linh)
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(linh)
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.