Airflow task prerelease_telemetry_aggregates mozaggregator2bq_extract failing on 2022-01-20
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: Alekhya, Assigned: linh)
Details
(Whiteboard: [airflow-triage])
The Airflow task prerelease_telemetry_aggregates failed on Jan 20 2022 for the task mozaggregator2bq_extract
Link to the error: https://workflow.telemetry.mozilla.org/log?dag_id=prerelease_telemetry_aggregates&task_id=mozaggregator2bq_extract&execution_date=2022-01-19T00%3A00%3A00%2B00%3A00
Reporter | ||
Updated•3 years ago
|
Comment 1•3 years ago
|
||
hmm, so I was trying to check if this small change would fix the problem:
https://github.com/mozilla/telemetry-airflow/pull/1461
This attempt was inspired by this post:
https://stackoverflow.com/questions/49143271/invalid-spark-url-in-local-spark-session
However, I re-run the task before it was available in Airflow and still succeeded. It seems like this change did not break anything so I'd leave it in the code. And the tasks which have not executed yet because of this are currently in the process of catching up, no failures yet.
:linh would you be able to validate that the data this DAG is producing looks fine?
Brain dump:
I'm still really puzzled as in to why it worked this time when I've re-run it, but previously it was consistently failing. I don't see any obvious changes to the environment or configuration, and the task (Spark job) that was failing appears to be running on the container locally.
For completion sake, I compared the YAML definition of the pod that completed successfully vs attempt 4 (failure) with the same execution date and did not find any obvious differences (which confirms the environments should be the same).
The only main difference I was able to notice so far looking at the Airflow logs is that the sparkDriver
service seems to come up with a different port every time. I wonder if maybe that's why we see the issue sometimes? At the same time I am not a Spark expert so this is just a wild guess:
https://stackoverflow.com/questions/58216831/sparksubmit-can-run-locally
https://stackoverflow.com/questions/32356143/what-does-setmaster-local-mean-in-spark
https://spark.apache.org/docs/1.6.1/submitting-applications.html#master-urls
The command we use to kick off spark locally:
spark-submit --master 'local[*]' --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=16 bin/pg_dump_to_parquet.py --input-dir data/submission_date/20220122 --output-dir data/parquet/submission_date/20220122
^^ I do not see any potential issues here either.
Example:
attempt 4: INFO Utils: Successfully started service 'sparkDriver' on port 46557
attempt 5: INFO Utils: Successfully started service 'sparkDriver' on port 44195
Shouldn't this always be using port 7077?
https://spark.apache.org/docs/latest/security.html#standalone-mode
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Description
•