Airflow dag bhr_collection failed for exec_date 2024-04-17
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: benwu, Assigned: benwu)
Details
(Whiteboard: [airflow-triage])
Attachments
(2 files)
Airflow dag bhr_collection is failing for exec_date 2024-04-17 due to out of memory error. Log says 10.6 GB of 10 GB physical memory used but the nodes in the cluster should have 26 GB usable so it might be fixable with a spark config change.
This is following this change https://github.com/mozilla/python_mozetl/pull/400. From what I can tell, the first run with "thread_filter": "Gecko" is succeeding but the Gecko_Child run is failing.
Log extract:
24/04/18 07:17:14 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container killed by YARN for exceeding memory limits. 10.6 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
| Assignee | ||
Updated•1 year ago
|
Comment 1•1 year ago
|
||
| Assignee | ||
Comment 2•1 year ago
|
||
I increased the memory allocation in the above pr from spark.driver.memory=7680m and spark.executor.memory=9309m to spark.driver.memory=12g and spark.executor.memory=15g but it's still failing with
Container killed by YARN for exceeding memory limits. 17.1 GB of 17 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
I'm not sure what the exact issue is we can try adding a thread_filter argument to the script and splitting the job into two runs, one for each thread_filter. Then we can see if OOM is caused by memory not being freed up between the two runs or the Gecko_Child run is just more intensive.
:alexical, are you able to make the changes to bhr_collection.py and I can review and create a new airflow task?
Comment 3•1 year ago
|
||
| Assignee | ||
Comment 4•1 year ago
|
||
Splitting it into two jobs and reducing the sample size for child process job to 0.1 seems to have worked. PRs: https://github.com/mozilla/python_mozetl/pull/401 and https://github.com/mozilla/telemetry-airflow/pull/1975
It's now succeeding but the first run I tried with Gecko_Child failed with what might have been an OOM error but a different one from before
Driver received SIGTERM/SIGKILL signal and exited with 137 code, which potentially signifies a memory pressure.
This seems to be the driver rather than the worker. It surprisingly succeeded on the automatic retry though. There may be some config change that can address this but this is low priority enough that I don't think I'll be able to look into it. I'll check on this next week but as long as it's not consistently failing, I'll say this is resolved.
Description
•