Closed Bug 1892258 Opened 1 year ago Closed 1 year ago

Airflow dag bhr_collection failed for exec_date 2024-04-17

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: benwu, Assigned: benwu)

Details

(Whiteboard: [airflow-triage])

Attachments

(2 files)

[mozilla/telemetry-airflow] fix: Bug 1892258 - Increase bhr_collection spark memory allocation (#1972) 1 year ago BMO Github Automation 54 bytes, text/x-github-pull-request		Details \| Review
[mozilla/telemetry-airflow] fix: Bug 1892258 - Separate bhr_collection into two jobs (#1975) 1 year ago BMO Github Automation 54 bytes, text/x-github-pull-request		Details \| Review

Ben Wu [:benwu]

Assignee

Description

•

1 year ago

Airflow dag bhr_collection is failing for exec_date 2024-04-17 due to out of memory error. Log says 10.6 GB of 10 GB physical memory used but the nodes in the cluster should have 26 GB usable so it might be fixable with a spark config change.

This is following this change https://github.com/mozilla/python_mozetl/pull/400. From what I can tell, the first run with "thread_filter": "Gecko" is succeeding but the Gecko_Child run is failing.

Task link:
https://workflow.telemetry.mozilla.org/dags/bhr_collection/grid?tab=code&dag_run_id=scheduled__2024-04-17T05%3A00%3A00%2B00%3A00&task_id=bhr_collection

Log extract:

24/04/18 07:17:14 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container killed by YARN for exceeding memory limits.  10.6 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

Dataproc job https://console.cloud.google.com/dataproc/jobs/bhr-collection_5eb3264d/monitoring?region=us-west1&project=airflow-dataproc

Ben Wu [:benwu]

Assignee

Updated

•

1 year ago

Assignee: nobody → bewu

BMO Github Automation

Comment 1

•

1 year ago

Attached file [mozilla/telemetry-airflow] fix: Bug 1892258 - Increase bhr_collection spark memory allocation (#1972) — Details

Ben Wu [:benwu]

Assignee

Comment 2

•

1 year ago

I increased the memory allocation in the above pr from spark.driver.memory=7680m and spark.executor.memory=9309m to spark.driver.memory=12g and spark.executor.memory=15g but it's still failing with

Container killed by YARN for exceeding memory limits.  17.1 GB of 17 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

I'm not sure what the exact issue is we can try adding a thread_filter argument to the script and splitting the job into two runs, one for each thread_filter. Then we can see if OOM is caused by memory not being freed up between the two runs or the Gecko_Child run is just more intensive.

:alexical, are you able to make the changes to bhr_collection.py and I can review and create a new airflow task?

Flags: needinfo?(dothayer)

BMO Github Automation

Comment 3

•

1 year ago

Attached file [mozilla/telemetry-airflow] fix: Bug 1892258 - Separate bhr_collection into two jobs (#1975) — Details

Ben Wu [:benwu]

Assignee

Comment 4

•

1 year ago

Splitting it into two jobs and reducing the sample size for child process job to 0.1 seems to have worked. PRs: https://github.com/mozilla/python_mozetl/pull/401 and https://github.com/mozilla/telemetry-airflow/pull/1975

It's now succeeding but the first run I tried with Gecko_Child failed with what might have been an OOM error but a different one from before

Driver received SIGTERM/SIGKILL signal and exited with 137 code, which potentially signifies a memory pressure.

This seems to be the driver rather than the worker. It surprisingly succeeded on the automatic retry though. There may be some config change that can address this but this is low priority enough that I don't think I'll be able to look into it. I'll check on this next week but as long as it's not consistently failing, I'll say this is resolved.

Status: NEW → RESOLVED

Closed: 1 year ago

Flags: needinfo?(dothayer)

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Airflow dag bhr_collection failed for exec_date 2024-04-17

Categories

(Data Platform and Tools :: General, defect)

Tracking

(Not tracked)

People

(Reporter: benwu, Assigned: benwu)

References

Details

(Whiteboard: [airflow-triage])

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Attachment

General

Description

File Name

Content Type