Closed Bug 1182499 Opened 9 years ago Closed 9 years ago

Reduce Spark jobs memory pressure

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

Details

(Whiteboard: spark [unifiedTelemetry])

Pyspark jobs can run out of memory when Java and the Python processes don't play nice together.
Priority: -- → P1
Whiteboard: spark [unifiedTelemetry]
As the JVM doens't release unused memory back to the OS even after the GC is run, I had to reduce the maximum size of the heap to avoid starving the rest of the sytem.

I have also added some configuration parameters to YARN to not let it kill an application that is consuming more virtual, or physical, memory than it's supposed to. As we are running a single application on the YARN cluster it's safe to do so.

Furthermore, thanks to the reduced memory pressure, I was able to increase the chunk size for partial reads from S3 to 100MB which speeds up the initial phase of fetching and parsing submissions considerably.

Finally I tested the new settings using the v4 aggregation job. Previously we weren't able to run the aggregator over a month of data without causing some OOM error while now it works flawlessy.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.