Experiments summary exploder job is failing

RESOLVED FIXED

Status

Data Platform and Tools
Datasets: Experiments
RESOLVED FIXED
10 months ago
10 months ago

People

(Reporter: sunahsuh, Unassigned)

Tracking

Details

(Reporter)

Description

10 months ago
Sigh. experiments_main_summary is *also* failing intermittently.

The relevant part of the stack trace:
java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@6b032d27 : No space left on device

This is apparently a disk space issue, which suggests spark is filing up the 10GB ebs dir that we use on our cluster machines with tmp output.
(Reporter)

Comment 1

10 months ago
Checked the spark configs we're using and spark.local.dir is currently pointing at /mnt and /mnt1, which have 145 and 153 GB free each on a new cluster I just spun up
(Reporter)

Comment 2

10 months ago
Actually, since we run our jobs via yarn it's actually the yarn configs that take precedence, but those still point to the same place.

More notes:
Looking closer at the 3 failures for run date 20170731, there are in fact two different exceptions -- attempt 2 and 3 failed with the message above while the first attempt failed with:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 554 in stage 4.0 failed 1 times, most recent failure: Lost task 554.0 in stage 4.0 (TID 892, localhost): java.io.IOException: No space left on device

I re-ran the job on an adhoc cluster and the master node went down to 203MB of free memory in the instance logs, so yeah, probably a memory issue and not disk. The other nodes have ~24 GB of memory free throughout the job run, so this doesn't seem to be something we can fix with a cluster size increase.
(Reporter)

Comment 3

10 months ago
Fixed with https://github.com/mozilla/telemetry-batch-view/pull/274
Status: NEW → RESOLVED
Last Resolved: 10 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.