Sigh. experiments_main_summary is *also* failing intermittently. The relevant part of the stack trace: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@6b032d27 : No space left on device This is apparently a disk space issue, which suggests spark is filing up the 10GB ebs dir that we use on our cluster machines with tmp output.
Checked the spark configs we're using and spark.local.dir is currently pointing at /mnt and /mnt1, which have 145 and 153 GB free each on a new cluster I just spun up
Actually, since we run our jobs via yarn it's actually the yarn configs that take precedence, but those still point to the same place. More notes: Looking closer at the 3 failures for run date 20170731, there are in fact two different exceptions -- attempt 2 and 3 failed with the message above while the first attempt failed with: org.apache.spark.SparkException: Job aborted due to stage failure: Task 554 in stage 4.0 failed 1 times, most recent failure: Lost task 554.0 in stage 4.0 (TID 892, localhost): java.io.IOException: No space left on device I re-ran the job on an adhoc cluster and the master node went down to 203MB of free memory in the instance logs, so yeah, probably a memory issue and not disk. The other nodes have ~24 GB of memory free throughout the job run, so this doesn't seem to be something we can fix with a cluster size increase.
Status: NEW → RESOLVED
Last Resolved: 10 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.