Closed Bug 1458256 Opened 7 years ago Closed 5 years ago

Investigate pioneer d2p dataset size oddities

Categories

(Data Platform and Tools :: General, enhancement, P3)

enhancement
Points:
1

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bugzilla, Unassigned)

Details

After running into constant OOM issues in the online news log entry deduplication job, breaking the job into phases where the first phase explodes out the logs and writes to hdfs (instead of trying to that all at once) dramatically reduced the size of that data in memory (previously, ~60gb persisted after exploding in memory -- reading the data back after writing to hdfs, on the other hand, dropped the size in memory to ~15GB.) This seems like a very odd result that's not consistent with my mental model of how spark caches datasets, and the implications seem important enough to investigate further. My initial thought is that this might be related to snappy compression when writing that intermediate to disk.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.