Closed
Bug 1458256
Opened 7 years ago
Closed 5 years ago
Investigate pioneer d2p dataset size oddities
Categories
(Data Platform and Tools :: General, enhancement, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: bugzilla, Unassigned)
Details
After running into constant OOM issues in the online news log entry deduplication job, breaking the job into phases where the first phase explodes out the logs and writes to hdfs (instead of trying to that all at once) dramatically reduced the size of that data in memory (previously, ~60gb persisted after exploding in memory -- reading the data back after writing to hdfs, on the other hand, dropped the size in memory to ~15GB.)
This seems like a very odd result that's not consistent with my mental model of how spark caches datasets, and the implications seem important enough to investigate further. My initial thought is that this might be related to snappy compression when writing that intermediate to disk.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•3 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•