Telemetry aggregates works like this: 1a. Map each ping to a set of dimensions and a set of metrics 2a. Start a dictionary that has dimensions + metric -> aggregates 3a. Keep adding new aggregates from pings to this dictionary At some point, it's failing with `SystemError: error return without exception set`. Full stack trace is available here . It is coming from pickle. Spark takes that interim dictionary from (2a.), pickles it, and the transmits it over the wire to the other executors. This happens on every shuffle. What this error indicates is one of two things (per ): 1b. There is an element in the dictionary that exceeds the 32 bit limit 2b. The dictionary itself is exceeding 2 GB To resolve this bug, we need to first figure out which piece is causing this it. To do that we can run on Python 2.12 which seems to have a better and more informative error. There are some alternatives: 1c. This seems to have been fixed to allow for 64-bit in Python 3.2. If we update this job for Python 3 that may fix it (per ). 2c. If the error is from (2b.), then we can change the job to, instead of creating a single dictionary with all aggregates, partition that as an RDD and run the database loading directly from each partition. 3c. If the error is from (1b.), we should really be doing some sanity checks. E.g. for release, if there was only one client reporting a certain key for a probe, we probably shouldn't aggregate it. Generally I'm leaning towards 1c. as it would just be good code hygiene to move this to Python 3.  Full stacktrace: https://pastebin.com/T3APmkRh  Bug report for pickle error: https://bugs.python.org/issue11564
Assignee: nobody → fbertsch
Points: --- → 3
Priority: -- → P1
You need to log in before you can comment on or make changes to this bug.