Open Bug 1517018 Opened 2 years ago Updated 5 months ago

Release Telemetry aggregates failing due to too large of an object to serialize

Categories

(Data Platform and Tools :: Datasets: Telemetry Aggregates, enhancement, P3)

enhancement
Points:
3

Tracking

(Not tracked)

People

(Reporter: frank, Assigned: frank)

References

Details

Telemetry aggregates works like this:

1a. Map each ping to a set of dimensions and a set of metrics
2a. Start a dictionary that has dimensions + metric -> aggregates
3a. Keep adding new aggregates from pings to this dictionary

At some point, it's failing with `SystemError: error return without exception set`. Full stack trace is available here [0]. It is coming from pickle.

Spark takes that interim dictionary from (2a.), pickles it, and the transmits it over the wire to the other executors. This happens on every shuffle. What this error indicates is one of two things (per [1]):

1b. There is an element in the dictionary that exceeds the 32 bit limit
2b. The dictionary itself is exceeding 2 GB

To resolve this bug, we need to first figure out which piece is causing this it. To do that we can run on Python 2.12 which seems to have a better and more informative error. There are some alternatives:

1c. This seems to have been fixed to allow for 64-bit in Python 3.2. If we update this job for Python 3 that may fix it (per [1]).
2c. If the error is from (2b.), then we can change the job to, instead of creating a single dictionary with all aggregates, partition that as an RDD and run the database loading directly from each partition.
3c. If the error is from (1b.), we should really be doing some sanity checks. E.g. for release, if there was only one client reporting a certain key for a probe, we probably shouldn't aggregate it.

Generally I'm leaning towards 1c. as it would just be good code hygiene to move this to Python 3.

[0] Full stacktrace: https://pastebin.com/T3APmkRh 
[1] Bug report for pickle error: https://bugs.python.org/issue11564
Assignee: nobody → fbertsch
Points: --- → 3
Priority: -- → P1

Hi Frank, has there been any progress here? It says P1 and assigned to you but there's been no update for 7 months.

Flags: needinfo?(fbertsch)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #1)

Hi Frank, has there been any progress here? It says P1 and assigned to you but there's been no update for 7 months.

Hi Gian, Thanks for the ping. There was work on this earlier in the year but no fix was in sight; I should have downgraded to a P3 at that point. If this is blocking you we need to:

  • Re-prioritize this work for this Q
  • Come up with an alternative to help you in the meantime

Let me know.

Flags: needinfo?(fbertsch) → needinfo?(gpascutto)
Priority: P1 → P3

I stumbled upon this bug because I was looking at the current state of histograms like this, also on release: https://bugzilla.mozilla.org/show_bug.cgi?id=1542162#c6

I assume there's a way to get the data out of https://sql.telemetry.mozilla.org/ but the histograms are obviously faster and easier.

Flags: needinfo?(gpascutto)

Okay, let me know if there's anything I can do to help with that ask.

See Also: → 1572115
See Also: → 1572112
See Also: 1572115

I'm trying to track down a bug and was surprised to find that there's no telemetry for release builds. Do you know when this will be fixed? In the mean time, is there a way I can get this data?

Flags: needinfo?(fbertsch)

Hey there:

  • Release will be available on MDV2. We were hoping it may "just work" on GCP but that doesn't seem to have been the case.
  • In the meantime, you can access data on STMO. Ping me if you need specific help formulating a query.
Flags: needinfo?(fbertsch)
You need to log in before you can comment on or make changes to this bug.