Open Bug 1913980 Opened 6 months ago Updated 6 months ago

Some Glean crash pings from firefox-desktop dropped at ingestion due to too large size

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: akomar, Unassigned)

References

Details

(Whiteboard: [dataquality])

Around July 21st we started getting ~150 ping size validation failures per day for firefox-desktop.crash: https://sql.telemetry.mozilla.org/queries/101823/source#250861

Our ping size limit is set to 8MB: https://github.com/mozilla/gcp-ingestion/blob/35b03c2bce9701f8aaa1a61e574a30ac84278b5b/ingestion-beam/src/main/java/com/mozilla/telemetry/Decoder.java#L108-L109

We will probably need to pull out some of these pings from error tables for investigation, in the meantime /cc :chutten for visibility.

Adding in :afranchuk and :gsvelto.

ni?afranchuk -- Isn't this unusually large? Will you need a sample of those Very Big Pings to run these down?

Flags: needinfo?(afranchuk)

We have stack overflows sometimes, infinite stack recursion, etc... Those might generate very large stacks, we truncate those on Socorro. Maybe we're not doing it in crash pings? We might put a cut-off point, for example we might be satisfied with the first 50 frames and drop everything else. That's assuming I'm right about the extra size, of course.

An interesting piece of context is that Glean caps pings to 1MB (compressed) because that's what we were told was the biggest we should ever send to ingestion (incidentally, we instrument how often that happens, which we really ought to monitor (bug 1830754)) so either these are coming from non-SDK sources, or they are compressing at better than 8:1.

Also, I was today years old when I learned there was an 8MB uncompressed limit on ingestion. Maybe we should forbid ping uploads in the SDK if they exceed that limit under the same reasoning that we forbid > 1MB compressed?

As far as I'm aware we didn't do any limits in the legacy telemetry pings, but we definitely should have some now. I think the first 50 frames is probably enough. The size is more than likely due to excessive stack frames, but we should try to confirm that as well.

Flags: needinfo?(afranchuk)

(In reply to Chris H-C :chutten|PTO (back August 26) from comment #3)

so either these are coming from non-SDK sources, or they are compressing at better than 8:1.

If we're generating full stack traces of stack overflows then they'll easily compress 40:1. It's a bunch of identical stack frames repeated over and over and over.

I filed https://mozilla-hub.atlassian.net/browse/DSRE-1737 to get a sample of these pings.

We have a sample of errored pings in moz-fx-data-shared-prod:analysis.dsre_1737. We're interested in payload field which is compressed, so we need to access it with udf_js.gunzip(payload).

To make this easier I ran this query:

SELECT
  * EXCEPT (payload),
  udf_js.gunzip(payload) AS payload
FROM
  moz-fx-data-shared-prod.analysis.dsre_1737
LIMIT
  10

and exported 10 rows to https://drive.google.com/file/d/1rRbxZwx5hCYNQI98lcdqAApV4HNqYnp3/view?usp=sharing.

tail -n1 bq-results-20240826-154916-1724687416857.json | jq ".payload" does indeed look like a stack trace.

ni?afranchuk - FYI my comment above, it looks like limiting number of frames would be a fix here.

Flags: needinfo?(afranchuk)

No, I don't think it's a native stack trace, it's the async_shutdown_timeout annotation which contains a JSON structure with mixed SQL queries and JS stack traces. This one is ridiculously large, so maybe it gets out of hand sometimes. There is a huge number of references to PlacesUtils.sys.mjs, History.sys.mjs and Sqlite.sys.mjs, so I'd start looking from there.

Agree with gsvelto, the async_shutdown_timeout is massive!

Flags: needinfo?(afranchuk)
You need to log in before you can comment on or make changes to this bug.