Open Bug 1913980 Opened 10 months ago Updated 10 months ago

Some Glean crash pings from firefox-desktop dropped at ingestion due to too large size

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: akomar, Unassigned)

References

Details

(Whiteboard: [dataquality])

Arkadiusz Komarzewski [:akomar]

Reporter

Description

•

10 months ago

Around July 21st we started getting ~150 ping size validation failures per day for firefox-desktop.crash: https://sql.telemetry.mozilla.org/queries/101823/source#250861

Our ping size limit is set to 8MB: https://github.com/mozilla/gcp-ingestion/blob/35b03c2bce9701f8aaa1a61e574a30ac84278b5b/ingestion-beam/src/main/java/com/mozilla/telemetry/Decoder.java#L108-L109

We will probably need to pull out some of these pings from error tables for investigation, in the meantime /cc :chutten for visibility.

Jira Integration Bot

Updated

•

10 months ago

See Also: → https://mozilla-hub.atlassian.net/browse/DENG-4551

Chris H-C :chutten

Comment 1

•

10 months ago

Adding in :afranchuk and :gsvelto.

ni?afranchuk -- Isn't this unusually large? Will you need a sample of those Very Big Pings to run these down?

Flags: needinfo?(afranchuk)

Gabriele Svelto [:gsvelto]

Comment 2

•

10 months ago

We have stack overflows sometimes, infinite stack recursion, etc... Those might generate very large stacks, we truncate those on Socorro. Maybe we're not doing it in crash pings? We might put a cut-off point, for example we might be satisfied with the first 50 frames and drop everything else. That's assuming I'm right about the extra size, of course.

Chris H-C :chutten

Comment 3

•

10 months ago

An interesting piece of context is that Glean caps pings to 1MB (compressed) because that's what we were told was the biggest we should ever send to ingestion (incidentally, we instrument how often that happens, which we really ought to monitor (bug 1830754)) so either these are coming from non-SDK sources, or they are compressing at better than 8:1.

Also, I was today years old when I learned there was an 8MB uncompressed limit on ingestion. Maybe we should forbid ping uploads in the SDK if they exceed that limit under the same reasoning that we forbid > 1MB compressed?

Alex Franchuk [:afranchuk]

Comment 4

•

10 months ago

•

Edited

As far as I'm aware we didn't do any limits in the legacy telemetry pings, but we definitely should have some now. I think the first 50 frames is probably enough. The size is more than likely due to excessive stack frames, but we should try to confirm that as well.

Flags: needinfo?(afranchuk)

Gabriele Svelto [:gsvelto]

Comment 5

•

10 months ago

(In reply to Chris H-C :chutten|PTO (back August 26) from comment #3)

so either these are coming from non-SDK sources, or they are compressing at better than 8:1.

If we're generating full stack traces of stack overflows then they'll easily compress 40:1. It's a bunch of identical stack frames repeated over and over and over.

Arkadiusz Komarzewski [:akomar]

Reporter

Comment 6

•

10 months ago

I filed https://mozilla-hub.atlassian.net/browse/DSRE-1737 to get a sample of these pings.

Arkadiusz Komarzewski [:akomar]

Reporter

Updated

•

10 months ago

See Also: → https://mozilla-hub.atlassian.net/browse/DSRE-1737

Arkadiusz Komarzewski [:akomar]

Reporter

Comment 7

•

10 months ago

We have a sample of errored pings in moz-fx-data-shared-prod:analysis.dsre_1737. We're interested in payload field which is compressed, so we need to access it with udf_js.gunzip(payload).

To make this easier I ran this query:

SELECT
  * EXCEPT (payload),
  udf_js.gunzip(payload) AS payload
FROM
  moz-fx-data-shared-prod.analysis.dsre_1737
LIMIT
  10

and exported 10 rows to https://drive.google.com/file/d/1rRbxZwx5hCYNQI98lcdqAApV4HNqYnp3/view?usp=sharing.

tail -n1 bq-results-20240826-154916-1724687416857.json | jq ".payload" does indeed look like a stack trace.

Arkadiusz Komarzewski [:akomar]

Reporter

Comment 8

•

10 months ago

ni?afranchuk - FYI my comment above, it looks like limiting number of frames would be a fix here.

Flags: needinfo?(afranchuk)

Gabriele Svelto [:gsvelto]

Comment 9

•

10 months ago

No, I don't think it's a native stack trace, it's the async_shutdown_timeout annotation which contains a JSON structure with mixed SQL queries and JS stack traces. This one is ridiculously large, so maybe it gets out of hand sometimes. There is a huge number of references to PlacesUtils.sys.mjs, History.sys.mjs and Sqlite.sys.mjs, so I'd start looking from there.

Alex Franchuk [:afranchuk]

Comment 10

•

10 months ago

Agree with gsvelto, the async_shutdown_timeout is massive!

Flags: needinfo?(afranchuk)

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Some Glean crash pings from firefox-desktop dropped at ingestion due to too large size

Categories

(Data Platform and Tools :: General, defect)

Tracking

(Not tracked)

People

(Reporter: akomar, Unassigned)

References

Details

(Whiteboard: [dataquality])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10