Some Glean crash pings from firefox-desktop dropped at ingestion due to too large size
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: akomar, Unassigned)
References
Details
(Whiteboard: [dataquality])
Around July 21st we started getting ~150 ping size validation failures per day for firefox-desktop.crash
: https://sql.telemetry.mozilla.org/queries/101823/source#250861
Our ping size limit is set to 8MB: https://github.com/mozilla/gcp-ingestion/blob/35b03c2bce9701f8aaa1a61e574a30ac84278b5b/ingestion-beam/src/main/java/com/mozilla/telemetry/Decoder.java#L108-L109
We will probably need to pull out some of these pings from error tables for investigation, in the meantime /cc :chutten for visibility.
Updated•6 months ago
|
Comment 1•6 months ago
|
||
Adding in :afranchuk and :gsvelto.
ni?afranchuk -- Isn't this unusually large? Will you need a sample of those Very Big Pings to run these down?
Comment 2•6 months ago
|
||
We have stack overflows sometimes, infinite stack recursion, etc... Those might generate very large stacks, we truncate those on Socorro. Maybe we're not doing it in crash pings? We might put a cut-off point, for example we might be satisfied with the first 50 frames and drop everything else. That's assuming I'm right about the extra size, of course.
Comment 3•6 months ago
|
||
An interesting piece of context is that Glean caps pings to 1MB (compressed) because that's what we were told was the biggest we should ever send to ingestion (incidentally, we instrument how often that happens, which we really ought to monitor (bug 1830754)) so either these are coming from non-SDK sources, or they are compressing at better than 8:1.
Also, I was today years old when I learned there was an 8MB uncompressed limit on ingestion. Maybe we should forbid ping uploads in the SDK if they exceed that limit under the same reasoning that we forbid > 1MB compressed?
Comment 4•6 months ago
•
|
||
As far as I'm aware we didn't do any limits in the legacy telemetry pings, but we definitely should have some now. I think the first 50 frames is probably enough. The size is more than likely due to excessive stack frames, but we should try to confirm that as well.
Comment 5•6 months ago
|
||
(In reply to Chris H-C :chutten|PTO (back August 26) from comment #3)
so either these are coming from non-SDK sources, or they are compressing at better than 8:1.
If we're generating full stack traces of stack overflows then they'll easily compress 40:1. It's a bunch of identical stack frames repeated over and over and over.
Reporter | ||
Comment 6•6 months ago
|
||
I filed https://mozilla-hub.atlassian.net/browse/DSRE-1737 to get a sample of these pings.
Reporter | ||
Updated•6 months ago
|
Reporter | ||
Comment 7•6 months ago
|
||
We have a sample of errored pings in moz-fx-data-shared-prod:analysis.dsre_1737
. We're interested in payload
field which is compressed, so we need to access it with udf_js.gunzip(payload)
.
To make this easier I ran this query:
SELECT
* EXCEPT (payload),
udf_js.gunzip(payload) AS payload
FROM
moz-fx-data-shared-prod.analysis.dsre_1737
LIMIT
10
and exported 10 rows to https://drive.google.com/file/d/1rRbxZwx5hCYNQI98lcdqAApV4HNqYnp3/view?usp=sharing.
tail -n1 bq-results-20240826-154916-1724687416857.json | jq ".payload"
does indeed look like a stack trace.
Reporter | ||
Comment 8•6 months ago
|
||
ni?afranchuk - FYI my comment above, it looks like limiting number of frames would be a fix here.
Comment 9•6 months ago
|
||
No, I don't think it's a native stack trace, it's the async_shutdown_timeout
annotation which contains a JSON structure with mixed SQL queries and JS stack traces. This one is ridiculously large, so maybe it gets out of hand sometimes. There is a huge number of references to PlacesUtils.sys.mjs, History.sys.mjs and Sqlite.sys.mjs, so I'd start looking from there.
Comment 10•6 months ago
|
||
Agree with gsvelto, the async_shutdown_timeout is massive!
Description
•