Open Bug 1938151 Opened 2 months ago Updated 1 month ago

Make queries randomly sample the LIMIT

Categories

(Crash Pings :: Post-Processing, enhancement)

enhancement

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: afranchuk, Assigned: afranchuk)

References

Details

When e.g. android has a ton of crash pings, the LIMIT without an ordering seems to take only those (with the UNION ALL that's currently done in the query). We need to change the query to randomly sample among all results.

Alternatively, we could facet on OS to guarantee results (if they exist). This may not accurately depict the top crashers across all platforms (if the volume is saturated with relation to the limit, the relative volumes will disappear), however typically we are interested in per-os statistics anyway.

With some hassle, I've managed to come up with a query which randomly samples the results. However, it is necessarily less efficient than LIMIT, since LIMIT can stop scanning data when the limit is reached. I've optimized this as much as possible (no temporary values/tables created; it's just an additional WHERE clause).

For release 133 main process crashes for the past week, this resulted in about 3 times as much data scanned (8.5GB vs 31GB) and likewise about 3 times as much time spent (13s vs 36s).

Since our goal is to process all nightly and beta pings in the future (approximately 9k pings a day), it might be more efficient to make a separate query for release which uses TABLESAMPLE to reduce to a certain fixed percentage. Unfortunately it only supports a percentage, so we'd have to manually poke that if e.g. there is a sudden increase in pings. Ideally our processing would be fast enough that it'd need to be a magnitude more pings to have detrimental effects, in which case we might be able to reasonably leave it be without worrying.

I just tried TABLESAMPLE and it errors, saying that it can only be applied to base tables; I guess these tables are all views over the live tables? So that may not work.

If we had an index over document_id, we could efficiently order by that column (which is effectively random) and use a LIMIT. I just tried that, however there seems to be no index as it scanned all the data.

Of course the other thing to note is that the query still isn't too slow (and on a daily basis would be pretty fast with ~5GB of data to scan for release per query), so we could just live with it. However if there's some monetary cost associated with scanning tables, it might be worth optimizing this. If it's a drop-in-the-bucket it may be a low priority, even if running daily.

This is implemented in https://sql.telemetry.mozilla.org/queries/104407, which is used by https://github.com/mozilla/crash-ping-ingest. Once CI is set up and we move the UI over to using this new ingestion pipeline, we can close this.

I think it's reasonable to scan all the data, only because the new code will process each day only once, so the new crash ping ingestion pipeline ends up loading in each ping's data once. That being said, we should still be able to save some memory by sampling using an indexed column if we want.

Depends on: 1940195
You need to log in before you can comment on or make changes to this bug.