Closed Bug 1679787 Opened 4 years ago Closed 4 years ago

Consider raising the maximum payload size we accept (round 2)

Categories

(Socorro :: Antenna, task, P2)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: willkg, Unassigned)

References

Details

(This bug was initially created as a clone of Bug #1648387)

In https://bugzilla.mozilla.org/show_bug.cgi?id=1679733#c1, Gabriele says users are sending him crash reports by email because they're being rejected by the collector probably because they're too big.

In bug #1648387, we raised the max payload size to 25mb. We should look at doing that again.

What's happening is odd. I tried manually uploading one of the dumps I received from a user using from my nightly build and it succeeded. It's here:

https://crash-stats.mozilla.org/report/index/e28ae95a-3731-465d-8545-e4bda0201130

I expected that because the minidump is 14144691 bytes in size (~14MiB) and the .extra file is 237607 bytes (~237607) so it should well be below the payload limit. However it took quite a while. Upload speed was a little below 1 Mb/s so I wonder if users aren't just becoming annoyed because the crash does not appear to have been submitted quickly and thus they assume it failed - and maybe close the crashed tab or the crash reporter before we finish uploading. In the meantime I think we can close this until someone sends me a minidump that's larger than 25 MiB.

I have some tools for taking a JSON-encoded set of crash annotations and one or more minidumps and building the HTTP POST payload:

https://github.com/mozilla-services/antenna/blob/main/testlib/mini_poster.py

I use that for testing the collector. They're janky--I could make them part of crashstats-tools and wrap it in a cli if that helps.

The payloads are in multipart/form-data which adds a bunch of extra data. Maybe it puts it over the threshold for some of the crash reports?

Maybe we should think about compressing the desktop crash reports. I'm pretty sure they're not compressed. I don't know how much space that'd save, but it might help some cases.

Here's October and November by week:

week count
2020-40 1237
2020-41 1182
2020-42 1220
2020-43 1221
2020-44 1236
2020-45 1305
2020-46 1352
2020-47 1350

It looks pretty steady. If there was a spike in stack overflow errors where the crash reports couldn't be submitted because the payloads were too large, I would expect to see a spike in HTTP 413s.

For reference, here are some weeks before we changed it to 25mb:

week count
2020-31 1999
2020-32 1854
2020-33 1677

We don't know what the payload sizes of the crash reports we're rejecting are. Other than stack overflow errors, we don't know what kinds of crash reports are getting rejected.

One consequence of raising the max payload size is that it takes longer for the payload to be transmitted. I suspect at some point, we're going to cross over from "payload was too large so crash report failed to submit" to "crash report took too long to submit and failed". I'm going to look into that next.

I just found the payload sizes of crash reports getting rejected in the nginx error logs. I think I can figure out how to extract that so we can see what a week of rejected report sizes looks like.

I also figured out how to count how many connection timeouts we get.

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #3)

One consequence of raising the max payload size is that it takes longer for the payload to be transmitted. I suspect at some point, we're going to cross over from "payload was too large so crash report failed to submit" to "crash report took too long to submit and failed". I'm going to look into that next.

Yes, it looks like we're already very close to that. Wiring up the crash reporter to send gzip-compressed crash reports shouldn't be too hard, what about Socorro's side? Would it be beneficial outside of handling large bugs, e.g. by reducing ingress bandwidth?

I looked at the last 24h and picked the busiest one 2020-12-01 11:00:00. During that busy hour, we got 47,227 crash reports of which 4,964 were compressed. Grafana link

The crash reports we get that are compressed come from Fennec and Fenix. No other product compresses crash reports.

Changing Firefox desktop to compress crash reports will dramatically increase the number of compressed crash reports.

The 47,227 number includes all crash reports some of which get accepted and the rest rejected. So while Socorro got 47,227 crash reports in that hour, it only saved and processed 12,088 of them. We look at all crash reports because in order to accept/reject a crash reports, the collector has to decompress it and look at the crash annotations.

I'm working on:

  • estimating change in CPU usage of collector due to increased work decompressing
  • estimating change in ingress bandwidth used
  • building a graph of connection timeouts per day to see how that changed the last time we increased the max size payload
  • building a graph of rejected payload sizes

I think the impetus behind this dropped in urgency, so I'll do this work over the next couple of weeks. We've got an upcoming change freeze and I have some other things I need to do so I think it's probably the case that we're not going to change the max payload size in 2020.

Is that ok?

I grabbed 1,000 crash reports and converted them back into multipart/form-data payloads and used that to figure out compression estimates to figure out change in ingress bandwidth.

label value
uncompressed mean 530298.267
compressed mean 76789.564
savings mean 83.6%
savings median 84.4%
savings best 93.1%

"Savings" is the ratio of compressed / uncompressed--I couldn't think of a better name.

For the busy hour mentioned in comment #6 and using some hand-wavy leaps of "deduction", we go from 22gb [1] to 3gb [2] in ingress bandwidth for that one busy hour.

  1. ((47,227 - 4,964) * 530,000) + (4,964 * 77,000) = 22,781,618,000 = 22gb
  2. 47,227 * 77,000 = 3,636,479,000 = 3gb

Two things to point out about that:

  1. That's an estimate based on 1,000 Firefox desktop crash reports for a single busy hour in a 24h span. I hand-wavingly think we could use that as a rough max and I think reality will be smaller than that.
  2. Since we're making this change in Firefox desktop, it'll land and ride the trains to nightly, beta, and release. It probably won't go to ESR, so it won't affect those users. We also have users stuck on certain versions. So the change in ingress bandwidth won't happen all at once but probably will happen in stages.

This is create data. The reduction would be pretty dramatic, would it have an impact on the bottom line? It will most certainly help users as we'd be less likely to time out when sending a crash report and the like. As for the timeline I'd say there's no hurry, I'll open a bug under Toolkit > Crash Reporting considering that compression is already handled on the Socorro side.

I pushed out my decompression timer. Brian and I don't think that CPU usage will change in any meaningful way when we change Firefox desktop crash reports to be compressed. Further, while ingress bandwidth will drop, it won't meaningfully affect costs, either.

Given that, I think we should:

  1. change the crash reporter to compress Firefox desktop crash reports
  2. push off the rest of the work I was thinking of doing in comment #6 and mark this as WONTFIX
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX

We already had bug 781630 on file for the desktop part, I'll follow up there. Thanks!

See Also: → 1686864
You need to log in before you can comment on or make changes to this bug.