Closed Bug 1337688 Opened 8 years ago Closed 8 years ago

Remove NIGHTLY_BUILD wrapping if the increased size from adding unloaded modules and process/thread data to minidumps is acceptable

Categories

(Toolkit :: Crash Reporting, defect)

Unspecified
Windows
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
mozilla55
Tracking Status
firefox54 --- affected
firefox55 --- fixed

People

(Reporter: ting, Unassigned)

References

Details

This is a followup for bug 1334027.

Once bug 1334027 is landed, we will know how the size of minidumps is affected by adding unloading modules and process/thread data. If the increased size is acceptable, we can let it propagate to the other channels.
Marco, I am not sure whom to ask, but do you know how to check the increased size of minidumps from bug 1334027 that we received recently?
Flags: needinfo?(mcastelluccio)
:adrian or :lonnen can probably help (or redirect to somebody that can help).
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(chris.lonnen)
Flags: needinfo?(adrian)
Adrian told me Will can probably help.
Flags: needinfo?(willkg)
Flags: needinfo?(chris.lonnen)
Flags: needinfo?(adrian)
Reading through this bug and bug #1334027, it sounds like you want to know the change in median/95%/max crash report sizes between before bug #1334027 landed and today for crashes from Firefox nightly. Is that correct?
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #4)
> Reading through this bug and bug #1334027, it sounds like you want to know
> the change in median/95%/max crash report sizes between before bug #1334027
> landed and today for crashes from Firefox nightly. Is that correct?

Correct.
We have metrics for overall median/95%/max for crash report sizes where a crash report is the entire breakpad crash report--not just the minidump.

Given that this requires just data about nightly, I think I'm going to have to do it by hand. I'll think about how to do that. Maybe build a list of crash ids using a socorro super search and then capturing the sizes for the upload_file_minidump files for those crashes.

When do you need this data by?
Flags: needinfo?(willkg) → needinfo?(janus926)
Marco said this on IRC:

<marco> it would be more precise by build ID, as in the AFTER period there might be crash reports both from Nightly builds that contain the change and Nightly builds that don't contain the change
<marco> the first build ID with the change that might have increased the size of the minidump is 20170209030214
<marco> so you can compare all crash reports with build IDs < 20170209030214 vs all crash reports with build ID >= 20170209030214

I'll search by build id.
Also! Note that that change only changed minidumps from Windows clients, so you should restrict your query to Windows-only.
Low priority, so do it when you have time.
Flags: needinfo?(janus926)
I used the following SuperSearch query:

https://gist.github.com/willkg/25e28570fd8c95537dbd7f9e2855c7c8#file-analysis_1337688-py-L129

https://gist.github.com/willkg/25e28570fd8c95537dbd7f9e2855c7c8#file-analysis_1337688-py-L149


Here's the script:

https://gist.github.com/willkg/25e28570fd8c95537dbd7f9e2855c7c8

It does a supersearch query per day for the before build id and the after build id. Then for each crashid, I pulled down the dump. Then I looked at the "before" set and the "after" set and here's the summary:

./before
   Number of files:        920
   Average size:        383092
   Median size:         348876
   95% size:            747782
   Max size:           3149706
./after
   Number of files:       1001
   Average size:        764912
   Median size:         662433
   95% size:           1516906
   Max size:          16634443


Please let me know if there are changes in how I did it that you want to see and/or if I messed up the SuperSearch query.

Hope this helps!
Do we have criteria for 'acceptable'?

From the perspective of the crash reporter, increasing the size of the minidump (1) increases the risk of a network disconnect during transmission and (2) results in additional load on the collectors. 

(2) We scale well in our current infrastructure and antenna (the collector rewrite) appears to be constrained on network throughput so I'm not worried about load on them. 

(1) I don't have ideas for quantifying the risk of a disconnect that can be validated in under a week. This may be less important in non-release channels because we can retry or prompt with the doorhanger
The HTTP POST payload from a breakpad crash report from Windows is uncompressed. Maybe compressing the payloads from Windows can alleviate the concerns?
I think it better stays in nightly if the size increases ~2x. Ted, what do you think?
Flags: needinfo?(ted)
I agree, it seems like a bit too much to ship in release.
Flags: needinfo?(ted)
Pushed by tchou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/52512f137a66
Update a TODO comment according to the experimental data. r=me
https://hg.mozilla.org/mozilla-central/rev/52512f137a66
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Resolution: FIXED → WONTFIX
Will, thanks for collecting the numbers. :)
You need to log in before you can comment on or make changes to this bug.