Closed Bug 1291068 Opened 9 years ago Closed 8 years ago

Large-scale analysis of OOM crash reports with ContainsMemoryReport=1

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Unassigned)

References

Details

Attachments

(2 files)

I would like to request access to a large set of crash reports caused by OOM where ContainsMemoryReport=1 is true. I want to do some analysis of these memory reports for interesting patterns, and possibly also write a processor that will pull out interesting things for presentation in crash-stats. In the past 7 days we've had ~5000 "OOM | small" crashes in 47.0.1 where ContainsMemoryReport=1 is present, as this search shows: https://crash-stats.mozilla.com/search/?product=Firefox&version=47.0.1&signature=%3DOOM%20%7C%20small&contains_memory_report=%21__null__&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature Would it be possible for me to download copies of all of these crash reports? Crash reports look to be about 100--200 KiB each when gzip'd, or about half that when bzip2'd, so I think it would be roughly 1 GiB of zipped data or less. I understand that crash reports are privacy-sensitive and I would follow the usual rules about handling them, esp. deleting them once I've finished. Thank you.
Blocks: 1291173
Blocks: 1291174
You can simply do this yourself. Start with curl "https://crash-stats.mozilla.com/api/SuperSearch/?product=Firefox&version=47.0.1&signature=%3DOOM%20%7C%20small&contains_memory_report=%21__null__&_columns=uuid&_results_number=500" and then use _results_offset=500, _results_offset=1000 etc. Once you have the uuids you can you use `curl -H "Auth-token: yourapitoken" https://crash-stats.mozilla.com/api/UnredactedCrash/?crash_id={UUID}` You'll need some patience. This is the kind of work we're hoping you can do with re-dash instead.
> You can simply do this yourself. It works! Thank you. For completeness: I had to visit https://crash-stats.mozilla.com/api/tokens/ to get an Auth-token.
Now that I have the crash reports, I'll move this out of the Socorro component.
Product: Socorro → Core
Attached file Analysis
Here is a first go at an analysis.
See also bug 965936 comment 7 and bug 1299747 and bug 1123465 for possible explanation of the system-heap-allocated
The most common cause of top(none)/detached that I've seen is chrome window leaks. Some addons leak windows when you open and close a window, and some users open and close a lot of windows. Bug 1276366 should deal with that problem, but it has some bad interactions with session store and devtools.
FYI the "all on windows" is expected, as we only write memory reports for OOM crashes on Windows (last I checked).
It would be interesting to see if you can find any correlations against installed addons or other information for some of those outlier measurements like top-none-detached, system-heap-allocated and ghost-windows.
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #6) > See also bug 965936 comment 7 and bug 1299747 and bug 1123465 for possible > explanation of the system-heap-allocated I don't think those are related. They are about VirtualAlloc() calls, but system-heap-allocated is about the heap, i.e. malloc and friends. So the problem is that we are failing to route some calls to malloc (or a related heap allocation function) through jemalloc.
If we defined interesting intervals for these values, I could add them to the correlation tool (once the Socorro data is in Telemetry, otherwise we'd need to download too much data from Socorro every day).
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9) > It would be interesting to see if you can find any correlations against > installed addons or other information for some of those outlier measurements > like top-none-detached, system-heap-allocated and ghost-windows. Indeed it would. I've just sketched out a design to support doing this automatically within crash-stats, in bug 1291173 comment 3.
(In reply to Marco Castelluccio [:marco] from comment #11) > If we defined interesting intervals for these values, I could add them to > the correlation tool (once the Socorro data is in Telemetry, otherwise we'd > need to download too much data from Socorro every day). What does "once the Socorro data is in Telemetry" mean? I'm not sure what that refers to...
Flags: needinfo?(mcastelluccio)
That's bug 1273657, that I mentioned in the last uptime meeting.
Flags: needinfo?(mcastelluccio)
(In reply to Nicholas Nethercote [:njn] from comment #13) > (In reply to Marco Castelluccio [:marco] from comment #11) > > If we defined interesting intervals for these values, I could add them to > > the correlation tool (once the Socorro data is in Telemetry, otherwise we'd > > need to download too much data from Socorro every day). > > What does "once the Socorro data is in Telemetry" mean? I'm not sure what > that refers to... For the record, there is no current plan to append the memory info to the crash when we send the crash over to Telemetry. This would certainly be doable but it's not planned and the work would require some smarts because doing so would add a fair amount of extra network I/O (ie. the additional S3 lookup).
(In reply to Peter Bengtsson [:peterbe] from comment #15) > For the record, there is no current plan to append the memory info to the > crash when we send the crash over to Telemetry. > This would certainly be doable but it's not planned and the work would > require some smarts because doing so would add a fair amount of extra > network I/O (ie. the additional S3 lookup). I mentioned this on IRC, but peterbe and I were looking into bug 1061371 when we were in Portland, and I noticed that we already have a processor rule that takes a memory report (if present) and puts the parsed JSON contents into a memory_report key on the processed JSON output. peterbe says the crash_report.json schema is what we're sending to telemetry, and it includes that key, so I think this will Just Work once that pipeline is enabled: https://github.com/mozilla/socorro/blob/f345b4beec58ba2b9440901d0706d00b8d2356d9/socorro/schemas/crash_report.json#L269
I'm going to mark this as complete. Bug 1291173 is going to carry through to get this kind of information available in production.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: