Closed Bug 1248587 Opened 8 years ago Closed 8 years ago

firefox producing huge memory dump files on OOM crashes

Categories

(Toolkit :: about:memory, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: lars, Unassigned)

References

Details

Attachments

(3 files)

In an OOM crash, FF submits a "memory dump" json file to Socorro. Some of these gzip'd memory dumps expand to over a gigabyte in size.  This is wreaking havoc on our long term storage in Postgres and Elastic Search.  

In Socorro we're going to start detecting and drop these huge files on the floor.  

It is suggested that these pathologically large json files are a bug in FF...

If appropriate, I can attach an example of one of these large files (gzip'd at 17M) as soon as someone can let me know the PII implications of doing so...
njn: These memory reports seem really busted. lars gave me a download link for one and I loaded it into Python on my local machine, and it looks like we've got a lot of duplicate entries in the "reports" key:
>>> data = json.load(open('cd4594cb-69b0-46fa-8d69-659132160210.memory_report.dump.json', 'rb'))
>>> len(data['reports'])
5530912
>>> len(set(json.dumps(r) for r in data['reports']))
32845

5.5M entries in "reports", but only 32k unique entries (I just serialized them to JSON strings and put them into a set to count uniqueness).
Component: Developer Tools: Memory → about:memory
Product: Firefox → Toolkit
Version: unspecified → Trunk
Lars, can we find out 1) when us getting such large reports started, and 2) which Firefox versions this happens with?
Flags: needinfo?(lars)
Can someone email me a link to an example report?
Flags: needinfo?(lars)
Lars emailed me an example. I can see two problems.

- On Windows, memory reports include info about the address space layout. The way these are represented isn't that efficient, and if there are lots of alternating segments of different kinds then this could bloat the report badly.

- Some of the other reports ("resident", "private", etc.) are duplicated 263 times. I'm not sure what might cause this, though I'll try to work it out.

Lars, are these bad files all occurring on Windows?
Flags: needinfo?(lars)
(In reply to Nicholas Nethercote [:njn] from comment #5)
> Lars, are these bad files all occurring on Windows?

Now that I saw the attachment Lars made above, we can easily do searches for this, see e.g. this search for platforms with version:
https://crash-stats.mozilla.com/search/?product=Firefox&processor_notes=memory+info+too+large&_facets=signature&_facets=platform_pretty_version#facet-platform_pretty_version

They are all on Windows, with a pretty typical split between Windows versions.
Flags: needinfo?(lars)
Depends on: 1249503
(In reply to Nicholas Nethercote [:njn] from comment #5)
> 
> - On Windows, memory reports include info about the address space layout.
> The way these are represented isn't that efficient, and if there are lots of
> alternating segments of different kinds then this could bloat the report
> badly.
>
> - Some of the other reports ("resident", "private", etc.) are duplicated 263
> times. I'm not sure what might cause this, though I'll try to work it out.

I've determined that both problems caused the huge size of the example file I have. There should have been 20,888 entries relating to address space layout, which is a bit excessive but not completely ridiculous. But somehow they got duplicated 263 times each and we end up with 5.5 million entries in the file.

Bug 1249503 has a patch to address the first problem. It reduces the number of address space entries in the file from 1000s to 10s, which should reduce the uncompressed file size from 1.1 GiB to something in the range 1--10 MiB. That should be enough to stop the machines that received the crash reports from choking.

I haven't investigated the second problem yet, though I will, because it means that the data in these files is at least partly bogus, which reduces their usefulness and trustworthiness.
(In reply to Nicholas Nethercote [:njn] from comment #7)
> Bug 1249503 has a patch to address the first problem. It reduces the number
> of address space entries in the file from 1000s to 10s, which should reduce
> the uncompressed file size from 1.1 GiB to something in the range 1--10 MiB.
> That should be enough to stop the machines that received the crash reports
> from choking.

Yes, that is awesome!
> I haven't investigated the second problem yet, though I will, because it
> means that the data in these files is at least partly bogus, which reduces
> their usefulness and trustworthiness.

All the reports that are duplicated 263 times are from the reporters that are registered in nsMemoryReporterManager::Init(). This suggests that nsMemoryReporterManager::Init() is called 263 times, which is strange and definitely not what is supposed to happen, nor what normally happens.

The only way I can think that Init() would be called multiple times on a single nsMemoryReporterManager instance is if an extension does it. (Thanks to dbaron for pointing me in this direction.) I simulated this locally and saw duplicates of the relevant reports just like those in the oversized crash reports. This seems weird, but I guess anything's possible in extension-land. It would also explain why this only seems to be a problem on a (very?) small fraction of crash reports. If this is the problem, I think we can simply remove Init() from the interface to nsMemoryReporterManager; I can't see any reason for it to be public.

Lars, KaiRo: is it possible to see if there is any correlation between the production of these oversized memory report files and the presence of particular extensions?
Flags: needinfo?(lars)
Flags: needinfo?(kairo)
this is a shot in the dark.  I ran my new version of the daily correlation report app, feeding it only crashes that experienced the "memory dump too large" error.  I don't know if this answers your question or is even useful information.
Flags: needinfo?(lars)
What you did attach is the modules part, we'd be interested in the add-ons part here, though.
Flags: needinfo?(kairo) → needinfo?(lars)
Flags: needinfo?(lars)
(In reply to Nicholas Nethercote [:njn] from comment #9)
> > I haven't investigated the second problem yet, though I will, because it
> > means that the data in these files is at least partly bogus, which reduces
> > their usefulness and trustworthiness.
> 
> All the reports that are duplicated 263 times are from the reporters that
> are registered in nsMemoryReporterManager::Init(). This suggests that
> nsMemoryReporterManager::Init() is called 263 times, which is strange and
> definitely not what is supposed to happen, nor what normally happens.
> 
> The only way I can think that Init() would be called multiple times on a
> single nsMemoryReporterManager instance is if an extension does it. (Thanks
> to dbaron for pointing me in this direction.) I simulated this locally and
> saw duplicates of the relevant reports just like those in the oversized
> crash reports. This seems weird, but I guess anything's possible in
> extension-land. It would also explain why this only seems to be a problem on
> a (very?) small fraction of crash reports. If this is the problem, I think
> we can simply remove Init() from the interface to nsMemoryReporterManager; I
> can't see any reason for it to be public.

Removing seems reasonable, but I suppose that would break those add-ons. Could we just add code to make multi-init a noop?
Comment on attachment 8722580 [details]
the addons correlations for memory-dump-too-large

In short, it does not look like we have any significant correlation with any specific add-on here.
Depends on: 1251127
> All the reports that are duplicated 263 times are from the reporters that
> are registered in nsMemoryReporterManager::Init(). This suggests that
> nsMemoryReporterManager::Init() is called 263 times, which is strange and
> definitely not what is supposed to happen, nor what normally happens.
> 
> The only way I can think that Init() would be called multiple times on a
> single nsMemoryReporterManager instance is if an extension does it. (Thanks
> to dbaron for pointing me in this direction.)

In bug 1251127 I added protection against Init() being called multiple times, and just landed the patch on inbound. In combination with bug 1249503's patch having landed, I think I can declare victory here.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: