Collect about:memory reports for OOM analysis

RESOLVED FIXED in mozilla34

Status

()

defect
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: dmajor, Assigned: dmajor)

Tracking

unspecified
mozilla34
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(7 attachments, 2 obsolete attachments)

Out-of-memory failures are a major source of crashes, but we can't make much progress with our current crash data. This bug is a proposal to collect some form of about:memory report automatically when the browser is running low on memory. That information could then be sent along with the user's crash report. We will likely need to sanitize the contents for privacy (remove URLs).
How do you determine if the browser is running low on memory?
Whiteboard: [MemShrink]

Comment 2

5 years ago
How much time and memory does it take to collect an about:memory report and store it in memory? Could we just do that periodically (every minute) if the browser isn't idle?

Otherwise we have to monitor things like available page file and available VM and make educated guesses.

Updated

5 years ago
Flags: needinfo?(n.nethercote)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> How much time and memory does it take to collect an about:memory report and
> store it in memory? Could we just do that periodically (every minute) if the
> browser isn't idle?

Nope. We used to do something like this in telemetry pings and eventually removed it because it was causing long hangs for some people. That's because it touches most of the significant data structures in Gecko, and touches every live thing on the JS GC heap, and it does all of this on the main thread. It's really designed to be done in response to user action, so that the user is aware of what's happening and will hopefully tolerate some delay.
Flags: needinfo?(n.nethercote)
What about collecting a report at the point of allocation failure? Could the reporting fit within the Breakpad reservation (currently 36MB), dump the results to a file, and return enough memory for Breakpad to do its thing?
Flags: needinfo?(n.nethercote)
Dumping to file doesn't take that much memory. There's one aspect -- notable string detection -- that can take MiB, but we could disable that. With that in place, I *think* the memory usage would be < 1 MiB, with gzip buffers being the major part.

However, I'm not sure what you mean by "fitting within the Breakpad reservation" -- would all the allocations done during memory report gathering have to occur within a special area? That sounds hard. E.g. each memory reporter (of which there are dozens) does some nsString operations to construct paths, and possibly other stuff.
Flags: needinfo?(n.nethercote)
A privacy sensitive about:memory is certainly not going to be able to report anything much about notable strings anyways.
(In reply to Nicholas Nethercote [:njn] from comment #5)
> However, I'm not sure what you mean by "fitting within the Breakpad
> reservation" -- would all the allocations done during memory report
> gathering have to occur within a special area?

The total allocation and overhead of reporting would have to be less than the reservation size, since that's the only memory available when we're in an OOM situation.
If the question is just "does dumping memory reports to file require less than 36 MiB?", then the answer is "yes" :) As I said, with the notable string detection off, it was less than 1 MiB when I last measured, assuming my measurements were correct.
Good to know. The other concern would be whether that memory could be completely given back to the system after the reporting is completed. If jemalloc holds on to that address space after the free(), then Breakpad may not have enough contiguous memory to collect the minidump. That said, if it's only 1MiB then it may be easier to just have a separate reservation for memory reporting.
> Nope. We used to do something like this in telemetry pings and eventually
> removed it because it was causing long hangs for some people.
How long were those hangs? It may still be a problem even if we get a report at crash time. I would be pretty unhappy to sit through a long hang only to be rewarded with a crash dialog.
Whiteboard: [MemShrink] → [MemShrink:P2]
Depends on: 1010064
In discussion Benjamin pointed out that it's generally not safe to do work in the crash handler since there may be locks held by suspended code. I guess that leaves the option of guessing based on memory availability.

Most out-of-physical-memory failures happen when AvailablePageFile falls under 6MB. Out-of-virtual-memory failures have high variance, partially due to bug 1005844 and bug 1005849. Once those are fixed then I expect we'll have a tighter cluster of failures around 100MB, but there will still be some variance from fragmentation. Maybe we could get a notification from the last-ditch allocators, or perhaps keep an eye on the max-contiguous size (that would require a lot of VirtualQuery calls, unknown cost).
We already compute the largest available block here: http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsMemoryReporterManager.cpp#215

Can you profile and see how long that takes in good and bad circumstances? I'd be happy to put it that metric on a background thread and only even start computing it if available-vm goes below some larger threshold.

dmajor can you take this?
Assignee: nobody → dmajor
I took some basic measurements with QueryPerformanceCounter. The VirtualQuery walk scales up linearly with memory usage. At the point of failure, it was taking up to 75ms on my fast development machine. GlobalMemoryStatus took 5-7ms independent of memory usage.

At a minimum, the slower measurement should be gated by the faster one. Although, I wonder if VirtualQuery is worth its cost at all. It's possible that GlobalMemoryStatus could be a pretty good predictor by itself. (I still want to see how the VM-waste patches affect the variance numbers)
dmajor, are you still waiting for the jemalloc patch to land to get the final data on when to trigger this?
Flags: needinfo?(dmajor)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #14)
> dmajor, are you still waiting for the jemalloc patch to land to get the
> final data on when to trigger this?

Yes.
Flags: needinfo?(dmajor)

Updated

5 years ago
Depends on: 1026059
Bug 1005844 is going through some iteration, so in the interest of not waiting too long, let's take an educated guess at what the threshold values might look like.

See bug 1001760 comment 4 for some sample crashes. I'm pretty convinced that the jemalloc fix will to drive the Usable numbers down to zero. The Other group is a measurement artifact, so really we just need to worry about Tiny+Misaligned.

Eyeballing it I'd say around 100MB free VA. Perhaps 80 to be conservative, 150 to cast a wide net.

Updated

5 years ago
Assignee: dmajor → benjamin
I have a windows prototype of this which does the following things:

* Adds and implements a method nsICrashReporter.saveMemoryReport()
* copy/move that file to the right place on crash
* submits it in the native crash reporter client

Here is an example crash with the patch:
https://crash-stats.mozilla.com/report/index/6779a0ef-fff2-4f90-9655-c68932140703

The following things still need to happen:
* periodically check for low memory and call saveMemoryReport() as appropriate
* make CrashSubmit.jsm know about the new <uuid>.memoryreport.json.gz file and submit it like the native client does
* Implement the mac and linux-specific bits of the HTTP upload and crashreporter_* code
* Separate out the google-breakpad changes and submit them as an upstream patch
* At least a basic test

dmajor I'm going to hand this back to you for completion; I've run out of time.
Assignee: benjamin → dmajor
Posted patch crash-memoryreport (obsolete) — Splinter Review
Comment on attachment 8450464 [details] [diff] [review]
crash-memoryreport

Review of attachment 8450464 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/xre/nsAppRunner.cpp
@@ +1144,5 @@
> +                                       getter_AddRefs(file));
> +  if (NS_WARN_IF(NS_FAILED(rv))) {
> +    return rv;
> +  }
> +  file->AppendNative(NS_LITERAL_CSTRING("memoryreport.json.gz"));

Super-nit: the default filename when saving from about:memory is memory-report.json.gz. Might as well use the same name (i.e. insert a '-') here.
> Here is an example crash with the patch:
> https://crash-stats.mozilla.com/report/index/6779a0ef-fff2-4f90-9655-
> c68932140703

I looked through all the tabs but I can't see the memory report data. Where is it?
Flags: needinfo?(benjamin)
Sorry, that example just shows that landing this patch won't harm crash reporting. We still need to do the server-side work to save/show the data, which is bug 1026059.
Flags: needinfo?(benjamin)
Comment on attachment 8450464 [details] [diff] [review]
crash-memoryreport

Review of attachment 8450464 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +476,5 @@
> +  }
> +
> +  bool ok = false;
> +  int fdto;
> +  do {

Was this supposed to end with a "while" condition?
Yes probably a "while (!ok)" condition. But that code was entirely untested because on Windows I could just use CopyFile.
Hardware: x86_64 → All
Depends on: 1048091
Is this a scenario that we need to worry about?

* (Whatever trigger we decide on) saves an anonymized memory report in the profile
* User saves non-anonymized memory report in the same location
* Crash reporter submits the non-anonymized report
Getting a head start on review for this piece, because it's completely separate. The other parts are still in progress.
Attachment #8468199 - Flags: review?(ted)
For the test, I assume I should model after crashreport/test/unit/*.js?

I tried to add a new type of crash to nsTestCrasher.cpp, that does this before crashing:
>    nsCOMPtr<nsICrashReporter> cr = do_GetService("@mozilla.org/toolkit/crash-reporter;1");
>    cr->SaveMemoryReport();
but nsXULAppInfo::SaveMemoryReport is failing to NS_GetSpecialDirectory(NS_APP_PROFILE_DIR_STARTUP).

Any idea what I can do here (and/or should I be using some other approach to test this)?
Flags: needinfo?(benjamin)
Meant to include a comment on the attachment: Patch originally from bsmedberg with tweaks by me. This is the XPCOM plumbing to add a saveMemoryReport method. We don't yet do anything with the saved file. The more interesting bits will be in part 2, hopefully tomorrow.
Comment on attachment 8468297 [details] [diff] [review]
Part 1: Add a saveMemoryReport method to nsICrashReporter

Review of attachment 8468297 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty straightforward.
Attachment #8468297 - Flags: review?(nfroyd) → review+
(In reply to David Major [:dmajor] from comment #26)
> For the test, I assume I should model after crashreport/test/unit/*.js?
> 
> I tried to add a new type of crash to nsTestCrasher.cpp, that does this
> before crashing:
> >    nsCOMPtr<nsICrashReporter> cr = do_GetService("@mozilla.org/toolkit/crash-reporter;1");
> >    cr->SaveMemoryReport();
> but nsXULAppInfo::SaveMemoryReport is failing to
> NS_GetSpecialDirectory(NS_APP_PROFILE_DIR_STARTUP).
> 
> Any idea what I can do here (and/or should I be using some other approach to
> test this)?

xpcshell tests don't have a profile directory by default. You can call do_get_profile() in the test to have one available.
Flags: needinfo?(benjamin)
Comment on attachment 8468199 [details] [diff] [review]
Part N: Support memory files in CrashSubmit.jsm

Review of attachment 8468199 [details] [diff] [review]:
-----------------------------------------------------------------

You could write a browser-chrome test for this. The easiest way would be to just jam it into one of the existing tests, like:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/browser/browser_aboutCrashesResubmit.js

::: toolkit/crashreporter/CrashSubmit.jsm
@@ +166,5 @@
>          dump.remove(false);
> +
> +        let memory = extra.clone();
> +        memory.leafName = matches[1] + '.memory.json.gz';
> +        if (memory.exists())

nit: here and elsewhere, brace all your conditionals even if they're just one line.
Attachment #8468199 - Flags: review?(ted) → review+
UIShowCrashUI parameter plumbing.

(This could have just as well been part 2 of bug 1048091, but I'm trying to keep that one Breakpad-only)
Attachment #8469045 - Flags: review?(ted)
So the only missing piece is to have someone actually call SaveMemoryReport. bsmedberg suggests doing this periodically in the main event loop, which conveniently guarantees that we're on the main thread. I'll look into that next...
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #30)
> xpcshell tests don't have a profile directory by default. You can call
> do_get_profile() in the test to have one available.

Is do_get_profile() supposed to be available in the crasher subprocess? I tried inserting a call at [1] but I got an error that it's not defined. (And if I put the call outside do_crash, then the subprocess doesn't get the memo)

[1] http://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/unit/test_crash_oom.js#10
Oh, no, it's not, sorry, that comes from head.js:
http://hg.mozilla.org/mozilla-central/annotate/afcb3af79d09/testing/xpcshell/head.js#l1033

I guess you could copy and paste all that, which sort of sucks. I'm not sure if loading all of head.js into the crasher subprocess will have unintended side effects.
Posted patch xpcshell testSplinter Review
So I made a copy and ripped out everything whose absence didn't lead to an immediate failure (don't actually know what I'm doing). Conveniently that let me get rid of things like |runningInParent| that weren't in scope.

Let me know if this is too atrocious. 

This patch also contains a drive-by fix to re-enable test_crash_oom.js on debug builds since we now use jemalloc: https://hg.mozilla.org/try/rev/99d443ba97c1
Attachment #8469770 - Flags: review?(ted)
Comment on attachment 8469770 [details] [diff] [review]
xpcshell test

Review of attachment 8469770 [details] [diff] [review]:
-----------------------------------------------------------------

I will add some .gz cleanup to handleMinidump of head_crashreporter.js next time I'm in this patch.
Having trouble finding my way around the message loop code. Can you provide some pointers on a good place to add code? Are there any existing facilities that could be used (perhaps DoDelayedWork/DoIdleWork)?
Flags: needinfo?(benjamin)
Comment on attachment 8469770 [details] [diff] [review]
xpcshell test

Review of attachment 8469770 [details] [diff] [review]:
-----------------------------------------------------------------

It's not fantastic, but I don't have a better idea.
Attachment #8469770 - Flags: review?(ted) → review+
The best place in the message loop is probably at http://hg.mozilla.org/mozilla-central/annotate/96a566fa1599/xpcom/threads/nsThread.cpp#l711

delayed/idle work are chromium messageloop things that probably don't work correctly in Mozilla code.
Flags: needinfo?(benjamin)
Attachment #8471409 - Flags: review?(benjamin)
Comment on attachment 8471409 [details] [diff] [review]
Part 4: Periodically save memory reports

3 minutes seems a bit long to me for just the check. CheckForLowMemory itself without actually collecting low-memory should be very fast. I guess the question is what happens once we are in a low-memory situation: we don't want to collect about:memory every 30 seconds, so 3 minutes makes a lot more sense there. Do we need two separate timers, one for checking and one for collecting? Or maybe a simple timestamp plus a flag "collected" or "checked".

PR_Now is subject to clock-skew; this may not be a big deal, and it has the advantage that we can use a simple static PRTime. mozilla::TimeStamp solves the clock skew issues but I don't know whether you're allowed to use a static variable. glandium do you know if static TimeStamp is ok?
Attachment #8471409 - Flags: feedback?(mh+mozilla)
I split the interval into two, depending on the result of the lowmem check.

Bumped the virtual memory threshold since I am losing faith that bug 1005844 will land anytime soon.

Looked through the TimeStamp source, other usage, and disassembly; static use seems OK but keeping f?glandium.
Attachment #8471409 - Attachment is obsolete: true
Attachment #8471409 - Flags: review?(benjamin)
Attachment #8471409 - Flags: feedback?(mh+mozilla)
Attachment #8472036 - Flags: review?(benjamin)
Attachment #8472036 - Flags: feedback?(mh+mozilla)

Updated

5 years ago
Attachment #8472036 - Flags: review?(benjamin) → review+
Comment on attachment 8472036 [details] [diff] [review]
Part 4: Periodically save memory reports

Review of attachment 8472036 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/nsThread.cpp
@@ +769,5 @@
> +    // but save memory reports (expensive, ~75ms) less frequently.
> +    const size_t LOW_MEMORY_CHECK_SECONDS = 30;
> +    const size_t LOW_MEMORY_SAVE_SECONDS = 3 * 60;
> +
> +    static TimeStamp nextCheck = TimeStamp::NowLoRes()

static variables in a function scope are always fine.
Attachment #8472036 - Flags: feedback?(mh+mozilla) → feedback+
OS: Windows 7 → All
Comment on attachment 8469045 [details] [diff] [review]
Part 2: Support for multiple upload files in the crash reporter clients

Review of attachment 8469045 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the review lag. Thanks for splitting these patches up into logical chunks, it makes reviewing much easier.

::: toolkit/crashreporter/client/crashreporter_win.cpp
@@ +1305,5 @@
> +       i != files.end();
> +       i++) {
> +    gSendData.files[UTF8ToWide(i->first)] = UTF8ToWide(i->second);
> +  }
> +  

nit: whitespace
Attachment #8469045 - Flags: review?(ted) → review+
Comment on attachment 8469046 [details] [diff] [review]
Part 3: Submit about:memory data from the native crash client

Review of attachment 8469046 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with one thing fixed, the sys_ wrappers are important.

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +155,5 @@
>  static const XP_CHAR extraFileExtension[] = {'.', 'e', 'x', 't',
>                                               'r', 'a', '\0'}; // .extra
>  
> +static const XP_CHAR memoryReportExtension[] = {'.', 'm', 'e', 'm', 'o', 'r',
> +  'y', '.', 'j', 's', 'o', 'n', '.', 'g', 'z', '\0'}; // .memory.json.gz

You know, maybe we should define a macro like Windows' _T() and use it for these...

@@ +473,5 @@
> +// Like Windows CopyFile for *nix
> +bool copy_file(const char* from, const char* to)
> +{
> +  const int kBufSize = 4096;
> +  int fdfrom = open(from, O_RDONLY);

It's important to use the sys_ wrappers for all of these libc functions since we call these in a potentially compromised process and don't want to invoke the dynamic linker. (They're #defined to the usual names to work on non-Linux up the top of this file.)

@@ +576,5 @@
> +  if (memoryReportPath) {
> +#ifdef XP_WIN
> +    CopyFile(memoryReportPath, memoryReportLocalPath, false);
> +#else
> +    copy_file(memoryReportPath, memoryReportLocalPath);

Generally I've been liberal with the use of #defines up the top of the file (XP_whatever) to avoid ifdefs in the source, but you can use your judgement since this is only one case.
Attachment #8469046 - Flags: review?(ted) → review+
Depends on: 1069165
You need to log in before you can comment on or make changes to this bug.