Closed
Bug 1007534
Opened 11 years ago
Closed 10 years ago
Collect about:memory reports for OOM analysis
Categories
(Toolkit :: about:memory, defect)
Toolkit
about:memory
Tracking
()
RESOLVED
FIXED
mozilla34
People
(Reporter: away, Assigned: away)
References
Details
(Whiteboard: [MemShrink:P2])
Attachments
(7 files, 2 obsolete files)
5.62 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
6.67 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
11.35 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
11.84 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
3.97 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
3.38 KB,
patch
|
benjamin
:
review+
glandium
:
feedback+
|
Details | Diff | Splinter Review |
278.87 KB,
application/x-gzip
|
Details |
Out-of-memory failures are a major source of crashes, but we can't make much progress with our current crash data. This bug is a proposal to collect some form of about:memory report automatically when the browser is running low on memory. That information could then be sent along with the user's crash report. We will likely need to sanitize the contents for privacy (remove URLs).
Comment 1•11 years ago
|
||
How do you determine if the browser is running low on memory?
Whiteboard: [MemShrink]
Comment 2•11 years ago
|
||
How much time and memory does it take to collect an about:memory report and store it in memory? Could we just do that periodically (every minute) if the browser isn't idle?
Otherwise we have to monitor things like available page file and available VM and make educated guesses.
Updated•11 years ago
|
Flags: needinfo?(n.nethercote)
Comment 3•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #2)
> How much time and memory does it take to collect an about:memory report and
> store it in memory? Could we just do that periodically (every minute) if the
> browser isn't idle?
Nope. We used to do something like this in telemetry pings and eventually removed it because it was causing long hangs for some people. That's because it touches most of the significant data structures in Gecko, and touches every live thing on the JS GC heap, and it does all of this on the main thread. It's really designed to be done in response to user action, so that the user is aware of what's happening and will hopefully tolerate some delay.
Flags: needinfo?(n.nethercote)
What about collecting a report at the point of allocation failure? Could the reporting fit within the Breakpad reservation (currently 36MB), dump the results to a file, and return enough memory for Breakpad to do its thing?
Flags: needinfo?(n.nethercote)
Comment 5•11 years ago
|
||
Dumping to file doesn't take that much memory. There's one aspect -- notable string detection -- that can take MiB, but we could disable that. With that in place, I *think* the memory usage would be < 1 MiB, with gzip buffers being the major part.
However, I'm not sure what you mean by "fitting within the Breakpad reservation" -- would all the allocations done during memory report gathering have to occur within a special area? That sounds hard. E.g. each memory reporter (of which there are dozens) does some nsString operations to construct paths, and possibly other stuff.
Flags: needinfo?(n.nethercote)
Comment 6•11 years ago
|
||
A privacy sensitive about:memory is certainly not going to be able to report anything much about notable strings anyways.
(In reply to Nicholas Nethercote [:njn] from comment #5)
> However, I'm not sure what you mean by "fitting within the Breakpad
> reservation" -- would all the allocations done during memory report
> gathering have to occur within a special area?
The total allocation and overhead of reporting would have to be less than the reservation size, since that's the only memory available when we're in an OOM situation.
Comment 8•11 years ago
|
||
If the question is just "does dumping memory reports to file require less than 36 MiB?", then the answer is "yes" :) As I said, with the notable string detection off, it was less than 1 MiB when I last measured, assuming my measurements were correct.
Good to know. The other concern would be whether that memory could be completely given back to the system after the reporting is completed. If jemalloc holds on to that address space after the free(), then Breakpad may not have enough contiguous memory to collect the minidump. That said, if it's only 1MiB then it may be easier to just have a separate reservation for memory reporting.
Assignee | ||
Comment 10•11 years ago
|
||
> Nope. We used to do something like this in telemetry pings and eventually
> removed it because it was causing long hangs for some people.
How long were those hangs? It may still be a problem even if we get a report at crash time. I would be pretty unhappy to sit through a long hang only to be rewarded with a crash dialog.
Updated•10 years ago
|
Whiteboard: [MemShrink] → [MemShrink:P2]
Assignee | ||
Comment 11•10 years ago
|
||
In discussion Benjamin pointed out that it's generally not safe to do work in the crash handler since there may be locks held by suspended code. I guess that leaves the option of guessing based on memory availability.
Most out-of-physical-memory failures happen when AvailablePageFile falls under 6MB. Out-of-virtual-memory failures have high variance, partially due to bug 1005844 and bug 1005849. Once those are fixed then I expect we'll have a tighter cluster of failures around 100MB, but there will still be some variance from fragmentation. Maybe we could get a notification from the last-ditch allocators, or perhaps keep an eye on the max-contiguous size (that would require a lot of VirtualQuery calls, unknown cost).
Comment 12•10 years ago
|
||
We already compute the largest available block here: http://mxr.mozilla.org/mozilla-central/source/xpcom/base/nsMemoryReporterManager.cpp#215
Can you profile and see how long that takes in good and bad circumstances? I'd be happy to put it that metric on a background thread and only even start computing it if available-vm goes below some larger threshold.
dmajor can you take this?
Assignee: nobody → dmajor
Assignee | ||
Comment 13•10 years ago
|
||
I took some basic measurements with QueryPerformanceCounter. The VirtualQuery walk scales up linearly with memory usage. At the point of failure, it was taking up to 75ms on my fast development machine. GlobalMemoryStatus took 5-7ms independent of memory usage.
At a minimum, the slower measurement should be gated by the faster one. Although, I wonder if VirtualQuery is worth its cost at all. It's possible that GlobalMemoryStatus could be a pretty good predictor by itself. (I still want to see how the VM-waste patches affect the variance numbers)
Comment 14•10 years ago
|
||
dmajor, are you still waiting for the jemalloc patch to land to get the final data on when to trigger this?
Flags: needinfo?(dmajor)
Assignee | ||
Comment 15•10 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #14)
> dmajor, are you still waiting for the jemalloc patch to land to get the
> final data on when to trigger this?
Yes.
Flags: needinfo?(dmajor)
Assignee | ||
Comment 16•10 years ago
|
||
Bug 1005844 is going through some iteration, so in the interest of not waiting too long, let's take an educated guess at what the threshold values might look like.
See bug 1001760 comment 4 for some sample crashes. I'm pretty convinced that the jemalloc fix will to drive the Usable numbers down to zero. The Other group is a measurement artifact, so really we just need to worry about Tiny+Misaligned.
Eyeballing it I'd say around 100MB free VA. Perhaps 80 to be conservative, 150 to cast a wide net.
Updated•10 years ago
|
Assignee: dmajor → benjamin
Comment 17•10 years ago
|
||
I have a windows prototype of this which does the following things:
* Adds and implements a method nsICrashReporter.saveMemoryReport()
* copy/move that file to the right place on crash
* submits it in the native crash reporter client
Here is an example crash with the patch:
https://crash-stats.mozilla.com/report/index/6779a0ef-fff2-4f90-9655-c68932140703
The following things still need to happen:
* periodically check for low memory and call saveMemoryReport() as appropriate
* make CrashSubmit.jsm know about the new <uuid>.memoryreport.json.gz file and submit it like the native client does
* Implement the mac and linux-specific bits of the HTTP upload and crashreporter_* code
* Separate out the google-breakpad changes and submit them as an upstream patch
* At least a basic test
dmajor I'm going to hand this back to you for completion; I've run out of time.
Assignee: benjamin → dmajor
Comment 18•10 years ago
|
||
Comment 19•10 years ago
|
||
Comment on attachment 8450464 [details] [diff] [review]
crash-memoryreport
Review of attachment 8450464 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/xre/nsAppRunner.cpp
@@ +1144,5 @@
> + getter_AddRefs(file));
> + if (NS_WARN_IF(NS_FAILED(rv))) {
> + return rv;
> + }
> + file->AppendNative(NS_LITERAL_CSTRING("memoryreport.json.gz"));
Super-nit: the default filename when saving from about:memory is memory-report.json.gz. Might as well use the same name (i.e. insert a '-') here.
Comment 20•10 years ago
|
||
> Here is an example crash with the patch:
> https://crash-stats.mozilla.com/report/index/6779a0ef-fff2-4f90-9655-
> c68932140703
I looked through all the tabs but I can't see the memory report data. Where is it?
Flags: needinfo?(benjamin)
Comment 21•10 years ago
|
||
Sorry, that example just shows that landing this patch won't harm crash reporting. We still need to do the server-side work to save/show the data, which is bug 1026059.
Flags: needinfo?(benjamin)
Assignee | ||
Comment 22•10 years ago
|
||
Comment on attachment 8450464 [details] [diff] [review]
crash-memoryreport
Review of attachment 8450464 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +476,5 @@
> + }
> +
> + bool ok = false;
> + int fdto;
> + do {
Was this supposed to end with a "while" condition?
Comment 23•10 years ago
|
||
Yes probably a "while (!ok)" condition. But that code was entirely untested because on Windows I could just use CopyFile.
Updated•10 years ago
|
Hardware: x86_64 → All
Assignee | ||
Comment 24•10 years ago
|
||
Is this a scenario that we need to worry about?
* (Whatever trigger we decide on) saves an anonymized memory report in the profile
* User saves non-anonymized memory report in the same location
* Crash reporter submits the non-anonymized report
Assignee | ||
Comment 25•10 years ago
|
||
Getting a head start on review for this piece, because it's completely separate. The other parts are still in progress.
Attachment #8468199 -
Flags: review?(ted)
Assignee | ||
Comment 26•10 years ago
|
||
For the test, I assume I should model after crashreport/test/unit/*.js?
I tried to add a new type of crash to nsTestCrasher.cpp, that does this before crashing:
> nsCOMPtr<nsICrashReporter> cr = do_GetService("@mozilla.org/toolkit/crash-reporter;1");
> cr->SaveMemoryReport();
but nsXULAppInfo::SaveMemoryReport is failing to NS_GetSpecialDirectory(NS_APP_PROFILE_DIR_STARTUP).
Any idea what I can do here (and/or should I be using some other approach to test this)?
Flags: needinfo?(benjamin)
Assignee | ||
Comment 27•10 years ago
|
||
Attachment #8468297 -
Flags: review?(nfroyd)
Assignee | ||
Comment 28•10 years ago
|
||
Meant to include a comment on the attachment: Patch originally from bsmedberg with tweaks by me. This is the XPCOM plumbing to add a saveMemoryReport method. We don't yet do anything with the saved file. The more interesting bits will be in part 2, hopefully tomorrow.
Comment 29•10 years ago
|
||
Comment on attachment 8468297 [details] [diff] [review]
Part 1: Add a saveMemoryReport method to nsICrashReporter
Review of attachment 8468297 [details] [diff] [review]:
-----------------------------------------------------------------
This looks pretty straightforward.
Attachment #8468297 -
Flags: review?(nfroyd) → review+
Comment 30•10 years ago
|
||
(In reply to David Major [:dmajor] from comment #26)
> For the test, I assume I should model after crashreport/test/unit/*.js?
>
> I tried to add a new type of crash to nsTestCrasher.cpp, that does this
> before crashing:
> > nsCOMPtr<nsICrashReporter> cr = do_GetService("@mozilla.org/toolkit/crash-reporter;1");
> > cr->SaveMemoryReport();
> but nsXULAppInfo::SaveMemoryReport is failing to
> NS_GetSpecialDirectory(NS_APP_PROFILE_DIR_STARTUP).
>
> Any idea what I can do here (and/or should I be using some other approach to
> test this)?
xpcshell tests don't have a profile directory by default. You can call do_get_profile() in the test to have one available.
Flags: needinfo?(benjamin)
Comment 31•10 years ago
|
||
Comment on attachment 8468199 [details] [diff] [review]
Part N: Support memory files in CrashSubmit.jsm
Review of attachment 8468199 [details] [diff] [review]:
-----------------------------------------------------------------
You could write a browser-chrome test for this. The easiest way would be to just jam it into one of the existing tests, like:
http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/browser/browser_aboutCrashesResubmit.js
::: toolkit/crashreporter/CrashSubmit.jsm
@@ +166,5 @@
> dump.remove(false);
> +
> + let memory = extra.clone();
> + memory.leafName = matches[1] + '.memory.json.gz';
> + if (memory.exists())
nit: here and elsewhere, brace all your conditionals even if they're just one line.
Attachment #8468199 -
Flags: review?(ted) → review+
Assignee | ||
Comment 32•10 years ago
|
||
UIShowCrashUI parameter plumbing.
(This could have just as well been part 2 of bug 1048091, but I'm trying to keep that one Breakpad-only)
Attachment #8469045 -
Flags: review?(ted)
Assignee | ||
Comment 33•10 years ago
|
||
Attachment #8450464 -
Attachment is obsolete: true
Attachment #8469046 -
Flags: review?(ted)
Assignee | ||
Comment 34•10 years ago
|
||
So the only missing piece is to have someone actually call SaveMemoryReport. bsmedberg suggests doing this periodically in the main event loop, which conveniently guarantees that we're on the main thread. I'll look into that next...
Assignee | ||
Comment 35•10 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #30)
> xpcshell tests don't have a profile directory by default. You can call
> do_get_profile() in the test to have one available.
Is do_get_profile() supposed to be available in the crasher subprocess? I tried inserting a call at [1] but I got an error that it's not defined. (And if I put the call outside do_crash, then the subprocess doesn't get the memo)
[1] http://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/test/unit/test_crash_oom.js#10
Comment 36•10 years ago
|
||
Oh, no, it's not, sorry, that comes from head.js:
http://hg.mozilla.org/mozilla-central/annotate/afcb3af79d09/testing/xpcshell/head.js#l1033
I guess you could copy and paste all that, which sort of sucks. I'm not sure if loading all of head.js into the crasher subprocess will have unintended side effects.
Assignee | ||
Comment 37•10 years ago
|
||
So I made a copy and ripped out everything whose absence didn't lead to an immediate failure (don't actually know what I'm doing). Conveniently that let me get rid of things like |runningInParent| that weren't in scope.
Let me know if this is too atrocious.
This patch also contains a drive-by fix to re-enable test_crash_oom.js on debug builds since we now use jemalloc: https://hg.mozilla.org/try/rev/99d443ba97c1
Attachment #8469770 -
Flags: review?(ted)
Assignee | ||
Comment 38•10 years ago
|
||
Comment on attachment 8469770 [details] [diff] [review]
xpcshell test
Review of attachment 8469770 [details] [diff] [review]:
-----------------------------------------------------------------
I will add some .gz cleanup to handleMinidump of head_crashreporter.js next time I'm in this patch.
Assignee | ||
Comment 39•10 years ago
|
||
Having trouble finding my way around the message loop code. Can you provide some pointers on a good place to add code? Are there any existing facilities that could be used (perhaps DoDelayedWork/DoIdleWork)?
Flags: needinfo?(benjamin)
Comment 40•10 years ago
|
||
Comment on attachment 8469770 [details] [diff] [review]
xpcshell test
Review of attachment 8469770 [details] [diff] [review]:
-----------------------------------------------------------------
It's not fantastic, but I don't have a better idea.
Attachment #8469770 -
Flags: review?(ted) → review+
Comment 41•10 years ago
|
||
The best place in the message loop is probably at http://hg.mozilla.org/mozilla-central/annotate/96a566fa1599/xpcom/threads/nsThread.cpp#l711
delayed/idle work are chromium messageloop things that probably don't work correctly in Mozilla code.
Flags: needinfo?(benjamin)
Assignee | ||
Comment 42•10 years ago
|
||
Attachment #8471409 -
Flags: review?(benjamin)
Comment 43•10 years ago
|
||
Comment on attachment 8471409 [details] [diff] [review]
Part 4: Periodically save memory reports
3 minutes seems a bit long to me for just the check. CheckForLowMemory itself without actually collecting low-memory should be very fast. I guess the question is what happens once we are in a low-memory situation: we don't want to collect about:memory every 30 seconds, so 3 minutes makes a lot more sense there. Do we need two separate timers, one for checking and one for collecting? Or maybe a simple timestamp plus a flag "collected" or "checked".
PR_Now is subject to clock-skew; this may not be a big deal, and it has the advantage that we can use a simple static PRTime. mozilla::TimeStamp solves the clock skew issues but I don't know whether you're allowed to use a static variable. glandium do you know if static TimeStamp is ok?
Attachment #8471409 -
Flags: feedback?(mh+mozilla)
Assignee | ||
Comment 44•10 years ago
|
||
I split the interval into two, depending on the result of the lowmem check.
Bumped the virtual memory threshold since I am losing faith that bug 1005844 will land anytime soon.
Looked through the TimeStamp source, other usage, and disassembly; static use seems OK but keeping f?glandium.
Attachment #8471409 -
Attachment is obsolete: true
Attachment #8471409 -
Flags: review?(benjamin)
Attachment #8471409 -
Flags: feedback?(mh+mozilla)
Attachment #8472036 -
Flags: review?(benjamin)
Attachment #8472036 -
Flags: feedback?(mh+mozilla)
Updated•10 years ago
|
Attachment #8472036 -
Flags: review?(benjamin) → review+
Comment 45•10 years ago
|
||
Comment on attachment 8472036 [details] [diff] [review]
Part 4: Periodically save memory reports
Review of attachment 8472036 [details] [diff] [review]:
-----------------------------------------------------------------
::: xpcom/threads/nsThread.cpp
@@ +769,5 @@
> + // but save memory reports (expensive, ~75ms) less frequently.
> + const size_t LOW_MEMORY_CHECK_SECONDS = 30;
> + const size_t LOW_MEMORY_SAVE_SECONDS = 3 * 60;
> +
> + static TimeStamp nextCheck = TimeStamp::NowLoRes()
static variables in a function scope are always fine.
Attachment #8472036 -
Flags: feedback?(mh+mozilla) → feedback+
Comment 46•10 years ago
|
||
Example
Updated•10 years ago
|
OS: Windows 7 → All
Comment 47•10 years ago
|
||
Comment on attachment 8469045 [details] [diff] [review]
Part 2: Support for multiple upload files in the crash reporter clients
Review of attachment 8469045 [details] [diff] [review]:
-----------------------------------------------------------------
Sorry for the review lag. Thanks for splitting these patches up into logical chunks, it makes reviewing much easier.
::: toolkit/crashreporter/client/crashreporter_win.cpp
@@ +1305,5 @@
> + i != files.end();
> + i++) {
> + gSendData.files[UTF8ToWide(i->first)] = UTF8ToWide(i->second);
> + }
> +
nit: whitespace
Attachment #8469045 -
Flags: review?(ted) → review+
Comment 48•10 years ago
|
||
Comment on attachment 8469046 [details] [diff] [review]
Part 3: Submit about:memory data from the native crash client
Review of attachment 8469046 [details] [diff] [review]:
-----------------------------------------------------------------
r=me with one thing fixed, the sys_ wrappers are important.
::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +155,5 @@
> static const XP_CHAR extraFileExtension[] = {'.', 'e', 'x', 't',
> 'r', 'a', '\0'}; // .extra
>
> +static const XP_CHAR memoryReportExtension[] = {'.', 'm', 'e', 'm', 'o', 'r',
> + 'y', '.', 'j', 's', 'o', 'n', '.', 'g', 'z', '\0'}; // .memory.json.gz
You know, maybe we should define a macro like Windows' _T() and use it for these...
@@ +473,5 @@
> +// Like Windows CopyFile for *nix
> +bool copy_file(const char* from, const char* to)
> +{
> + const int kBufSize = 4096;
> + int fdfrom = open(from, O_RDONLY);
It's important to use the sys_ wrappers for all of these libc functions since we call these in a potentially compromised process and don't want to invoke the dynamic linker. (They're #defined to the usual names to work on non-Linux up the top of this file.)
@@ +576,5 @@
> + if (memoryReportPath) {
> +#ifdef XP_WIN
> + CopyFile(memoryReportPath, memoryReportLocalPath, false);
> +#else
> + copy_file(memoryReportPath, memoryReportLocalPath);
Generally I've been liberal with the use of #defines up the top of the file (XP_whatever) to avoid ifdefs in the source, but you can use your judgement since this is only one case.
Attachment #8469046 -
Flags: review?(ted) → review+
Assignee | ||
Comment 49•10 years ago
|
||
Added the sys_ wrappers and macro'd the silly char arrays.
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8d2fb4b0279
https://hg.mozilla.org/integration/mozilla-inbound/rev/f95219244e18
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc77de51d4b6
https://hg.mozilla.org/integration/mozilla-inbound/rev/e714eaf1ae9d
https://hg.mozilla.org/integration/mozilla-inbound/rev/bc7c6f1deaf1
https://hg.mozilla.org/integration/mozilla-inbound/rev/bf7861340733
Comment 50•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/a8d2fb4b0279
https://hg.mozilla.org/mozilla-central/rev/f95219244e18
https://hg.mozilla.org/mozilla-central/rev/fc77de51d4b6
https://hg.mozilla.org/mozilla-central/rev/e714eaf1ae9d
https://hg.mozilla.org/mozilla-central/rev/bc7c6f1deaf1
https://hg.mozilla.org/mozilla-central/rev/bf7861340733
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
You need to log in
before you can comment on or make changes to this bug.
Description
•