Closed Bug 1587722 Opened 2 years ago Closed 2 years ago

Add memory statistics to crashes on Linux

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla74
Tracking Status
firefox74 --- fixed

People

(Reporter: gsvelto, Assigned: Yoric)

References

Details

Attachments

(4 files, 1 obsolete file)

We currently write out memory statistics upon a crash only on Windows. This bug is for adding Linux support. We'll need to parse /proc/vmstat, /proc/meminfo or both to gather the required information. Documentation on both files is available in Linux proc manpage.

Note that the contents of both files is kernel-dependent so the parsing code will need to be sufficiently robust to handle that.

Assignee: nobody → dteller

(currently installing a Linux VM for testing this)

Thanks David, note that on Linux we have this type of gunk to avoid using libc functions:

https://searchfox.org/mozilla-central/rev/073b138dcba41cd3f858522e5f0a9ee73e39afa0/toolkit/crashreporter/nsExceptionHandler.cpp#155-156

It's not pretty, in fact it's downright ugly but there's no way around it for now.

Thanks! That is ugly indeed.

Submitting a first draft for feedback, I haven't had the opportunity to test it on Linux yet, my VM crashes on me during install – and I still haven't received my Linux/Windows machine.

Attached file data-review.md (obsolete) —

Ok, let's try and do this correctly this time :)

Attachment #9113718 - Flags: data-review?(chutten)
Comment on attachment 9113718 [details]
data-review.md

Looks like this review request is missing answers to questions 6 and onwards. It'll be permanent collection, :Yoric is responsible, all channels, the Telemetry "crash" ping part of the collection can be opted out via Firefox's Preferences, we need an answer for analysis questions, and you're using the Mozilla backend.
Attachment #9113718 - Flags: data-review?(chutten) → data-review-
Attached file data-review.md

Oops, added missing questions/responses.

Attachment #9113718 - Attachment is obsolete: true
Attachment #9114084 - Flags: data-review?(chutten)

Depends on bug 1563403 <-- was that a typo?

Sorry, that should have been bug 1601872 (which was caused by a regression in bug 1563403, so I'm not entirely off :) ).

Depends on: 1601872
No longer depends on: 1563403

Error while running ./mach build
ERROR: Rust compiler 1.36.0 is too old.
0:03.68 To compile Rust language sources please install at least
0:03.68 version 1.37.0 of the 'rustc' toolchain and make sure it is
0:03.68 first in your path.
0:03.68 You can verify this by typing 'rustc --version'.
0:03.68 If you have the 'rustup' tool installed you can upgrade
0:03.68 to the latest release by typing 'rustup update'. The
0:03.68 installer is available from https://rustup.rs/
0:03.72 *** Fix above errors and then restart with
0:03.72 "./mach build"
0:03.72 client.mk:111: recipe for target 'configure' failed
0:03.72 make: *** [configure] Error 1

I tried upgrading it but does not work
I tried installing rustup but says

info: downloading installer
error: it looks like you have an existing installation of Rust at:
error: /usr/bin
error: rustup cannot be installed alongside Rust. Please uninstall first
error: if this is what you want, restart the installation with `-y'
error: cannot install while Rust is installed

Comment on attachment 9114084 [details]
data-review.md

DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. This collection is documented in its definitions file [CrashAnnotations.yaml](https://searchfox.org/mozilla-central/source/toolkit/crashreporter/CrashAnnotations.yaml).

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection can be controlled through Firefox's Preferences. Because it is sent opt-out by Telemetry and opt-in in crash reports, the user should pay special attention to the settings.

    If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, :Yoric is responsible.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

    Is the data collection request for default-on or default-off?

The annotations in the "crash" ping are default on for all channels. The aspects in crash reports are opt-in.

    Does the instrumentation include the addition of any new identifiers?

No.

    Is the data collection covered by the existing Firefox privacy notice?

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

---
Result: datareview+
Attachment #9114084 - Flags: data-review?(chutten) → data-review+
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/19ffd117ebbd
Memory statistics for Linux crashes;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/c8e7456abc5b
Tests for memory statistics on crashes;r=gsvelto

So, for some reason, my patch seems to remove the dumpID under Linux (or at least BrowserTestUtils won't find it).

I'm digging to understand what's going on.

Flags: needinfo?(dteller)

I can't reproduce locally, but I can 100% reproduce on try.

According to my debug-by-try attempts, the following line kills the crashreporter:

int fd = sys_open("/proc/meminfo", O_RDONLY, /* chmod */ 0);

Except it crashes only in mochitests (works nicely on xpcshell) and so far only on try.

Gabriele, do you know if there could be some sandbox restriction at work here?

Flags: needinfo?(gsvelto)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #16)

I can't reproduce locally, but I can 100% reproduce on try.

According to my debug-by-try attempts, the following line kills the crashreporter:

int fd = sys_open("/proc/meminfo", O_RDONLY, /* chmod */ 0);

Except it crashes only in mochitests (works nicely on xpcshell) and so far only on try.

Gabriele, do you know if there could be some sandbox restriction at work here?

Oh yes, that might be the problem. This is happening in the exception handler so if it's a content process that's crashing it's running in the content process context which would be sandboxed. Fortunately there should be a relatively simple fix for this. Right now we're writing out the memory information annotations in the content process exception handler:

https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/toolkit/crashreporter/nsExceptionHandler.cpp#1352

But there's no need for that, these annotations are global values so the main process could read them and add them to the crashed process' annotations when we add all the other ones here:

https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/toolkit/crashreporter/nsExceptionHandler.cpp#2919

This runs in the context of the main process so it wouldn't run afoul of the sandbox.

One small issue with this is that the WriteMemoryStatus() function was designed to write to a file and in this case we wouldn't need to do that because we would have access to the crash annotations table directly so some refactoring might be needed.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #17)

But there's no need for that, these annotations are global values so the main process could read them and add them to the crashed process' annotations when we add all the other ones here:

Won't we get the memory information after cleanup instead of the memory information before cleanup?

Flags: needinfo?(gsvelto)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #18)

Won't we get the memory information after cleanup instead of the memory information before cleanup?

IIRC we don't actually kill the content process until we're done with crash reporting. The ReadExceptionTimeAnnotations() function is called from within Breakpad's child-crash callback and I'm fairly sure that the process doesn't get killed until after it returns because I've seen cases where we hit a second exception in the same content process (but on a different thread) while we were processing the first one in that function.

So in theory the memory scenario should be pretty much the same. The only difference I can think of is on Windows the AvailableVirtualMemory information is process-specific; all the other fields should be global measurements.

Flags: needinfo?(gsvelto)

Ok, I have a prototype. It works on bc tests but oranges test_oom_annotations, though. I still need to find where to hook it so that it works for xpcshell tests.

Gabriele, I can't find any reason for it to fail on test_oom_annotations. The codepaths you gave me above cover both content processes and parent processes, right?

Flags: needinfo?(gsvelto)

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #21)

Gabriele, I can't find any reason for it to fail on test_oom_annotations. The codepaths you gave me above cover both content processes and parent processes, right?

Yes, there's only those two. Can you show me how it's failing?

Flags: needinfo?(gsvelto)

It's the X6 here. I have annotated AnnotateMemoryStatus to print YORIC> and some stuff whenever we enter/exit the function. The printouts show up in bc1 but not in X6.

Flags: needinfo?(gsvelto)

I'm looking into this, sorry if it took a while but I'm swamped with reviews this week.

I ran this locally and it was running just fine so I was utterly stumped until I noticed that the AvailablePageFile annotation is written conditionally depending on the presence of the CommitLimit and Committed_AS fields in /proc/meminfo. As it turns out those are optional in Linux so it's likely that our automation machines aren't using overcommit accounting and thus those values aren't populated.

You'll have to adjust the test, maybe making it so that it's OK if AvailablePageFile is present but we don't fail if it isn't.

Flags: needinfo?(gsvelto)

Well, reading the logs, I conclude that the memory statistics collection code doesn't seem to run at all. I'll double-check that I've made the change you requested, but here's the same test without AvailablePageFile: this time, AvailablePhysicalMemory isn't present, either.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=937e75ec2daa780d138c7381c4bfa21c345be455&selectedJob=281883731

(In reply to David Teller [:Yoric] (please use "needinfo") from comment #26)

Well, reading the logs, I conclude that the memory statistics collection code doesn't seem to run at all. I'll double-check that I've made the change you requested, but here's the same test without AvailablePageFile: this time, AvailablePhysicalMemory isn't present, either.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=937e75ec2daa780d138c7381c4bfa21c345be455&selectedJob=281883731

Ouch, seems like this is a hard nut to crack. I'll double-check too and see what happens.

So, I think the problem is that function do_crash() reads the subprocess dump without going through the main process. If I replace this with do_content_crash(), it seems to work nicely.

Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c9dbcfa6a938
Memory statistics for Linux crashes;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/7d664635c1dc
Tests for memory statistics on crashes;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/6aad53d9e447
Move crash memory statistics to the parent process;r=gsvelto

Investigating.

Flags: needinfo?(dteller)
Pushed by dteller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c8f1f5a0fe57
Memory statistics for Linux crashes;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/00dfa64114e6
Tests for memory statistics on crashes;r=gsvelto
https://hg.mozilla.org/integration/autoland/rev/262a7dd60d12
Move crash memory statistics to the parent process;r=gsvelto
Regressions: 1649132
You need to log in before you can comment on or make changes to this bug.