Closed Bug 1135543 Opened 7 years ago Closed 7 years ago

Report OOM crashes in FHR

Categories

(Toolkit :: Telemetry, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: away, Unassigned)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files, 3 obsolete files)

This will tell us whether users are leaving due to OOMs.
Attached patch Report OOM crashes in telemetry (obsolete) — Splinter Review
Attachment #8567746 - Flags: review?(benjamin)
Whiteboard: [MemShrink] → [MemShrink:P2]
Comment on attachment 8567746 [details] [diff] [review]
Report OOM crashes in telemetry

I apologize it's taken me a while to get to this. I'm concerned about the overall data design and the related lack of a useful commit message or comments.

My understanding of what this patch does is if we have an OOM allocation size, it will write a marker file to disk (where exactly? It appears to not be in the profile but outside somewhere). Then on startup, we check for the existence of the marker and record a single CRASH_OUT_OF_MEMORY flag in the first session of the new launch.

I'm concerned about this for the following combination of reasons:
* We need to have a system for recording app crashes in general in telemetry (bug 1121013 covers this). I expect that we should coordinate those.
* We already have the CrashManager system for recording crash events and processing them near startup. Rather than using a separate marker, it sounds better to me to use that existing system to add the additional OOM marker. See http://hg.mozilla.org/mozilla-central/annotate/a27dd304348d/toolkit/crashreporter/nsExceptionHandler.cpp#l704 for where we write the crash event, and http://hg.mozilla.org/mozilla-central/annotate/a27dd304348d/toolkit/components/crashes/CrashManager.jsm#l515 for where we read it.

I suspect that what we want to do is centralize both this recording and the recording from bug 1121013 in the crash manager. This will avoid additional main-thread I/O and avoid creating yet another system.
Attachment #8567746 - Flags: review?(benjamin) → review-
I was trying to follow the existing pattern by putting the OOM marker in the same directory as the "LastCrash" and "LastCrashFileName" files.

I will look into the CrashManager event code. At first glance it does look like it could be a better approach.

ni?: It's not clear to me what you are suggesting regarding bug 1121013. (I don't know any of its backstory) Must that bug be resolved before we can proceed here?
Flags: needinfo?(benjamin)
Bug 1121013 is a necessary part of the FHR/telemetry unification program and our team needs to figure it out in the short term. It doesn't necessarily block this one, but having a unified design for it seem prudent given the close relationship. If you want to take it, I'd be happy, but I was going to probably take it myself.
Flags: needinfo?(benjamin)
I have no desire to take it. All yours!
Attached patch crash events WIP (obsolete) — Splinter Review
Is this at all on the right track? I really don't know what I'm doing here and I don't want to waste time on the wrong thing.
Attachment #8567746 - Attachment is obsolete: true
Attachment #8574466 - Flags: feedback?(benjamin)
Comment on attachment 8574466 [details] [diff] [review]
crash events WIP

This is along the right track, but I think you should approach the crash manager differently. The fact that a particular crash is an OOM doesn't mean that it's not a crash: it's metadata that we associate with a crash.

Also in the near future, I expect that we'll want to store all sorts of metadata about a crash, not just the OOM state. So instead of hard-coding all the possibilities, I think we should have an optional set of key/value metadata about each crash record. So the nsICrashManager API would be something like:

addCrash(processType, crashType, id, metadata);

The only real problem with my plan is that this isn't just a JS API, it's also an XPIDL API, and XPIDL doesn't have a dictionary datatype. So we'll have to figure out something with the nsICrashManager.addCrash method.

That way, you can record {"OOMAllocationSize": "23546"} in the metadata, and then use that to drive the final data collection. Soon I'd like that to include most of the metadata that we collect for crashes except the personal data (URLs).

What this means technically for this patch is that the format of the crash event file should be something like this:

crash.main.2
<crashid>
Name=Value
Name2=Value2
...

And you can use the same block of code to process crash.main.1 and crash.main.2 to avoid duplication. You won't need the CRASH_TYPE_OOM constant at all.

Other questions you have in the code:
You don't need to modify CrashReporterParent yet, unless you need to record the OOM state for subprocesses (content/plugin/gmp). I don't think we can do that yet, but it might be worth a followup.

// xxx where does this call go?!
I'm pretty sure it ends up here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/crashes/CrashManager.jsm#1106

You should always change IIDs. Not because of binary-compat concerns, but because changing the IID is what triggers our build system to rebuild IDL files.
Attachment #8574466 - Flags: feedback?(benjamin)
Thank you for the pointers!
So there has been some massive confusion on my end, and this was actually supposed to be FHR. (Fortunately it doesn't make *too* much of a difference from a code perspective; I still plan to use the Crash Manager)
Summary: Report OOM crashes in telemetry → Report OOM crashes in FHR
Attachment #8574466 - Attachment is obsolete: true
Attachment #8576414 - Flags: review?(benjamin)
Attachment #8576416 - Flags: review?(benjamin)
Comment on attachment 8576414 [details] [diff] [review]
Part 1: Add a |metadata| parameter to nsICrashService::addCrash

In Mozilla-COM You don't need to/shouldn't revise the contractID or CID. So please undo the crashservice;2 change as well as the new {0cb49ca4-8c96-4f31-b22a-09f3082d250d} CID.

I wish we had a better datatype for the metadata, but I suppose the hacky way is good enough for now.

Are you sure we need to make this change at all?
Attachment #8576414 - Flags: review?(benjamin) → review+
Comment on attachment 8576416 [details] [diff] [review]
Part 2: Add a |metadata| parameter to CrashManager.addCrash

It feels like you ought to have tests which exercize the keyvalue parsing mechanism, if we actually need it. My impression is that we don't need it (just skip part 1 entirely). So I'm going to mark r+ on this patch but the parsing bits shouldn't land without a separately reviewed test.
Attachment #8576416 - Flags: review?(benjamin) → review+
Comment on attachment 8576417 [details] [diff] [review]
Part 3: Create events file format crash.main.2 which contains metadata

+There should be ``UUID.dmp`` and ``UUID.extra`` files on disk, saved by
+Breakpad.

I don't think this comment belongs here. If the crash has already been submitted by the user, the .dmp/extra files will already have been deleted.

r=me on the rest.
Attachment #8576417 - Flags: review?(benjamin) → review+
Attachment #8576418 - Flags: review?(benjamin) → review+
> Are you sure we need to make this change at all?
Nope!
> My impression is that we don't need it (just skip part 1 entirely). 
No objection here.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
Regarding the tests, I assume it's a moot point if we decide not to add that parsing code?

So just to confirm, if I:
* Drop part 1
* Remove CrashService.js from part 2
* Clean up .rst comments in part 3

...then this stack is ready to land?
Flags: needinfo?(benjamin)
I think that's correct.
Flags: needinfo?(benjamin)
Attachment #8576414 - Attachment is obsolete: true
Is this ineffective on 39 due to the unification? Should it go up to 38?
Flags: needinfo?(vdjeric)
(In reply to David Major [:dmajor] (UTC+13) from comment #22)
> Is this ineffective on 39 due to the unification? Should it go up to 38?

We do need this patch on 39. We still have the old FHR enabled on 39, and we won't turn it off until we've confirmed that unified Telemetry is reporting the same numbers as the old FHR.
Flags: needinfo?(vdjeric)
You need to log in before you can comment on or make changes to this bug.