Write a log of crashes for FHR consumption

VERIFIED FIXED in Firefox 30

Status

()

defect
P1
normal
VERIFIED FIXED
6 years ago
5 years ago

People

(Reporter: ted, Assigned: gps)

Tracking

(Blocks 2 bugs)

unspecified
mozilla30
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox29- affected, firefox30 verified)

Details

(URL)

Attachments

(10 attachments, 10 obsolete attachments)

6.52 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
12.43 KB, patch
ted
: review+
Yoric
: review+
Details | Diff | Splinter Review
37 bytes, text/x-review-board-request
gps
: review+
Details
37 bytes, text/x-review-board-request
Yoric
: review+
Yoric
: feedback+
Details
4.68 KB, patch
gps
: review+
Details | Diff | Splinter Review
37 bytes, text/x-review-board-request
gps
: review+
Details
37 bytes, text/x-review-board-request
Details
37 bytes, text/x-review-board-request
Yoric
: review+
Details
37 bytes, text/x-review-board-request
rnewman
: review+
Details
37 bytes, text/x-review-board-request
Details
gps and I talked about this today. Currently FHR is trying to read the tea leaves in Crash Reports/{pending,submitted} to determine how many crashes a user experienced. This kind of sucks because those locations weren't designed for data collection so it's hard to get useful numbers out of them. We can't distinguish plugin and browser crashes, nor can we tell if the user crashed but declined to submit the crash report (because we simply delete it in that case).

We settled on having the exception handler callback write a simple log of actions, something like:
<timestamp> <local crash uuid> <action>

Examples might be:
1369351425 8355c849-2688-4697-afa9-1164b27cdd47 browser-crash
1369351456 c7dffee9-621e-44de-853a-8697a1d710e8 plugin-crash
1369351521 8355c849-2688-4697-afa9-1164b27cdd47 crash-submit
1369351538 c7dffee9-621e-44de-853a-8697a1d710e8 crash-delete

We should write a log entry every time the browser crashes, a plugin crashes, or a content process crashes. We should write an entry when we successfully submit a crash, or when the user declines to submit a crash and we delete it (clicking quit or restart in the crash reporter with the "submit" checkbox unchecked, for example).

We discussed some related issues that FHR could measure, like uncaught crashes. FHR could compare the crash log against the clean-shutdown tracking mechanism to see when unclean shutdowns happened that were not recorded as browser crashes.
I forgot to mention, but we'll want to truncate this log periodically to make sure it doesn't get ridiculously long. We'll want to make sure we're doing that in a non-racy way.
After discussion, we might want to write this log in potentially two separate places:
1) In the Crash Reports directory
2) In the profile directory

If we crash before we get a profile we can write it in the former. If we crash after we get a profile we can write it in the latter.
(Assignee)

Comment 3

6 years ago
Daniel: Since we're talking about changing how crashes are recorded, are there any more pieces of data Metrics would like from FHR?
Flags: needinfo?(deinspanjer)

Comment 4

6 years ago
What is the proposed aggregation for sending this crash data back via FHR? I'm not sure sending the local crash UUIDs is appropriate, but perhaps, using your example, the number of browser-crash, plugin-crash, crash-submit, crash-delete per day or session.
(Assignee)

Comment 5

6 years ago
Presumably FHR would read the log from the last recorded offset and store new data as new per-day counts (just like it does today). The big difference from today is the underlying crash data would be much more reliable.

Comment 6

6 years ago
FHR should for sure not send any crash IDs. The client also does not know about what category (browser or plugin, etc.) a crash belongs to, I think, at least not without putting more analysis in. I think what FHR should send is only the count of encountered and submitted crashes per day.

Comment 7

6 years ago
The client can and should know about browser/content/plugin crashes. That's one of the points of this bug.
From a user-facing feature POV we'll want to be able to correlate those crashes to plugin+plugin version and app+app version.  The reason is so that we can provide accurate diagnostics to users (if the previous version of a plugin was crashing when a user sneezed, but the new version isn't crashing, we shouldn't tell them to update/disable that plugin, for example).
(Assignee)

Updated

6 years ago
Blocks: 907354
(Assignee)

Comment 9

6 years ago
Ted: Didn't we brainstorm this on an etherpad? I seem to have lost the link.

Also, this came up at the crashkill and stability work week last week. A lot of people would like to see richer crash data that this crash "log" will provide. Hopefully we can finalize a technical plan in the next week or two and then start to move forward on implementation. I volunteer myself to write the JS API. C++ is still up in the air.
Flags: needinfo?(ted)
I don't think we ever had an etherpad, I think we just brainstormed in a text editor and I pasted the contents we wound up with in comment 0.
Flags: needinfo?(ted)
For Session Restore, we find ourselves needing a simple mechanism to determine whether the previous session has crashed. We have been working on bug 888373 for that purpose. It seems that the FHR crash log would be a superset of what we have been doing, so if the ETA is close enough, we should be able to make use of this crash log instead of landing our component.

Which brings me to my question: do we have an ETA?
Flags: needinfo?(ted)
We don't have an ETA, this was just something gps and I discussed wanting for FHR. I'd be a little worried about relying on this for detecting unclean shutdowns, because we know there are still crashes that Breakpad doesn't catch, so they wouldn't show up in this log. In fact I mentioned that in comment 0:

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #0)
> We discussed some related issues that FHR could measure, like uncaught
> crashes. FHR could compare the crash log against the clean-shutdown tracking
> mechanism to see when unclean shutdowns happened that were not recorded as
> browser crashes.
Flags: needinfo?(ted)
(Assignee)

Updated

6 years ago
Flags: needinfo?(deinspanjer)
Ben, I am CC'ing you so you can be aware of the current limitations around counting crashes in FHR as well as potentially providing more feedback on the new special.

Comment 15

6 years ago
Who is going to work on this? Does anyone have this in their near-term planning?
(Assignee)

Comment 16

6 years ago
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #15)
> Who is going to work on this? Does anyone have this in their near-term
> planning?

This was my action item. However, I'll need help with the C++ bits. I can do the JS side and the technical work up front, however.

First thing's first: we need a technical design.

I believe bsmedberg has contributed to https://etherpad.mozilla.org/crash-log.

Ted: Could you please fill in your thoughts?
Flags: needinfo?(ted)
I've already added my comments to that etherpad (in green). The design is sensible, it just needs an implementation.
Flags: needinfo?(ted)
(Assignee)

Comment 18

6 years ago
I talked with bsmedberg about this yesterday. To make the implementation easier, we propose having the crash reporter write individual "event" files into well-defined locations and having an XPCOM service in Gecko slurp these up and write them into a unified crash log. This makes things like locking much more manageable. Plus, it means we only need to write 1 version of the code for interfacing with the crash log files.

I've updated https://etherpad.mozilla.org/crash-log. If this design is accepted, I can start working on the XPCOM service (which I prefer to implement in JavaScript).

Also, now seems like the right time to get a preliminary perf review of the design. CC vladan.
Flags: needinfo?(vdjeric)
Flags: needinfo?(ted)
Flags: needinfo?(benjamin)
Bear in mind that said XPCOM service (a) would need to run on a timer, 'cos it won't be touched by FHR on Android and thus can't lazily do the work, (b) not run on a timer, 'cos Gecko dies all the time on Android, so the timer would never get to run, and (c) would need to be designed to not do IO on Gecko startup, 'cos that'll kill perf on Android even worse than on desktop.

Those are tricky requirements to meet, so like most other lazy/background activities on Android, most likely we'd reimplement this in Java and just not ship the Gecko component -- Gecko is an engine for fetching and displaying web pages, not a cross-platform service layer.

If you want to make a call on this now and factor it into your design, I'd be happy to sign off.
(Assignee)

Comment 20

6 years ago
Why does Fennec always have to rain on the parade?

If Fennec wanted to reimplement the crash log aggregating and writing in Java, power to them.

At the end of the day, we need an API for crash data that is accessible from the FHR backend (for submitting data) and from "health report"/"doctor" features in the application (such as about:healthreport) so said data can be consumed to make intelligent choices. Who knows, perhaps things like the plugin manager also need this data so they can choose to automatically disable crashy plugins or change the CtP notification appropriately, etc. I /think/ there will be a requirement for at least one component in Gecko to access this data, so a read API is a requirement.

At the end of the day, I think we just need to agree on the XPCOM surface area that placates all customers. Maybe one globally available interface for reading the log and another optional interface for managing/updating the log?

On a related front, this feature would be a candidate for the not-yet-invented "background Gecko service" people talk about every so often.
(In reply to Gregory Szorc [:gps] from comment #20)
> Why does Fennec always have to rain on the parade?

Not rain, just illustrating the enormously varied worlds in which this stuff might have to work! It's very easy to sketch out a solution that works great when Gecko is running on modern hardware for several minutes, but doesn't work well when Gecko only runs for 30 seconds at a time, and you have a few hundred KB of memory available while it does so.

 
> If Fennec wanted to reimplement the crash log aggregating and writing in
> Java, power to them.

"them" == "us", to be fair. And if we don't reimplement it, we're essentially WONTFIXing Bug 891597.

 
> plugin manager also need this data so they can choose to automatically
> disable crashy plugins or change the CtP notification appropriately, etc. I
> /think/ there will be a requirement for at least one component in Gecko to
> access this data, so a read API is a requirement.

Does it have to run within Gecko (e.g., for extensibility reasons), or is it enough to communicate via notifications/messages/calls?

Does the read API need to provide up-to-date data (and thus need to trigger a rollup)?


> At the end of the day, I think we just need to agree on the XPCOM surface
> area that placates all customers. Maybe one globally available interface for
> reading the log and another optional interface for managing/updating the log?

So long as the update part is hidden behind an interface, and its behavior is well-specified, we can pretty easily swap that out for a Java service, just as we do for FHR document generation itself. There would be no need to reimplement the log reader.


> On a related front, this feature would be a candidate for the
> not-yet-invented "background Gecko service" people talk about every so often.

It would, as soon as more than a minority of phones on the market can run Gecko without swapping out every other app on the device... for the foreseeable future, Gecko is a foreground feature, and we can reasonably expect to implement most new non-foreground functionality in Java to ship on Android.
I took a quick look at the Etherpad with the proposed design, I have only a couple of comments so far:

- I think it's ok to put crash event files from different profiles in the same directory. You could include the profile name (if available) as another column. When crash data is shown in FHR, it could be broken down by profile or only shown for the current profile.
- We should also include information about the version of the app that crashed in the event logs. If Firefox starts crashing frequently after updating to version X, we might want to make note of it in the FHR dash
- When exactly will the crash event files be processed into crash logs? On-demand or on every startup?
Flags: needinfo?(vdjeric)
At this point I don't think we should fret too much about separating crashes from profiles. Perhaps as a later step we should consider keeping crashes in the profile instead of the shared location (I forget exactly why we chose the shared location). Then we'd only have to deal with the shared location for crashes that happen before we've picked a profile.

Otherwise if we need to do this in the Firefox and Android frontend separately, we should split out a separate bug for Android and at least get the desktop bits landed.
Flags: needinfo?(benjamin)
We used the shared location just because we wanted to handle startup crashes that happened before a profile was chosen, and I didn't bother writing it to have a separate code path for the "have a profile" case.
Flags: needinfo?(ted)
Comment [1] mentions two types of crashes: browser-crash and plugin-crash.

Q1. Are there other possible categories for desktop crashes? Will we include them?
Q2. What are the relevant categories for Fennec crashes?

thanks
There are content process crashes, we're not shipping any e10s on desktop yet, but when we do we'll need to handle those. We have them in B2G, but I'm not sure if that's in scope yet.

Fennec only has browser crashes, since there's no e10s there at all.
We are shipping e10s on desktop 26 and maybe event 25 for thumbnails, and I do want the FHR stats to include these.

Comment 28

6 years ago
Yes, e10s is already in desktop builds, it's disabled by default for tabs (but a simple pref flip away) and even enabled and used by default for thumbnails generation (we don't have any UI hooked up for sending crashes there but IMHO we should strive to record them in that log for FHR). And in any case, we should make that mechanism ready to be able to handle those content process crashes so we don't run into needing a redesign early after having this in place.
(Assignee)

Comment 29

6 years ago
Posted patch Part 1: Crash reporter docs (obsolete) — Splinter Review
I started diving into this bug on the plane yesterday. I was unable to
find decent comprehensive docs on how the crash reporter works and how
all the components play together, so I decided to write some! The docs
aren't super detailed, but I think they are helpful enough to newbies
wanting to learn how everything is hooked up.

This patch applies on top of bug 939367 so Sphinx docs can come from
multiple parts of the tree. I still need to improve bug 939367 to not
require changes to tools/docs when adding new docs sources. And, I
wouldn't be surprised if there were pushback to using moz.build files
for declaring documentation sources. So please limit your review to just
the index.rst file. And, since this is all NPOTB and since you,
Benjamin, are a subject matter expert, perhaps you should just change
the docs to be accurate instead of leaving review notes?
Attachment #833354 - Flags: review?(benjamin)
(Assignee)

Comment 30

6 years ago
This is what I have in mind.

It doesn't compile because I'm still trying to wrap my head around how
strings work. I probably need to expand the API a bit because eventually
we'll need to know both directories where crash event data can be
located.

Questions:

* Is it acceptable for me to have a statically allocated string or
  should it be initialized at run-time? The string should be allocated
  very early in process lifetime, so perf concerns over static
  constructors may not apply?

* Is passing the path to the crash reporter client as an argument
  acceptable? A crash annotation or note didn't feel right to me.
  And environment variable manipulation would require global state
  or memory mucking. Both seem bad.

* I'm not sure where the best place for the "paths to events files"
  API is. It needs to be accessible via JS. Directory service? Or some
  XPCOM service?

* What about directory creation? Should I create these directories at
  app startup time? That seems bad for perf. Is it safe to create them at
  crash time? The system may be unstable. But I suppose we're already
  doing I/O in the crash reporter client... Should I create directories
  at profile creation time?

Most of my questions are probably obvious to you. But this is my first
C++ patch for Gecko and I know practically nothing abount the underlying
systems.
Attachment #8334201 - Flags: feedback?(benjamin)
(Assignee)

Updated

6 years ago
Assignee: nobody → gps
Status: NEW → ASSIGNED
Comment on attachment 833354 [details] [diff] [review]
Part 1: Crash reporter docs

This doc looks correct and fine in terms of firefox.exe crashes. It does not cover plugin-process crashes at all, nor does it mention that crash report submissions can also occur from about:crashes

child-process crashes are handled by the dom/ipc/CrashReproterParent class, and then into the Firefox frontend for submission via CrashSubmit.jsm

Updated

6 years ago
Attachment #8334201 - Flags: feedback?(benjamin) → feedback+
> * Is it acceptable for me to have a statically allocated string or
>   should it be initialized at run-time? The string should be allocated
>   very early in process lifetime, so perf concerns over static
>   constructors may not apply?

Static constructors are still bad. Use "nsString* gFoo";

> * Is passing the path to the crash reporter client as an argument
>   acceptable?

Yes.

> * I'm not sure where the best place for the "paths to events files"
>   API is. It needs to be accessible via JS. Directory service? Or some
>   XPCOM service?

The directory service sucks. nsIXULRuntime or nsICrashReporter would both be ok places to hang this.

> * What about directory creation? Should I create these directories at
>   app startup time? That seems bad for perf. Is it safe to create them at
>   crash time?

Yes, you can make syscalls safely at crash time.

Oh also, rereading the patch: you're using nsIFile.getNativePath, which is a lossy method on Windows. We need to use unicode paths/nsAString in order to capture things properly.
(Assignee)

Comment 33

6 years ago
The tree doesn't have a robust and reusable API for interfacing with
crash data. This patch is the start of a new API.

In this patch, I created the CrashManager type. It has APIs for
retrieving the lists of files related to crash dumps. In subsequent
patches, I will convert existing code in the tree that does similar
things to the new API. I will also build the events/timeline API onto
this type.

I made CrashManager generic because I hate, hate, hate singletons and
global variables. Allowing it to be instantiated multiple times with
different options (instead of say binding a global instance to ProfD)
makes the testing story much, much nicer. That is reason enough, IMO. In
a subsequent patch, I'll add an XPCOM service that instantiates the
"global" instance of CrashManager with the appropriate options.

It was tempting to add this code into the existing CrashReports.jsm.
However, this file does not import cleanly in xpcshell tests and I
didn't want to bloat scope to include fixing that file... yet.
CrashReports.jsm is using synchronous I/O. So, depending on how
adventerous I feel, I may replace consumers of CrashReports.jsm with the
new CrashManager.jsm, remove CrashReports.jsm, and eliminate another
source of synchronous I/O in the tree.

There is a lot that will build on this patch. Things should be clearer a
few patches in.
Attachment #8334977 - Flags: review?(benjamin)
(Assignee)

Updated

6 years ago
Attachment #8334201 - Attachment is obsolete: true
(Assignee)

Updated

6 years ago
Attachment #8334201 - Attachment is obsolete: false
(Assignee)

Comment 34

6 years ago
Oh, somehow I missed the existence of nsICrashReporter! I can only image what hysterical raisin has it sitting in xpcom/system and its implementation in toolkit/xre/nsAppRunner.cpp instead of toolkit/crashreporter. I didn't even realize nsXULAppInfo inherited from so many interfaces!

I would like to hang a lot of things off nsICrashReporter. Is it possible to implement part of that interface in JavaScript and part in C++? If so, could you please point me to an interface that is implemented in both languages so I have can have a clue what's going on. Or, will I need to write an interface just for JS (and possibly hang proxy methods off nsICrashReporter/nsXULAppInfo)?
> I would like to hang a lot of things off nsICrashReporter. Is it possible to
> implement part of that interface in JavaScript and part in C++?

No.
(In reply to Gregory Szorc [:gps] from comment #34)
> Oh, somehow I missed the existence of nsICrashReporter! I can only image
> what hysterical raisin has it sitting in xpcom/system and its implementation
> in toolkit/xre/nsAppRunner.cpp instead of toolkit/crashreporter. I didn't
> even realize nsXULAppInfo inherited from so many interfaces!

It's...complicated.
Comment on attachment 833354 [details] [diff] [review]
Part 1: Crash reporter docs

r- based on the previous comment.
Attachment #833354 - Flags: review?(benjamin) → review-
Comment on attachment 8334977 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data

Use for...of instead of for...in Iterator

I'm a little worried about the regexes: on nightly/aurora we're still using bp-hr- IDs, and I think we should relax that to bp-*.txt.

I'm not confident reviewing the async JS code: perhaps yoric should review this also...

Finally, there are some pieces of data that are important which we might want to roll up in this API. Specifically the crash process type (browser/plugin/content) and whether this is a hang or a true crash. That information is in the .ini file for pending crashes, but isn't anywhere currently for the saved crashes; we may need to add it to the submitted .txt file or log it some other way.
Attachment #8334977 - Flags: review?(benjamin) → feedback+
(Assignee)

Comment 39

6 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #38)
> Comment on attachment 8334977 [details] [diff] [review]
> Part 2: Create CrashManager API for managing crash data
> 
> Use for...of instead of for...in Iterator

js> for (let k of {foo: true, bar: false}) { print(k); }
typein:1:0 TypeError: (intermediate value)['@@iterator'] is not a function

js> for (let [k, v] in Iterator({foo: true, bar: false})) { print(k); }
foo
bar

> I'm a little worried about the regexes: on nightly/aurora we're still using
> bp-hr- IDs, and I think we should relax that to bp-*.txt.

Will do.

> Finally, there are some pieces of data that are important which we might
> want to roll up in this API. Specifically the crash process type
> (browser/plugin/content) and whether this is a hang or a true crash. That
> information is in the .ini file for pending crashes, but isn't anywhere
> currently for the saved crashes; we may need to add it to the submitted .txt
> file or log it some other way.

Noted. I may roll these features in incrementally.
(Assignee)

Updated

6 years ago
Depends on: 946604
(Assignee)

Comment 40

6 years ago
I moved the code from toolkit/crashreporter to
toolkit/components/crashes. I think it logically makes sense to create a
wall between toolkit/crashreporter and the code that consumes it. If
nothing else, there's already a lot happening in toolkit/crashreporter
and I didn't want to contribute to the confusion. Subsequent patches in
the series will build upon this patch and make the establishment of a
new component more easily justified.
Attachment #8342971 - Flags: review?(dteller)
(Assignee)

Updated

6 years ago
Attachment #8334201 - Attachment is obsolete: true
(Assignee)

Updated

6 years ago
Attachment #8334977 - Attachment is obsolete: true
(Assignee)

Comment 41

6 years ago
This patch now works. I also added an IDL interface so we do things
properly. The service is mostly a stub now. Will build features in
subsequent patches. I'm authoring patches so every patch is standalone
and should result in a green tree. I can squash patches down before
landing if desired.
Attachment #8342972 - Flags: review?(benjamin)
(Assignee)

Updated

6 years ago
Blocks: 946645
Comment on attachment 8342972 [details] [diff] [review]
Part 3: XPCOM service for managing crash data

It's not clear from the nsICrashesManager IDL whether it is intended to be used as a service (getService or createInstance). This is more confusing because of the name CrashesService.js. Can we make it CrashesManager.js for name consistency and add an .idl comment explaining that this is intended as a service?

Does this service really need to be created at startup, or only when somebody starts to use it? In any case, the profile-after-change dance doesn't seem to be necessary, since you aren't using the profile directory anywhere; and if you were, you could just use ProfDS to get the directory before the profile is official "started".

Ted wanted to see this, so with those changes please mark him for second review on the new patch.
Attachment #8342972 - Flags: review?(benjamin)
(Assignee)

Comment 43

6 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #42)
> Comment on attachment 8342972 [details] [diff] [review]
> Part 3: XPCOM service for managing crash data
> 
> It's not clear from the nsICrashesManager IDL whether it is intended to be
> used as a service (getService or createInstance). This is more confusing
> because of the name CrashesService.js. Can we make it CrashesManager.js for
> name consistency and add an .idl comment explaining that this is intended as
> a service?

While the underlying JS type can be instantiated multiple times (to facilitate testing), there likely will only ever be one instance inside Gecko and that instance can live as a service. Perhaps I should expose a "service" IDL that exposes a single "manager" attribute. That seems like over-engineering.

> Does this service really need to be created at startup, or only when
> somebody starts to use it? In any case, the profile-after-change dance
> doesn't seem to be necessary, since you aren't using the profile directory
> anywhere; and if you were, you could just use ProfDS to get the directory
> before the profile is official "started".

I'll eventually hang a timer to aggregate event files on there. I totally agree that lazy loading is preferred for perf reasons - that's why .manager is a getter.
> Gecko and that instance can live as a service. Perhaps I should expose a
> "service" IDL that exposes a single "manager" attribute. That seems like
> over-engineering.

Indeed. Basically I just want the IDL and the JS to match naming, and a comment explaining that this is supposed to be used with .getService()

> I'll eventually hang a timer to aggregate event files on there. I totally
> agree that lazy loading is preferred for perf reasons - that's why .manager
> is a getter.

ok, then you don't need the profile stuff at all. If you want it created on startup continue with the app-startup category.
(Assignee)

Comment 45

6 years ago
Initial stab at support for "event files" in the JS API. Needs a lot of
love in the "robust" department (recovering from corruption, etc). But,
there's enough here to start asking for feedback.

I've gone around the block a number of times on what storage format to
use. I've considered append-only files, files grouped by ID prefix,
files grouped by date of crash. I decided on the easiest implementation:
a single soon-to-be-gzipped JSON file. We currently have an unbound data
problem. Definitely need to add pruning in here. Storage is abstracted
from the public API, so we can swap things out easily enough. Although
once the cat is out of the bag, we'll need to write migration scripts.
Considerations for forwards compatibility fall under the "robust"
department. Hopefully the next iteration of the patch.

I still need to hook the XPCOM service up to this new code. Future
patch.
Attachment #8343749 - Flags: feedback?(benjamin)
(Assignee)

Updated

5 years ago
Depends on: 947694
(Assignee)

Comment 46

5 years ago
Start of the CrashManager API. Just supports dump file management for
now.
Attachment #8344319 - Flags: review?(ted)
Attachment #8344319 - Flags: review?(dteller)
(Assignee)

Updated

5 years ago
Attachment #8342971 - Attachment is obsolete: true
Attachment #8342971 - Flags: review?(dteller)
(Assignee)

Comment 47

5 years ago
Here is the start of the XPCOM service so our shiny new CrashManager is
instantiated at app startup. I changed the service name and
initialization logic since the last patch.
Attachment #8344320 - Flags: review?(ted)
(Assignee)

Updated

5 years ago
Attachment #8342972 - Attachment is obsolete: true
(Assignee)

Comment 48

5 years ago
This patch adds support for "crash events" in the CrashManager.

It could use more in the way of tests, but I think I hit all the
high-level desires for an initial landing.

Ted and Benjamin can fight over code review.
Vladan gets perf review.

Some of the design decisions were documented in the attached .rst file.
The choice of a single JSON file for event storage is probably what
scares me the most. It's very easy to come up with scenarios where that
doesn't scale. But without real data from clients in the wild, I have no
clue how realistic those concerns are.
Attachment #8344322 - Flags: review?(ted)
Attachment #8344322 - Flags: review?(benjamin)
Attachment #8344322 - Flags: feedback?(vdjeric)
Comment on attachment 8344322 [details] [diff] [review]
Part 4: Add Support for crash event files to CrashManager

> /**
>  * Provides an interface for crash data.
>  *
>  * This type is generic and can be instantiated any number of times.
>  * However, most applications will typically only have one instance
>  * instantiated and that instance will point to profile and user appdata
>  * directories.
>  *
>  * Instances are created by passing an object with properties.
>  * Recognized properties are:
>  *
>  *   pendingDumpsDir (string)
>  *     Where dump files that haven't been uploaded are located.
>  *
>  *   submittedDumpsDir (string)
>  *     Where records up uploaded dumps are located.
>+ *
>+ *   eventsDirs (array)
>+ *     Directories (defined as strings) where events files are written. This
>+ *     instance will collects events from files in the directories specified.

It's not clear to me whether this will delete the submitted dumps after it has loaded the submitted dumps directory. Will it (in this patch or a later one)?

>+  _processEventFile: function (entry) {
>+    return Task.spawn(function () {
>+      let data;
>+      try {
>+        data = yield OS.File.read(entry.path);
>+      } catch (ex) {
>+        // TODO log
>+        throw new Task.Result(this.EVENT_FILE_ERROR_IO);
>+      }
>+
>+      let store = yield this._getStore();
>+
>+      let decoder = new TextDecoder();
>+      data = decoder.decode(data);
>+
>+      let sepIndex = data.indexOf("\n");
>+      if (sepIndex == -1) {
>+        throw new Task.Result(this.EVENT_FILE_ERROR_MALFORMED);
>+      }
>+
>+      let type = data.substring(0, sepIndex);
>+      let payload = data.substring(sepIndex + 1);
>+
>+      if (type == "crash.gecko.1" || type == "crash.plugin.1") {

.gecko seems confusing, can this be .chrome? This seems to be missing content-process crashes; these are already being used for thumbnails. I suspect that we should paramaterize this rather than having separate .addGeckoCrash and .addPluginCrash methods.

Also how does this handle plugin hangs?

I don't understand what the store expiration timer is for. Why is there an expiration at all? Is that because of memory usage?

>diff --git a/toolkit/components/crashes/crash-events.rst b/toolkit/components/crashes/crash-events.rst

>+crash.gecko.1

As noted above, I'm wary of having separate events for crash.gecko and crash.plugin. Also it seems to me that what we really want is to include all of the crash metadata in the event file: this will naturally include the ProcessType= annotation as well as any Hang= annotation.

>+crashreporter.submit.decline.1
>+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>+
>+This event is produced by the crash reporter when the user declines to submit
>+a crash. The crash reporter will likely delete the crash data files from disk.

It will!? I'm pretty sure that the crash remains in the pending/ directory in this case.

Also this doc doesn't indicate whether this event is written by both the in-product crash reporting UI (plugin/content crash submission) or just the binary crashreporter. I think this is an important detail.

Finally, it doesn't need to be in this patch, but we should probably limit the crash data by total count as well as the 180-day limit. There have been people who crash Flash hundreds of times a day (it basically crashes on plugin startup every time), and we want to apply a sane limit. Maybe 5000 crashes total?
Attachment #8344322 - Flags: review?(benjamin)
(Assignee)

Comment 50

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> Comment on attachment 8344322 [details] [diff] [review]
> Part 4: Add Support for crash event files to CrashManager

Excellent feedback, Benjamin.

> It's not clear to me whether this will delete the submitted dumps after it
> has loaded the submitted dumps directory. Will it (in this patch or a later
> one)?

I think eventually we'll get there. Maybe not in this patch set, but definitely a follow-up. The reason is, once the event timeline can be trusted as canonical, there should be no need for the "submitted" directory to exist at all - we can just have the crash submission mechanism write the appropriate event file or call into the "timeline API" from inside Gecko. I didn't want to bloat scope too much. But I do anticipate rewriting FHR, about:crashes, any others.

> >+      if (type == "crash.gecko.1" || type == "crash.plugin.1") {
> 
> .gecko seems confusing, can this be .chrome? This seems to be missing
> content-process crashes; these are already being used for thumbnails. I
> suspect that we should paramaterize this rather than having separate
> .addGeckoCrash and .addPluginCrash methods.
> 
> Also how does this handle plugin hangs?
>
> As noted above, I'm wary of having separate events for crash.gecko and
> crash.plugin. Also it seems to me that what we really want is to include all
> of the crash metadata in the event file: this will naturally include the
> ProcessType= annotation as well as any Hang= annotation.

Bikeshedding on the message names and their payloads is very much welcome. I'm no expert on the various types of crashes we experience and the metadata associated with them. My decisions for message types and content are likely very suboptimal. Can you please brain dump to https://etherpad.mozilla.org/crash-log?

> I don't understand what the store expiration timer is for. Why is there an
> expiration at all? Is that because of memory usage?

Yes, it's because of memory usage. Chances are we load the crash data once near startup then never use it again. I don't like idle/useless memory sitting around like that. I like the TTL pattern.

> 
> >+crashreporter.submit.decline.1
> >+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >+
> >+This event is produced by the crash reporter when the user declines to submit
> >+a crash. The crash reporter will likely delete the crash data files from disk.
> 
> It will!? I'm pretty sure that the crash remains in the pending/ directory
> in this case.

I copied that from https://etherpad.mozilla.org/crash-log without validating it!
 
> Also this doc doesn't indicate whether this event is written by both the
> in-product crash reporting UI (plugin/content crash submission) or just the
> binary crashreporter. I think this is an important detail.

In theory, both can happen. Although, there are performance implications. I'm not sure if we want in-product store writing happening directly because patterns like plugins crashing every 60s could have significant performance implications compared to writing events files and aggregating much less frequently.

> Finally, it doesn't need to be in this patch, but we should probably limit
> the crash data by total count as well as the 180-day limit. There have been
> people who crash Flash hundreds of times a day (it basically crashes on
> plugin startup every time), and we want to apply a sane limit. Maybe 5000
> crashes total?

Sounds reasonable. I do worry about loss of data (to FHR) when a high water mark is reached. i.e. is the difference between 5000 and 50000 worth reporting? Could we just store a "high water mark reached" flag when we exceed the daily limit? What if lots of plugin crashes mask (more important) Gecko crashes or other important events?

Comment 51

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> .gecko seems confusing, can this be .chrome?

I wonder if .chrome is less confusing on B2G, where this would be when the "b2g" process (i.e. "the whole phone") crashes. In crash-stats, we call it "browser" process due to legacy, but I tend to call it "main process" nowadays and that would work cross-product.

Comment 52

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> Comment on attachment 8344322 [details] [diff] [review]
> >+This event is produced by the crash reporter when the user declines to submit
> >+a crash. The crash reporter will likely delete the crash data files from disk.
> 
> It will!? I'm pretty sure that the crash remains in the pending/ directory
> in this case.


Actually, AFAIK, that comment might very well be correct, as I was told we delete crash reports even locally when people decline to submit them. I know for a fact we do that on B2G and IIRC when we fixed that on B2G to be like that, ted said this would be the correct thing as it would match what breakpad UI does.

Comment 53

5 years ago
(In reply to Gregory Szorc [:gps] (in southeast Asia until Dec 14) from comment #50)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> > It's not clear to me whether this will delete the submitted dumps after it
> > has loaded the submitted dumps directory. Will it (in this patch or a later
> > one)?
> 
> I think eventually we'll get there. Maybe not in this patch set, but
> definitely a follow-up. The reason is, once the event timeline can be
> trusted as canonical, there should be no need for the "submitted" directory
> to exist at all - we can just have the crash submission mechanism write the
> appropriate event file or call into the "timeline API" from inside Gecko. I
> didn't want to bloat scope too much. But I do anticipate rewriting FHR,
> about:crashes, any others.

Note that right now, an |ls -l submitted/| is the only way to get your crash IDs at all. So when we end up deleting those, we better end up having a better way in place to get to those (bugs are filed but have not really been moving forward). But as you said, probably a different bug anyhow where this happens. ;-)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #52)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> > Comment on attachment 8344322 [details] [diff] [review]
> > >+This event is produced by the crash reporter when the user declines to submit
> > >+a crash. The crash reporter will likely delete the crash data files from disk.
> > 
> > It will!? I'm pretty sure that the crash remains in the pending/ directory
> > in this case.
> 
> 
> Actually, AFAIK, that comment might very well be correct, as I was told we
> delete crash reports even locally when people decline to submit them. I know
> for a fact we do that on B2G and IIRC when we fixed that on B2G to be like
> that, ted said this would be the correct thing as it would match what
> breakpad UI does.

If the user explicitly declines to submit a crash report (for example, if the browser crashes, the user does not have "Tell Mozilla about this crash" checked, and they click quit or restart in the crash reporter UI), then we'll delete the pending crash.

If the crash submission simply fails for some reason (network etc) we'll leave it in pending.

Plugin crashes are a little weird in that it's really easy for the user to not submit but not explicitly decline submission, so we sometimes wind up leaving them in pending.
Comment on attachment 8343749 [details] [diff] [review]
Part 6: Support events files in CrashManager

I feel like this patch and part 4 are versions of the same thing, or am I confused?
Attachment #8343749 - Flags: feedback?(benjamin)
(Assignee)

Updated

5 years ago
Attachment #8343749 - Attachment is obsolete: true
Comment on attachment 8344319 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data

Review of attachment 8344319 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/crashes/CrashManager.jsm
@@ +6,5 @@
> +
> +const {utils: Cu} = Components;
> +
> +Cu.import("resource://gre/modules/osfile.jsm")
> +Cu.import("resource://gre/modules/Task.jsm");

Shouldn't we import into |this|, these days?

@@ +13,5 @@
> +  "CrashManager",
> +];
> +
> +/**
> + * Provides an interface for crash data.

I'm not a big fan of this docstring. This makes the module look like a InstanceFactoryImplFactoryFactory, which it isn't :)
Any way to make it more friendly?

Also, so far, it's an interface for *accessing* crash data.

@@ +27,5 @@
> + *   pendingDumpsDir (string)
> + *     Where dump files that haven't been uploaded are located.
> + *
> + *   submittedDumpsDir (string)
> + *     Where records up uploaded dumps are located.

Could you make it clear that these properties (which are really labelled args) are required?

@@ +100,5 @@
> +                                     this.SUBMITTED_REGEX);
> +  },
> +
> +  _getDirectoryEntries: function (path, re) {
> +    return Task.spawn(function () {

Here too, I'd suggest function* and return.

@@ +106,5 @@
> +        yield OS.File.stat(path);
> +      } catch (ex if ex instanceof OS.File.Error) {
> +        if (ex.becauseNoSuchFile) {
> +          throw new Task.Result([]);
> +        }

I'd go for

 } catch (ex if ex instanceof OS.File.Error && ex.becauseNoSuchFile) {

Also, these days, it's probably simpler to make it a |function*()| and use |return| instead of |throw new Task.Result|.

@@ +119,5 @@
> +        while (true) {
> +          let entry;
> +          try {
> +            entry = yield it.next();
> +          } catch (ex if ex == StopIteration) {

Unfortunately, this is going to break with bug 938704. I'd suggest calling it.forEach.

@@ +130,5 @@
> +
> +          let match = re.exec(entry.name);
> +          if (!match) {
> +            continue;
> +          }

You could also use a winPattern to optimize the code for Windows.

::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
@@ +21,5 @@
> +    f.create(Ci.nsIFile.DIRECTORY_TYPE, dirMode);
> +  }
> +
> +  const dirMode = OS.Constants.libc.S_IRWXU;
> +  let baseFile = do_get_tempdir();

Not a request, but if you want, OS.Constants.Path.tmpDir does the trick, too.

@@ +63,5 @@
> +    file.lastModifiedTime = date.getTime();
> +    dump("Created fake crash: " + file.path + "\n");
> +
> +    return uuid;
> +  };

Wouldn't it be possible to use the actual Crash Reporter to generate crashes? There are xpcshell tests that call it in a subprocess, so that would probably be a much better test, wouldn't it?

@@ +91,5 @@
> +  let m = new CrashManager({
> +    pendingDumpsDir: "/foo",
> +    submittedDumpsDir: "/bar",
> +  });
> +  Assert.ok(m);

Assert? Where does this come from? I want some :)

@@ +94,5 @@
> +  });
> +  Assert.ok(m);
> +
> +  run_next_test();
> +});

Nit: Please use add_task these days.

@@ +104,5 @@
> +
> +  run_next_test();
> +});
> +
> +add_test(function test_get_manager() {

Could you document what each of your test ensures? Looking at the source code, I'm often unsure.
Attachment #8344319 - Flags: review?(dteller) → feedback+
Generators have now been standardized and should now be written with |function*|, see http://dxr.mozilla.org/mozilla-central/source/toolkit/modules/Task.jsm .
Attachment #8344322 - Flags: feedback?(vdjeric) → feedback?(dteller)
Comment on attachment 8344322 [details] [diff] [review]
Part 4: Add Support for crash event files to CrashManager

Review of attachment 8344322 [details] [diff] [review]:
-----------------------------------------------------------------

Overall, looks good to me, if a bit lacking on in-code documentation.

::: toolkit/components/crashes/CrashManager.jsm
@@ +48,2 @@
>   */
>  this.CrashManager = function (options) {

Wouldn't it be simpler to store a shallow copy of |options| as |this._options| and use this everywhere?

@@ +79,5 @@
>        default:
>          throw new Error("Unknown property in options: " + k);
>      }
>    }
> +

Could you please document these fields?

@@ +105,5 @@
> +
> +  EVENT_FILE_SUCCESS: "ok",
> +  EVENT_FILE_ERROR_IO: "io",
> +  EVENT_FILE_ERROR_MALFORMED: "malformed",
> +  EVENT_FILE_ERROR_UNKNOWN_EVENT: "unknown-event",

These look like global constants rather than prototype fodder. Do you have any reason to put these things in the prototype?

@@ +166,5 @@
> +    }.bind(this));
> +  },
> +
> +  /**
> +   * Aggregates "loose" events files into the unified "database."

Could you document the effect on files?

@@ +177,5 @@
> +  aggregateEventsFiles: function () {
> +    return Task.spawn(function () {
> +      if (this._aggregateInProgress) {
> +        throw new Error("Event aggregation already in progress. Try again " +
> +                        "later.");

Shouldn't you return the promise of the aggregation in progress?

@@ +198,5 @@
> +              case this.EVENT_FILE_SUCCESS:
> +                needsSave = true;
> +                // Fall through.
> +
> +              case this.EVENT_FILE_ERROR_MALFORMED:

Looks like it should be an exception.

@@ +304,5 @@
> +
> +      let decoder = new TextDecoder();
> +      data = decoder.decode(data);
> +
> +      let sepIndex = data.indexOf("\n");

Please reference crash-event.rst from a comment.

@@ +427,5 @@
> +
> +      this._storeTimer.cancel();
> +
> +      let timerCB = function () {
> +        if (this._storeProtected) {

Please document both _storeProtected and the role if this callback.

@@ +475,5 @@
> +  this._storeDir = storeDir;
> +  this._telemetryKey = telemetryKey;
> +
> +  this._storePath = OS.Path.join(storeDir, "store.json");
> +  this._data = null;

Could you document all these fields/args?

@@ +480,5 @@
> +}
> +
> +CrashStore.prototype = Object.freeze({
> +  TYPE_GECKO: 1,
> +  TYPE_PLUGIN: 2,

As above, this doesn't look like it should be in the prototype.

@@ +502,5 @@
> +        if (data.corruptDate) {
> +          this._data.corruptDate = new Date(data.corruptDate);
> +        }
> +
> +        for (let [id, crash] of Iterator(data.crashes)) {

I hear that |Iterator| is deprecated and that the way to go is now Object.keys.

@@ +511,5 @@
> +          }
> +
> +          this._data.crashes[id] = unnormalized;
> +        }
> +      } catch (ex if ex instanceof OS.File.Error && ex.becauseNoSuchFile) {

// Ignore
(I assume)

@@ +518,5 @@
> +        // and swallow the error.
> +        //
> +        // The marking of a corrupted file is intentionally not persisted to
> +        // disk yet. Instead, we wait until the next save(). This is to give
> +        // non-permanent failures the opportunity to recover on their own.

What kind of failures are expected?

@@ +553,5 @@
> +        normalized.crashes[id] = c;
> +      }
> +
> +      let method = this._storePath.endsWith(".gz") ? "writeGzippedFile"
> +                                                   : "writeFile";

As mentioned in the other patch, I suspect that we should normalize upon lz4 for the time being.

@@ +600,5 @@
> +
> +    return normalized;
> +  },
> +
> +  _unnormalize: function (o) {

That's a very weird method name. It really deserves some documentation.

@@ +627,5 @@
> +    this._data.crashes[id].type = this.TYPE_GECKO;
> +  },
> +
> +  /**
> +   * Record the declined submission of a particular crash.

You should explain what this means.

@@ +764,5 @@
> +/**
> + * Represents an individual crash with metadata.
> + *
> + * This is a wrapper around the low-level anonymous JS objects that define
> + * crashes. It exposes a consistent and helpful API.

Please document the argument.

@@ +767,5 @@
> + * This is a wrapper around the low-level anonymous JS objects that define
> + * crashes. It exposes a consistent and helpful API.
> + */
> +function CrashRecord(o) {
> +  this._o = o;

Nice smiley :)
However, _o is a very awkward name.

::: toolkit/components/crashes/CrashService.js
@@ +8,5 @@
>  
>  Cu.import("resource://gre/modules/Services.jsm");
>  Cu.import("resource://gre/modules/XPCOMUtils.jsm");
>  
> +const AGGREGATE_STARTUP_DELAY_MS = 60000;

Please document.
Also, I realize that the maintenance tasks will probably be quite fast in most cases, but since we're delaying them, I'd rather we put stuff in idle or on idle-daily, rather than after a completely arbitrary amount of time.

::: toolkit/components/crashes/crash-events.rst
@@ +7,5 @@
> +
> +When an event worthy of recording occurs, a file containing that event's
> +information is written to a well-defined location on the filesystem. The Gecko
> +process periodically scans for produced files and consolidates information
> +into a more unified and effecient filesystem format.

"efficient"

@@ +22,5 @@
> +The filename of the event file is not relevant. However, producers need
> +to choose a filename intelligently to avoid name collisions and race
> +conditions. Since file locking is potentially dangerous at crash time,
> +the convention of generating a UUID and using it as a filename has been
> +adopted.

If data can be in UAppData, I believe that we should insist upon a recognizable file extension (say .mozcrash).

@@ +53,5 @@
> +
> +crash.gecko.1
> +^^^^^^^^^^^^^
> +
> +This event is produced when the Gecko process crashes.

In the context of e10s, does this include both the parent process and children processes?

@@ +57,5 @@
> +This event is produced when the Gecko process crashes.
> +
> +The payload of this event is a single line: the UUID of the crash that was
> +produced and saved to disk by Breakpad. There should exist *<UUID>*.dmp and
> +*<UUID>*.extra files on disk.

Where?

@@ +80,5 @@
> +crashreporter.submit.attempt.1
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +This event is produced by the crash reporter when it attempts to upload a
> +crash (likely to Mozilla). Crashes may also be uploaded in product. This

Nit: "in product" might not be clear for all readers.

@@ +83,5 @@
> +This event is produced by the crash reporter when it attempts to upload a
> +crash (likely to Mozilla). Crashes may also be uploaded in product. This
> +event is only meant for use by the crash reporter.
> +
> +The payload consists of 3 lines::

Single ":", I guess.

@@ +93,5 @@
> +The submission ID is an ID that uniquely identifies this submission attempt.
> +It is used to correlate results against the request that initiated them.
> +
> +Time of submission is the number of seconds since UNIX epoch expressed as
> +an integer.

Number of seconds as an integer? That's unusual. Why not the result of Date.now()?

::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
@@ +377,5 @@
> +  yield m.aggregateEventsFiles();
> +
> +  let crashes = yield m.getCrashes();
> +  Assert.equal(crashes.length, 1);
> +  let c = crashes[0];

Despite the comment, I don't understand what you have tested here.

::: toolkit/components/crashes/tests/xpcshell/test_crash_service.js
@@ +32,5 @@
>  
>    run_next_test();
>  });
> +
> +add_task(function test_aggregate_events_files() {

What does this test, exactly?

::: toolkit/components/crashes/tests/xpcshell/test_crash_store.js
@@ +34,5 @@
> +  let s = new CrashStore("/some/path");
> +  Assert.ok(s instanceof CrashStore);
> +
> +  run_next_test();
> +});

Is there a reason to combine add_test and add_task?
Attachment #8344322 - Flags: feedback?(dteller)
Attachment #8344322 - Flags: feedback+
Comment on attachment 8344319 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data

Review of attachment 8344319 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/crashes/CrashManager.jsm
@@ +27,5 @@
> + *   pendingDumpsDir (string)
> + *     Where dump files that haven't been uploaded are located.
> + *
> + *   submittedDumpsDir (string)
> + *     Where records up uploaded dumps are located.

I assume this was meant to read "Where records of uploaded dumps are located."

@@ +99,5 @@
> +    return this._getDirectoryEntries(this._submittedDumpsDir,
> +                                     this.SUBMITTED_REGEX);
> +  },
> +
> +  _getDirectoryEntries: function (path, re) {

None of these functions mentions that the array of objects is sorted by date, FWIW.

::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
@@ +21,5 @@
> +    f.create(Ci.nsIFile.DIRECTORY_TYPE, dirMode);
> +  }
> +
> +  const dirMode = OS.Constants.libc.S_IRWXU;
> +  let baseFile = do_get_tempdir();

Actually do_get_tempdir is a directory that the Python harness will clean up after the test is run, so it's better suited for use in an xpcshell test.

@@ +63,5 @@
> +    file.lastModifiedTime = date.getTime();
> +    dump("Created fake crash: " + file.path + "\n");
> +
> +    return uuid;
> +  };

It's possible, but I don't think it adds much to this test.

@@ +91,5 @@
> +  let m = new CrashManager({
> +    pendingDumpsDir: "/foo",
> +    submittedDumpsDir: "/bar",
> +  });
> +  Assert.ok(m);

This was bug 873126 FYI.
Attachment #8344319 - Flags: review?(ted) → review+
Comment on attachment 8344320 [details] [diff] [review]
Part 3: XPCOM service for managing crash data

Review of attachment 8344320 [details] [diff] [review]:
-----------------------------------------------------------------

I wish we had a standard way for defining chrome interfaces in WebIDL. :-/

::: toolkit/components/crashes/CrashService.js
@@ +34,5 @@
> +    reportsDir.append("Crash Reports");
> +    let pendingDir = reportsDir.clone();
> +    let submittedDir = reportsDir.clone();
> +    pendingDir.append("pending");
> +    submittedDir.append("submitted");

I wish we had a better JS API for path management. Do we not have that in OS.File yet?

::: toolkit/components/crashes/nsICrashes.idl
@@ +45,5 @@
> +   *  - path
> +   *  - date
> +   *
> +   * This interface could be changed to expose richer types if consumers
> +   * require it. Please file a bug.

This comment doesn't really need to be here.
Attachment #8344320 - Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #61)
> Comment on attachment 8344320 [details] [diff] [review]
> Part 3: XPCOM service for managing crash data
> 
> Review of attachment 8344320 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I wish we had a standard way for defining chrome interfaces in WebIDL. :-/
> 
> ::: toolkit/components/crashes/CrashService.js
> @@ +34,5 @@
> > +    reportsDir.append("Crash Reports");
> > +    let pendingDir = reportsDir.clone();
> > +    let submittedDir = reportsDir.clone();
> > +    pendingDir.append("pending");
> > +    submittedDir.append("submitted");
> 
> I wish we had a better JS API for path management. Do we not have that in
> OS.File yet?

We do.

Hint, to make it readable:
let Path = OS.Path;
let Paths = OS.Constants.Path;
(Assignee)

Comment 63

5 years ago
Incorporated review feedback. The plugin and child process crash section
could be improved. Perfect is the enemy of done. I'd much prefer to
commit something and let a subject matter expert fill it in later.
Attachment #8356347 - Flags: review?(benjamin)
(Assignee)

Comment 64

5 years ago
function* is now used. Review feedback incorporated (except for the
comments I didn't grok).
Attachment #8356364 - Flags: review?(ted)
(Assignee)

Updated

5 years ago
Attachment #8344319 - Attachment is obsolete: true
(Assignee)

Comment 65

5 years ago
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #60)
> ::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
> @@ +21,5 @@
> > +    f.create(Ci.nsIFile.DIRECTORY_TYPE, dirMode);
> > +  }
> > +
> > +  const dirMode = OS.Constants.libc.S_IRWXU;
> > +  let baseFile = do_get_tempdir();
> 
> Actually do_get_tempdir is a directory that the Python harness will clean up
> after the test is run, so it's better suited for use in an xpcshell test.

I don't parse this comment.
 
> @@ +63,5 @@
> > +    file.lastModifiedTime = date.getTime();
> > +    dump("Created fake crash: " + file.path + "\n");
> > +
> > +    return uuid;
> > +  };
> 
> It's possible, but I don't think it adds much to this test.

*scratches head*

I'll resubmit updated patch so you have a chance to clarify.
(Assignee)

Updated

5 years ago
Attachment #833354 - Attachment is obsolete: true
(Assignee)

Comment 66

5 years ago
Comment on attachment 8356364 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data

I think I missed Yoric's comment. Yet another reminder how crappy Bugzilla is for code review.

I may very well submit these reviews on ReviewBoard...
Attachment #8356364 - Flags: review?(ted)
Sorry, my comments were in reply to comments Yoric left. That's totally unclear. :-(
(Assignee)

Comment 68

5 years ago
If anyone feels like using a review tool with better UX than bugzilla:

https://reviewboard.allizom.org/r/21/
https://reviewboard.allizom.org/r/22/
Comment on attachment 8356347 [details] [diff] [review]
Part 1: Document existing crash reporter behavior

I might tweak it later, but this is a good summary.
Attachment #8356347 - Flags: review?(benjamin) → review+
(Assignee)

Comment 71

5 years ago
Incorporated Yoric's feedback.

Also up for review at https://reviewboard.allizom.org/r/22/. I highly
prefer the review be conducted there. I think you will too :)
Attachment #8357318 - Flags: review?(ted)
Attachment #8357318 - Flags: review?(dteller)
(Assignee)

Updated

5 years ago
Attachment #8356364 - Attachment is obsolete: true
Comment on attachment 8357318 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data;

Review of attachment 8357318 [details] [diff] [review]:
-----------------------------------------------------------------

Side-note: could you add version numbers to your patches when you upload them?

Reviewed on https://reviewboard.allizom.org/r/22/
Attachment #8357318 - Flags: review?(dteller) → review+
(Assignee)

Comment 74

5 years ago
> Side-note: could you add version numbers to your patches when you upload
> them?

Sorry, I usually do. FWIW, ReviewBoard creates the version numbers automatically. Another reason why ReviewBoard is superior to Bugzilla.
 
> Reviewed on https://reviewboard.allizom.org/r/22/

Thanks. I also noticed ReviewBoard isn't sending notification emails. I filed bug 958236.
(Assignee)

Comment 75

5 years ago
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #61)
> Comment on attachment 8344320 [details] [diff] [review]
> Part 3: XPCOM service for managing crash data

> ::: toolkit/components/crashes/CrashService.js
> @@ +34,5 @@
> > +    reportsDir.append("Crash Reports");
> > +    let pendingDir = reportsDir.clone();
> > +    let submittedDir = reportsDir.clone();
> > +    pendingDir.append("pending");
> > +    submittedDir.append("submitted");
> 
> I wish we had a better JS API for path management. Do we not have that in
> OS.File yet?

I'd need to patch OS.Path.Constants to provide access to UAppData (XRE_USER_APP_DATA_DIR).

I looked at dom/system/OSFileConstants.cpp, but I'm not sure what all is involved. Perhaps Yoric can help...
(In reply to Gregory Szorc [:gps] from comment #75)
> I'd need to patch OS.Path.Constants to provide access to UAppData
> (XRE_USER_APP_DATA_DIR).
> 
> I looked at dom/system/OSFileConstants.cpp, but I'm not sure what all is
> involved. Perhaps Yoric can help...

Sure. File a big and I'll do it (or mentor it).
(Assignee)

Comment 77

5 years ago
Attempting to upload an attachment that redirects to ReviewBoard...

ted has already granted r+.
Attachment #8344320 - Attachment is obsolete: true
Attachment #8358101 - Flags: review?(benjamin)
(Assignee)

Comment 78

5 years ago
Holy crap, I can't believe the ReviewBoard auto redirect actually worked! Rich MIME type too. I owe someone a beer.
Attachment #8344322 - Attachment is obsolete: true
Attachment #8344322 - Flags: review?(ted)
(Assignee)

Comment 79

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> Comment on attachment 8344322 [details] [diff] [review]
> Part 4: Add Support for crash event files to CrashManager
> >+  _processEventFile: function (entry) {
> >+    return Task.spawn(function () {
> >+      let data;
> >+      try {
> >+        data = yield OS.File.read(entry.path);
> >+      } catch (ex) {
> >+        // TODO log
> >+        throw new Task.Result(this.EVENT_FILE_ERROR_IO);
> >+      }
> >+
> >+      let store = yield this._getStore();
> >+
> >+      let decoder = new TextDecoder();
> >+      data = decoder.decode(data);
> >+
> >+      let sepIndex = data.indexOf("\n");
> >+      if (sepIndex == -1) {
> >+        throw new Task.Result(this.EVENT_FILE_ERROR_MALFORMED);
> >+      }
> >+
> >+      let type = data.substring(0, sepIndex);
> >+      let payload = data.substring(sepIndex + 1);
> >+
> >+      if (type == "crash.gecko.1" || type == "crash.plugin.1") {
> 
> .gecko seems confusing, can this be .chrome? This seems to be missing
> content-process crashes; these are already being used for thumbnails. I
> suspect that we should paramaterize this rather than having separate
> .addGeckoCrash and .addPluginCrash methods.

Good point.

Nit pick: Is "chrome" the best choice? "chrome" to me implies the main UI process. That just happens to be the "main" (parent) application process today. Will this assumption always hold? Should we just call it "main?"
 
> Also how does this handle plugin hangs?

It doesn't, yet. That's mainly because I'm not sure how plugin/subprocess hangs/crashes work.

> I don't understand what the store expiration timer is for. Why is there an
> expiration at all? Is that because of memory usage?

Yes. No sense keeping unused resources allocated! I'm not sure if this will actually impact the bottom line in practice. But at least I won't be contributing to the problem!

> >diff --git a/toolkit/components/crashes/crash-events.rst b/toolkit/components/crashes/crash-events.rst
> 
> >+crash.gecko.1
> 
> As noted above, I'm wary of having separate events for crash.gecko and
> crash.plugin. Also it seems to me that what we really want is to include all
> of the crash metadata in the event file: this will naturally include the
> ProcessType= annotation as well as any Hang= annotation.

I was kinda hoping to avoid processing the annotations at this point. If this is inevitable for the initial release, I guess I'll just bite the bullet.

> Finally, it doesn't need to be in this patch, but we should probably limit
> the crash data by total count as well as the 180-day limit. There have been
> people who crash Flash hundreds of times a day (it basically crashes on
> plugin startup every time), and we want to apply a sane limit. Maybe 5000
> crashes total?

Agreed. Unbound data structures and performance don't go well together.

I worry about the impact of a naive chomping algorithm on data reporting. Perhaps we should have separate ceilings for plugins and main process crashes so we don't lose the (presumably) more important data on main process crashes.

We can store a high water mark marker to indicate excessive crashes occurred on a given day.
(Assignee)

Comment 80

5 years ago
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #59)
> Comment on attachment 8344322 [details] [diff] [review]
> Part 4: Add Support for crash event files to CrashManager
> ::: toolkit/components/crashes/CrashManager.jsm
> @@ +48,2 @@
> >   */
> >  this.CrashManager = function (options) {
> 
> Wouldn't it be simpler to store a shallow copy of |options| as
> |this._options| and use this everywhere?

Yes. But then you have to type more at use sites. Plus you involve an extra property lookup. Meh.

> @@ +79,5 @@
> >        default:
> >          throw new Error("Unknown property in options: " + k);
> >      }
> >    }
> > +
> 
> Could you please document these fields?

They are documented above.

> @@ +105,5 @@
> > +
> > +  EVENT_FILE_SUCCESS: "ok",
> > +  EVENT_FILE_ERROR_IO: "io",
> > +  EVENT_FILE_ERROR_MALFORMED: "malformed",
> > +  EVENT_FILE_ERROR_UNKNOWN_EVENT: "unknown-event",
> 
> These look like global constants rather than prototype fodder. Do you have
> any reason to put these things in the prototype?

As stated on an earlier review, I like attaching things to the logical closest scope. That's how lots of programming languages work (globals are evil) and it makes refactoring easier since the only external symbols are typically imports. Just my style pref.

> @@ +177,5 @@
> > +  aggregateEventsFiles: function () {
> > +    return Task.spawn(function () {
> > +      if (this._aggregateInProgress) {
> > +        throw new Error("Event aggregation already in progress. Try again " +
> > +                        "later.");
> 
> Shouldn't you return the promise of the aggregation in progress?

Good idea! Will fix.
 
> @@ +553,5 @@
> > +        normalized.crashes[id] = c;
> > +      }
> > +
> > +      let method = this._storePath.endsWith(".gz") ? "writeGzippedFile"
> > +                                                   : "writeFile";
> 
> As mentioned in the other patch, I suspect that we should normalize upon lz4
> for the time being.

I'm fine with that. What's the timetable for landing lz4?

> ::: toolkit/components/crashes/CrashService.js
> @@ +8,5 @@
> >  
> >  Cu.import("resource://gre/modules/Services.jsm");
> >  Cu.import("resource://gre/modules/XPCOMUtils.jsm");
> >  
> > +const AGGREGATE_STARTUP_DELAY_MS = 60000;
> 
> Please document.
> Also, I realize that the maintenance tasks will probably be quite fast in
> most cases, but since we're delaying them, I'd rather we put stuff in idle
> or on idle-daily, rather than after a completely arbitrary amount of time.

In Sync land, we learned the hard way to not trust idle-daily because it apparently doesn't fire in a lot of our users. I believe sometimes it fires many signals all at once! See bug 845558 where I explained the problem and requested Telemetry to assess the scope of the problem. In the mean time, I'm very wary about relying on idle anything for important things.

> @@ +22,5 @@
> > +The filename of the event file is not relevant. However, producers need
> > +to choose a filename intelligently to avoid name collisions and race
> > +conditions. Since file locking is potentially dangerous at crash time,
> > +the convention of generating a UUID and using it as a filename has been
> > +adopted.
> 
> If data can be in UAppData, I believe that we should insist upon a
> recognizable file extension (say .mozcrash).

Why? The directory name already denotes the content as belonging to the Gecko app.

> @@ +53,5 @@
> > +
> > +crash.gecko.1
> > +^^^^^^^^^^^^^
> > +
> > +This event is produced when the Gecko process crashes.
> 
> In the context of e10s, does this include both the parent process and
> children processes?

I have no clue since I know little about e10n. See earlier reply to bsmedberg's review, where I suggest we call crashes in the main process as "main."
 

> @@ +83,5 @@
> > +This event is produced by the crash reporter when it attempts to upload a
> > +crash (likely to Mozilla). Crashes may also be uploaded in product. This
> > +event is only meant for use by the crash reporter.
> > +
> > +The payload consists of 3 lines::
> 
> Single ":", I guess.

No, this is restructured text. The double colon tells it that preformatted text follows.

> 
> @@ +93,5 @@
> > +The submission ID is an ID that uniquely identifies this submission attempt.
> > +It is used to correlate results against the request that initiated them.
> > +
> > +Time of submission is the number of seconds since UNIX epoch expressed as
> > +an integer.
> 
> Number of seconds as an integer? That's unusual. Why not the result of
> Date.now()?

Because we don't need millisecond precision IMO.
 
> ::: toolkit/components/crashes/tests/xpcshell/test_crash_store.js
> @@ +34,5 @@
> > +  let s = new CrashStore("/some/path");
> > +  Assert.ok(s instanceof CrashStore);
> > +
> > +  run_next_test();
> > +});
> 
> Is there a reason to combine add_test and add_task?

Originally, it was because non-task tests didn't need to be executed as tasks. I suppose it really doesn't matter. I like using tasks because they can't hang as easily as add_test() can. I'll probably just switch to add_task() for everything from now on.
I really want to insist that we call all crash reporting events "crash.1" and then use something like ProcessType to distinguish between main-process/plugin-process/content-process.
(Assignee)

Comment 82

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #81)
> I really want to insist that we call all crash reporting events "crash.1"
> and then use something like ProcessType to distinguish between
> main-process/plugin-process/content-process.

What about hangs? I'm still unclear on every scenario where we create a minidump. Should we even refer to "minidumps" as "crashes?"

Perhaps you could answer by updating the in-tree docs. I'd like to see these things documented where others will find them later. If you don't update the docs, I will :)
Posted patch crash-docsSplinter Review
Attachment #8358567 - Flags: review?(gps)

Updated

5 years ago
Attachment #8358101 - Flags: review?(benjamin)
(Assignee)

Comment 84

5 years ago
Comment on attachment 8358567 [details] [diff] [review]
crash-docs

Review of attachment 8358567 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine. But the patch is rather small. Did you upload everything you intended to?
Attachment #8358567 - Flags: review?(gps) → review+
Yes, I was just writing the docs that I thought would be helpful in terms of plugin and content crash flow. If you have other stuff you'd like me to write, let me know.

Updated

5 years ago
Priority: -- → P1
(Assignee)

Comment 87

5 years ago
(In reply to Gregory Szorc [:gps] from comment #84)
> Comment on attachment 8358567 [details] [diff] [review]
> crash-docs
> 
> Review of attachment 8358567 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks fine. But the patch is rather small. Did you upload everything
> you intended to?

Oh, turns out Splinter wasn't formatting the patch properly! My browser is only showing a net addition of ~2 lines but the raw patch contained much more. Sadness.
Comment on attachment 8357318 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data;

Review of attachment 8357318 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, sorry for the delay.
Attachment #8357318 - Flags: review?(ted) → review+
(Assignee)

Comment 89

5 years ago
Comment on attachment 8358101 [details]
Part 3: XPCOM Service for managing crash data

Latest revision on ReviewBoard incorporates review feedback.
Attachment #8358101 - Flags: review?(benjamin)
(Assignee)

Comment 91

5 years ago
Comment on attachment 8358101 [details]
Part 3: XPCOM Service for managing crash data

Review was granted on ReviewBoard.
Attachment #8358101 - Flags: review?(benjamin) → review+
(Assignee)

Comment 92

5 years ago
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager

I paired the patch down to just the basic crash event files implementation. Hopefully we can focus on that. We can bikeshed event specifics later (in the next day or two).
Attachment #8358103 - Flags: review?(dteller)
Attachment #8358103 - Flags: review?(benjamin)
And a follow-up to hopefully un-bust comm-central.
https://hg.mozilla.org/comm-central/rev/1e1da3a4dac3
(Assignee)

Comment 95

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #94)
> And a follow-up to hopefully un-bust comm-central.
> https://hg.mozilla.org/comm-central/rev/1e1da3a4dac3

Thanks, Ryan.
Attachment #8358103 - Flags: review?(dteller) → feedback+

Updated

5 years ago
Attachment #8358103 - Flags: review?(benjamin)
(Assignee)

Comment 96

5 years ago
Attachment #8368279 - Flags: review?(benjamin)
(Assignee)

Comment 97

5 years ago
Attachment #8368280 - Flags: review?(benjamin)
(Assignee)

Comment 98

5 years ago
Attachment #8368282 - Flags: review?(benjamin)
(Assignee)

Comment 99

5 years ago
Attachment #8368319 - Flags: review?(rnewman)
Attachment #8368319 - Flags: review?(benjamin)
(Assignee)

Comment 100

5 years ago
And with that, I think the JS is feature complete. Now, time for the C++.

Updated

5 years ago
Attachment #8368280 - Flags: review?(benjamin)

Updated

5 years ago
Attachment #8368282 - Flags: review?(benjamin)

Updated

5 years ago
Attachment #8368319 - Flags: review?(benjamin)
(Assignee)

Updated

5 years ago
Blocks: 966691
(Assignee)

Comment 102

5 years ago
Attachment #8369147 - Flags: feedback?(ted)
Attachment #8369147 - Flags: feedback?(benjamin)

Updated

5 years ago
Attachment #8368279 - Flags: review?(benjamin)

Updated

5 years ago
Attachment #8369147 - Flags: feedback?(benjamin)
I think I've got all the reviews you needed from me. Please let me know in the flags if there is something that needs re-review or that I missed.
(Assignee)

Updated

5 years ago
Attachment #8369147 - Flags: feedback?(ted)
(Assignee)

Comment 105

5 years ago
Looks like the latter patches are going to miss the 29 Aurora train. I'm flagging this bug to target 29 for uplift. This bug will give us significantly better FHR data on crashes in Firefox. The old mechanism was prone to double counting and incomplete data. Hopefully we can get the entire patch set uplifted to Aurora in the next week.
Not sure that we need to track this I spoke with :gps on IRC and advised he could request uplift.
(Assignee)

Comment 107

5 years ago
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager

Feedback applied. Requesting re-review.
Attachment #8358103 - Flags: review?(dteller)
(Assignee)

Updated

5 years ago
Attachment #8368282 - Flags: review?(dteller)
Attachment #8358103 - Flags: review?(dteller) → review+
(Assignee)

Comment 109

5 years ago
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager

This bug tracks an overhaul of how Gecko tracks crashes. The goal of this feature is to enable Firefox Health Report to report more accurate crash data. This data will be extremely useful to Mozilla and will directly help the stability efforts. It therefore will be beneficial to our users.

The sooner we can get this code to the release channel, the better.

The feature consists of a new XPCOM component that aggregates and stores crash data. This feature is self-contained. Initially, FHR will be the only consumer. Although, about:crashes and other "health" browser features will likely be built on top of it.

I believe the risk of the patch set should be low. The highest risk is that latter patches in this series break crash reporting. But, that should be noticeable via dropped crash volume on Soccoro and via FHR numbers changing.

I highly doubt we'd need to back out the new XPCOM service. But if we did, it is self-contained within a single directory, which we could easily delete.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): New feature for better crash reporting
User impact if declined: FHR crash reporting will continue to be bad.
Testing completed (on m-c, etc.): Unit tests. Feature in process of landing.
Risk to taking this patch (and alternatives if risky): Self-isolated component. 
String or IDL/UUID changes made by this patch: None
Attachment #8358103 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 110

5 years ago
Comment on attachment 8368279 [details]
Part 5: Document crash event types

See comment #109 for more uplift context.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Crash logging 
User impact if declined: Bad crash data
Testing completed (on m-c, etc.): This is a docs-only change and is NPOTB.
Risk to taking this patch (and alternatives if risky): None
String or IDL/UUID changes made by this patch: None
Attachment #8368279 - Flags: review+
Attachment #8368279 - Flags: approval-mozilla-aurora?
(In reply to Gregory Szorc [:gps] from comment #109)
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): New feature for better crash
> reporting
> User impact if declined: FHR crash reporting will continue to be bad.
> Testing completed (on m-c, etc.): Unit tests. Feature in process of landing.
> Risk to taking this patch (and alternatives if risky): Self-isolated
> component. 
> String or IDL/UUID changes made by this patch: None

Bug 952335 might need to be uplifted, too.
Depends on: 952335
(Assignee)

Comment 112

5 years ago
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #111)
> Bug 952335 might need to be uplifted, too.

Thanks for the awareness! I requested it to be uplifted.
We should probably get QA to do some verification of this first before we request uplift.
(Assignee)

Comment 115

5 years ago
Comment on attachment 8368280 [details]
Part 6: Implement support for initial crash events

This needs re-review since r+'d patch has been updated significantly.
Attachment #8368280 - Flags: review?(benjamin)

Updated

5 years ago
Attachment #8368280 - Flags: review?(benjamin)
gps: does Part X still need review from me? I can't tell from the 116 comments and 10 active attachments :P
Attachment #8368279 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8358103 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager

Too early
Attachment #8358103 - Flags: approval-mozilla-aurora+ → approval-mozilla-aurora?
Attachment #8368279 - Flags: approval-mozilla-aurora+ → approval-mozilla-aurora-
Attachment #8368279 - Flags: approval-mozilla-aurora- → approval-mozilla-aurora?
(Assignee)

Comment 120

5 years ago
(In reply to Richard Newman [:rnewman] from comment #117)
> gps: does Part X still need review from me? I can't tell from the 116
> comments and 10 active attachments :P

Yes it does. (Reducing clutter is one reason I prefer an external review tool.)
Attachment #8368282 - Flags: review?(dteller) → review+
(Assignee)

Updated

5 years ago
Blocks: 970103
(Assignee)

Comment 123

5 years ago
Comment on attachment 8369147 [details]
Part 8: Write main process crash events files

I think I got things working on Linux and Windows. Time to request formal review. I expect many review comments.
Attachment #8369147 - Flags: review?(benjamin)
(Assignee)

Comment 124

5 years ago
I spent a large portion of my time on Friday trying to sort out a Windows debug only failure with part 8 (seen also at https://tbpl.mozilla.org/?tree=Try&rev=9f47be8a7faa).

When I run the xpcshell test (read: the code that produces event files) locally, the window about a "program encountering an error... would you like to abort, retry, ignore" comes up. But I'm not sure why. No matter what I do at that prompt, the test fails.

I've tried starting the xpcshell.exe process, attaching the Visual Studio debugger, then resuming the test, but I get the same behavior.

I've banged my head against this for far too long. I suspect the issue is something silly, but I'm not sure what. I tried commenting out whole sections of my code to isolate the problem, but I can't seem to isolate it. I'm not sure if my code is just doing something silly, if the failure in automation is the same issue I'm seeing or what. I suspect/hope the issue will get teased out during code review.
The way do_crash works is it launches another xpcshell process and crashes *that* process. So I expect that you are seeing the error dialog from that process and not the main xpcshell process running the test.

That dialog is expected when the crash reporter is not turned on properly, or when the crash handler itself crashes. I just posted the review of part 8 and I expect that the non-nullterminated id_ascii is causing the crash handler itself to crash.

Updated

5 years ago
Attachment #8369147 - Flags: review?(benjamin)
Don't forget that if you want uplift for aurora for all patches, it should be requested soon. 
If it going to be harder to uplift them in beta.
Comment on attachment 8368319 [details]
Part 9: Change FHR provider to pull from crash manager

Three small issues on ReviewBoard.
Attachment #8368319 - Flags: review?(rnewman) → review+
(Assignee)

Comment 128

5 years ago
Comment on attachment 8369147 [details]
Part 8: Write main process crash events files

New version is posted on RB and passes Try \o/
Attachment #8369147 - Flags: review?(benjamin)
(Assignee)

Comment 129

5 years ago
(In reply to Richard Newman [:rnewman] from comment #127)
> Comment on attachment 8368319 [details]
> Part X: Change FHR provider to pull from crash manager
> 
> Three small issues on ReviewBoard.

Did you forget to post your review? I'm not seeing any new content on RB...
Flags: needinfo?(rnewman)
Doesn't show as pending to me:

https://reviewboard.allizom.org/r/30/#

  Posted 2 weeks ago (Feb. 11, 2014, 7:08 p.m.)
Flags: needinfo?(rnewman)

Updated

5 years ago
Attachment #8369147 - Flags: review?(benjamin)
(Assignee)

Comment 131

5 years ago
Part 8 (finally):

https://hg.mozilla.org/integration/mozilla-inbound/rev/de17e5deb35d

With nits addressed and Vidyo confirmation from bsmedberg that landing this before plugin events are recorded is acceptable.
(Assignee)

Updated

5 years ago
Blocks: 977464
(Assignee)

Comment 132

5 years ago
Needed to update the UUID of nsICrashReporter.

https://hg.mozilla.org/integration/mozilla-inbound/rev/bb4d62536dfd
Depends on: 976241
(Assignee)

Updated

5 years ago
Blocks: 982836
(Assignee)

Comment 134

5 years ago
Comment on attachment 8368319 [details]
Part 9: Change FHR provider to pull from crash manager

Requesting re-review since plugin references were removed.

I was pretty close to carrying r+ forward because code changes are minimal (ReviewBoard should give proper interdiff). But I like a review.

rnewman: I'll review the Sync patches right now.
Attachment #8368319 - Flags: review?(rnewman)
Attachment #8368319 - Flags: review?(benjamin)
Attachment #8368319 - Flags: review+
(Assignee)

Updated

5 years ago
Attachment #8368319 - Attachment description: Part X: Change FHR provider to pull from crash manager → Part 9: Change FHR provider to pull from crash manager
(Assignee)

Comment 135

5 years ago
Also, plugin reporting isn't going to land by Monday unless there is a miracle. We'll hopefully get that in next cycle and possibly uplift.

I plan to close out this bug once the FHR provider patch lands.
Comment on attachment 8368319 [details]
Part 9: Change FHR provider to pull from crash manager

rs=me
Attachment #8368319 - Flags: review?(rnewman) → review+
Gregory, does that mean I should refuse the two pending uplift requests?
Flags: needinfo?(gps)

Updated

5 years ago
Attachment #8368319 - Flags: review?(benjamin)
(Assignee)

Comment 138

5 years ago
https://hg.mozilla.org/integration/fx-team/rev/21793ea94b09

Landing FHR provider change and closing out this bug.

Will track plugin event recording in a soon-to-be-filed bug.
Whiteboard: [leave open]
(Assignee)

Comment 139

5 years ago
Switching needinfo to bsmedberg to answer 29 uplift question.
Flags: needinfo?(gps) → needinfo?(benjamin)
This is not safe enough to uplift less than a week before beta.
Flags: needinfo?(benjamin)
(Assignee)

Updated

5 years ago
Depends on: 983313
(Assignee)

Comment 142

5 years ago
Likely due to --disable-crashreporter in ASAN builds. We didn't have a direct dependency on the crash reporter before because we slurped files directory off the filesystem. Will likely need to put the crashes provider behind an #ifdef.
(Assignee)

Comment 143

5 years ago
https://hg.mozilla.org/integration/fx-team/rev/4496b6e98cf6

(Review conducted on ReviewBoard and IRC.)
Attachment #8368279 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager

Taking in account comment #140, I am going to reject the uplift request.
Fill free to submit it again later.
Attachment #8358103 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
https://hg.mozilla.org/mozilla-central/rev/4496b6e98cf6
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
No longer blocks: fxdesktoptriage

Updated

5 years ago
Blocks: 994707
Depends on: 1004623
I couldn't generate any crashes to show up in about:healthreport on a Mac OS X 10.9.2 and Ubuntu 13.10 x32 machines. I suppose this is because bug 1004623 is not yet on beta branch?
Flags: needinfo?(gps)
Flags: needinfo?(gps) → needinfo?(benjamin)
Alexandra, I don't know exactly what builds you're testing. If you are testing a build which doesn't have bug 1004623, then you need to run a page with a multiprocess plugin before testing.
Flags: needinfo?(benjamin)
Sorry for not mentioning it, I was testing the fix for Firefox 30.

Verified as fixed on Firefox 30 beta 6 (Build ID: 20140520115057) after loading http://benjamin.smedbergs.us/tests/ctptests/javaflash-together.html and following the steps from bug 1004623 comment 7 - a record of org.mozilla.crashes.crashes is present in about:healthreport on Ubuntu 12.04 x32, Windows 7 x64 and Mac OS X 10.9.2:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Mozilla/5.0 (X11; Linux i686; rv:30.0) Gecko/20100101 Firefox/30.0
Status: RESOLVED → VERIFIED
Keywords: verifyme
You need to log in before you can comment on or make changes to this bug.