Closed
Bug 875562
Opened 12 years ago
Closed 11 years ago
Write a log of crashes for FHR consumption
Categories
(Toolkit :: Crash Reporting, defect, P1)
Toolkit
Crash Reporting
Tracking
()
VERIFIED
FIXED
mozilla30
People
(Reporter: ted, Assigned: gps)
References
(Blocks 2 open bugs, )
Details
Attachments
(10 files, 10 obsolete files)
6.52 KB,
patch
|
benjamin
:
review+
|
Details | Diff | Splinter Review |
12.43 KB,
patch
|
ted
:
review+
Yoric
:
review+
|
Details | Diff | Splinter Review |
37 bytes,
text/x-review-board-request
|
gps
:
review+
|
Details |
37 bytes,
text/x-review-board-request
|
Yoric
:
review+
Yoric
:
feedback+
Sylvestre
:
approval-mozilla-aurora-
|
Details |
4.68 KB,
patch
|
gps
:
review+
|
Details | Diff | Splinter Review |
37 bytes,
text/x-review-board-request
|
gps
:
review+
Sylvestre
:
approval-mozilla-aurora-
|
Details |
37 bytes,
text/x-review-board-request
|
Details | |
37 bytes,
text/x-review-board-request
|
Yoric
:
review+
|
Details |
37 bytes,
text/x-review-board-request
|
rnewman
:
review+
|
Details |
37 bytes,
text/x-review-board-request
|
Details |
gps and I talked about this today. Currently FHR is trying to read the tea leaves in Crash Reports/{pending,submitted} to determine how many crashes a user experienced. This kind of sucks because those locations weren't designed for data collection so it's hard to get useful numbers out of them. We can't distinguish plugin and browser crashes, nor can we tell if the user crashed but declined to submit the crash report (because we simply delete it in that case).
We settled on having the exception handler callback write a simple log of actions, something like:
<timestamp> <local crash uuid> <action>
Examples might be:
1369351425 8355c849-2688-4697-afa9-1164b27cdd47 browser-crash
1369351456 c7dffee9-621e-44de-853a-8697a1d710e8 plugin-crash
1369351521 8355c849-2688-4697-afa9-1164b27cdd47 crash-submit
1369351538 c7dffee9-621e-44de-853a-8697a1d710e8 crash-delete
We should write a log entry every time the browser crashes, a plugin crashes, or a content process crashes. We should write an entry when we successfully submit a crash, or when the user declines to submit a crash and we delete it (clicking quit or restart in the crash reporter with the "submit" checkbox unchecked, for example).
We discussed some related issues that FHR could measure, like uncaught crashes. FHR could compare the crash log against the clean-shutdown tracking mechanism to see when unclean shutdowns happened that were not recorded as browser crashes.
Reporter | ||
Comment 1•12 years ago
|
||
I forgot to mention, but we'll want to truncate this log periodically to make sure it doesn't get ridiculously long. We'll want to make sure we're doing that in a non-racy way.
Reporter | ||
Comment 2•12 years ago
|
||
After discussion, we might want to write this log in potentially two separate places:
1) In the Crash Reports directory
2) In the profile directory
If we crash before we get a profile we can write it in the former. If we crash after we get a profile we can write it in the latter.
Assignee | ||
Comment 3•12 years ago
|
||
Daniel: Since we're talking about changing how crashes are recorded, are there any more pieces of data Metrics would like from FHR?
Flags: needinfo?(deinspanjer)
Comment 4•12 years ago
|
||
What is the proposed aggregation for sending this crash data back via FHR? I'm not sure sending the local crash UUIDs is appropriate, but perhaps, using your example, the number of browser-crash, plugin-crash, crash-submit, crash-delete per day or session.
Assignee | ||
Comment 5•12 years ago
|
||
Presumably FHR would read the log from the last recorded offset and store new data as new per-day counts (just like it does today). The big difference from today is the underlying crash data would be much more reliable.
Comment 6•12 years ago
|
||
FHR should for sure not send any crash IDs. The client also does not know about what category (browser or plugin, etc.) a crash belongs to, I think, at least not without putting more analysis in. I think what FHR should send is only the count of encountered and submitted crashes per day.
Comment 7•12 years ago
|
||
The client can and should know about browser/content/plugin crashes. That's one of the points of this bug.
Comment 8•12 years ago
|
||
From a user-facing feature POV we'll want to be able to correlate those crashes to plugin+plugin version and app+app version. The reason is so that we can provide accurate diagnostics to users (if the previous version of a plugin was crashing when a user sneezed, but the new version isn't crashing, we shouldn't tell them to update/disable that plugin, for example).
Assignee | ||
Comment 9•11 years ago
|
||
Ted: Didn't we brainstorm this on an etherpad? I seem to have lost the link.
Also, this came up at the crashkill and stability work week last week. A lot of people would like to see richer crash data that this crash "log" will provide. Hopefully we can finalize a technical plan in the next week or two and then start to move forward on implementation. I volunteer myself to write the JS API. C++ is still up in the air.
Flags: needinfo?(ted)
Reporter | ||
Comment 10•11 years ago
|
||
I don't think we ever had an etherpad, I think we just brainstormed in a text editor and I pasted the contents we wound up with in comment 0.
Flags: needinfo?(ted)
For Session Restore, we find ourselves needing a simple mechanism to determine whether the previous session has crashed. We have been working on bug 888373 for that purpose. It seems that the FHR crash log would be a superset of what we have been doing, so if the ETA is close enough, we should be able to make use of this crash log instead of landing our component.
Which brings me to my question: do we have an ETA?
Flags: needinfo?(ted)
Reporter | ||
Comment 12•11 years ago
|
||
We don't have an ETA, this was just something gps and I discussed wanting for FHR. I'd be a little worried about relying on this for detecting unclean shutdowns, because we know there are still crashes that Breakpad doesn't catch, so they wouldn't show up in this log. In fact I mentioned that in comment 0:
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #0)
> We discussed some related issues that FHR could measure, like uncaught
> crashes. FHR could compare the crash log against the clean-shutdown tracking
> mechanism to see when unclean shutdowns happened that were not recorded as
> browser crashes.
Flags: needinfo?(ted)
Assignee | ||
Updated•11 years ago
|
Flags: needinfo?(deinspanjer)
Assignee | ||
Comment 13•11 years ago
|
||
Let's brainstorm this feature here:
https://etherpad.mozilla.org/crash-log
Comment 14•11 years ago
|
||
Ben, I am CC'ing you so you can be aware of the current limitations around counting crashes in FHR as well as potentially providing more feedback on the new special.
Comment 15•11 years ago
|
||
Who is going to work on this? Does anyone have this in their near-term planning?
Assignee | ||
Comment 16•11 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #15)
> Who is going to work on this? Does anyone have this in their near-term
> planning?
This was my action item. However, I'll need help with the C++ bits. I can do the JS side and the technical work up front, however.
First thing's first: we need a technical design.
I believe bsmedberg has contributed to https://etherpad.mozilla.org/crash-log.
Ted: Could you please fill in your thoughts?
Flags: needinfo?(ted)
Reporter | ||
Comment 17•11 years ago
|
||
I've already added my comments to that etherpad (in green). The design is sensible, it just needs an implementation.
Flags: needinfo?(ted)
Assignee | ||
Comment 18•11 years ago
|
||
I talked with bsmedberg about this yesterday. To make the implementation easier, we propose having the crash reporter write individual "event" files into well-defined locations and having an XPCOM service in Gecko slurp these up and write them into a unified crash log. This makes things like locking much more manageable. Plus, it means we only need to write 1 version of the code for interfacing with the crash log files.
I've updated https://etherpad.mozilla.org/crash-log. If this design is accepted, I can start working on the XPCOM service (which I prefer to implement in JavaScript).
Also, now seems like the right time to get a preliminary perf review of the design. CC vladan.
Flags: needinfo?(vdjeric)
Flags: needinfo?(ted)
Flags: needinfo?(benjamin)
Comment 19•11 years ago
|
||
Bear in mind that said XPCOM service (a) would need to run on a timer, 'cos it won't be touched by FHR on Android and thus can't lazily do the work, (b) not run on a timer, 'cos Gecko dies all the time on Android, so the timer would never get to run, and (c) would need to be designed to not do IO on Gecko startup, 'cos that'll kill perf on Android even worse than on desktop.
Those are tricky requirements to meet, so like most other lazy/background activities on Android, most likely we'd reimplement this in Java and just not ship the Gecko component -- Gecko is an engine for fetching and displaying web pages, not a cross-platform service layer.
If you want to make a call on this now and factor it into your design, I'd be happy to sign off.
Assignee | ||
Comment 20•11 years ago
|
||
Why does Fennec always have to rain on the parade?
If Fennec wanted to reimplement the crash log aggregating and writing in Java, power to them.
At the end of the day, we need an API for crash data that is accessible from the FHR backend (for submitting data) and from "health report"/"doctor" features in the application (such as about:healthreport) so said data can be consumed to make intelligent choices. Who knows, perhaps things like the plugin manager also need this data so they can choose to automatically disable crashy plugins or change the CtP notification appropriately, etc. I /think/ there will be a requirement for at least one component in Gecko to access this data, so a read API is a requirement.
At the end of the day, I think we just need to agree on the XPCOM surface area that placates all customers. Maybe one globally available interface for reading the log and another optional interface for managing/updating the log?
On a related front, this feature would be a candidate for the not-yet-invented "background Gecko service" people talk about every so often.
Comment 21•11 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #20)
> Why does Fennec always have to rain on the parade?
Not rain, just illustrating the enormously varied worlds in which this stuff might have to work! It's very easy to sketch out a solution that works great when Gecko is running on modern hardware for several minutes, but doesn't work well when Gecko only runs for 30 seconds at a time, and you have a few hundred KB of memory available while it does so.
> If Fennec wanted to reimplement the crash log aggregating and writing in
> Java, power to them.
"them" == "us", to be fair. And if we don't reimplement it, we're essentially WONTFIXing Bug 891597.
> plugin manager also need this data so they can choose to automatically
> disable crashy plugins or change the CtP notification appropriately, etc. I
> /think/ there will be a requirement for at least one component in Gecko to
> access this data, so a read API is a requirement.
Does it have to run within Gecko (e.g., for extensibility reasons), or is it enough to communicate via notifications/messages/calls?
Does the read API need to provide up-to-date data (and thus need to trigger a rollup)?
> At the end of the day, I think we just need to agree on the XPCOM surface
> area that placates all customers. Maybe one globally available interface for
> reading the log and another optional interface for managing/updating the log?
So long as the update part is hidden behind an interface, and its behavior is well-specified, we can pretty easily swap that out for a Java service, just as we do for FHR document generation itself. There would be no need to reimplement the log reader.
> On a related front, this feature would be a candidate for the
> not-yet-invented "background Gecko service" people talk about every so often.
It would, as soon as more than a minority of phones on the market can run Gecko without swapping out every other app on the device... for the foreseeable future, Gecko is a foreground feature, and we can reasonably expect to implement most new non-foreground functionality in Java to ship on Android.
Comment 22•11 years ago
|
||
I took a quick look at the Etherpad with the proposed design, I have only a couple of comments so far:
- I think it's ok to put crash event files from different profiles in the same directory. You could include the profile name (if available) as another column. When crash data is shown in FHR, it could be broken down by profile or only shown for the current profile.
- We should also include information about the version of the app that crashed in the event logs. If Firefox starts crashing frequently after updating to version X, we might want to make note of it in the FHR dash
- When exactly will the crash event files be processed into crash logs? On-demand or on every startup?
Flags: needinfo?(vdjeric)
Comment 23•11 years ago
|
||
At this point I don't think we should fret too much about separating crashes from profiles. Perhaps as a later step we should consider keeping crashes in the profile instead of the shared location (I forget exactly why we chose the shared location). Then we'd only have to deal with the shared location for crashes that happen before we've picked a profile.
Otherwise if we need to do this in the Firefox and Android frontend separately, we should split out a separate bug for Android and at least get the desktop bits landed.
Flags: needinfo?(benjamin)
Reporter | ||
Comment 24•11 years ago
|
||
We used the shared location just because we wanted to handle startup crashes that happened before a profile was chosen, and I didn't bother writing it to have a separate code path for the "have a profile" case.
Flags: needinfo?(ted)
Comment 25•11 years ago
|
||
Comment [1] mentions two types of crashes: browser-crash and plugin-crash.
Q1. Are there other possible categories for desktop crashes? Will we include them?
Q2. What are the relevant categories for Fennec crashes?
thanks
Reporter | ||
Comment 26•11 years ago
|
||
There are content process crashes, we're not shipping any e10s on desktop yet, but when we do we'll need to handle those. We have them in B2G, but I'm not sure if that's in scope yet.
Fennec only has browser crashes, since there's no e10s there at all.
Comment 27•11 years ago
|
||
We are shipping e10s on desktop 26 and maybe event 25 for thumbnails, and I do want the FHR stats to include these.
Comment 28•11 years ago
|
||
Yes, e10s is already in desktop builds, it's disabled by default for tabs (but a simple pref flip away) and even enabled and used by default for thumbnails generation (we don't have any UI hooked up for sending crashes there but IMHO we should strive to record them in that log for FHR). And in any case, we should make that mechanism ready to be able to handle those content process crashes so we don't run into needing a redesign early after having this in place.
Assignee | ||
Comment 29•11 years ago
|
||
I started diving into this bug on the plane yesterday. I was unable to
find decent comprehensive docs on how the crash reporter works and how
all the components play together, so I decided to write some! The docs
aren't super detailed, but I think they are helpful enough to newbies
wanting to learn how everything is hooked up.
This patch applies on top of bug 939367 so Sphinx docs can come from
multiple parts of the tree. I still need to improve bug 939367 to not
require changes to tools/docs when adding new docs sources. And, I
wouldn't be surprised if there were pushback to using moz.build files
for declaring documentation sources. So please limit your review to just
the index.rst file. And, since this is all NPOTB and since you,
Benjamin, are a subject matter expert, perhaps you should just change
the docs to be accurate instead of leaving review notes?
Attachment #833354 -
Flags: review?(benjamin)
Assignee | ||
Comment 30•11 years ago
|
||
This is what I have in mind.
It doesn't compile because I'm still trying to wrap my head around how
strings work. I probably need to expand the API a bit because eventually
we'll need to know both directories where crash event data can be
located.
Questions:
* Is it acceptable for me to have a statically allocated string or
should it be initialized at run-time? The string should be allocated
very early in process lifetime, so perf concerns over static
constructors may not apply?
* Is passing the path to the crash reporter client as an argument
acceptable? A crash annotation or note didn't feel right to me.
And environment variable manipulation would require global state
or memory mucking. Both seem bad.
* I'm not sure where the best place for the "paths to events files"
API is. It needs to be accessible via JS. Directory service? Or some
XPCOM service?
* What about directory creation? Should I create these directories at
app startup time? That seems bad for perf. Is it safe to create them at
crash time? The system may be unstable. But I suppose we're already
doing I/O in the crash reporter client... Should I create directories
at profile creation time?
Most of my questions are probably obvious to you. But this is my first
C++ patch for Gecko and I know practically nothing abount the underlying
systems.
Attachment #8334201 -
Flags: feedback?(benjamin)
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → gps
Status: NEW → ASSIGNED
Comment 31•11 years ago
|
||
Comment on attachment 833354 [details] [diff] [review]
Part 1: Crash reporter docs
This doc looks correct and fine in terms of firefox.exe crashes. It does not cover plugin-process crashes at all, nor does it mention that crash report submissions can also occur from about:crashes
child-process crashes are handled by the dom/ipc/CrashReproterParent class, and then into the Firefox frontend for submission via CrashSubmit.jsm
Updated•11 years ago
|
Attachment #8334201 -
Flags: feedback?(benjamin) → feedback+
Comment 32•11 years ago
|
||
> * Is it acceptable for me to have a statically allocated string or
> should it be initialized at run-time? The string should be allocated
> very early in process lifetime, so perf concerns over static
> constructors may not apply?
Static constructors are still bad. Use "nsString* gFoo";
> * Is passing the path to the crash reporter client as an argument
> acceptable?
Yes.
> * I'm not sure where the best place for the "paths to events files"
> API is. It needs to be accessible via JS. Directory service? Or some
> XPCOM service?
The directory service sucks. nsIXULRuntime or nsICrashReporter would both be ok places to hang this.
> * What about directory creation? Should I create these directories at
> app startup time? That seems bad for perf. Is it safe to create them at
> crash time?
Yes, you can make syscalls safely at crash time.
Oh also, rereading the patch: you're using nsIFile.getNativePath, which is a lossy method on Windows. We need to use unicode paths/nsAString in order to capture things properly.
Assignee | ||
Comment 33•11 years ago
|
||
The tree doesn't have a robust and reusable API for interfacing with
crash data. This patch is the start of a new API.
In this patch, I created the CrashManager type. It has APIs for
retrieving the lists of files related to crash dumps. In subsequent
patches, I will convert existing code in the tree that does similar
things to the new API. I will also build the events/timeline API onto
this type.
I made CrashManager generic because I hate, hate, hate singletons and
global variables. Allowing it to be instantiated multiple times with
different options (instead of say binding a global instance to ProfD)
makes the testing story much, much nicer. That is reason enough, IMO. In
a subsequent patch, I'll add an XPCOM service that instantiates the
"global" instance of CrashManager with the appropriate options.
It was tempting to add this code into the existing CrashReports.jsm.
However, this file does not import cleanly in xpcshell tests and I
didn't want to bloat scope to include fixing that file... yet.
CrashReports.jsm is using synchronous I/O. So, depending on how
adventerous I feel, I may replace consumers of CrashReports.jsm with the
new CrashManager.jsm, remove CrashReports.jsm, and eliminate another
source of synchronous I/O in the tree.
There is a lot that will build on this patch. Things should be clearer a
few patches in.
Attachment #8334977 -
Flags: review?(benjamin)
Assignee | ||
Updated•11 years ago
|
Attachment #8334201 -
Attachment is obsolete: true
Assignee | ||
Updated•11 years ago
|
Attachment #8334201 -
Attachment is obsolete: false
Assignee | ||
Comment 34•11 years ago
|
||
Oh, somehow I missed the existence of nsICrashReporter! I can only image what hysterical raisin has it sitting in xpcom/system and its implementation in toolkit/xre/nsAppRunner.cpp instead of toolkit/crashreporter. I didn't even realize nsXULAppInfo inherited from so many interfaces!
I would like to hang a lot of things off nsICrashReporter. Is it possible to implement part of that interface in JavaScript and part in C++? If so, could you please point me to an interface that is implemented in both languages so I have can have a clue what's going on. Or, will I need to write an interface just for JS (and possibly hang proxy methods off nsICrashReporter/nsXULAppInfo)?
Comment 35•11 years ago
|
||
> I would like to hang a lot of things off nsICrashReporter. Is it possible to
> implement part of that interface in JavaScript and part in C++?
No.
Reporter | ||
Comment 36•11 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #34)
> Oh, somehow I missed the existence of nsICrashReporter! I can only image
> what hysterical raisin has it sitting in xpcom/system and its implementation
> in toolkit/xre/nsAppRunner.cpp instead of toolkit/crashreporter. I didn't
> even realize nsXULAppInfo inherited from so many interfaces!
It's...complicated.
Comment 37•11 years ago
|
||
Comment on attachment 833354 [details] [diff] [review]
Part 1: Crash reporter docs
r- based on the previous comment.
Attachment #833354 -
Flags: review?(benjamin) → review-
Comment 38•11 years ago
|
||
Comment on attachment 8334977 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data
Use for...of instead of for...in Iterator
I'm a little worried about the regexes: on nightly/aurora we're still using bp-hr- IDs, and I think we should relax that to bp-*.txt.
I'm not confident reviewing the async JS code: perhaps yoric should review this also...
Finally, there are some pieces of data that are important which we might want to roll up in this API. Specifically the crash process type (browser/plugin/content) and whether this is a hang or a true crash. That information is in the .ini file for pending crashes, but isn't anywhere currently for the saved crashes; we may need to add it to the submitted .txt file or log it some other way.
Attachment #8334977 -
Flags: review?(benjamin) → feedback+
Assignee | ||
Comment 39•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #38)
> Comment on attachment 8334977 [details] [diff] [review]
> Part 2: Create CrashManager API for managing crash data
>
> Use for...of instead of for...in Iterator
js> for (let k of {foo: true, bar: false}) { print(k); }
typein:1:0 TypeError: (intermediate value)['@@iterator'] is not a function
js> for (let [k, v] in Iterator({foo: true, bar: false})) { print(k); }
foo
bar
> I'm a little worried about the regexes: on nightly/aurora we're still using
> bp-hr- IDs, and I think we should relax that to bp-*.txt.
Will do.
> Finally, there are some pieces of data that are important which we might
> want to roll up in this API. Specifically the crash process type
> (browser/plugin/content) and whether this is a hang or a true crash. That
> information is in the .ini file for pending crashes, but isn't anywhere
> currently for the saved crashes; we may need to add it to the submitted .txt
> file or log it some other way.
Noted. I may roll these features in incrementally.
Assignee | ||
Comment 40•11 years ago
|
||
I moved the code from toolkit/crashreporter to
toolkit/components/crashes. I think it logically makes sense to create a
wall between toolkit/crashreporter and the code that consumes it. If
nothing else, there's already a lot happening in toolkit/crashreporter
and I didn't want to contribute to the confusion. Subsequent patches in
the series will build upon this patch and make the establishment of a
new component more easily justified.
Attachment #8342971 -
Flags: review?(dteller)
Assignee | ||
Updated•11 years ago
|
Attachment #8334201 -
Attachment is obsolete: true
Assignee | ||
Updated•11 years ago
|
Attachment #8334977 -
Attachment is obsolete: true
Assignee | ||
Comment 41•11 years ago
|
||
This patch now works. I also added an IDL interface so we do things
properly. The service is mostly a stub now. Will build features in
subsequent patches. I'm authoring patches so every patch is standalone
and should result in a green tree. I can squash patches down before
landing if desired.
Attachment #8342972 -
Flags: review?(benjamin)
Comment 42•11 years ago
|
||
Comment on attachment 8342972 [details] [diff] [review]
Part 3: XPCOM service for managing crash data
It's not clear from the nsICrashesManager IDL whether it is intended to be used as a service (getService or createInstance). This is more confusing because of the name CrashesService.js. Can we make it CrashesManager.js for name consistency and add an .idl comment explaining that this is intended as a service?
Does this service really need to be created at startup, or only when somebody starts to use it? In any case, the profile-after-change dance doesn't seem to be necessary, since you aren't using the profile directory anywhere; and if you were, you could just use ProfDS to get the directory before the profile is official "started".
Ted wanted to see this, so with those changes please mark him for second review on the new patch.
Attachment #8342972 -
Flags: review?(benjamin)
Assignee | ||
Comment 43•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #42)
> Comment on attachment 8342972 [details] [diff] [review]
> Part 3: XPCOM service for managing crash data
>
> It's not clear from the nsICrashesManager IDL whether it is intended to be
> used as a service (getService or createInstance). This is more confusing
> because of the name CrashesService.js. Can we make it CrashesManager.js for
> name consistency and add an .idl comment explaining that this is intended as
> a service?
While the underlying JS type can be instantiated multiple times (to facilitate testing), there likely will only ever be one instance inside Gecko and that instance can live as a service. Perhaps I should expose a "service" IDL that exposes a single "manager" attribute. That seems like over-engineering.
> Does this service really need to be created at startup, or only when
> somebody starts to use it? In any case, the profile-after-change dance
> doesn't seem to be necessary, since you aren't using the profile directory
> anywhere; and if you were, you could just use ProfDS to get the directory
> before the profile is official "started".
I'll eventually hang a timer to aggregate event files on there. I totally agree that lazy loading is preferred for perf reasons - that's why .manager is a getter.
Comment 44•11 years ago
|
||
> Gecko and that instance can live as a service. Perhaps I should expose a
> "service" IDL that exposes a single "manager" attribute. That seems like
> over-engineering.
Indeed. Basically I just want the IDL and the JS to match naming, and a comment explaining that this is supposed to be used with .getService()
> I'll eventually hang a timer to aggregate event files on there. I totally
> agree that lazy loading is preferred for perf reasons - that's why .manager
> is a getter.
ok, then you don't need the profile stuff at all. If you want it created on startup continue with the app-startup category.
Assignee | ||
Comment 45•11 years ago
|
||
Initial stab at support for "event files" in the JS API. Needs a lot of
love in the "robust" department (recovering from corruption, etc). But,
there's enough here to start asking for feedback.
I've gone around the block a number of times on what storage format to
use. I've considered append-only files, files grouped by ID prefix,
files grouped by date of crash. I decided on the easiest implementation:
a single soon-to-be-gzipped JSON file. We currently have an unbound data
problem. Definitely need to add pruning in here. Storage is abstracted
from the public API, so we can swap things out easily enough. Although
once the cat is out of the bag, we'll need to write migration scripts.
Considerations for forwards compatibility fall under the "robust"
department. Hopefully the next iteration of the patch.
I still need to hook the XPCOM service up to this new code. Future
patch.
Attachment #8343749 -
Flags: feedback?(benjamin)
Assignee | ||
Comment 46•11 years ago
|
||
Start of the CrashManager API. Just supports dump file management for
now.
Attachment #8344319 -
Flags: review?(ted)
Attachment #8344319 -
Flags: review?(dteller)
Assignee | ||
Updated•11 years ago
|
Attachment #8342971 -
Attachment is obsolete: true
Attachment #8342971 -
Flags: review?(dteller)
Assignee | ||
Comment 47•11 years ago
|
||
Here is the start of the XPCOM service so our shiny new CrashManager is
instantiated at app startup. I changed the service name and
initialization logic since the last patch.
Attachment #8344320 -
Flags: review?(ted)
Assignee | ||
Updated•11 years ago
|
Attachment #8342972 -
Attachment is obsolete: true
Assignee | ||
Comment 48•11 years ago
|
||
This patch adds support for "crash events" in the CrashManager.
It could use more in the way of tests, but I think I hit all the
high-level desires for an initial landing.
Ted and Benjamin can fight over code review.
Vladan gets perf review.
Some of the design decisions were documented in the attached .rst file.
The choice of a single JSON file for event storage is probably what
scares me the most. It's very easy to come up with scenarios where that
doesn't scale. But without real data from clients in the wild, I have no
clue how realistic those concerns are.
Attachment #8344322 -
Flags: review?(ted)
Attachment #8344322 -
Flags: review?(benjamin)
Attachment #8344322 -
Flags: feedback?(vdjeric)
Comment 49•11 years ago
|
||
Comment on attachment 8344322 [details] [diff] [review]
Part 4: Add Support for crash event files to CrashManager
> /**
> * Provides an interface for crash data.
> *
> * This type is generic and can be instantiated any number of times.
> * However, most applications will typically only have one instance
> * instantiated and that instance will point to profile and user appdata
> * directories.
> *
> * Instances are created by passing an object with properties.
> * Recognized properties are:
> *
> * pendingDumpsDir (string)
> * Where dump files that haven't been uploaded are located.
> *
> * submittedDumpsDir (string)
> * Where records up uploaded dumps are located.
>+ *
>+ * eventsDirs (array)
>+ * Directories (defined as strings) where events files are written. This
>+ * instance will collects events from files in the directories specified.
It's not clear to me whether this will delete the submitted dumps after it has loaded the submitted dumps directory. Will it (in this patch or a later one)?
>+ _processEventFile: function (entry) {
>+ return Task.spawn(function () {
>+ let data;
>+ try {
>+ data = yield OS.File.read(entry.path);
>+ } catch (ex) {
>+ // TODO log
>+ throw new Task.Result(this.EVENT_FILE_ERROR_IO);
>+ }
>+
>+ let store = yield this._getStore();
>+
>+ let decoder = new TextDecoder();
>+ data = decoder.decode(data);
>+
>+ let sepIndex = data.indexOf("\n");
>+ if (sepIndex == -1) {
>+ throw new Task.Result(this.EVENT_FILE_ERROR_MALFORMED);
>+ }
>+
>+ let type = data.substring(0, sepIndex);
>+ let payload = data.substring(sepIndex + 1);
>+
>+ if (type == "crash.gecko.1" || type == "crash.plugin.1") {
.gecko seems confusing, can this be .chrome? This seems to be missing content-process crashes; these are already being used for thumbnails. I suspect that we should paramaterize this rather than having separate .addGeckoCrash and .addPluginCrash methods.
Also how does this handle plugin hangs?
I don't understand what the store expiration timer is for. Why is there an expiration at all? Is that because of memory usage?
>diff --git a/toolkit/components/crashes/crash-events.rst b/toolkit/components/crashes/crash-events.rst
>+crash.gecko.1
As noted above, I'm wary of having separate events for crash.gecko and crash.plugin. Also it seems to me that what we really want is to include all of the crash metadata in the event file: this will naturally include the ProcessType= annotation as well as any Hang= annotation.
>+crashreporter.submit.decline.1
>+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>+
>+This event is produced by the crash reporter when the user declines to submit
>+a crash. The crash reporter will likely delete the crash data files from disk.
It will!? I'm pretty sure that the crash remains in the pending/ directory in this case.
Also this doc doesn't indicate whether this event is written by both the in-product crash reporting UI (plugin/content crash submission) or just the binary crashreporter. I think this is an important detail.
Finally, it doesn't need to be in this patch, but we should probably limit the crash data by total count as well as the 180-day limit. There have been people who crash Flash hundreds of times a day (it basically crashes on plugin startup every time), and we want to apply a sane limit. Maybe 5000 crashes total?
Attachment #8344322 -
Flags: review?(benjamin)
Assignee | ||
Comment 50•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #49)
> Comment on attachment 8344322 [details] [diff] [review]
> Part 4: Add Support for crash event files to CrashManager
Excellent feedback, Benjamin.
> It's not clear to me whether this will delete the submitted dumps after it
> has loaded the submitted dumps directory. Will it (in this patch or a later
> one)?
I think eventually we'll get there. Maybe not in this patch set, but definitely a follow-up. The reason is, once the event timeline can be trusted as canonical, there should be no need for the "submitted" directory to exist at all - we can just have the crash submission mechanism write the appropriate event file or call into the "timeline API" from inside Gecko. I didn't want to bloat scope too much. But I do anticipate rewriting FHR, about:crashes, any others.
> >+ if (type == "crash.gecko.1" || type == "crash.plugin.1") {
>
> .gecko seems confusing, can this be .chrome? This seems to be missing
> content-process crashes; these are already being used for thumbnails. I
> suspect that we should paramaterize this rather than having separate
> .addGeckoCrash and .addPluginCrash methods.
>
> Also how does this handle plugin hangs?
>
> As noted above, I'm wary of having separate events for crash.gecko and
> crash.plugin. Also it seems to me that what we really want is to include all
> of the crash metadata in the event file: this will naturally include the
> ProcessType= annotation as well as any Hang= annotation.
Bikeshedding on the message names and their payloads is very much welcome. I'm no expert on the various types of crashes we experience and the metadata associated with them. My decisions for message types and content are likely very suboptimal. Can you please brain dump to https://etherpad.mozilla.org/crash-log?
> I don't understand what the store expiration timer is for. Why is there an
> expiration at all? Is that because of memory usage?
Yes, it's because of memory usage. Chances are we load the crash data once near startup then never use it again. I don't like idle/useless memory sitting around like that. I like the TTL pattern.
>
> >+crashreporter.submit.decline.1
> >+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >+
> >+This event is produced by the crash reporter when the user declines to submit
> >+a crash. The crash reporter will likely delete the crash data files from disk.
>
> It will!? I'm pretty sure that the crash remains in the pending/ directory
> in this case.
I copied that from https://etherpad.mozilla.org/crash-log without validating it!
> Also this doc doesn't indicate whether this event is written by both the
> in-product crash reporting UI (plugin/content crash submission) or just the
> binary crashreporter. I think this is an important detail.
In theory, both can happen. Although, there are performance implications. I'm not sure if we want in-product store writing happening directly because patterns like plugins crashing every 60s could have significant performance implications compared to writing events files and aggregating much less frequently.
> Finally, it doesn't need to be in this patch, but we should probably limit
> the crash data by total count as well as the 180-day limit. There have been
> people who crash Flash hundreds of times a day (it basically crashes on
> plugin startup every time), and we want to apply a sane limit. Maybe 5000
> crashes total?
Sounds reasonable. I do worry about loss of data (to FHR) when a high water mark is reached. i.e. is the difference between 5000 and 50000 worth reporting? Could we just store a "high water mark reached" flag when we exceed the daily limit? What if lots of plugin crashes mask (more important) Gecko crashes or other important events?
Comment 51•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #49)
> .gecko seems confusing, can this be .chrome?
I wonder if .chrome is less confusing on B2G, where this would be when the "b2g" process (i.e. "the whole phone") crashes. In crash-stats, we call it "browser" process due to legacy, but I tend to call it "main process" nowadays and that would work cross-product.
Comment 52•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #49)
> Comment on attachment 8344322 [details] [diff] [review]
> >+This event is produced by the crash reporter when the user declines to submit
> >+a crash. The crash reporter will likely delete the crash data files from disk.
>
> It will!? I'm pretty sure that the crash remains in the pending/ directory
> in this case.
Actually, AFAIK, that comment might very well be correct, as I was told we delete crash reports even locally when people decline to submit them. I know for a fact we do that on B2G and IIRC when we fixed that on B2G to be like that, ted said this would be the correct thing as it would match what breakpad UI does.
Comment 53•11 years ago
|
||
(In reply to Gregory Szorc [:gps] (in southeast Asia until Dec 14) from comment #50)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #49)
> > It's not clear to me whether this will delete the submitted dumps after it
> > has loaded the submitted dumps directory. Will it (in this patch or a later
> > one)?
>
> I think eventually we'll get there. Maybe not in this patch set, but
> definitely a follow-up. The reason is, once the event timeline can be
> trusted as canonical, there should be no need for the "submitted" directory
> to exist at all - we can just have the crash submission mechanism write the
> appropriate event file or call into the "timeline API" from inside Gecko. I
> didn't want to bloat scope too much. But I do anticipate rewriting FHR,
> about:crashes, any others.
Note that right now, an |ls -l submitted/| is the only way to get your crash IDs at all. So when we end up deleting those, we better end up having a better way in place to get to those (bugs are filed but have not really been moving forward). But as you said, probably a different bug anyhow where this happens. ;-)
Reporter | ||
Comment 54•11 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #52)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #49)
> > Comment on attachment 8344322 [details] [diff] [review]
> > >+This event is produced by the crash reporter when the user declines to submit
> > >+a crash. The crash reporter will likely delete the crash data files from disk.
> >
> > It will!? I'm pretty sure that the crash remains in the pending/ directory
> > in this case.
>
>
> Actually, AFAIK, that comment might very well be correct, as I was told we
> delete crash reports even locally when people decline to submit them. I know
> for a fact we do that on B2G and IIRC when we fixed that on B2G to be like
> that, ted said this would be the correct thing as it would match what
> breakpad UI does.
If the user explicitly declines to submit a crash report (for example, if the browser crashes, the user does not have "Tell Mozilla about this crash" checked, and they click quit or restart in the crash reporter UI), then we'll delete the pending crash.
If the crash submission simply fails for some reason (network etc) we'll leave it in pending.
Plugin crashes are a little weird in that it's really easy for the user to not submit but not explicitly decline submission, so we sometimes wind up leaving them in pending.
Comment 55•11 years ago
|
||
Comment on attachment 8343749 [details] [diff] [review]
Part 6: Support events files in CrashManager
I feel like this patch and part 4 are versions of the same thing, or am I confused?
Attachment #8343749 -
Flags: feedback?(benjamin)
Assignee | ||
Updated•11 years ago
|
Attachment #8343749 -
Attachment is obsolete: true
Comment on attachment 8344319 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data
Review of attachment 8344319 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/crashes/CrashManager.jsm
@@ +6,5 @@
> +
> +const {utils: Cu} = Components;
> +
> +Cu.import("resource://gre/modules/osfile.jsm")
> +Cu.import("resource://gre/modules/Task.jsm");
Shouldn't we import into |this|, these days?
@@ +13,5 @@
> + "CrashManager",
> +];
> +
> +/**
> + * Provides an interface for crash data.
I'm not a big fan of this docstring. This makes the module look like a InstanceFactoryImplFactoryFactory, which it isn't :)
Any way to make it more friendly?
Also, so far, it's an interface for *accessing* crash data.
@@ +27,5 @@
> + * pendingDumpsDir (string)
> + * Where dump files that haven't been uploaded are located.
> + *
> + * submittedDumpsDir (string)
> + * Where records up uploaded dumps are located.
Could you make it clear that these properties (which are really labelled args) are required?
@@ +100,5 @@
> + this.SUBMITTED_REGEX);
> + },
> +
> + _getDirectoryEntries: function (path, re) {
> + return Task.spawn(function () {
Here too, I'd suggest function* and return.
@@ +106,5 @@
> + yield OS.File.stat(path);
> + } catch (ex if ex instanceof OS.File.Error) {
> + if (ex.becauseNoSuchFile) {
> + throw new Task.Result([]);
> + }
I'd go for
} catch (ex if ex instanceof OS.File.Error && ex.becauseNoSuchFile) {
Also, these days, it's probably simpler to make it a |function*()| and use |return| instead of |throw new Task.Result|.
@@ +119,5 @@
> + while (true) {
> + let entry;
> + try {
> + entry = yield it.next();
> + } catch (ex if ex == StopIteration) {
Unfortunately, this is going to break with bug 938704. I'd suggest calling it.forEach.
@@ +130,5 @@
> +
> + let match = re.exec(entry.name);
> + if (!match) {
> + continue;
> + }
You could also use a winPattern to optimize the code for Windows.
::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
@@ +21,5 @@
> + f.create(Ci.nsIFile.DIRECTORY_TYPE, dirMode);
> + }
> +
> + const dirMode = OS.Constants.libc.S_IRWXU;
> + let baseFile = do_get_tempdir();
Not a request, but if you want, OS.Constants.Path.tmpDir does the trick, too.
@@ +63,5 @@
> + file.lastModifiedTime = date.getTime();
> + dump("Created fake crash: " + file.path + "\n");
> +
> + return uuid;
> + };
Wouldn't it be possible to use the actual Crash Reporter to generate crashes? There are xpcshell tests that call it in a subprocess, so that would probably be a much better test, wouldn't it?
@@ +91,5 @@
> + let m = new CrashManager({
> + pendingDumpsDir: "/foo",
> + submittedDumpsDir: "/bar",
> + });
> + Assert.ok(m);
Assert? Where does this come from? I want some :)
@@ +94,5 @@
> + });
> + Assert.ok(m);
> +
> + run_next_test();
> +});
Nit: Please use add_task these days.
@@ +104,5 @@
> +
> + run_next_test();
> +});
> +
> +add_test(function test_get_manager() {
Could you document what each of your test ensures? Looking at the source code, I'm often unsure.
Attachment #8344319 -
Flags: review?(dteller) → feedback+
Assignee | ||
Comment 57•11 years ago
|
||
What is this function* you speak of? I haven't been writing JS for a few months and am behind on the times. Of course, it's not documented on https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/function or https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions_and_function_scope?redirectlocale=en-US&redirectslug=JavaScript%2FReference%2FFunctions_and_function_scope or https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Generator or https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Iterators_and_Generators or any other page I looked at. The best I found was http://bjouhier.wordpress.com/2013/06/01/bringing-asyncawait-to-life-in-javascript/, which doesn't exactly inspire confidence that SM supports it to the point it is usable in production code.
Generators have now been standardized and should now be written with |function*|, see http://dxr.mozilla.org/mozilla-central/source/toolkit/modules/Task.jsm .
Updated•11 years ago
|
Attachment #8344322 -
Flags: feedback?(vdjeric) → feedback?(dteller)
Comment on attachment 8344322 [details] [diff] [review]
Part 4: Add Support for crash event files to CrashManager
Review of attachment 8344322 [details] [diff] [review]:
-----------------------------------------------------------------
Overall, looks good to me, if a bit lacking on in-code documentation.
::: toolkit/components/crashes/CrashManager.jsm
@@ +48,2 @@
> */
> this.CrashManager = function (options) {
Wouldn't it be simpler to store a shallow copy of |options| as |this._options| and use this everywhere?
@@ +79,5 @@
> default:
> throw new Error("Unknown property in options: " + k);
> }
> }
> +
Could you please document these fields?
@@ +105,5 @@
> +
> + EVENT_FILE_SUCCESS: "ok",
> + EVENT_FILE_ERROR_IO: "io",
> + EVENT_FILE_ERROR_MALFORMED: "malformed",
> + EVENT_FILE_ERROR_UNKNOWN_EVENT: "unknown-event",
These look like global constants rather than prototype fodder. Do you have any reason to put these things in the prototype?
@@ +166,5 @@
> + }.bind(this));
> + },
> +
> + /**
> + * Aggregates "loose" events files into the unified "database."
Could you document the effect on files?
@@ +177,5 @@
> + aggregateEventsFiles: function () {
> + return Task.spawn(function () {
> + if (this._aggregateInProgress) {
> + throw new Error("Event aggregation already in progress. Try again " +
> + "later.");
Shouldn't you return the promise of the aggregation in progress?
@@ +198,5 @@
> + case this.EVENT_FILE_SUCCESS:
> + needsSave = true;
> + // Fall through.
> +
> + case this.EVENT_FILE_ERROR_MALFORMED:
Looks like it should be an exception.
@@ +304,5 @@
> +
> + let decoder = new TextDecoder();
> + data = decoder.decode(data);
> +
> + let sepIndex = data.indexOf("\n");
Please reference crash-event.rst from a comment.
@@ +427,5 @@
> +
> + this._storeTimer.cancel();
> +
> + let timerCB = function () {
> + if (this._storeProtected) {
Please document both _storeProtected and the role if this callback.
@@ +475,5 @@
> + this._storeDir = storeDir;
> + this._telemetryKey = telemetryKey;
> +
> + this._storePath = OS.Path.join(storeDir, "store.json");
> + this._data = null;
Could you document all these fields/args?
@@ +480,5 @@
> +}
> +
> +CrashStore.prototype = Object.freeze({
> + TYPE_GECKO: 1,
> + TYPE_PLUGIN: 2,
As above, this doesn't look like it should be in the prototype.
@@ +502,5 @@
> + if (data.corruptDate) {
> + this._data.corruptDate = new Date(data.corruptDate);
> + }
> +
> + for (let [id, crash] of Iterator(data.crashes)) {
I hear that |Iterator| is deprecated and that the way to go is now Object.keys.
@@ +511,5 @@
> + }
> +
> + this._data.crashes[id] = unnormalized;
> + }
> + } catch (ex if ex instanceof OS.File.Error && ex.becauseNoSuchFile) {
// Ignore
(I assume)
@@ +518,5 @@
> + // and swallow the error.
> + //
> + // The marking of a corrupted file is intentionally not persisted to
> + // disk yet. Instead, we wait until the next save(). This is to give
> + // non-permanent failures the opportunity to recover on their own.
What kind of failures are expected?
@@ +553,5 @@
> + normalized.crashes[id] = c;
> + }
> +
> + let method = this._storePath.endsWith(".gz") ? "writeGzippedFile"
> + : "writeFile";
As mentioned in the other patch, I suspect that we should normalize upon lz4 for the time being.
@@ +600,5 @@
> +
> + return normalized;
> + },
> +
> + _unnormalize: function (o) {
That's a very weird method name. It really deserves some documentation.
@@ +627,5 @@
> + this._data.crashes[id].type = this.TYPE_GECKO;
> + },
> +
> + /**
> + * Record the declined submission of a particular crash.
You should explain what this means.
@@ +764,5 @@
> +/**
> + * Represents an individual crash with metadata.
> + *
> + * This is a wrapper around the low-level anonymous JS objects that define
> + * crashes. It exposes a consistent and helpful API.
Please document the argument.
@@ +767,5 @@
> + * This is a wrapper around the low-level anonymous JS objects that define
> + * crashes. It exposes a consistent and helpful API.
> + */
> +function CrashRecord(o) {
> + this._o = o;
Nice smiley :)
However, _o is a very awkward name.
::: toolkit/components/crashes/CrashService.js
@@ +8,5 @@
>
> Cu.import("resource://gre/modules/Services.jsm");
> Cu.import("resource://gre/modules/XPCOMUtils.jsm");
>
> +const AGGREGATE_STARTUP_DELAY_MS = 60000;
Please document.
Also, I realize that the maintenance tasks will probably be quite fast in most cases, but since we're delaying them, I'd rather we put stuff in idle or on idle-daily, rather than after a completely arbitrary amount of time.
::: toolkit/components/crashes/crash-events.rst
@@ +7,5 @@
> +
> +When an event worthy of recording occurs, a file containing that event's
> +information is written to a well-defined location on the filesystem. The Gecko
> +process periodically scans for produced files and consolidates information
> +into a more unified and effecient filesystem format.
"efficient"
@@ +22,5 @@
> +The filename of the event file is not relevant. However, producers need
> +to choose a filename intelligently to avoid name collisions and race
> +conditions. Since file locking is potentially dangerous at crash time,
> +the convention of generating a UUID and using it as a filename has been
> +adopted.
If data can be in UAppData, I believe that we should insist upon a recognizable file extension (say .mozcrash).
@@ +53,5 @@
> +
> +crash.gecko.1
> +^^^^^^^^^^^^^
> +
> +This event is produced when the Gecko process crashes.
In the context of e10s, does this include both the parent process and children processes?
@@ +57,5 @@
> +This event is produced when the Gecko process crashes.
> +
> +The payload of this event is a single line: the UUID of the crash that was
> +produced and saved to disk by Breakpad. There should exist *<UUID>*.dmp and
> +*<UUID>*.extra files on disk.
Where?
@@ +80,5 @@
> +crashreporter.submit.attempt.1
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +This event is produced by the crash reporter when it attempts to upload a
> +crash (likely to Mozilla). Crashes may also be uploaded in product. This
Nit: "in product" might not be clear for all readers.
@@ +83,5 @@
> +This event is produced by the crash reporter when it attempts to upload a
> +crash (likely to Mozilla). Crashes may also be uploaded in product. This
> +event is only meant for use by the crash reporter.
> +
> +The payload consists of 3 lines::
Single ":", I guess.
@@ +93,5 @@
> +The submission ID is an ID that uniquely identifies this submission attempt.
> +It is used to correlate results against the request that initiated them.
> +
> +Time of submission is the number of seconds since UNIX epoch expressed as
> +an integer.
Number of seconds as an integer? That's unusual. Why not the result of Date.now()?
::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
@@ +377,5 @@
> + yield m.aggregateEventsFiles();
> +
> + let crashes = yield m.getCrashes();
> + Assert.equal(crashes.length, 1);
> + let c = crashes[0];
Despite the comment, I don't understand what you have tested here.
::: toolkit/components/crashes/tests/xpcshell/test_crash_service.js
@@ +32,5 @@
>
> run_next_test();
> });
> +
> +add_task(function test_aggregate_events_files() {
What does this test, exactly?
::: toolkit/components/crashes/tests/xpcshell/test_crash_store.js
@@ +34,5 @@
> + let s = new CrashStore("/some/path");
> + Assert.ok(s instanceof CrashStore);
> +
> + run_next_test();
> +});
Is there a reason to combine add_test and add_task?
Attachment #8344322 -
Flags: feedback?(dteller)
Attachment #8344322 -
Flags: feedback+
Reporter | ||
Comment 60•11 years ago
|
||
Comment on attachment 8344319 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data
Review of attachment 8344319 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/crashes/CrashManager.jsm
@@ +27,5 @@
> + * pendingDumpsDir (string)
> + * Where dump files that haven't been uploaded are located.
> + *
> + * submittedDumpsDir (string)
> + * Where records up uploaded dumps are located.
I assume this was meant to read "Where records of uploaded dumps are located."
@@ +99,5 @@
> + return this._getDirectoryEntries(this._submittedDumpsDir,
> + this.SUBMITTED_REGEX);
> + },
> +
> + _getDirectoryEntries: function (path, re) {
None of these functions mentions that the array of objects is sorted by date, FWIW.
::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
@@ +21,5 @@
> + f.create(Ci.nsIFile.DIRECTORY_TYPE, dirMode);
> + }
> +
> + const dirMode = OS.Constants.libc.S_IRWXU;
> + let baseFile = do_get_tempdir();
Actually do_get_tempdir is a directory that the Python harness will clean up after the test is run, so it's better suited for use in an xpcshell test.
@@ +63,5 @@
> + file.lastModifiedTime = date.getTime();
> + dump("Created fake crash: " + file.path + "\n");
> +
> + return uuid;
> + };
It's possible, but I don't think it adds much to this test.
@@ +91,5 @@
> + let m = new CrashManager({
> + pendingDumpsDir: "/foo",
> + submittedDumpsDir: "/bar",
> + });
> + Assert.ok(m);
This was bug 873126 FYI.
Attachment #8344319 -
Flags: review?(ted) → review+
Reporter | ||
Comment 61•11 years ago
|
||
Comment on attachment 8344320 [details] [diff] [review]
Part 3: XPCOM service for managing crash data
Review of attachment 8344320 [details] [diff] [review]:
-----------------------------------------------------------------
I wish we had a standard way for defining chrome interfaces in WebIDL. :-/
::: toolkit/components/crashes/CrashService.js
@@ +34,5 @@
> + reportsDir.append("Crash Reports");
> + let pendingDir = reportsDir.clone();
> + let submittedDir = reportsDir.clone();
> + pendingDir.append("pending");
> + submittedDir.append("submitted");
I wish we had a better JS API for path management. Do we not have that in OS.File yet?
::: toolkit/components/crashes/nsICrashes.idl
@@ +45,5 @@
> + * - path
> + * - date
> + *
> + * This interface could be changed to expose richer types if consumers
> + * require it. Please file a bug.
This comment doesn't really need to be here.
Attachment #8344320 -
Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #61)
> Comment on attachment 8344320 [details] [diff] [review]
> Part 3: XPCOM service for managing crash data
>
> Review of attachment 8344320 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> I wish we had a standard way for defining chrome interfaces in WebIDL. :-/
>
> ::: toolkit/components/crashes/CrashService.js
> @@ +34,5 @@
> > + reportsDir.append("Crash Reports");
> > + let pendingDir = reportsDir.clone();
> > + let submittedDir = reportsDir.clone();
> > + pendingDir.append("pending");
> > + submittedDir.append("submitted");
>
> I wish we had a better JS API for path management. Do we not have that in
> OS.File yet?
We do.
Hint, to make it readable:
let Path = OS.Path;
let Paths = OS.Constants.Path;
Assignee | ||
Comment 63•11 years ago
|
||
Incorporated review feedback. The plugin and child process crash section
could be improved. Perfect is the enemy of done. I'd much prefer to
commit something and let a subject matter expert fill it in later.
Attachment #8356347 -
Flags: review?(benjamin)
Assignee | ||
Comment 64•11 years ago
|
||
function* is now used. Review feedback incorporated (except for the
comments I didn't grok).
Attachment #8356364 -
Flags: review?(ted)
Assignee | ||
Updated•11 years ago
|
Attachment #8344319 -
Attachment is obsolete: true
Assignee | ||
Comment 65•11 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #60)
> ::: toolkit/components/crashes/tests/xpcshell/test_crash_manager.js
> @@ +21,5 @@
> > + f.create(Ci.nsIFile.DIRECTORY_TYPE, dirMode);
> > + }
> > +
> > + const dirMode = OS.Constants.libc.S_IRWXU;
> > + let baseFile = do_get_tempdir();
>
> Actually do_get_tempdir is a directory that the Python harness will clean up
> after the test is run, so it's better suited for use in an xpcshell test.
I don't parse this comment.
> @@ +63,5 @@
> > + file.lastModifiedTime = date.getTime();
> > + dump("Created fake crash: " + file.path + "\n");
> > +
> > + return uuid;
> > + };
>
> It's possible, but I don't think it adds much to this test.
*scratches head*
I'll resubmit updated patch so you have a chance to clarify.
Assignee | ||
Updated•11 years ago
|
Attachment #833354 -
Attachment is obsolete: true
Assignee | ||
Comment 66•11 years ago
|
||
Comment on attachment 8356364 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data
I think I missed Yoric's comment. Yet another reminder how crappy Bugzilla is for code review.
I may very well submit these reviews on ReviewBoard...
Attachment #8356364 -
Flags: review?(ted)
Reporter | ||
Comment 67•11 years ago
|
||
Sorry, my comments were in reply to comments Yoric left. That's totally unclear. :-(
Assignee | ||
Comment 68•11 years ago
|
||
If anyone feels like using a review tool with better UX than bugzilla:
https://reviewboard.allizom.org/r/21/
https://reviewboard.allizom.org/r/22/
Comment 69•11 years ago
|
||
Comment on attachment 8356347 [details] [diff] [review]
Part 1: Document existing crash reporter behavior
I might tweak it later, but this is a good summary.
Attachment #8356347 -
Flags: review?(benjamin) → review+
Assignee | ||
Comment 70•11 years ago
|
||
Whiteboard: [leave open]
Assignee | ||
Comment 71•11 years ago
|
||
Incorporated Yoric's feedback.
Also up for review at https://reviewboard.allizom.org/r/22/. I highly
prefer the review be conducted there. I think you will too :)
Attachment #8357318 -
Flags: review?(ted)
Attachment #8357318 -
Flags: review?(dteller)
Assignee | ||
Updated•11 years ago
|
Attachment #8356364 -
Attachment is obsolete: true
Comment 72•11 years ago
|
||
Comment on attachment 8357318 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data;
Review of attachment 8357318 [details] [diff] [review]:
-----------------------------------------------------------------
Side-note: could you add version numbers to your patches when you upload them?
Reviewed on https://reviewboard.allizom.org/r/22/
Attachment #8357318 -
Flags: review?(dteller) → review+
Assignee | ||
Comment 74•11 years ago
|
||
> Side-note: could you add version numbers to your patches when you upload
> them?
Sorry, I usually do. FWIW, ReviewBoard creates the version numbers automatically. Another reason why ReviewBoard is superior to Bugzilla.
> Reviewed on https://reviewboard.allizom.org/r/22/
Thanks. I also noticed ReviewBoard isn't sending notification emails. I filed bug 958236.
Assignee | ||
Comment 75•11 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #61)
> Comment on attachment 8344320 [details] [diff] [review]
> Part 3: XPCOM service for managing crash data
> ::: toolkit/components/crashes/CrashService.js
> @@ +34,5 @@
> > + reportsDir.append("Crash Reports");
> > + let pendingDir = reportsDir.clone();
> > + let submittedDir = reportsDir.clone();
> > + pendingDir.append("pending");
> > + submittedDir.append("submitted");
>
> I wish we had a better JS API for path management. Do we not have that in
> OS.File yet?
I'd need to patch OS.Path.Constants to provide access to UAppData (XRE_USER_APP_DATA_DIR).
I looked at dom/system/OSFileConstants.cpp, but I'm not sure what all is involved. Perhaps Yoric can help...
(In reply to Gregory Szorc [:gps] from comment #75)
> I'd need to patch OS.Path.Constants to provide access to UAppData
> (XRE_USER_APP_DATA_DIR).
>
> I looked at dom/system/OSFileConstants.cpp, but I'm not sure what all is
> involved. Perhaps Yoric can help...
Sure. File a big and I'll do it (or mentor it).
Assignee | ||
Comment 77•11 years ago
|
||
Attempting to upload an attachment that redirects to ReviewBoard...
ted has already granted r+.
Attachment #8344320 -
Attachment is obsolete: true
Attachment #8358101 -
Flags: review?(benjamin)
Assignee | ||
Comment 78•11 years ago
|
||
Holy crap, I can't believe the ReviewBoard auto redirect actually worked! Rich MIME type too. I owe someone a beer.
Attachment #8344322 -
Attachment is obsolete: true
Attachment #8344322 -
Flags: review?(ted)
Assignee | ||
Comment 79•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #49)
> Comment on attachment 8344322 [details] [diff] [review]
> Part 4: Add Support for crash event files to CrashManager
> >+ _processEventFile: function (entry) {
> >+ return Task.spawn(function () {
> >+ let data;
> >+ try {
> >+ data = yield OS.File.read(entry.path);
> >+ } catch (ex) {
> >+ // TODO log
> >+ throw new Task.Result(this.EVENT_FILE_ERROR_IO);
> >+ }
> >+
> >+ let store = yield this._getStore();
> >+
> >+ let decoder = new TextDecoder();
> >+ data = decoder.decode(data);
> >+
> >+ let sepIndex = data.indexOf("\n");
> >+ if (sepIndex == -1) {
> >+ throw new Task.Result(this.EVENT_FILE_ERROR_MALFORMED);
> >+ }
> >+
> >+ let type = data.substring(0, sepIndex);
> >+ let payload = data.substring(sepIndex + 1);
> >+
> >+ if (type == "crash.gecko.1" || type == "crash.plugin.1") {
>
> .gecko seems confusing, can this be .chrome? This seems to be missing
> content-process crashes; these are already being used for thumbnails. I
> suspect that we should paramaterize this rather than having separate
> .addGeckoCrash and .addPluginCrash methods.
Good point.
Nit pick: Is "chrome" the best choice? "chrome" to me implies the main UI process. That just happens to be the "main" (parent) application process today. Will this assumption always hold? Should we just call it "main?"
> Also how does this handle plugin hangs?
It doesn't, yet. That's mainly because I'm not sure how plugin/subprocess hangs/crashes work.
> I don't understand what the store expiration timer is for. Why is there an
> expiration at all? Is that because of memory usage?
Yes. No sense keeping unused resources allocated! I'm not sure if this will actually impact the bottom line in practice. But at least I won't be contributing to the problem!
> >diff --git a/toolkit/components/crashes/crash-events.rst b/toolkit/components/crashes/crash-events.rst
>
> >+crash.gecko.1
>
> As noted above, I'm wary of having separate events for crash.gecko and
> crash.plugin. Also it seems to me that what we really want is to include all
> of the crash metadata in the event file: this will naturally include the
> ProcessType= annotation as well as any Hang= annotation.
I was kinda hoping to avoid processing the annotations at this point. If this is inevitable for the initial release, I guess I'll just bite the bullet.
> Finally, it doesn't need to be in this patch, but we should probably limit
> the crash data by total count as well as the 180-day limit. There have been
> people who crash Flash hundreds of times a day (it basically crashes on
> plugin startup every time), and we want to apply a sane limit. Maybe 5000
> crashes total?
Agreed. Unbound data structures and performance don't go well together.
I worry about the impact of a naive chomping algorithm on data reporting. Perhaps we should have separate ceilings for plugins and main process crashes so we don't lose the (presumably) more important data on main process crashes.
We can store a high water mark marker to indicate excessive crashes occurred on a given day.
Assignee | ||
Comment 80•11 years ago
|
||
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #59)
> Comment on attachment 8344322 [details] [diff] [review]
> Part 4: Add Support for crash event files to CrashManager
> ::: toolkit/components/crashes/CrashManager.jsm
> @@ +48,2 @@
> > */
> > this.CrashManager = function (options) {
>
> Wouldn't it be simpler to store a shallow copy of |options| as
> |this._options| and use this everywhere?
Yes. But then you have to type more at use sites. Plus you involve an extra property lookup. Meh.
> @@ +79,5 @@
> > default:
> > throw new Error("Unknown property in options: " + k);
> > }
> > }
> > +
>
> Could you please document these fields?
They are documented above.
> @@ +105,5 @@
> > +
> > + EVENT_FILE_SUCCESS: "ok",
> > + EVENT_FILE_ERROR_IO: "io",
> > + EVENT_FILE_ERROR_MALFORMED: "malformed",
> > + EVENT_FILE_ERROR_UNKNOWN_EVENT: "unknown-event",
>
> These look like global constants rather than prototype fodder. Do you have
> any reason to put these things in the prototype?
As stated on an earlier review, I like attaching things to the logical closest scope. That's how lots of programming languages work (globals are evil) and it makes refactoring easier since the only external symbols are typically imports. Just my style pref.
> @@ +177,5 @@
> > + aggregateEventsFiles: function () {
> > + return Task.spawn(function () {
> > + if (this._aggregateInProgress) {
> > + throw new Error("Event aggregation already in progress. Try again " +
> > + "later.");
>
> Shouldn't you return the promise of the aggregation in progress?
Good idea! Will fix.
> @@ +553,5 @@
> > + normalized.crashes[id] = c;
> > + }
> > +
> > + let method = this._storePath.endsWith(".gz") ? "writeGzippedFile"
> > + : "writeFile";
>
> As mentioned in the other patch, I suspect that we should normalize upon lz4
> for the time being.
I'm fine with that. What's the timetable for landing lz4?
> ::: toolkit/components/crashes/CrashService.js
> @@ +8,5 @@
> >
> > Cu.import("resource://gre/modules/Services.jsm");
> > Cu.import("resource://gre/modules/XPCOMUtils.jsm");
> >
> > +const AGGREGATE_STARTUP_DELAY_MS = 60000;
>
> Please document.
> Also, I realize that the maintenance tasks will probably be quite fast in
> most cases, but since we're delaying them, I'd rather we put stuff in idle
> or on idle-daily, rather than after a completely arbitrary amount of time.
In Sync land, we learned the hard way to not trust idle-daily because it apparently doesn't fire in a lot of our users. I believe sometimes it fires many signals all at once! See bug 845558 where I explained the problem and requested Telemetry to assess the scope of the problem. In the mean time, I'm very wary about relying on idle anything for important things.
> @@ +22,5 @@
> > +The filename of the event file is not relevant. However, producers need
> > +to choose a filename intelligently to avoid name collisions and race
> > +conditions. Since file locking is potentially dangerous at crash time,
> > +the convention of generating a UUID and using it as a filename has been
> > +adopted.
>
> If data can be in UAppData, I believe that we should insist upon a
> recognizable file extension (say .mozcrash).
Why? The directory name already denotes the content as belonging to the Gecko app.
> @@ +53,5 @@
> > +
> > +crash.gecko.1
> > +^^^^^^^^^^^^^
> > +
> > +This event is produced when the Gecko process crashes.
>
> In the context of e10s, does this include both the parent process and
> children processes?
I have no clue since I know little about e10n. See earlier reply to bsmedberg's review, where I suggest we call crashes in the main process as "main."
> @@ +83,5 @@
> > +This event is produced by the crash reporter when it attempts to upload a
> > +crash (likely to Mozilla). Crashes may also be uploaded in product. This
> > +event is only meant for use by the crash reporter.
> > +
> > +The payload consists of 3 lines::
>
> Single ":", I guess.
No, this is restructured text. The double colon tells it that preformatted text follows.
>
> @@ +93,5 @@
> > +The submission ID is an ID that uniquely identifies this submission attempt.
> > +It is used to correlate results against the request that initiated them.
> > +
> > +Time of submission is the number of seconds since UNIX epoch expressed as
> > +an integer.
>
> Number of seconds as an integer? That's unusual. Why not the result of
> Date.now()?
Because we don't need millisecond precision IMO.
> ::: toolkit/components/crashes/tests/xpcshell/test_crash_store.js
> @@ +34,5 @@
> > + let s = new CrashStore("/some/path");
> > + Assert.ok(s instanceof CrashStore);
> > +
> > + run_next_test();
> > +});
>
> Is there a reason to combine add_test and add_task?
Originally, it was because non-task tests didn't need to be executed as tasks. I suppose it really doesn't matter. I like using tasks because they can't hang as easily as add_test() can. I'll probably just switch to add_task() for everything from now on.
Comment 81•11 years ago
|
||
I really want to insist that we call all crash reporting events "crash.1" and then use something like ProcessType to distinguish between main-process/plugin-process/content-process.
Assignee | ||
Comment 82•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #81)
> I really want to insist that we call all crash reporting events "crash.1"
> and then use something like ProcessType to distinguish between
> main-process/plugin-process/content-process.
What about hangs? I'm still unclear on every scenario where we create a minidump. Should we even refer to "minidumps" as "crashes?"
Perhaps you could answer by updating the in-tree docs. I'd like to see these things documented where others will find them later. If you don't update the docs, I will :)
Comment 83•11 years ago
|
||
Attachment #8358567 -
Flags: review?(gps)
Updated•11 years ago
|
Attachment #8358101 -
Flags: review?(benjamin)
Assignee | ||
Comment 84•11 years ago
|
||
Comment on attachment 8358567 [details] [diff] [review]
crash-docs
Review of attachment 8358567 [details] [diff] [review]:
-----------------------------------------------------------------
This looks fine. But the patch is rather small. Did you upload everything you intended to?
Attachment #8358567 -
Flags: review?(gps) → review+
Comment 85•11 years ago
|
||
Yes, I was just writing the docs that I thought would be helpful in terms of plugin and content crash flow. If you have other stuff you'd like me to write, let me know.
Updated•11 years ago
|
Priority: -- → P1
Comment 86•11 years ago
|
||
Assignee | ||
Comment 87•11 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #84)
> Comment on attachment 8358567 [details] [diff] [review]
> crash-docs
>
> Review of attachment 8358567 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> This looks fine. But the patch is rather small. Did you upload everything
> you intended to?
Oh, turns out Splinter wasn't formatting the patch properly! My browser is only showing a net addition of ~2 lines but the raw patch contained much more. Sadness.
Reporter | ||
Comment 88•11 years ago
|
||
Comment on attachment 8357318 [details] [diff] [review]
Part 2: Create CrashManager API for managing crash data;
Review of attachment 8357318 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, sorry for the delay.
Attachment #8357318 -
Flags: review?(ted) → review+
Assignee | ||
Comment 89•11 years ago
|
||
Comment on attachment 8358101 [details]
Part 3: XPCOM Service for managing crash data
Latest revision on ReviewBoard incorporates review feedback.
Attachment #8358101 -
Flags: review?(benjamin)
Assignee | ||
Comment 90•11 years ago
|
||
Assignee | ||
Comment 91•11 years ago
|
||
Comment on attachment 8358101 [details]
Part 3: XPCOM Service for managing crash data
Review was granted on ReviewBoard.
Attachment #8358101 -
Flags: review?(benjamin) → review+
Assignee | ||
Comment 92•11 years ago
|
||
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager
I paired the patch down to just the basic crash event files implementation. Hopefully we can focus on that. We can bikeshed event specifics later (in the next day or two).
Attachment #8358103 -
Flags: review?(dteller)
Attachment #8358103 -
Flags: review?(benjamin)
Comment 93•11 years ago
|
||
Comment 94•11 years ago
|
||
And a follow-up to hopefully un-bust comm-central.
https://hg.mozilla.org/comm-central/rev/1e1da3a4dac3
Assignee | ||
Comment 95•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #94)
> And a follow-up to hopefully un-bust comm-central.
> https://hg.mozilla.org/comm-central/rev/1e1da3a4dac3
Thanks, Ryan.
Updated•11 years ago
|
Attachment #8358103 -
Flags: review?(dteller) → feedback+
Updated•11 years ago
|
Attachment #8358103 -
Flags: review?(benjamin)
Assignee | ||
Comment 96•11 years ago
|
||
Attachment #8368279 -
Flags: review?(benjamin)
Assignee | ||
Comment 97•11 years ago
|
||
Attachment #8368280 -
Flags: review?(benjamin)
Assignee | ||
Comment 98•11 years ago
|
||
Attachment #8368282 -
Flags: review?(benjamin)
Assignee | ||
Comment 99•11 years ago
|
||
Attachment #8368319 -
Flags: review?(rnewman)
Attachment #8368319 -
Flags: review?(benjamin)
Assignee | ||
Comment 100•11 years ago
|
||
And with that, I think the JS is feature complete. Now, time for the C++.
Updated•11 years ago
|
Attachment #8368280 -
Flags: review?(benjamin)
Updated•11 years ago
|
Attachment #8368282 -
Flags: review?(benjamin)
Updated•11 years ago
|
Attachment #8368319 -
Flags: review?(benjamin)
Assignee | ||
Comment 101•11 years ago
|
||
Pushed a review fixup for part 3:
https://hg.mozilla.org/integration/mozilla-inbound/rev/09013fef24b1
Assignee | ||
Comment 102•11 years ago
|
||
Attachment #8369147 -
Flags: feedback?(ted)
Attachment #8369147 -
Flags: feedback?(benjamin)
Comment 103•11 years ago
|
||
Updated•11 years ago
|
Attachment #8368279 -
Flags: review?(benjamin)
Updated•11 years ago
|
Attachment #8369147 -
Flags: feedback?(benjamin)
Comment 104•11 years ago
|
||
I think I've got all the reviews you needed from me. Please let me know in the flags if there is something that needs re-review or that I missed.
Assignee | ||
Updated•11 years ago
|
Attachment #8369147 -
Flags: feedback?(ted)
Assignee | ||
Comment 105•11 years ago
|
||
Looks like the latter patches are going to miss the 29 Aurora train. I'm flagging this bug to target 29 for uplift. This bug will give us significantly better FHR data on crashes in Firefox. The old mechanism was prone to double counting and incomplete data. Hopefully we can get the entire patch set uplifted to Aurora in the next week.
tracking-firefox29:
--- → ?
Updated•11 years ago
|
Comment 106•11 years ago
|
||
Not sure that we need to track this I spoke with :gps on IRC and advised he could request uplift.
Assignee | ||
Comment 107•11 years ago
|
||
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager
Feedback applied. Requesting re-review.
Attachment #8358103 -
Flags: review?(dteller)
Assignee | ||
Updated•11 years ago
|
Attachment #8368282 -
Flags: review?(dteller)
Updated•11 years ago
|
Attachment #8358103 -
Flags: review?(dteller) → review+
Assignee | ||
Comment 108•11 years ago
|
||
Assignee | ||
Comment 109•11 years ago
|
||
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager
This bug tracks an overhaul of how Gecko tracks crashes. The goal of this feature is to enable Firefox Health Report to report more accurate crash data. This data will be extremely useful to Mozilla and will directly help the stability efforts. It therefore will be beneficial to our users.
The sooner we can get this code to the release channel, the better.
The feature consists of a new XPCOM component that aggregates and stores crash data. This feature is self-contained. Initially, FHR will be the only consumer. Although, about:crashes and other "health" browser features will likely be built on top of it.
I believe the risk of the patch set should be low. The highest risk is that latter patches in this series break crash reporting. But, that should be noticeable via dropped crash volume on Soccoro and via FHR numbers changing.
I highly doubt we'd need to back out the new XPCOM service. But if we did, it is self-contained within a single directory, which we could easily delete.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): New feature for better crash reporting
User impact if declined: FHR crash reporting will continue to be bad.
Testing completed (on m-c, etc.): Unit tests. Feature in process of landing.
Risk to taking this patch (and alternatives if risky): Self-isolated component.
String or IDL/UUID changes made by this patch: None
Attachment #8358103 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 110•11 years ago
|
||
Comment on attachment 8368279 [details]
Part 5: Document crash event types
See comment #109 for more uplift context.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Crash logging
User impact if declined: Bad crash data
Testing completed (on m-c, etc.): This is a docs-only change and is NPOTB.
Risk to taking this patch (and alternatives if risky): None
String or IDL/UUID changes made by this patch: None
Attachment #8368279 -
Flags: review+
Attachment #8368279 -
Flags: approval-mozilla-aurora?
(In reply to Gregory Szorc [:gps] from comment #109)
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): New feature for better crash
> reporting
> User impact if declined: FHR crash reporting will continue to be bad.
> Testing completed (on m-c, etc.): Unit tests. Feature in process of landing.
> Risk to taking this patch (and alternatives if risky): Self-isolated
> component.
> String or IDL/UUID changes made by this patch: None
Bug 952335 might need to be uplifted, too.
Depends on: 952335
Assignee | ||
Comment 112•11 years ago
|
||
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #111)
> Bug 952335 might need to be uplifted, too.
Thanks for the awareness! I requested it to be uplifted.
Comment 113•11 years ago
|
||
We should probably get QA to do some verification of this first before we request uplift.
Comment 114•11 years ago
|
||
Updated•11 years ago
|
Updated•11 years ago
|
status-firefox29:
--- → affected
Assignee | ||
Comment 115•11 years ago
|
||
Comment on attachment 8368280 [details]
Part 6: Implement support for initial crash events
This needs re-review since r+'d patch has been updated significantly.
Attachment #8368280 -
Flags: review?(benjamin)
Updated•11 years ago
|
Attachment #8368280 -
Flags: review?(benjamin)
Assignee | ||
Comment 116•11 years ago
|
||
Comment 117•11 years ago
|
||
gps: does Part X still need review from me? I can't tell from the 116 comments and 10 active attachments :P
Updated•11 years ago
|
Attachment #8368279 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Updated•11 years ago
|
Attachment #8358103 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 118•11 years ago
|
||
Comment 119•11 years ago
|
||
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager
Too early
Attachment #8358103 -
Flags: approval-mozilla-aurora+ → approval-mozilla-aurora?
Updated•11 years ago
|
Attachment #8368279 -
Flags: approval-mozilla-aurora+ → approval-mozilla-aurora-
Updated•11 years ago
|
Attachment #8368279 -
Flags: approval-mozilla-aurora- → approval-mozilla-aurora?
Assignee | ||
Comment 120•11 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #117)
> gps: does Part X still need review from me? I can't tell from the 116
> comments and 10 active attachments :P
Yes it does. (Reducing clutter is one reason I prefer an external review tool.)
Updated•11 years ago
|
Attachment #8368282 -
Flags: review?(dteller) → review+
Assignee | ||
Comment 121•11 years ago
|
||
Comment 122•11 years ago
|
||
Flags: in-testsuite+
Assignee | ||
Comment 123•11 years ago
|
||
Comment on attachment 8369147 [details]
Part 8: Write main process crash events files
I think I got things working on Linux and Windows. Time to request formal review. I expect many review comments.
Attachment #8369147 -
Flags: review?(benjamin)
Assignee | ||
Comment 124•11 years ago
|
||
I spent a large portion of my time on Friday trying to sort out a Windows debug only failure with part 8 (seen also at https://tbpl.mozilla.org/?tree=Try&rev=9f47be8a7faa).
When I run the xpcshell test (read: the code that produces event files) locally, the window about a "program encountering an error... would you like to abort, retry, ignore" comes up. But I'm not sure why. No matter what I do at that prompt, the test fails.
I've tried starting the xpcshell.exe process, attaching the Visual Studio debugger, then resuming the test, but I get the same behavior.
I've banged my head against this for far too long. I suspect the issue is something silly, but I'm not sure what. I tried commenting out whole sections of my code to isolate the problem, but I can't seem to isolate it. I'm not sure if my code is just doing something silly, if the failure in automation is the same issue I'm seeing or what. I suspect/hope the issue will get teased out during code review.
Comment 125•11 years ago
|
||
The way do_crash works is it launches another xpcshell process and crashes *that* process. So I expect that you are seeing the error dialog from that process and not the main xpcshell process running the test.
That dialog is expected when the crash reporter is not turned on properly, or when the crash handler itself crashes. I just posted the review of part 8 and I expect that the non-nullterminated id_ascii is causing the crash handler itself to crash.
Updated•11 years ago
|
Attachment #8369147 -
Flags: review?(benjamin)
Comment 126•11 years ago
|
||
Don't forget that if you want uplift for aurora for all patches, it should be requested soon.
If it going to be harder to uplift them in beta.
Comment 127•11 years ago
|
||
Comment on attachment 8368319 [details]
Part 9: Change FHR provider to pull from crash manager
Three small issues on ReviewBoard.
Attachment #8368319 -
Flags: review?(rnewman) → review+
Assignee | ||
Comment 128•11 years ago
|
||
Comment on attachment 8369147 [details]
Part 8: Write main process crash events files
New version is posted on RB and passes Try \o/
Attachment #8369147 -
Flags: review?(benjamin)
Assignee | ||
Comment 129•11 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #127)
> Comment on attachment 8368319 [details]
> Part X: Change FHR provider to pull from crash manager
>
> Three small issues on ReviewBoard.
Did you forget to post your review? I'm not seeing any new content on RB...
Flags: needinfo?(rnewman)
Comment 130•11 years ago
|
||
Doesn't show as pending to me:
https://reviewboard.allizom.org/r/30/#
Posted 2 weeks ago (Feb. 11, 2014, 7:08 p.m.)
Flags: needinfo?(rnewman)
Updated•11 years ago
|
Attachment #8369147 -
Flags: review?(benjamin)
Assignee | ||
Comment 131•11 years ago
|
||
Part 8 (finally):
https://hg.mozilla.org/integration/mozilla-inbound/rev/de17e5deb35d
With nits addressed and Vidyo confirmation from bsmedberg that landing this before plugin events are recorded is acceptable.
Assignee | ||
Comment 132•11 years ago
|
||
Needed to update the UUID of nsICrashReporter.
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb4d62536dfd
Comment 133•11 years ago
|
||
Updated•11 years ago
|
Blocks: fxdesktoptriage
Assignee | ||
Comment 134•11 years ago
|
||
Comment on attachment 8368319 [details]
Part 9: Change FHR provider to pull from crash manager
Requesting re-review since plugin references were removed.
I was pretty close to carrying r+ forward because code changes are minimal (ReviewBoard should give proper interdiff). But I like a review.
rnewman: I'll review the Sync patches right now.
Attachment #8368319 -
Flags: review?(rnewman)
Attachment #8368319 -
Flags: review?(benjamin)
Attachment #8368319 -
Flags: review+
Assignee | ||
Updated•11 years ago
|
Attachment #8368319 -
Attachment description: Part X: Change FHR provider to pull from crash manager → Part 9: Change FHR provider to pull from crash manager
Assignee | ||
Comment 135•11 years ago
|
||
Also, plugin reporting isn't going to land by Monday unless there is a miracle. We'll hopefully get that in next cycle and possibly uplift.
I plan to close out this bug once the FHR provider patch lands.
Comment 136•11 years ago
|
||
Comment on attachment 8368319 [details]
Part 9: Change FHR provider to pull from crash manager
rs=me
Attachment #8368319 -
Flags: review?(rnewman) → review+
Comment 137•11 years ago
|
||
Gregory, does that mean I should refuse the two pending uplift requests?
Updated•11 years ago
|
Flags: needinfo?(gps)
Updated•11 years ago
|
Attachment #8368319 -
Flags: review?(benjamin)
Assignee | ||
Comment 138•11 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/21793ea94b09
Landing FHR provider change and closing out this bug.
Will track plugin event recording in a soon-to-be-filed bug.
Whiteboard: [leave open]
Assignee | ||
Comment 139•11 years ago
|
||
Switching needinfo to bsmedberg to answer 29 uplift question.
Flags: needinfo?(gps) → needinfo?(benjamin)
Comment 140•11 years ago
|
||
This is not safe enough to uplift less than a week before beta.
Flags: needinfo?(benjamin)
Backed out in https://hg.mozilla.org/integration/fx-team/rev/56c230aeaa61 for ASAN xpcshell failure: https://tbpl.mozilla.org/php/getParsedLog.php?id=36094299&tree=Fx-Team
Assignee | ||
Comment 142•11 years ago
|
||
Likely due to --disable-crashreporter in ASAN builds. We didn't have a direct dependency on the crash reporter before because we slurped files directory off the filesystem. Will likely need to put the crashes provider behind an #ifdef.
Assignee | ||
Comment 143•11 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/4496b6e98cf6
(Review conducted on ReviewBoard and IRC.)
Updated•11 years ago
|
Attachment #8368279 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment 144•11 years ago
|
||
Comment on attachment 8358103 [details]
Part 4: Add support for crash event files to crashes manager
Taking in account comment #140, I am going to reject the uplift request.
Fill free to submit it again later.
Attachment #8358103 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora-
Comment 145•11 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Updated•11 years ago
|
No longer blocks: fxdesktoptriage
Updated•11 years ago
|
status-firefox30:
--- → fixed
Comment 146•11 years ago
|
||
Comment 147•11 years ago
|
||
I couldn't generate any crashes to show up in about:healthreport on a Mac OS X 10.9.2 and Ubuntu 13.10 x32 machines. I suppose this is because bug 1004623 is not yet on beta branch?
Flags: needinfo?(gps)
Updated•11 years ago
|
Flags: needinfo?(gps) → needinfo?(benjamin)
Comment 148•11 years ago
|
||
Alexandra, I don't know exactly what builds you're testing. If you are testing a build which doesn't have bug 1004623, then you need to run a page with a multiprocess plugin before testing.
Flags: needinfo?(benjamin)
Comment 149•11 years ago
|
||
Sorry for not mentioning it, I was testing the fix for Firefox 30.
Verified as fixed on Firefox 30 beta 6 (Build ID: 20140520115057) after loading http://benjamin.smedbergs.us/tests/ctptests/javaflash-together.html and following the steps from bug 1004623 comment 7 - a record of org.mozilla.crashes.crashes is present in about:healthreport on Ubuntu 12.04 x32, Windows 7 x64 and Mac OS X 10.9.2:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Mozilla/5.0 (X11; Linux i686; rv:30.0) Gecko/20100101 Firefox/30.0
You need to log in
before you can comment on or make changes to this bug.
Description
•