Last Comment Bug 1293656 - Send stack traces for each content crash to the crash manager
: Send stack traces for each content crash to the crash manager
Status: VERIFIED FIXED
:
Product: Toolkit
Classification: Components
Component: Breakpad Integration (show other bugs)
: unspecified
: Unspecified Unspecified
-- normal (vote)
: mozilla53
Assigned To: Gabriele Svelto [:gsvelto]
:
: Ted Mielczarek [:ted.mielczarek]
Mentors:
Depends on: 1319071 1322611 1280477 1319702 1324249 1328768
Blocks: 1280469 1306426 1310674 1310664 1310703 1328657
  Show dependency treegraph
 
Reported: 2016-08-09 06:30 PDT by Gabriele Svelto [:gsvelto]
Modified: 2017-02-06 05:47 PST (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
[PATCH WIP] Send stack traces for content crashes to the crash manager (26.21 KB, patch)
2016-10-12 05:25 PDT, Gabriele Svelto [:gsvelto]
no flags Details | Diff | Splinter Review
[PATCH WIP] Send stack traces for content crashes to the crash manager (31.46 KB, patch)
2016-10-18 06:39 PDT, Gabriele Svelto [:gsvelto]
no flags Details | Diff | Splinter Review
[PATCH 1/3] Clean up the exception handler code (13.13 KB, patch)
2016-10-19 06:14 PDT, Gabriele Svelto [:gsvelto]
no flags Details | Diff | Splinter Review
[PATCH 2/3] Move filtering annotations that go in the crash ping into the crash manager (10.33 KB, patch)
2016-10-19 06:26 PDT, Gabriele Svelto [:gsvelto]
no flags Details | Diff | Splinter Review
[PATCH 3/3] Send crash pings for content crashes complete with stack traces (10.32 KB, patch)
2016-10-19 06:30 PDT, Gabriele Svelto [:gsvelto]
no flags Details | Diff | Splinter Review
[PATCH 2/3 v2] Move filtering annotations that go in the crash ping into the crash manager (15.31 KB, patch)
2016-10-20 02:29 PDT, Gabriele Svelto [:gsvelto]
benjamin: review-
Details | Diff | Splinter Review
[PATCH 3/3 v2] Send crash pings for content crashes complete with stack traces (10.36 KB, patch)
2016-10-21 03:42 PDT, Gabriele Svelto [:gsvelto]
ted: feedback+
Details | Diff | Splinter Review
[PATCH 3/3 v3] Send crash pings for content crashes complete with stack traces (15.76 KB, patch)
2016-10-25 07:53 PDT, Gabriele Svelto [:gsvelto]
benjamin: review-
Details | Diff | Splinter Review
[PATCH 2/3 v3] Store all crash annotations in the crash manager's database and filter out privacy-sensitive ones when sending a crash ping (17.69 KB, patch)
2016-11-03 08:11 PDT, Gabriele Svelto [:gsvelto]
benjamin: review+
Details | Diff | Splinter Review
[PATCH 3/3 v4] Send crash pings for content crashes complete with stack traces (25.22 KB, patch)
2016-11-03 08:17 PDT, Gabriele Svelto [:gsvelto]
no flags Details | Diff | Splinter Review
[PATCH 3/3 v5] Send crash pings for content crashes complete with stack traces (23.45 KB, patch)
2016-11-04 08:28 PDT, Gabriele Svelto [:gsvelto]
benjamin: review-
Details | Diff | Splinter Review
[PATCH 3/3 v6] Send crash pings for content crashes complete with stack traces (27.92 KB, patch)
2016-11-11 09:58 PST, Gabriele Svelto [:gsvelto]
benjamin: review+
Details | Diff | Splinter Review

Description User image Gabriele Svelto [:gsvelto] 2016-08-09 06:30:39 PDT
+++ This bug was initially created as a clone of Bug #1280477 +++

With the infrastructure setup in bug 1280477 we can analyze minidumps for content process crashes. This requires invoking the minidump analyzer from the CrashReporter parent and modifying it so that we'll be able to gather its output easily. Currently the minidump analyzer writes directly into the crash event file, we'll need to either write to a temporary file and read back the output or find a way to capture stdout directly.
Comment 1 User image Gabriele Svelto [:gsvelto] 2016-08-09 06:32:27 PDT
Ouch, wrong title.
Comment 2 User image Gabriele Svelto [:gsvelto] 2016-10-10 06:32:38 PDT
I've started hacking on this but I soon realized that there's a fairly big issue I hadn't considered. Right now the crash ping is sent only for full browser crashes. Sending a ping for a content crash without flagging it as different than others is likely to cause our server-side stuff to consider it a regular, browser crash. I can add a field that describes the type of crash but I don't know what needs to be changed in our serer-side telemetry code to pick it up. Chris, could you help me out with this?
Comment 3 User image Benjamin Smedberg [:bsmedberg] 2016-10-10 07:22:02 PDT
Mauro is also involved with crash datasets and is your data engineering lead.
Comment 4 User image David Durst [:ddurst] 2016-10-10 09:03:18 PDT
(In reply to Gabriele Svelto [:gsvelto] from comment #2)
> I've started hacking on this but I soon realized that there's a fairly big
> issue I hadn't considered. Right now the crash ping is sent only for full
> browser crashes. Sending a ping for a content crash without flagging it as
> different than others is likely to cause our server-side stuff to consider
> it a regular, browser crash. I can add a field that describes the type of
> crash but I don't know what needs to be changed in our serer-side telemetry
> code to pick it up. Chris, could you help me out with this?

The thought was that we could add a field to the datastore; any derived datasets will need to know about that (requires modifying the scala job, if I understand it correctly). But we were looking at that in terms of identifying the new ones, not differentiating the old ones -- so we should probably file that as a separate bug, because it seems like we want to make sure that any existing dataset-feeding job ignores records it shouldn't care about.
Comment 5 User image Gabriele Svelto [:gsvelto] 2016-10-10 14:03:52 PDT
(In reply to David Durst [:ddurst] from comment #4)
> The thought was that we could add a field to the datastore; any derived
> datasets will need to know about that (requires modifying the scala job, if
> I understand it correctly). But we were looking at that in terms of
> identifying the new ones, not differentiating the old ones -- so we should
> probably file that as a separate bug, because it seems like we want to make
> sure that any existing dataset-feeding job ignores records it shouldn't care
> about.

Yeah, what I fear is that adding a field (e.g. type of process that crashed) before we update our server-side jobs will cause the new crash pings to be lumped together with the old ones.

Basically I think that the current assumption is that each crash ping received is a browser crash. With new pings one must check if the additional field is present and categorize them based on that, and if not categorize them as browser crashes (they're old pings).
Comment 6 User image Mauro Doglio [:mdoglio] 2016-10-11 03:46:57 PDT
The crash_aggregates job needs to be changed [1] but it's not a lot of work. We just need to coordinate on the name of the flag.
Do you have a rough estimate for the size of the additional payload? What's the format? Does that change the frequency of the crash ping deliveries? (some of these questions may be answered by the telemetry client folks)

[1]https://github.com/mozilla/telemetry-batch-view/blob/72e448d4ae0b90ce4745d414137c45f1413b3c50/src/main/scala/com/mozilla/telemetry/views/CrashAggregateView.scala#L284
Comment 7 User image Gabriele Svelto [:gsvelto] 2016-10-11 04:01:36 PDT
Thanks Maurio, this is exactly what I was looking. The frequency of crash pings is going to go up since we'll send content process pings in addition to the pings for browser crashes we already get. The ping is going to be quite a bit bigger since it will contain the stack traces for every thread. I estimate it will be up to ten times larger but it will be restricted only to the nightly and aurora channels. BTW the size increase is not going to happen on this bug; it will happen earlier when I land bug 1280484.
Comment 8 User image Chris H-C :chutten 2016-10-11 06:14:29 PDT
The additional payload (stack frames) format is JSON, but it's probably best to ignore it for the time being as it won't be terribly useful to anyone until we write the server-side symbolication step.

The data in crash_aggregates can give you the exact count of new pings we're expecting (the child crashes from SUBPROCESS_ABNORMAL_ABORT/content will now also show up as crash pings) if that helps.
Comment 9 User image David Durst [:ddurst] 2016-10-11 10:49:23 PDT
Should we be looking at bug 1297843 also (and any others like it) so that we treat them all the right way?
Comment 10 User image Gabriele Svelto [:gsvelto] 2016-10-11 14:16:39 PDT
(In reply to David Durst [:ddurst] from comment #9)
> Should we be looking at bug 1297843 also (and any others like it) so that we
> treat them all the right way?

Yes, it's using the same code I'm hooking in to send stack traces to the crash manager so eventually we'll be able to support that too. For now I'm restricting this to content crashes only because they're more frequent and I want to add this functionality one step at a time.
Comment 11 User image Gabriele Svelto [:gsvelto] 2016-10-12 05:25:37 PDT
Created attachment 8800207 [details] [diff] [review]
[PATCH WIP] Send stack traces for content crashes to the crash manager

This is a WIP patch that does a few different things to enable this:

- Cleans up and factorizes some code in nsExceptionHandler.* to allow calling the minidump analyzer easily and also splits out the code for invoking the crashreporter client on Android which uses an entirely different method than on desktop platforms

- Calls the minidump analyzer on all content crashes so that the .extra file for those crashes includes the stack traces

- In the crash manager extracts the metadata from the .extra file for all crashes added via nsICrashService.addCrash()

This is still incomplete; I haven't added the code for sending the ping itself and I haven't introduced any new tests to cover the new functionality. I'll probably split up the patch in multiple pieces for ease of review.
Comment 12 User image Gabriele Svelto [:gsvelto] 2016-10-18 06:39:28 PDT
Created attachment 8802114 [details] [diff] [review]
[PATCH WIP] Send stack traces for content crashes to the crash manager

Another WIP patch; the biggest change being that I've removed the logic we use for filtering entries that go into the crash ping from the exception handler to the crash manager. This has multiple upsides: first of all we do this filtering in the same place where we assemble the ping which makes more sense. Additionally if we add more annotations to the event file then this will have to be reflected in the crash manager, otherwise the new annotation will not be sent thus reducing the risk of adding something to the ping accidentally. Third the in-code documentation was wrong, now there's no need for documentation anymore as the code always reflect the list of entries that go into the crash ping. Last but not least this code can be easily tested and I intend to add tests before landing so that we get this covered.

Since the changes are piling up I think I'll split this bug in two or possibly three pieces to simplify review and separate changes that are not directly related to each other.
Comment 13 User image Gabriele Svelto [:gsvelto] 2016-10-19 06:14:37 PDT
Created attachment 8802496 [details] [diff] [review]
[PATCH 1/3] Clean up the exception handler code

This is the first patch needed to enable this functionality. It cleans up and factorizes some code in the exception handler to simplify subsequent modifications. Specifically this does the following:

- It factors out into separate functions the code to launch the crashreporter client and splits it between desktop platforms (where we launch a program) and Android (where we launch an activity)

- It makes string manipulation functions use size_t for the length of strings instead of int (yuck)

- It removes some dead code

Overall this is pretty straightforward but makes reading this code a little easier and gives me a simple entry point to launch the minidump analyzer later.
Comment 14 User image Gabriele Svelto [:gsvelto] 2016-10-19 06:26:03 PDT
Created attachment 8802499 [details] [diff] [review]
[PATCH 2/3] Move filtering annotations that go in the crash ping into the crash manager

This is the second patch I've written to pave the way for adding content crash pings. What it does is to move the filtering logic that decides what annotations goes into the crash ping from the exception handler to the crash manager.

First of all this is a more logical approach IMHO. Right now we take the .extra file and copy the annotations it contains into the event file while filtering them in the exception handler. Later we pick up the annotations from the event file and send *all of them* along with the crash ping. This means that if afterwards something ended up in the event file, it will also be sent along the crash ping. Additionally there's no way to test this code to prevent a (privacy-sensitive) field from being accidentally added.

What this patch does is to put a complete whitelist of the fields we send into the crash manager and use it to filter out what we don't want before assembling the contents of the crash ping. This also makes the exception handler simpler which is a Good Thing™ (IMHO).

I'm not asking for review just yet because this is missing tests; I'll upload a revised patch with tests soon.
Comment 15 User image Gabriele Svelto [:gsvelto] 2016-10-19 06:30:15 PDT
Created attachment 8802500 [details] [diff] [review]
[PATCH 3/3] Send crash pings for content crashes complete with stack traces

Last but not least this is the patch that assembles and send a crash ping for every content crash. As you can see it's rather small as it leverages the refactoring I've done in the two previous patches.

In short what this does is run the minidump analyzer to get the stack traces for each content crash, then in the crash manager process the .extra file to extract the relevant annotations and then send them into a ping.

The ping now has an additional field called 'processType' which specifies the process type. This can be set to 'browser' for main content crashes or 'content' for content process crashes. Mauro does this work for you? As discussed on IRC if the field is absent it should be interpreted as 'browser' to preserve compatibility with older pings.
Comment 16 User image Gabriele Svelto [:gsvelto] 2016-10-19 06:32:06 PDT
I forgot something, I'm not asking for review for the third patch for the same reason as for the second: I'm not done writing the tests yet.
Comment 17 User image Gabriele Svelto [:gsvelto] 2016-10-20 02:29:38 PDT
Created attachment 8802849 [details] [diff] [review]
[PATCH 2/3 v2] Move filtering annotations that go in the crash ping into the crash manager

Updated patch with proper testing; I've tweaked the existing test to cover this functionality.
Comment 18 User image Mauro Doglio [:mdoglio] 2016-10-20 06:15:52 PDT
:gsvelto that works for me. I'll update this bug once bug 1310673 is solved, please wait for it before you land this patch on central.
Comment 19 User image Gabriele Svelto [:gsvelto] 2016-10-20 07:48:03 PDT
(In reply to Mauro Doglio [:mdoglio] from comment #18)
> :gsvelto that works for me. I'll update this bug once bug 1310673 is solved,
> please wait for it before you land this patch on central.

I'll wait for your changes before landing any of this so we don't risk polluting the existing data. BTW I've assumed that the volume of new pings is not going to be a problem for our infra but it might be worth checking. As with browser crashes, content crash pings will have stack traces on nightly and aurora (so it's going to be pretty big, up to 100-120KiB) but it won't have them once it will get to beta and release. Once there it'll be a little smaller than our regular crash ping (a few KiB probably).
Comment 20 User image Gabriele Svelto [:gsvelto] 2016-10-21 03:42:31 PDT
Created attachment 8803296 [details] [diff] [review]
[PATCH 3/3 v2] Send crash pings for content crashes complete with stack traces

The previous patch had a pretty big issue when parsing the extra file for content crashes. That's fixed in this patch but I'm encountering an unexpected issue on try which I'm still investigating.
Comment 21 User image Ted Mielczarek [:ted.mielczarek] 2016-10-24 07:33:14 PDT
Comment on attachment 8803296 [details] [diff] [review]
[PATCH 3/3 v2] Send crash pings for content crashes complete with stack traces

Review of attachment 8803296 [details] [diff] [review]:
-----------------------------------------------------------------

This generally looks fine.

::: ipc/glue/CrashReporterHost.cpp
@@ +91,5 @@
>    nsCString telemetryKey;
>  
>    switch (aProcessType) {
>      case GeckoProcessType_Content:
> +      CrashReporter::RunMinidumpAnalyzer(aChildDumpID);

Are we not going to bother with this for other crash types, or is that something we'd add later?

::: toolkit/components/crashes/CrashManager.jsm
@@ +586,5 @@
> +        let extraFile = yield OS.File.read(extraPath);
> +        let store = yield this._getStore();
> +        let decoder = new TextDecoder();
> +        let extraData = decoder.decode(extraFile);
> +        let metadata = parseKeyValuePairsFromLines(extraData.split("\n"));

`parseKeyValuePairs` does the split for you:
https://dxr.mozilla.org/mozilla-central/rev/215f9686117673a2c914ed207bc7da9bb8d741ad/toolkit/crashreporter/KeyValueParser.jsm#34

::: toolkit/crashreporter/nsExceptionHandler.cpp
@@ +1618,5 @@
>  #ifdef XP_WIN32
> +    exePath->GetPath(tempPath);
> +    crashReporterPath = reinterpret_cast<wchar_t*>(ToNewUnicode(tempPath));
> +    analyzerPath->GetPath(tempPath);
> +    minidumpAnalyzerPath = reinterpret_cast<wchar_t*>(ToNewUnicode(tempPath));

You don't need to store this as a global char* since you're not trying to run it during the in-process dump callback. We do that for the client path because we don't want to be calculating it in a crashed process.

It'd be fine to just derive the path at runtime in `RunMinidumpAnalyzer`.
Comment 22 User image Gabriele Svelto [:gsvelto] 2016-10-25 01:15:19 PDT
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> ::: ipc/glue/CrashReporterHost.cpp
> @@ +91,5 @@
> >    nsCString telemetryKey;
> >  
> >    switch (aProcessType) {
> >      case GeckoProcessType_Content:
> > +      CrashReporter::RunMinidumpAnalyzer(aChildDumpID);
> 
> Are we not going to bother with this for other crash types, or is that
> something we'd add later?

We'll add support for other crashes later. We're still working on the server-side infrastructure to accommodate for this and use this data in a meaningful way which is why I'm restricting this only to content crashes for now.

> `parseKeyValuePairs` does the split for you:
> https://dxr.mozilla.org/mozilla-central/rev/
> 215f9686117673a2c914ed207bc7da9bb8d741ad/toolkit/crashreporter/
> KeyValueParser.jsm#34

Thanks, I hadn't noticed it :-|

> You don't need to store this as a global char* since you're not trying to
> run it during the in-process dump callback. We do that for the client path
> because we don't want to be calculating it in a crashed process.
> 
> It'd be fine to just derive the path at runtime in `RunMinidumpAnalyzer`.

Good point, that'll make the code simpler. I've got most of the tests working already and I'm adding some more before I ask for review.
Comment 23 User image Gabriele Svelto [:gsvelto] 2016-10-25 07:53:46 PDT
Created attachment 8804270 [details] [diff] [review]
[PATCH 3/3 v3] Send crash pings for content crashes complete with stack traces

OK, this should be good for review, it include tests and it seems to be working fine:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=60b6006c1320

While applying the changes suggested by Ted I've tried to factorize out the code that retrieves the executable path for the crashreporter client and minidump analyzer to make it a little more readable.
Comment 24 User image Benjamin Smedberg [:bsmedberg] 2016-10-25 09:41:38 PDT
Comment on attachment 8802849 [details] [diff] [review]
[PATCH 2/3 v2] Move filtering annotations that go in the crash ping into the crash manager

Why is this a good change to make? In particular if we're going to implement a separate crash ping sender, wouldn't we need to duplicate this whitelist into there again?

>Bug 1293656 - Move filtering annotations that go in the crash ping into the crash manager

Please add more detail here in the commit message. In particular, we're going to record all of the data in the event file on disk, and do the filtering later?

>diff --git a/toolkit/components/crashes/CrashManager.jsm b/toolkit/components/crashes/CrashManager.jsm

>+function parseAndRemove(obj, field, raw) {

This function needs doccomments. I can't figure out reading the signature what it actually does. It seems `raw` is a bool? Since many values are raw, perhaps we should flip this and make it `parseAsJSON` to be clear?

>+  // A whitelist of crash annotations which do not contain sensitive data
>+  // and are saved in the crash record and sent with Firefox Health Report.
>+  // This needs to be kept in sync with nsCrashReporter.cpp

There is no file nsCrashReporter.cpp, and didn't you remove the whitelist from nsExceptionHandler.cpp so there's nothing to keep in sync?

>   _handleEventFilePayload: function (store, entry, type, date, payload) {
>       // The payload types and formats are documented in docs/crash-events.rst.
>       // Do not change the format of an existing type. Instead, invent a new
>       // type.
>       // DO NOT ADD NEW TYPES WITHOUT DOCUMENTING!
>       let lines = payload.split("\n");
> 
>       switch (type) {
>@@ -526,58 +625,17 @@ this.CrashManager.prototype = Object.fre
>           }
>           // fall-through
>         case "crash.main.2":
>           let crashID = lines[0];
>           let metadata = parseKeyValuePairsFromLines(lines.slice(1));
>           store.addCrash(this.PROCESS_TYPE_MAIN, this.CRASH_TYPE_CRASH,
>                          crashID, date, metadata);

The consequence here is that we are recording more data than before? Previously we only recorded metadata that was on the whitelist: now we're recording everything but not sending it in the ping to the server? Is that the behavior we actually want here? I'm willing to say "yes" but we should document and understand this.

Do we have crash manager .rst docs? If so don't they need some updating?
Comment 25 User image Benjamin Smedberg [:bsmedberg] 2016-10-25 09:53:35 PDT
Comment on attachment 8804270 [details] [diff] [review]
[PATCH 3/3 v3] Send crash pings for content crashes complete with stack traces

Are you asking for a data review or a code review? I don't see any doc updates that I'd use for a data review.

>diff --git a/ipc/glue/CrashReporterHost.cpp b/ipc/glue/CrashReporterHost.cpp

>+      CrashReporter::RunMinidumpAnalyzer(aChildDumpID);

None of the method comments say whether this method is synchronous or asynchronous. I imagine that analyzing a minidump takes a while and isn't something that we'd want to block the main thread on. If this is sync, that should be fixed. If async, documented ;-)

>diff --git a/toolkit/components/crashes/CrashManager.jsm b/toolkit/components/crashes/CrashManager.jsm

>   addCrash: function (processType, crashType, id, date, metadata) {
>     return Task.spawn(function* () {
>+      // Send a telemetry ping for each content process crash
>+      if (processType === this.PROCESS_TYPE_CONTENT) {
>+        let metadata = yield this._processExtraFile(id);
>+
>+        this._sendCrashPing(id, processType, date, metadata);
>+      }
>+
>       let store = yield this._getStore();
>       if (store.addCrash(processType, crashType, id, date, metadata)) {
>         yield store.save();
>       }
>     }.bind(this));

I'm a little uncomfortable with this. The crash manager has previously not been aware of extra files or had any external state dependencies on them. In terms of the API design, I feel like it would be better to explicitly pass in the extra data somehow, rather than having to grovel for it/read it off disk.

Related, you have an inner `let metadata` which hides the outer-scope parameter. I think this is bad style and should fail linting at least, but also feels like we have multiple ways of passing metadata around which aren't good.

>diff --git a/toolkit/crashreporter/nsExceptionHandler.h b/toolkit/crashreporter/nsExceptionHandler.h

>+void RunMinidumpAnalyzer(const nsAString& id);

doccomments!

Also moar docs in general! I think you need to update http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/data/crash-ping.rst and http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/docs/concepts/crashes.rst
Comment 26 User image Gabriele Svelto [:gsvelto] 2016-10-26 05:43:27 PDT
(In reply to Benjamin Smedberg [:bsmedberg] from comment #24)
> Why is this a good change to make?

Right now it allows us to keep the list in a single place where it can be tested. Otherwise I'd have to duplicate it in nsExceptionHandler.cpp (for main process crashes) and in the crash manager (for content process crashes).

> In particular if we're going to implement
> a separate crash ping sender, wouldn't we need to duplicate this whitelist
> into there again?

That would require duplicating more than just the list. I'm still thinking about the design of that to minimize that.

> Please add more detail here in the commit message. In particular, we're
> going to record all of the data in the event file on disk, and do the
> filtering later?

Yes, that's my goal so that we can be sure that no other data can accidentally end in the ping, no matter where it's coming from (extra file, event file, etc...) or when it was added.

> This function needs doccomments. I can't figure out reading the signature
> what it actually does. It seems `raw` is a bool? Since many values are raw,
> perhaps we should flip this and make it `parseAsJSON` to be clear?

Right, I'll do that.

> >+  // A whitelist of crash annotations which do not contain sensitive data
> >+  // and are saved in the crash record and sent with Firefox Health Report.
> >+  // This needs to be kept in sync with nsCrashReporter.cpp
> 
> There is no file nsCrashReporter.cpp, and didn't you remove the whitelist
> from nsExceptionHandler.cpp so there's nothing to keep in sync?

Yeah, it's a leftover from an older version of the patch and it's about nsExceptionHandler.cpp.

> >   _handleEventFilePayload: function (store, entry, type, date, payload) {
> >       // The payload types and formats are documented in docs/crash-events.rst.
> >       // Do not change the format of an existing type. Instead, invent a new
> >       // type.
> >       // DO NOT ADD NEW TYPES WITHOUT DOCUMENTING!
> >       let lines = payload.split("\n");
> > 
> >       switch (type) {
> >@@ -526,58 +625,17 @@ this.CrashManager.prototype = Object.fre
> >           }
> >           // fall-through
> >         case "crash.main.2":
> >           let crashID = lines[0];
> >           let metadata = parseKeyValuePairsFromLines(lines.slice(1));
> >           store.addCrash(this.PROCESS_TYPE_MAIN, this.CRASH_TYPE_CRASH,
> >                          crashID, date, metadata);
> 
> The consequence here is that we are recording more data than before?
> Previously we only recorded metadata that was on the whitelist: now we're
> recording everything but not sending it in the ping to the server? Is that
> the behavior we actually want here? I'm willing to say "yes" but we should
> document and understand this.

Yes, the extra data is now being put into the database too though it's not much. The stack traces are larger by far in comparison.

> Do we have crash manager .rst docs? If so don't they need some updating?

We do and it's rather bare right now so I can document this and the rest of the logic in greater detail. The contents of the ping on the other hand are well documented; I've listed all the fields we include in bug 1280484.
Comment 27 User image Ted Mielczarek [:ted.mielczarek] 2016-10-26 09:27:21 PDT
Comment on attachment 8804270 [details] [diff] [review]
[PATCH 3/3 v3] Send crash pings for content crashes complete with stack traces

Review of attachment 8804270 [details] [diff] [review]:
-----------------------------------------------------------------

bsmedberg's review should be sufficient here. If you'd like me to take another look at it feel free to re-request review.
Comment 28 User image Gabriele Svelto [:gsvelto] 2016-10-27 06:00:18 PDT
(In reply to Benjamin Smedberg [:bsmedberg] from comment #25)
> Are you asking for a data review or a code review? I don't see any doc
> updates that I'd use for a data review.

Both but you're right, the extra documentation is missing.

> >diff --git a/ipc/glue/CrashReporterHost.cpp b/ipc/glue/CrashReporterHost.cpp
> 
> >+      CrashReporter::RunMinidumpAnalyzer(aChildDumpID);
> 
> None of the method comments say whether this method is synchronous or
> asynchronous. I imagine that analyzing a minidump takes a while and isn't
> something that we'd want to block the main thread on. If this is sync, that
> should be fixed. If async, documented ;-)
> [...]
> I'm a little uncomfortable with this. The crash manager has previously not
> been aware of extra files or had any external state dependencies on them. In
> terms of the API design, I feel like it would be better to explicitly pass
> in the extra data somehow, rather than having to grovel for it/read it off
> disk.

Currently this code is synchronous because the content crash handler [1] will eventually delete the files so this ensures that doesn't happen before we're done with it. One way to handle this would be to move the code that runs the minidump analyzer and parse the extra file to the content crash handler. However I still feel that the code that assembles the crash ping should all be in the same place so that we have a consistent mechanism to assemble pings for all types of crashes.

[1] https://dxr.mozilla.org/mozilla-central/source/browser/modules/ContentCrashHandlers.jsm

> Related, you have an inner `let metadata` which hides the outer-scope
> parameter. I think this is bad style and should fail linting at least, but
> also feels like we have multiple ways of passing metadata around which
> aren't good.

That was a mistake, thanks for spotting it.

I'll try to rework this code so that the extra file is parsed in the content crash handler instead and see if it makes more sense.
Comment 29 User image Gabriele Svelto [:gsvelto] 2016-10-28 03:21:19 PDT
While reworking the patch I've noticed something odd: if we submit a content crash then the minidump and extra files are deleted, but if we don't they're just left around. Mike, Felipe, you both seem to have worked on this. Was this behavior intentional? Is there any other mechanism that cleans up the content process crash dumps? CrashSubmit.jsm has a pruneSavedDumps() and delete() method but they both seem to be used only from b2g code.
Comment 30 User image Gabriele Svelto [:gsvelto] 2016-10-28 05:36:02 PDT
OK, there's also this that removes all crash dumps & extra files:

https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/content/crashes.js#131

All in all the lifetime of all crash files that are not main process crashes is... unclear. They can be left alone or deleted depending on user actions and it doesn't necessarily happen in the same place. Running the minidump analyzer asynchronously without risking races is complicated. I think we should first consolidate the lifetime management of these files into a single place but I don't know which would be appropriate.
Comment 31 User image Gabriele Svelto [:gsvelto] 2016-10-28 06:34:58 PDT
(In reply to Benjamin Smedberg [:bsmedberg] from comment #25)
> I'm a little uncomfortable with this. The crash manager has previously not
> been aware of extra files or had any external state dependencies on them. In
> terms of the API design, I feel like it would be better to explicitly pass
> in the extra data somehow, rather than having to grovel for it/read it off
> disk.

In the light of this comment and what I found out about the crash dump lifecycle do you think it would be reasonable to augment CrashSubmit so that it's the only piece of code that tracks and manages all pending crashdumps? This would have a few benefits:

- Provide a single point from where to list those files

- Provide a mechanism to ensure a crashdump is kept around until all its "consumers" are done with it

- Provide a reliable mechanism to clean up crashdumps the user has already submitted, or ignored

CrashSubmit already has most of this functionality but some of it has only ever been used in b2g (!). My guess is that's because leaving crash dumps around on b2g could easily fill up one of the phone's partitions which is a lot less likely to happen on a desktop machine.

I could start by adding what's missing and using it in this patch then file a follow up to adapt the other places in the code where we deal with pending files.
Comment 32 User image Mike Conley (:mconley) 2016-10-28 11:39:49 PDT
(In reply to Gabriele Svelto [:gsvelto] from comment #29)
> While reworking the patch I've noticed something odd: if we submit a content
> crash then the minidump and extra files are deleted, but if we don't they're
> just left around. Mike, Felipe, you both seem to have worked on this. Was
> this behavior intentional?

Not a thing I know, I'm afraid. I've worked a bit with CrashSubmit in order to make about:tabcrashed and the plugin crash reporter work, but lifetimes of the crash report files themselves is not a thing I've investigated or know about.
Comment 33 User image Gabriele Svelto [:gsvelto] 2016-10-31 02:21:47 PDT
(In reply to Mike Conley (:mconley) - (digging out of review / needinfo backlog) from comment #32)
> Not a thing I know, I'm afraid. I've worked a bit with CrashSubmit in order
> to make about:tabcrashed and the plugin crash reporter work, but lifetimes
> of the crash report files themselves is not a thing I've investigated or
> know about.

Thanks Mike. I've dug a little further myself and it looks like CrashSubmit and most of its infrastructure was born and designed around b2g. That's unsurprising because b2g was the first scenario in which we had plenty of content processes and they did crash rather often (mostly due to e10s bugs or crappy dependencies). Since it seems appropriate for the job I think I'll re-purpose most of that functionality to fulfill our desktop requirements.
Comment 34 User image Mauro Doglio [:mdoglio] 2016-10-31 07:58:44 PDT
:gsvelto heads up, bug 1310673 is now fixed.
Comment 35 User image :Felipe Gomes (needinfo me!) 2016-10-31 09:38:48 PDT
I don't know about the minidump files management either. I believe you found information about the other questions.. Needinfo me again if I can help with something else
Comment 36 User image Benjamin Smedberg [:bsmedberg] 2016-10-31 12:40:03 PDT
(In reply to Gabriele Svelto [:gsvelto] from comment #29)
> While reworking the patch I've noticed something odd: if we submit a content
> crash then the minidump and extra files are deleted, but if we don't they're
> just left around.

This is intentional: if we don't submit, then the crash continues to show up as unsubmitted in about:crashes and the user can choose to submit later. The files should be deleted if/when the user chooses to submit. Also I believe there's cleanup code that deletes old dump/extra files after a certain threshold or period of time. But I don't remember where that lives!

You're right that this is currently messy and lives in various scripts all over the tree. I think originally the goal of the crash manager was to be the API and driver for about:crashes and crash submission. But you're welcome to organize that any way you think is reasonable.

I just don't think we should enable doing that work synchronously on the main thread, since that's potentially a *lot* of jank.
Comment 37 User image Gabriele Svelto [:gsvelto] 2016-11-03 08:11:45 PDT
Created attachment 8807167 [details] [diff] [review]
[PATCH 2/3 v3] Store all crash annotations in the crash manager's database and filter out privacy-sensitive ones when sending a crash ping

I've addressed the review comments, added some extra documentation and made the commit message clearer.
Comment 38 User image Gabriele Svelto [:gsvelto] 2016-11-03 08:17:16 PDT
Created attachment 8807172 [details] [diff] [review]
[PATCH 3/3 v4] Send crash pings for content crashes complete with stack traces

I've modified this patch heavily compared to the previous version:

- The whole process of analyzing a content crash dump and adding it to the crash manager is now entirely asynchronous as far as the main thread is concerned, the synchronous parts have been moved to a background thread which is spawned only when needed and then torn down

- I've updated both the in-code and .rst documentation

- I've moved processing the extra file out of the crash manager and into the crash service. I deliberately did not put this part in the runnable that processes the minidump file because in C++ it took a _lot_ more code than in JS.

The rest of the patch should be essentially the same.
Comment 39 User image Gabriele Svelto [:gsvelto] 2016-11-03 08:28:09 PDT
I forgot something. In attachment 8807172 [details] [diff] [review] I also had to change the regexp that's used to detect .dmp files in the crash manager. This is because breakpad uses two different formats for minidump file names. One uses the traditional 8-4-4-4-12 arrangement for the file name GUID, the other one - on Linux - uses an 8-4-4-8-8 arrangment:

https://chromium.googlesource.com/breakpad/breakpad/+/master/src/common/linux/guid_creator.h#38

The change that introduced that string dates back to 200, I'm not sure why it was made.
Comment 40 User image Gabriele Svelto [:gsvelto] 2016-11-04 08:28:21 PDT
Created attachment 8807588 [details] [diff] [review]
[PATCH 3/3 v5] Send crash pings for content crashes complete with stack traces

I found that the previous patch had an issue that showed up in the tests:

https://treeherder.mozilla.org/logviewer.html#?job_id=30394453&repo=try#L1796

It's not the first time I run into this and I couldn't figure out what causes it. I worked around it quickly by replacing my use of CrashManager.pendingDumps() when adding a content crash with something that doesn't need to touch the CrashManager.
Comment 41 User image Benjamin Smedberg [:bsmedberg] 2016-11-09 09:32:40 PST
Comment on attachment 8807588 [details] [diff] [review]
[PATCH 3/3 v5] Send crash pings for content crashes complete with stack traces

>diff --git a/ipc/glue/CrashReporterHost.cpp b/ipc/glue/CrashReporterHost.cpp


>+/**
>+ * Add information about a crash to the crash manager. This method runs the
>+ * various activities required to gather additional information and notify the
>+ * crash manager asynchronously, since many of them involve I/O or potentially
>+ * long processing.

How does this interact with shutdown? Is this long enough that it shouldn't block shutdown? I don't see this adding a shutdown blocker anywhere: we do wait for the lazy idle threads to complete before doing the very end of shutdown, but by that time I'm sure we've finished shutting down the profile etc. So I expect that we need extra code here to explicitly block shutdown and perhaps do timeouts if things take too long.

Additional questions:
* should we monitor telemetry to measure how long this process takes?
* what happens if the analyzer fails? Do we even notice (should we be monitoring telemetry on this)?
* If the main process crashes while we're processing, I guess we'll lose this entirely? Is that the correct tradeoff? I believe we've had content crashes that *cause* main-process crashes soon after, so this doesn't seem entirely theoretical.

>diff --git a/toolkit/crashreporter/nsExceptionHandler.cpp b/toolkit/crashreporter/nsExceptionHandler.cpp


>+/**
>+ * Runs the minidump analyzer program on the specified crash dump. The analyzer
>+ * will extract the stack traces from the dump and store them in JSON format as
>+ * an annotation in the extra file associated with the crash.
>+ *
>+ * This method waits synchronously for the program to have finished executing,
>+ * do not call it from the main thread!
>+ */
>+void
>+RunMinidumpAnalyzer(const nsAString& id)
>+{
>+#if !defined(MOZ_WIDGET_ANDROID)
>+  nsCOMPtr<nsIFile> xreDir;
>+  nsDirectoryService::gService->Get(NS_GRE_BIN_DIR,
>+                                    NS_GET_IID(nsIFile),
>+                                    getter_AddRefs(xreDir));
>+

This is now running on a non-main thread, right? You may not use the directory service off the main thread. It should assert!

Is there a different patch which updates toolkit/crashreporter/docs/index.rst ? This machinery is really complex, and we need to document a bunch of the data flow there to give anyone else a chance to understand and modify this code in the future.
Comment 42 User image Gabriele Svelto [:gsvelto] 2016-11-10 05:47:10 PST
(In reply to Benjamin Smedberg [:bsmedberg] from comment #41)
> How does this interact with shutdown? Is this long enough that it shouldn't
> block shutdown? I don't see this adding a shutdown blocker anywhere: we do
> wait for the lazy idle threads to complete before doing the very end of
> shutdown, but by that time I'm sure we've finished shutting down the profile
> etc. So I expect that we need extra code here to explicitly block shutdown
> and perhaps do timeouts if things take too long.

Yes, we're not blocking shutdown in any way or form. If the crash eventually causes the main process to shut down then we might not send the crash ping.

> Additional questions:
> * should we monitor telemetry to measure how long this process takes?

I don't think this is going to take much; it's about processing a small file (<1MB) which has just been written to disk and thus is highly likely to be in the FS cache on any modern OS. Plus now that it's happening on a background thread it really shouldn't hamper the user in any way or form.

> * what happens if the analyzer fails? Do we even notice (should we be
> monitoring telemetry on this)?

If the analyzer fails in a catastrophic way (crashes) we won't have a StackTraces annotation in the .extra file. The ping will still be sent nevertheless, just without including the stack traces. If the analyzer fails in a controlled way the StackTraces annotation will be there (though possibly with incomplete traces) and will contain the error that breakpad encountered while parsing the minidump file (e.g. minidump file is corrupt, the version is not supported, etc...).

> * If the main process crashes while we're processing, I guess we'll lose
> this entirely? Is that the correct tradeoff? I believe we've had content
> crashes that *cause* main-process crashes soon after, so this doesn't seem
> entirely theoretical.

Yes, we'll lose everything. The rationale behind the crashsender (bug 1310703) is that having an external, resilient tool capable of analyzing the minidump and sending the ping will allow us to gather information about crashes irrespective of what ultimately happens to Firefox (and possibly enables us to cut some of this complexity out of Gecko).

> >+void
> >+RunMinidumpAnalyzer(const nsAString& id)
> >+{
> >+#if !defined(MOZ_WIDGET_ANDROID)
> >+  nsCOMPtr<nsIFile> xreDir;
> >+  nsDirectoryService::gService->Get(NS_GRE_BIN_DIR,
> >+                                    NS_GET_IID(nsIFile),
> >+                                    getter_AddRefs(xreDir));
> >+
> 
> This is now running on a non-main thread, right? You may not use the
> directory service off the main thread. It should assert!

Strangely enough it does not, and if you look at the code there's no assertion in place:

https://dxr.mozilla.org/mozilla-central/rev/336759fad4621dfcd0a3293840edbed67018accd/xpcom/io/nsDirectoryService.cpp#344

Since the implementation is caching the paths, it might be possible that we're hitting a cached entry and thus avoiding an assert that may be lurking somewhere else if we were to retrieve the actual path. I'll rewrite this code to read that directory on the main thread just in case. But if the directory service is main-thread only then we should file a bug to enforce it.

> Is there a different patch which updates
> toolkit/crashreporter/docs/index.rst ? This machinery is really complex, and
> we need to document a bunch of the data flow there to give anyone else a
> chance to understand and modify this code in the future.

Good point. I had updated that file with the flow for main process crashes but I forgot to do it for content process crashes. I'll add it to the patch.
Comment 43 User image Benjamin Smedberg [:bsmedberg] 2016-11-10 07:11:18 PST
> Yes, we're not blocking shutdown in any way or form. If the crash eventually
> causes the main process to shut down then we might not send the crash ping.

We should block profile-shutdown on this async task, then. Calling the crash service after that isn't a good idea.

> code to read that directory on the main thread just in case. But if the
> directory service is main-thread only then we should file a bug to enforce
> it.

See bug 746830, bug 641367 and a few others at https://bugzilla.mozilla.org/buglist.cgi?quicksearch=directory%20service%20thread&list_id=13303828
Comment 44 User image Gabriele Svelto [:gsvelto] 2016-11-10 09:10:58 PST
(In reply to Benjamin Smedberg [:bsmedberg] from comment #43)
> We should block profile-shutdown on this async task, then. Calling the crash
> service after that isn't a good idea.

I'm looking at the AsyncShutdown documentation & code, if I understood it correctly this means adding a blocker to the profileBeforeChange barrier right?

I'll have to think this through carefully as I'll have to call AddBlocker() on the C++ side, then RemoveBlocker() on the JS side once we're done. I hope it doesn't turn out too ugly.
Comment 45 User image Benjamin Smedberg [:bsmedberg] 2016-11-10 11:08:49 PST
I think you can add and remove this in c++, but don't remove the blocker until after you've called into the crash manager? And then if the crash manager needs its own shutdown blocker (which it might for its I/O), that would be entirely in JS.
Comment 46 User image Gabriele Svelto [:gsvelto] 2016-11-11 03:29:36 PST
(In reply to Benjamin Smedberg [:bsmedberg] from comment #45)
> I think you can add and remove this in c++, but don't remove the blocker
> until after you've called into the crash manager? And then if the crash
> manager needs its own shutdown blocker (which it might for its I/O), that
> would be entirely in JS.

Right, calling into the CrashService is synchronous so I can grab another blocker in there and relinquish the C++ one after CrashService.addCrash() returns.
Comment 47 User image Gabriele Svelto [:gsvelto] 2016-11-11 09:58:11 PST
Created attachment 8809888 [details] [diff] [review]
[PATCH 3/3 v6] Send crash pings for content crashes complete with stack traces

Updated patch, the changes are:

- Both running the minidump analyzer, adding the crash to the CrashManager database and sending the crash ping now block shutdown

- CrashService.addCrash() and CrashManager.addCrash() now chain all their asynchronous operations so that they can be returned as a single promise

- I went back to computing the minidump analyzer path at startup and storing it in a global so that we don't have to worry about access off the main thread to the directory service, building it after a crash, etc... It's crude, but it works.

- I've added more documentation
Comment 48 User image Benjamin Smedberg [:bsmedberg] 2016-11-17 10:25:58 PST
Comment on attachment 8809888 [details] [diff] [review]
[PATCH 3/3 v6] Send crash pings for content crashes complete with stack traces

>diff --git a/ipc/glue/CrashReporterHost.cpp b/ipc/glue/CrashReporterHost.cpp

>+  /**
>+   * Create a new minidump analyzer runnable, this will also block shutdown
>+   * until the associated crash has been added to the crash manager.
>+   */
>+  AsyncMinidumpAnalyzer(int32_t aProcessType,
>+                        int32_t aCrashType,
>+                        const nsString& aChildDumpID)
>+    : mProcessType(aProcessType)
>+    , mCrashType(aCrashType)
>+    , mChildDumpID(aChildDumpID)
>+    , mName(NS_LITERAL_STRING("Crash reporter: blocking on minidump analysis"))
>+  {
>+    nsresult rv = GetShutdownBarrier()->AddBlocker(
>+      this, NS_LITERAL_STRING(__FILE__), __LINE__,
>+      NS_LITERAL_STRING("Minidump analisys"));

nit "analysis"

Reading through this patch, it's not clear to me which thread is running each method. I'd suggest at a minimum comment @notes explaining that, although threading assertions sound even better. Asserting on NS_IsMainThread or !NS_IsMainThread.

If I'm reading this right, this constructor runs on the main thread?

>+    MOZ_RELEASE_ASSERT(NS_SUCCEEDED(rv));

I doubt a release assert is the right thing here. Crashing in a crash-handling path sounds too aggressive.

>+  NS_IMETHOD Run() override
>+  {

And this runs on the threadpool helper?

>+  // Returns the profile-before-change shutdown barrier, cannot fail
>+  static nsCOMPtr<nsIAsyncShutdownClient> GetShutdownBarrier()

This is mainthread-only?

>+  {
>+    nsCOMPtr<nsIAsyncShutdownService> svc = services::GetAsyncShutdown();
>+    MOZ_RELEASE_ASSERT(svc);
>+
>+    nsCOMPtr<nsIAsyncShutdownClient> barrier;
>+    nsresult rv = svc->GetProfileBeforeChange(getter_AddRefs(barrier));
>+    MOZ_RELEASE_ASSERT(NS_SUCCEEDED(rv));
>+    MOZ_RELEASE_ASSERT(barrier);
>+    return barrier.forget();

Again seems like too much release asserting.

>+  }
>+
>+  int32_t mProcessType;
>+  int32_t mCrashType;
>+  nsString mChildDumpID;
>+  const nsString mName;

Why is nName const but not mChildDumpID? Seems like they both should be.

>+/* static */ void
>+CrashReporterHost::AsyncAddCrash(int32_t aProcessType,
>+                                 int32_t aCrashType,
>+                                 const nsString& aChildDumpID)
>+{

mainthread only?

>diff --git a/toolkit/components/crashes/CrashManager.jsm b/toolkit/components/crashes/CrashManager.jsm

>-  DUMP_REGEX: /^([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})\.dmp$/i,
>+  DUMP_REGEX: /^([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-([0-9a-f]{4}-[0-9a-f]{12}|[0-9a-f]{8}-[0-9a-f]{8}))\.dmp$/i,

what changed here? Do subprocess dumps have a different filename/uuid format?

>-    TelemetryController.submitExternalPing("crash",
>+    return TelemetryController.submitExternalPing("crash",

Reading this and below, this is probably-incorrect. submitExternalPing is documented to return a test-only promise, but it's not clear exactly when that will be resolved. We want the telemetry controller to block shutdown if necessary, and not this code (and it doesn't block shutdown on sending the ping, just recording it to disk). So I believe that this should return void as it currently does and not affect this shutdown blocker.

This also means that `addCrash` should not yield the result of _sendCrashPing, only call that functions.

Marking r+ with changes because I don't think anything here is too complex and it just needs minor fixup.
Comment 49 User image Gabriele Svelto [:gsvelto] 2016-11-21 06:11:36 PST
(In reply to Benjamin Smedberg [:bsmedberg] from comment #48)
> Reading through this patch, it's not clear to me which thread is running
> each method. I'd suggest at a minimum comment @notes explaining that,
> although threading assertions sound even better. Asserting on
> NS_IsMainThread or !NS_IsMainThread.

I'll update the patch and put assertions everywhere instead. Comments might be ignored, assertions will turn try runs red.

> I doubt a release assert is the right thing here. Crashing in a
> crash-handling path sounds too aggressive.

OK, I've tried to mirror other code dealing with barriers hence the assertions. I'll change this to fail gracefully instead.

> >+  NS_IMETHOD Run() override
> >+  {
> 
> And this runs on the threadpool helper?

Yes.

> >+  // Returns the profile-before-change shutdown barrier, cannot fail
> >+  static nsCOMPtr<nsIAsyncShutdownClient> GetShutdownBarrier()
> 
> This is mainthread-only?

Yes.

> >+  }
> >+
> >+  int32_t mProcessType;
> >+  int32_t mCrashType;
> >+  nsString mChildDumpID;
> >+  const nsString mName;
> 
> Why is nName const but not mChildDumpID? Seems like they both should be.

Right, I'll do that.

> mainthread only?

Yep.

> >-  DUMP_REGEX: /^([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})\.dmp$/i,
> >+  DUMP_REGEX: /^([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-([0-9a-f]{4}-[0-9a-f]{12}|[0-9a-f]{8}-[0-9a-f]{8}))\.dmp$/i,
> 
> what changed here? Do subprocess dumps have a different filename/uuid format?

It's a leftover from a previous version. It's about breakpad using a different format for turning UUIDs into filenames in Linux (see comment 39). This change is not needed anymore since I'm not calling CrashManager.pendingDumps() anymore, I'll remove it and file a separate bug to fix it.

> >-    TelemetryController.submitExternalPing("crash",
> >+    return TelemetryController.submitExternalPing("crash",
> 
> Reading this and below, this is probably-incorrect. submitExternalPing is
> documented to return a test-only promise, but it's not clear exactly when
> that will be resolved. We want the telemetry controller to block shutdown if
> necessary, and not this code (and it doesn't block shutdown on sending the
> ping, just recording it to disk). So I believe that this should return void
> as it currently does and not affect this shutdown blocker.
> 
> This also means that `addCrash` should not yield the result of
> _sendCrashPing, only call that functions.
> 
> Marking r+ with changes because I don't think anything here is too complex
> and it just needs minor fixup.

I see, I hadn't noticed the note about the promise being test-only. Looking at the code in TelemetrySend.submitPing() the promise is effectively resolved only for persisting the ping, not for sending. Arguably once the ping is persisted it will be sent eventually. It seems to me that Telemetry is already blocking shutdown when a ping is pending (by calling Impl._trackPendingTask()) so I think we're good here.
Comment 50 User image Gabriele Svelto [:gsvelto] 2016-11-21 06:22:46 PST
Here's a try run with review comments addressed:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=af0057b95ca93329e609c3e31a68d7690ed31760
Comment 51 User image Gabriele Svelto [:gsvelto] 2016-11-23 00:50:19 PST
I've been running try-runs for the past couple of days and I've identified at least two mochitests that are timing-based and now fail very often due to the asynchronous nature of my changes. They were probably prone to failure before too (the process of adding a crash was already asynchronous, I've just made it a little bit more) so I'll have to fix them first before landing.
Comment 52 User image Gabriele Svelto [:gsvelto] 2016-11-25 02:50:42 PST
This is a try run with the lastest patch from bug 1319702 applied:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=60d52514302aff7268dea9572ed46a3d28dc5931

Finally everything's green as it should; no more crash-related orange.
Comment 53 User image Gabriele Svelto [:gsvelto] 2016-12-05 13:32:32 PST
Now that bug 1319702 is fixed I can finally land this. This is the green try-run with refreshed patches that apply to the current tree and have all the remaining review comments / nits addressed:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec4a1ccb8141ccf8a96da6cec16c8b595725f96b
Comment 54 User image Pulsebot 2016-12-05 13:34:33 PST
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b997bb5bccb4
Clean up the exception handler code r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/f59861923f81
Store all crash annotations in the crash manager's database and filter out privacy-sensitive ones when sending a crash ping r=bsmedberg
https://hg.mozilla.org/integration/mozilla-inbound/rev/c8545ffbd0cb
Send crash pings for content crashes complete with stack traces r=bsmedberg
Comment 55 User image Phil Ringnalda (:philor) 2016-12-05 20:36:30 PST
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/93d785448a69 for what retriggering says is about a 10% chance of hitting https://treeherder.mozilla.org/logviewer.html#?job_id=40320964&repo=mozilla-inbound
Comment 56 User image Phil Ringnalda (:philor) 2016-12-05 23:21:40 PST
Also quite frequent https://treeherder.mozilla.org/logviewer.html#?job_id=40322282&repo=mozilla-inbound
Comment 57 User image Gabriele Svelto [:gsvelto] 2016-12-07 02:20:18 PST
(In reply to Phil Ringnalda (:philor) from comment #55)
> Backed out in
> https://hg.mozilla.org/integration/mozilla-inbound/rev/93d785448a69 for what
> retriggering says is about a 10% chance of hitting
> https://treeherder.mozilla.org/logviewer.html#?job_id=40320964&repo=mozilla-
> inbound

I've spent some time trying to reproduce this and I managed to get a similar issue using rr's chaos mode. I think that like for bug 1319702 this is a race that was already in the code but became more prone to triggering due to my changes in the crash reporting sequence.

(In reply to Phil Ringnalda (:philor) from comment #56)
> Also quite frequent
> https://treeherder.mozilla.org/logviewer.html#?job_id=40322282&repo=mozilla-
> inbound

This is... surprising. I couldn't analyze it yet but it most certainly comes from my changes; it's strange I've never encountered it in my testing (or maybe I just didn't notice).
Comment 58 User image Gabriele Svelto [:gsvelto] 2016-12-09 13:01:20 PST
Phil, are you sure that the issue in comment 56 was caused by my changes? I've reproduce the other issues on try (not locally yet though) but I cannot reproduce that one at all. See here for example:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a174e48c5911f0c4c19e3fc58bb47265e9873f18&selectedJob=32443110

I've re-triggered the failing test a dozen times without hitting the condition; I've also made other try-runs with only a subset of the patches but still couldn't hit it.
Comment 59 User image Phil Ringnalda (:philor) 2016-12-09 13:36:39 PST
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=c8545ffbd0cb&tochange=93d785448a69&filter-searchStr=7683e774f942b4179faf5e4c1dce1865187a22f2&group_state=expanded seems pretty conclusive, and I don't see any Win8 tests on that try run, so I triggered them for you.
Comment 60 User image Gabriele Svelto [:gsvelto] 2016-12-13 06:34:20 PST
(In reply to Phil Ringnalda (:philor) from comment #59)
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> inbound&fromchange=c8545ffbd0cb&tochange=93d785448a69&filter-
> searchStr=7683e774f942b4179faf5e4c1dce1865187a22f2&group_state=expanded
> seems pretty conclusive, and I don't see any Win8 tests on that try run, so
> I triggered them for you.

Thanks Phil, I forgot to specifically trigger the Windows 8 tests when I did the other try run. I've re-triggered bc4 a bunch of times but it never crashes and yet I haven't modified my patch: I have the feeling this was caused by an interaction with something else (that might also have been backed out?).

Either way I've fixed the other issue (test_process_error.xul) and while the test itself was racy it allowed me to discover another genuine and unrelated race in my code so I've fixed that too. Here's the new try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7529031f407d490e50c7613b1f650333cac6c914
Comment 61 User image Phil Ringnalda (:philor) 2016-12-13 07:18:25 PST
It wasn't in bc4 at that last try push's revision (browser-chrome does ridiculous balancing of the size of chunks, so any time you go from one revision to another, if a single browser-chrome test has been added or removed or enabled or disabled, every single test may run in a different chunk), it was in bc2.
Comment 62 User image Gabriele Svelto [:gsvelto] 2016-12-13 08:44:55 PST
(In reply to Phil Ringnalda (:philor) from comment #61)
> It wasn't in bc4 at that last try push's revision (browser-chrome does
> ridiculous balancing of the size of chunks, so any time you go from one
> revision to another, if a single browser-chrome test has been added or
> removed or enabled or disabled, every single test may run in a different
> chunk), it was in bc2.

OK, thanks for pointing that out. So to err on the side of caution I made a Win8-only run here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=958676fd7b824ba16f21675801d8202297828531

... and I'm going to re-trigger all the mochitests a lot to be sure everything's OK.
Comment 63 User image Gabriele Svelto [:gsvelto] 2016-12-13 14:38:57 PST
And it still crashes :(
Comment 64 User image Gabriele Svelto [:gsvelto] 2016-12-14 15:12:20 PST
I just found out what was wrong and I feel incredibly silly. I was passing an UTF-8 string to a Win32 API call that expected an UTF-16 one; for a number of reasons the compiler wasn't complaining but weird things were obviously happening. I've got a fixed patch ready and here are the try runs:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ae4601da6ca39ab7677b4382d0c11a9bf8ef22b
https://treeherder.mozilla.org/#/jobs?repo=try&revision=190f5a040b6d6f8258450e29eed563f34b7fedff
Comment 65 User image Gabriele Svelto [:gsvelto] 2016-12-16 05:00:51 PST
There were still some tests failing intermittently and I've traced them down to races where the test would cause a crash, analyze it then try to remove the minidumps and found the files locked as the minidump-analyzer was still extracting the stack traces from the files. I've got a new try run with all the fixes here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9e4d8911a5f798ce923fa09cda49888190d64d2

It's looking good for now but I want to re-trigger multiple times the mochitests to be confident that there's no more intermittents left.
Comment 66 User image Pulsebot 2016-12-16 09:59:12 PST
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f54cd3be1771
Clean up the exception handler code r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/3a3dd6e12c89
Store all crash annotations in the crash manager's database and filter out privacy-sensitive ones when sending a crash ping r=bsmedberg
https://hg.mozilla.org/integration/mozilla-inbound/rev/a848fea61a8f
Send crash pings for content crashes complete with stack traces r=bsmedberg
Comment 68 User image Abe - QA (:Abe_LV) 2017-01-26 07:56:41 PST
Verified as fixed.

Note You need to log in before you can comment on or make changes to this bug.