Closed Bug 1024672 Opened 10 years ago Closed 8 years ago

Firefox should fetch the support classification for each crash after submission

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 36

People

(Reporter: benjamin, Assigned: poiru)

References

Details

Attachments

(7 files, 5 obsolete files)

9.06 KB, patch
gps
: review+
Details | Diff | Splinter Review
10.15 KB, patch
poiru
: review+
Details | Diff | Splinter Review
19.59 KB, patch
gps
: review+
Details | Diff | Splinter Review
5.04 KB, patch
gps
: review+
Details | Diff | Splinter Review
1.07 KB, patch
gps
: review+
Details | Diff | Splinter Review
5.23 KB, patch
gps
: review+
Details | Diff | Splinter Review
5.96 KB, patch
benjamin
: review+
Details | Diff | Splinter Review
Currently when we submit a crash, the client only knows the server crash ID but doesn't have any more information about the crash signature etc. About 30 minutes after a crash, we want Firefox to go back to the server and ask it for some basic details of the crash: the crash signature and the "support classification". This will be saved off for future use by the support API.

Lonnen, I know we've talked about this a lot, but did we ever decide what the API endpoint was going to be? Given that it will be a high-volume API, I wonder if we should separate it from the normal API. Happy to file a separate server bug for that if appropriate.
Flags: needinfo?(chris.lonnen)
Summary: Firefox should fetch the "support reason" for each crash after submission → Firefox should fetch the support classification for each crash after submission
Blocks: 1024677
We can use this as a tracker and get it into q3 goals to set up our end.

I suspect the v1 looks like this --

A 3 column table of UUID, Time Processed + support reason(s), with a single API endpoint. Processors will write to that table. After a short TTL we'll have a cron clean them up.

For performance we could also split this out as an independent service with it's own database, or swap the database with s3.
Flags: needinfo?(chris.lonnen)
(In reply to Chris Lonnen :lonnen from comment #1)
> A 3 column table of UUID, Time Processed + support reason(s), with a single
> API endpoint. Processors will write to that table. After a short TTL we'll
> have a cron clean them up.

"short TTL" might not be long enough. If the Firefox client is to fetch this, the client could disappear for any amount of time - potentially weeks at a time. I would design the TTL no shorter than 72 hours (to account for somebody disappearing over the weekend). Generally speaking, the longer you can retain the data, the better. Looking at my about:crashes, I'm able to pull up crashes from last July on crash-stats.mozilla.com. Why do we need to expire these mappings so quickly?
In earlier discussion with Benjamin we set it at 14 days, but that's easy to configure. It's short relative to how long we retain other crash data. We want to expire them quickly just to keep the storage requirements low.

In the future we may implement an advanced delete pattern, or maybe even process on demand. Once we get some usage data that will be easier to figure out.
FWIW, Socorro bug 915667 was filed for the API for this functionality.
This bug is about the client implementation. We should either use bug 915667 or file a new bug to track the server work. (14 days will be sufficient: we don't need 100% coverage).
I was thinking of doing something like this:

We add two new fields to the crash metadata (stored in CrashStore._data.crashes): submissionDate and classification. When CrashManager.addSubmission is called, we set submissionDate to the current time. Every now and then, we check if we have unclassified crashes and attempt to retrieve the classification if sufficient time has passed.

To find unclassified crashes, we could either iterate over all crashes and find crashes without a classification or we could add e.g. CrashStore._data.unclassifiedCrashIDs, which would be an array of crash IDs pending classification. Which method you prefer?

Does this sound reasonable?
Assignee: nobody → birunthan
Status: NEW → ASSIGNED
Flags: needinfo?(gps)
OS: Linux → All
Hardware: x86_64 → All
(In reply to Birunthan Mohanathas [:poiru] from comment #6)
> I was thinking of doing something like this:
> 
> We add two new fields to the crash metadata (stored in
> CrashStore._data.crashes): submissionDate and classification. When
> CrashManager.addSubmission is called, we set submissionDate to the current
> time. Every now and then, we check if we have unclassified crashes and
> attempt to retrieve the classification if sufficient time has passed.

+1. Please do this work as separate patches since the work is logically separate.

I think the "now and then" check should be part of the to-be-implemented self-support XPCOM background service.

> To find unclassified crashes, we could either iterate over all crashes and
> find crashes without a classification or we could add e.g.
> CrashStore._data.unclassifiedCrashIDs, which would be an array of crash IDs
> pending classification. Which method you prefer?

I almost always favor APIs that hide complexity. This keeps the lower-level logical abstracted away from callers. This reduces the surface area for code duplication and bugs. Let's expose CrashManager.getUnclassifiedCrashes() that returns an array of CrashRecord instances. This is better than a simple Set of IDs because it doesn't require callers to subsequently re-fetch data from the CrashManager (assembling the set of IDs from a list of CrashRecord is a one-liner).
Flags: needinfo?(gps)
Associating the crash-reporter bug for the service that will expose these reasons with the larger dependency chain.
Depends on: 915667
Right now, each submission is a separate record in the store (see bug 994707 comment 19 and 20). For example, a store could contain the following records:
  main-crash {id: 1, ...}
  plugin-crash {id: 2, ...}
  plugin-crash-submission-succeeded {id: 2, ...}

This is not quite ideal for what we want to do in this bug (for example, should the classification be attached to the "plugin-crash" record or the "plugin-crash-submission-succeeded" record?). It would be much easier to work with something like this:
  main-crash {id: 1, ...}
  plugin-crash {id: 2, submissionDate: x, submissionResult: true,
                classification: "foo", ...}

Or, if we want to record multiple submissions:
  plugin-crash {id: 2, submissions: [{date: x, result: false},
                                     {date: y, result: true}],
                classification: "foo", ...}

I think it would be easier to fix this sooner rather than later. gps, do you think we should fix this first?
Flags: needinfo?(gps)
If you look at the old patches in bug 875562, you'll see that my original implementation for submissions had support for tracking multiple submissions and that submissions were metadata against the crash itself. Essentially, there was an array of "submission events" for each crash. Each "submission event" featured a date and result type. This allowed us to easily track the submission success of each individual crash. It also allowed us to see how often crash submissions were failing. In other words, it was designed for the use case in this bug.

That feature was cut from the initial landing of CrashManager.jsm because the functionality wasn't needed. I am a bit disappointed to see that my original work appears to have been thrown away and a less robust solution for tracking submissions has since landed. Although, this was somewhat acknowledged in bug 994707 comment 19, so it's not all bad.

It appears the sub-optimal code has recently landed, so the cat is only half out of the bag. My vote would be to reimplement submission storage more robustly (similarly to how you have proposed it - which is nearly identical to how I did it originally) and to write some code to convert the *-crash-submission-* records into the new format. Fortunately the public APIs for adding submissions are mostly sane, so it should be a mostly internal refactor of CrashManager.jsm, tests, and possibly FHR.

As for which submission tracking method, I always liked tracking each submission attempt separately because it gives us insight into e.g. failed submission attempts instead of merely last result. We like data that shows us how many times it takes things to happen because this data uncovers bugs and other weirdness.
Flags: needinfo?(gps)
Before I do further work, does what I have so far look reasonable?

In bug 875562 attachment 8344322 [details] [diff] [review], the submissions are stored like so:

  crash.submissions: {
    "id": {
      attemptDate: x,
      successDate: y
    },
  }

With this patch, it looks like:

  crash.submissions: {
    "id": {
      date: y,
      result: true
    },
  }

I suspect the approach you took in attachment 8344322 [details] [diff] [review] might be to keep `attemptDate` around if the submission first fails and later succeeds. However, in that case, the submission ID should arguably change. Also, if submission fails multiple times, we would only be recording the last failure date.

If the submission ID does not change, perhaps we should just store an array of submissions without IDs? For example:

  crash.submissions: [
    {
      date: x,
      result: false
    },
    {
      date: y,
      result: true
    },
  ]

Also, note that CrashManager.addSubmission still uses the old code path. I'll attach a subsequent patch to switch over things to the new stuff after we are happy with this bit.
Attachment #8471135 - Flags: feedback?(gps)
Comment on attachment 8471135 [details] [diff] [review]
Part 1: Allow submission details to be stored in crash metadata

Review of attachment 8471135 [details] [diff] [review]:
-----------------------------------------------------------------

The "submission ID" was intended as a purely client-side entity to correlate the "submission started" and "submission finished" events that would presumably be produced and written to a "crash log" event file. These would be randomly generated (probably UUIDs) and have no relationship to crash IDs (also UUIDs in practice). The thinking was we needed multiple events (and thus submission IDs) because when submitting from e.g. the crash reporter, the system could be in a funky state and could very well crash when doing the HTTP request, never leading to a "finished" event and thus never leading to us recording the submission attempt. That's an important signal!

The intent of tracking multiple submissions was to allow us to see how effective the crash upload process is - whether we see common failures, etc. The intent was only to ever have 1 final "remote crash ID," even if multiple submissions were performed.

I think the structure should be something like:

  .submissions: {
    "id1": {
      requestDate: x,
      responseDate: null,
      result: "failed-http-500",
    },
    "id2": {
      requestDate: y,
      responseDate: z,
      result: "ok",
    }
  }

Here, "result" would be some kind of string enumeration. The values could probably be "ok" and "failed" initially. (We can always add more values depending on metrics needs.)

I also included recording of request and response times so we can calculate upload times. I think it would be useful for us to identify "slow uploading" and "fast uploading" clients so we could potentially be kinder to the "slow uploading" crowd. This is completely a non-requirement, but easy enough to add.
Attachment #8471135 - Flags: feedback?(gps) → feedback+
(In reply to Gregory Szorc [:gps] from comment #12)
> Comment on attachment 8471135 [details] [diff] [review]
> Part 1: Allow submission details to be stored in crash metadata
> 
> Review of attachment 8471135 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think the structure should be something like:
> 
>   .submissions: {
>     "id1": {
>       requestDate: x,
>       responseDate: null,
>       result: "failed-http-500",
>     },
>     "id2": {
>       requestDate: y,
>       responseDate: z,
>       result: "ok",
>     }
>   }

Done.

Also, thanks for the detailed replies!
Attachment #8471135 - Attachment is obsolete: true
Attachment #8473148 - Flags: review?(gps)
Comment on attachment 8473148 [details] [diff] [review]
Part 1: Allow submission details to be stored in crash metadata

Review of attachment 8473148 [details] [diff] [review]:
-----------------------------------------------------------------

This is good, but not quite ready for r+.

The biggest issue is that of API design. You've designed the "addSubmission" API such that last write wins for every attribute. This is great... if you are recording submissions after they took place and submission always finishes. Here's the thing. Logging systems that hold on to a "start" event and don't write until the "finish" event are prone to losing data because the "finish" event that triggers the write never occurs.

Say you are submitting a crash from within Firefox. Before you get a response from the server, Firefox crashes. You lost any record of the submission attempt. Or, say the client is on a slow connection or the crash server is being really slow. It takes >1 minute to submit a crash. While that's happening, the user quits Firefox. The submission attempt is not recorded.

I would change the API to separate submission attempts from submission results. e.g. addSubmissionAttempt(date), addSubmissionResult(date, result). This reinforces the idea that submission recording is an event-based logging operation and that consumers should call it as soon as those events occur. This API is simple for callers (they don't need to hold state in variables during submission) and helps us not lose data.

::: toolkit/components/crashes/tests/xpcshell/test_crash_store.js
@@ +96,1 @@
>    );

This convention is silly. It masks which expression caused a test failure. Please split this into 3 Assert.ok() calls/lines/statements.

@@ +532,5 @@
> +
> +  let submission = submissions.get("sub1");
> +  Assert.ok(!!submission);
> +  Assert.equal(submission.requestDate, DUMMY_DATE);
> +  Assert.equal(submission.responseDate, DUMMY_DATE_2);

This test is passing by accident because you are comparing the same Date instances. Construct a new Date instance representing the same value and this test will fail. This also exposes a hole in your testing: testing that serialization and deserialization are working.

More tests, please!
Attachment #8473148 - Flags: review?(gps) → feedback+
(In reply to Gregory Szorc [:gps] from comment #14)
> Comment on attachment 8473148 [details] [diff] [review]
> Part 1: Allow submission details to be stored in crash metadata
> 
> Review of attachment 8473148 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I would change the API to separate submission attempts from submission
> results. e.g. addSubmissionAttempt(date), addSubmissionResult(date, result).
> This reinforces the idea that submission recording is an event-based logging
> operation and that consumers should call it as soon as those events occur.
> This API is simple for callers (they don't need to hold state in variables
> during submission) and helps us not lose data.

Makes sense, changed.

> This also exposes a hole in your testing: testing that serialization and
> deserialization are working.

The additions to test_save_load were for exactly that. Would you prefer a separate test?
Attachment #8473148 - Attachment is obsolete: true
Attachment #8473385 - Flags: review?(gps)
Gah, missed a change in the previous patch.
Attachment #8473385 - Attachment is obsolete: true
Attachment #8473385 - Flags: review?(gps)
Attachment #8473391 - Flags: review?(gps)
Comment on attachment 8473391 [details] [diff] [review]
Part 1: Allow submission details to be stored in crash metadata

Review of attachment 8473391 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/crashes/CrashManager.jsm
@@ +1053,5 @@
> +   * @return boolean True if the attempt was recorded.
> +   */
> +  addSubmissionAttempt: function (crashID, submissionID, date) {
> +    return !!this._ensureSubmissionRecord(crashID, submissionID, date,
> +                                          null, null);

I'm not sure we'll ever hit the code path, but if we call this API twice with the same parameters, we'll potentially lose data.

I would change _ensureSubmissionRecord to set null values for all the properties and leave it up to this function and addSubmissionResponse to set what it knows. (addSubmissionResponse already does it this way.)

I don't think this is a huge deal. Your choice whether to land as-is or follow-up.

::: toolkit/components/crashes/tests/xpcshell/test_crash_store.js
@@ +106,5 @@
> +
> +  Assert.ok(!!c.submissions);
> +  let submission = c.submissions.get("sub1");
> +  Assert.ok(!!submission);
> +  Assert.equal(submission.responseDate.getTime(), d2.getTime());

Please check *all* the properties.
Attachment #8473391 - Flags: review?(gps) → review+
This does not yet remove the old functions. I'll remove the old code in a subsequent patch.
Attachment #8473954 - Flags: review?(gps)
(In reply to Gregory Szorc [:gps] from comment #17)
> I would change _ensureSubmissionRecord to set null values for all the
> properties and leave it up to this function and addSubmissionResponse to set
> what it knows. (addSubmissionResponse already does it this way.)

Done.
Attachment #8473391 - Attachment is obsolete: true
Attachment #8473955 - Flags: review+
Comment on attachment 8473954 [details] [diff] [review]
Part 2: Convert submission details stored in separate crash records to crash metadata

Review of attachment 8473954 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/crashes/CrashManager.jsm
@@ +417,5 @@
> +   *
> +   * @return boolean True if the attempt was recorded and false if not.
> +   */
> +  addSubmissionAttempt: function (crashID, submissionID, date) {
> +    return Task.spawn(function* () {

The new hotness for doing this pattern is:

addSubmissionAttempt: Task.async(function* (crashID, submissionID, date) {
  ...
  yield ...
  ...
}),

@@ +739,5 @@
> +        // with "-submission", we will need to convert the data to be stored
> +        // as actual submissions.
> +        //
> +        // TODO: The old way of storing submissions was used from FF33 to
> +        //       FF34. We should drop the conversion code after a few releases.

You should file the bug now and reference it.
Attachment #8473954 - Flags: review?(gps) → review+
Comment on attachment 8474002 [details] [diff] [review]
Part 3: Remove CrashManager.addSubmission in favor of CrashManager.addSubmission{Attempt,Result}

Review of attachment 8474002 [details] [diff] [review]:
-----------------------------------------------------------------

The patch before this inadvertently stopped FHR from reporting crash submission data. You'll need to fix that before these patches are pushed. A separate patch is fine.

::: browser/base/content/browser-plugins.js
@@ +503,5 @@
>        if (this.getPluginUI(plugin, "submitURLOptIn").checked)
>          keyVals.PluginContentURL = plugin.ownerDocument.URL;
>      }
>  
> +    this.CrashSubmit.submit(pluginDumpID, { recordSubmission: true,

Programming lesson: note how a well thought out data layout and API can lead to less code and complexity. Always design things with UX in mind.
Attachment #8474002 - Flags: review?(gps) → review+
(In reply to Gregory Szorc [:gps] from comment #22)
> The patch before this inadvertently stopped FHR from reporting crash
> submission data. You'll need to fix that before these patches are pushed. A
> separate patch is fine.

Fixed by storing the submission counts in _countsByDay like the current logic does. We should probably devise a better format for reporting submission data, but perhaps that is best left for a follow-up. I will file a bug for it if you agree.
Attachment #8474595 - Flags: review?(gps)
Comment on attachment 8474595 [details] [diff] [review]
Part 4: Store submission counts in order to keep sending data to FHR

Review of attachment 8474595 [details] [diff] [review]:
-----------------------------------------------------------------

I'd prefer we solve the FHR data problem now rather than carry the bad data representation any farther.

I would prefer you change the FHR measurement to draw a clear line between crashes and submissions. Extracting meaning from strings is a very non-robust approach and AFAICT was only applied because it was easy.

If you want to take a shortcut, change the crashes FHR provider to build the old field names from the submission data in CrashManager. This preserves the data format but leaves CrashManager.jsm "clean."
Attachment #8474595 - Flags: review?(gps) → review-
(In reply to Gregory Szorc [:gps] from comment #24)
> I would prefer you change the FHR measurement to draw a clear line between
> crashes and submissions. Extracting meaning from strings is a very
> non-robust approach and AFAICT was only applied because it was easy.
> 
> If you want to take a shortcut, change the crashes FHR provider to build the
> old field names from the submission data in CrashManager. This preserves the
> data format but leaves CrashManager.jsm "clean."

I completely agree, but I went with the shortcut route for now in order to move forward with this bug. I'll file a new bug for the proper solution and work on it in parallel.
Attachment #8474595 - Attachment is obsolete: true
Attachment #8475933 - Flags: review?(gps)
Blocks: 1056157
Comment on attachment 8475933 [details] [diff] [review]
Part 4: Add workaround to continue providing submission data using DailyCrashesMeasurement

Review of attachment 8475933 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/healthreport/providers.jsm
@@ +1148,5 @@
> +    // TODO: CrashManager no longer stores submissions as crashes, but we still
> +    // want to send the submission data to FHR. As a temporary workaround, we
> +    // populate |crashCounts| with the submission data to match past behaviour.
> +    // The proper solution would be to devise a separate, sane measurement
> +    // scheme for submissions (see bug N).

Please fill in "bug N"
Attachment #8475933 - Flags: review?(gps) → review+
Attachment #8477501 - Flags: review?(gps)
Attachment #8477501 - Flags: review?(gps) → review+
Comment on attachment 8478727 [details] [diff] [review]
Part 5: Allow classifications to be stored in crash metadata

Review of attachment 8478727 [details] [diff] [review]:
-----------------------------------------------------------------

Code looks fine.

Personally, I find it weird that we can only associate a single classification with crashes. Classification usually involves "tagging" and tags imply sets or lists of things.

But in the absence of an "upstream" requirement to have multiple classifications, YAGNI applies and we don't need to over-engineer this. It should be easy enough to migrate later, if needed.
Attachment #8478727 - Flags: review?(gps) → review+
Depends on: 1059390
Blocks: 1041327
This makes us store the classifications as an array of strings in order to match the classification API[0].

[0]: https://github.com/mozilla/socorro/blob/master/socorro/external/ceph/crashstorage.py#L489
Attachment #8498824 - Flags: review?(benjamin)
Attachment #8498824 - Flags: review?(benjamin) → review+
Assignee: birunthan → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(birunthan)
Looks like everything has landed here? Can this be closed?
(In reply to Georg Fritzsche [:gfritzsche] from comment #39)
> Looks like everything has landed here? Can this be closed?

The fetch part has not been implemented. I should have a WIP/POC patch for that somewhere. I'll polish it a bit and submit for review when I have some extra time. (Sorry for the huge delay.)
Assignee: nobody → birunthan
Flags: needinfo?(birunthan)
Gah, I didn't mean to assign myself.
Assignee: birunthan → nobody
There were multiple patches landed already - for tracking landed changes/regressions/... against releases we are probably better off to move the remaining work to a separate bug.
(In reply to Georg Fritzsche [:gfritzsche] from comment #42)
> There were multiple patches landed already - for tracking landed
> changes/regressions/... against releases we are probably better off to move
> the remaining work to a separate bug.

OK, closing this.
Assignee: nobody → birunthan
Status: NEW → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → Firefox 36
Blocks: 1225910
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.