Closed Bug 1187270 Opened 9 years ago Closed 8 years ago

Add Telemetry session id to crash annotations

Categories

(Toolkit :: Telemetry, defect, P2)

defect
Points:
2

Tracking

()

RESOLVED FIXED
mozilla48
Iteration:
42.3 - Aug 10
Tracking Status
firefox40 --- wontfix
firefox41 --- affected
firefox42 --- affected
firefox46 --- fixed
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: rvitillo, Assigned: benjamin)

References

(Blocks 1 open bug)

Details

(Whiteboard: [unifiedTelemetry][measurement:client])

Attachments

(1 file, 4 obsolete files)

Crash pings are hard to chain/join back with main pings as the info field is missing. Is there a reason why the info section is missing?
Flags: needinfo?(gfritzsche)
Crash pings are separate to the "main" pings and we don't store that session information with the crashes.

I suggest we add a crash annotation with the current session id.
From talking with Roberto, that should be enough and allows us to correlate the crash pings with the specific session they belong to.
Blocks: 1120356
Flags: needinfo?(gfritzsche)
Summary: Missing info field in telemetry crash pings. → Add Telemetry session id to crash annotations
Whiteboard: [unifiedTelemetry] [uplift3]
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
We try not to correlate the crash reports with the telemetry data. It's fine to store it and send it as part of the "crash" telemetry ping, but it shouldn't be sent to crash-stats as part of the metadata.
Attachment #8639433 - Flags: review?(benjamin) → review-
Good point. Also, we dropped the retentionDays parameter for now, so removing that here.
Attachment #8639819 - Flags: review?(benjamin)
Attachment #8639433 - Attachment is obsolete: true
Whiteboard: [unifiedTelemetry] [uplift3] → [unifiedTelemetry] [uplift4]
Comment on attachment 8639819 [details] [diff] [review]
Add Telemetry session id to crash annotations

The metadata is still included in the crash report though. You're only excluding it here from the crash manager.
Attachment #8639819 - Flags: review?(benjamin) → review-
Iteration: --- → 42.3 - Aug 10
Ok, figured out my misconceptions. This now always filters out TelemetrySessionId fields from the extra data on crash submissions. I am a bit sad that we dont have proper end-to-end testing from crash generation to submission, but that would take more time to fix than i can justify now. So for now the test coverage is a bit... distributed.
Attachment #8641649 - Flags: review?(benjamin)
Attachment #8639819 - Attachment is obsolete: true
Whiteboard: [unifiedTelemetry] [uplift4] → [unifiedTelemetry] [uplift5]
Comment on attachment 8641649 [details] [diff] [review]
Add Telemetry session id to crash annotations

This still only deals with the CrashSubmit.jsm submission path which is primarily used for plugin/content crashes. The native client submission path would still have this.

Overall I think that this approach of annotating and then removing is fragile and we should do something different. Probably add the session ID specifically only to the event file here: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#770
Attachment #8641649 - Flags: review?(benjamin) → review-
Blocks: 1122482
No longer blocks: 1120356, 1185123
Whiteboard: [unifiedTelemetry] [uplift5] → [unifiedTelemetry]
Dropping this for now over other work as it's not urgent.
Assignee: gfritzsche → nobody
Status: ASSIGNED → NEW
Priority: -- → P3
Points: --- → 2
Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][measurement:client]
Priority: P3 → P4
Blocks: 1245490
Assignee: nobody → benjamin
Status: NEW → ASSIGNED
Priority: P4 → P2
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

https://reviewboard.mozilla.org/r/39769/#review36555

::: toolkit/components/crashes/CrashManager.jsm:546
(Diff revision 1)
>              }
>              delete reportMeta.TelemetryEnvironment;
>            }
> +          if ('TelemetrySessionId' in reportMeta) {
> +            sessionId = reportMeta.TelemetrySessionId;
> +            delete reportMeta.TelemetrySessionId;

Given your concerns about submitting the session id with crash reports; can we also assert that we are not submitting it with crash reports?

::: toolkit/components/telemetry/docs/crash-ping.rst:14
(Diff revision 1)
>  The client ID is submitted with this ping.
>  
>  Structure::
>  
>      {
>        version: 1,

Given that the addition is in the payload, we probably don't want to bump the version?

::: toolkit/components/telemetry/docs/crash-ping.rst:21
(Diff revision 1)
>        ... common ping data
>        clientId: <UUID>,
>        environment: { ... },
>        payload: {
>          crashDate: "YYYY-MM-DD",
> +        sessionId: <UUID>, // may be missing for crashes that happen early

Let's add a note below on what Firefox version this was added in.
Finding the right historic cutoffs will be a pain otherwise.

::: toolkit/components/telemetry/docs/crash-ping.rst:22
(Diff revision 1)
>        clientId: <UUID>,
>        environment: { ... },
>        payload: {
>          crashDate: "YYYY-MM-DD",
> +        sessionId: <UUID>, // may be missing for crashes that happen early
> +                           // in startup

We should also update the schema:
https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/telemetry/crash.schema.json

::: toolkit/crashreporter/nsExceptionHandler.cpp:912
(Diff revision 1)
> +        WriteLiteral(eventFile, "TelemetrySessionId=");
> +        WriteString(eventFile, currentSessionId);
> +        WriteLiteral(eventFile, "\n");

This can use `WriteAnnotation()`.

::: toolkit/crashreporter/nsExceptionHandler.cpp:1943
(Diff revision 1)
>      free(eventsDirectory);
>      eventsDirectory = nullptr;
>    }
>  
> +  if (currentSessionId) {
> +    free(currentSessionId);

Our documentation says to use `NS_Free()` for `ToNewCstring()`:
https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Internal_strings#Getting_a_char_*_buffer_from_a_String
I see that this matches existing code here though.

Is this an oversight or is the documentation outdated?

::: toolkit/crashreporter/nsExceptionHandler.cpp:2664
(Diff revision 1)
> +{
> +  if (!gExceptionHandler) {
> +    return;
> +  }
> +  if (currentSessionId) {
> +    free(currentSessionId);

Ditto on the `NS_Free()`.

This seems to potentially race with other uses of `currentSessionId`.
Are we sure those can't run concurrently/interleaved?

Currently we only have one call per session for `SetTelemetrySessionId()`, but this seems like a bad surprise to leave in.

::: toolkit/crashreporter/test/unit/test_crashreporter_crash.js:1
(Diff revision 1)
> +Components.utils.import("resource://gre/modules/TelemetrySession.jsm");

Unused import.

::: toolkit/crashreporter/test/unit/test_event_files.js:51
(Diff revision 1)
>    Assert.equal(crashes.length, 1);
>    let crash = crashes[0];
>    Assert.ok(crash.isOfType(cm.PROCESS_TYPE_MAIN, cm.CRASH_TYPE_CRASH));
>    Assert.equal(crash.id + ".dmp", basename, "ID recorded properly");
>    Assert.equal(crash.metadata.ShutdownProgress, "event-test");
> +  Assert.ok("TelemetrySessionId" in crash.metadata);

We could explicitly compare the session id here or at least check for a valid UUID format (to protect against junk or `null` values).

::: xpcom/system/nsICrashReporter.idl:130
(Diff revision 1)
>     * @throws NS_ERROR_NOT_INITIALIZED if crash reporting is disabled.
>     */
>    void saveMemoryReport();
> +
> +  /**
> +   * Set the telemetry session ID which is recorded in crash metadata.

Lets add a note that this is not to be submitted with crash reports and only used for crash pings etc.
Attachment #8730231 - Flags: review?(gfritzsche)
Attachment #8641649 - Attachment is obsolete: true
Attachment #8729014 - Attachment is obsolete: true
https://reviewboard.mozilla.org/r/39769/#review36555

> Given that the addition is in the payload, we probably don't want to bump the version?

I don't know what bumping the version would mean in practice.

> Our documentation says to use `NS_Free()` for `ToNewCstring()`:
> https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Internal_strings#Getting_a_char_*_buffer_from_a_String
> I see that this matches existing code here though.
> 
> Is this an oversight or is the documentation outdated?

NS_Free is equivalent to free() when inside libxul. It's only an issue if you're writing external code and need the consistent heap. So we actually prefer free() for internal use now.

> Ditto on the `NS_Free()`.
> 
> This seems to potentially race with other uses of `currentSessionId`.
> Are we sure those can't run concurrently/interleaved?
> 
> Currently we only have one call per session for `SetTelemetrySessionId()`, but this seems like a bad surprise to leave in.

We don't have a safe way to protect from this kind of threading hazard. You can't lock because that would just deadlock. So we live with the slight risk.
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/39769/diff/1-2/
Attachment #8730231 - Flags: review?(gfritzsche)
https://reviewboard.mozilla.org/r/39769/#review36555

> I don't know what bumping the version would mean in practice.

I think this is only interesting right now if we would make `sessionId` required.
As it is optional i think we can just leave it.

> NS_Free is equivalent to free() when inside libxul. It's only an issue if you're writing external code and need the consistent heap. So we actually prefer free() for internal use now.

Ok, updated the wiki.

> We don't have a safe way to protect from this kind of threading hazard. You can't lock because that would just deadlock. So we live with the slight risk.

What about an `atomic.exchange(nullptr)` using `Atomics.h`?
That won't help. The only race here is if we crash while we are setting the session ID. And in that case, using an atomic-set or atomic exchange doesn't make a difference because the word-size pointer set is atomic on all our platforms anyway.
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

https://reviewboard.mozilla.org/r/39769/#review36689
Attachment #8730231 - Flags: review?(gfritzsche) → review+
Build bustage:
/toolkit/crashreporter/nsExceptionHandler.cpp(912) : error C2780: 'void CrashReporter::WriteAnnotation(CrashReporter::PlatformWriter &,const char (&)[N],const char *)' : expects 3 arguments - 2 provided

Sample log:
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win64/1458081472/mozilla-inbound-win64-bm74-build1-build333.txt.gz

Backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e9a5d52d88dd
Flags: needinfo?(benjamin)
Flags: needinfo?(benjamin)
Blocks: 1257321
https://hg.mozilla.org/mozilla-central/rev/a5ded32ae906
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

Approval Request Comment
[Feature/regressing bug #]: Not a regression, but a necessary part of the crash analysis for bug 1257321 to implement the new telemetry-based stability dashboards
[Describe test coverage new/current, TreeHerder]: manual and automated tests
[Risks and why]: relatively low risk. Very isolated code paths. The main risk is that we're changing nsICrashReporter, but we're doing that at the end of a vtable so it's unlikely to matter.
[String/UUID change made/needed]: UUID changes are no longer done, but this does add a method to the end of nsICrashReporter.
Attachment #8730231 - Flags: approval-mozilla-beta?
Attachment #8730231 - Flags: approval-mozilla-aurora?
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

Another improvement to the crash metadata that will help us make more informed decisions about release quality and severity of crashes. Aurora47+, Beta46+
Attachment #8730231 - Flags: approval-mozilla-beta?
Attachment #8730231 - Flags: approval-mozilla-beta+
Attachment #8730231 - Flags: approval-mozilla-aurora?
Attachment #8730231 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.