Add Telemetry session id to crash annotations

RESOLVED FIXED in Firefox 46

Status

()

P2
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: rvitillo, Assigned: benjamin)

Tracking

(Blocks: 1 bug)

unspecified
mozilla48
Points:
2
Dependency tree / graph

Firefox Tracking Flags

(firefox40 wontfix, firefox41 affected, firefox42 affected, firefox46 fixed, firefox47 fixed, firefox48 fixed)

Details

(Whiteboard: [unifiedTelemetry][measurement:client])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment, 4 obsolete attachments)

Crash pings are hard to chain/join back with main pings as the info field is missing. Is there a reason why the info section is missing?
(Reporter)

Updated

3 years ago
Flags: needinfo?(gfritzsche)
Crash pings are separate to the "main" pings and we don't store that session information with the crashes.

I suggest we add a crash annotation with the current session id.
From talking with Roberto, that should be enough and allows us to correlate the crash pings with the specific session they belong to.
Blocks: 1120356
Flags: needinfo?(gfritzsche)
Summary: Missing info field in telemetry crash pings. → Add Telemetry session id to crash annotations
Whiteboard: [unifiedTelemetry] [uplift3]
Created attachment 8639433 [details] [diff] [review]
Add Telemetry session id to crash annotations
Attachment #8639433 - Flags: review?(benjamin)
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
(Assignee)

Comment 3

3 years ago
We try not to correlate the crash reports with the telemetry data. It's fine to store it and send it as part of the "crash" telemetry ping, but it shouldn't be sent to crash-stats as part of the metadata.
(Assignee)

Updated

3 years ago
Attachment #8639433 - Flags: review?(benjamin) → review-
Created attachment 8639819 [details] [diff] [review]
Add Telemetry session id to crash annotations

Good point. Also, we dropped the retentionDays parameter for now, so removing that here.
Attachment #8639819 - Flags: review?(benjamin)
Attachment #8639433 - Attachment is obsolete: true
Whiteboard: [unifiedTelemetry] [uplift3] → [unifiedTelemetry] [uplift4]
(Assignee)

Comment 5

3 years ago
Comment on attachment 8639819 [details] [diff] [review]
Add Telemetry session id to crash annotations

The metadata is still included in the crash report though. You're only excluding it here from the crash manager.
Attachment #8639819 - Flags: review?(benjamin) → review-
Iteration: --- → 42.3 - Aug 10
Created attachment 8641649 [details] [diff] [review]
Add Telemetry session id to crash annotations

Ok, figured out my misconceptions. This now always filters out TelemetrySessionId fields from the extra data on crash submissions. I am a bit sad that we dont have proper end-to-end testing from crash generation to submission, but that would take more time to fix than i can justify now. So for now the test coverage is a bit... distributed.
Attachment #8641649 - Flags: review?(benjamin)
Attachment #8639819 - Attachment is obsolete: true
status-firefox40: --- → wontfix
status-firefox41: --- → affected
status-firefox42: --- → affected
Whiteboard: [unifiedTelemetry] [uplift4] → [unifiedTelemetry] [uplift5]
(Assignee)

Comment 7

3 years ago
Comment on attachment 8641649 [details] [diff] [review]
Add Telemetry session id to crash annotations

This still only deals with the CrashSubmit.jsm submission path which is primarily used for plugin/content crashes. The native client submission path would still have this.

Overall I think that this approach of annotating and then removing is fragile and we should do something different. Probably add the session ID specifically only to the event file here: http://mxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#770
Attachment #8641649 - Flags: review?(benjamin) → review-
Blocks: 1122482
No longer blocks: 1120356, 1185123
Whiteboard: [unifiedTelemetry] [uplift5] → [unifiedTelemetry]
Dropping this for now over other work as it's not urgent.
Assignee: gfritzsche → nobody
Status: ASSIGNED → NEW
Priority: -- → P3

Updated

3 years ago
Points: --- → 2
Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][measurement:client]
Priority: P3 → P4
(Reporter)

Updated

3 years ago
Blocks: 1245490
(Assignee)

Updated

3 years ago
Assignee: nobody → benjamin
Status: NEW → ASSIGNED
Priority: P4 → P2
(Assignee)

Comment 9

3 years ago
Created attachment 8729014 [details] [diff] [review]
Add Telemetry session ID to crash annotations, CHECKPOINT NOT READY YET
(Assignee)

Comment 10

3 years ago
Created attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

Review commit: https://reviewboard.mozilla.org/r/39769/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/39769/
Attachment #8730231 - Flags: review?(gfritzsche)
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

https://reviewboard.mozilla.org/r/39769/#review36555

::: toolkit/components/crashes/CrashManager.jsm:546
(Diff revision 1)
>              }
>              delete reportMeta.TelemetryEnvironment;
>            }
> +          if ('TelemetrySessionId' in reportMeta) {
> +            sessionId = reportMeta.TelemetrySessionId;
> +            delete reportMeta.TelemetrySessionId;

Given your concerns about submitting the session id with crash reports; can we also assert that we are not submitting it with crash reports?

::: toolkit/components/telemetry/docs/crash-ping.rst:14
(Diff revision 1)
>  The client ID is submitted with this ping.
>  
>  Structure::
>  
>      {
>        version: 1,

Given that the addition is in the payload, we probably don't want to bump the version?

::: toolkit/components/telemetry/docs/crash-ping.rst:21
(Diff revision 1)
>        ... common ping data
>        clientId: <UUID>,
>        environment: { ... },
>        payload: {
>          crashDate: "YYYY-MM-DD",
> +        sessionId: <UUID>, // may be missing for crashes that happen early

Let's add a note below on what Firefox version this was added in.
Finding the right historic cutoffs will be a pain otherwise.

::: toolkit/components/telemetry/docs/crash-ping.rst:22
(Diff revision 1)
>        clientId: <UUID>,
>        environment: { ... },
>        payload: {
>          crashDate: "YYYY-MM-DD",
> +        sessionId: <UUID>, // may be missing for crashes that happen early
> +                           // in startup

We should also update the schema:
https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/master/telemetry/crash.schema.json

::: toolkit/crashreporter/nsExceptionHandler.cpp:912
(Diff revision 1)
> +        WriteLiteral(eventFile, "TelemetrySessionId=");
> +        WriteString(eventFile, currentSessionId);
> +        WriteLiteral(eventFile, "\n");

This can use `WriteAnnotation()`.

::: toolkit/crashreporter/nsExceptionHandler.cpp:1943
(Diff revision 1)
>      free(eventsDirectory);
>      eventsDirectory = nullptr;
>    }
>  
> +  if (currentSessionId) {
> +    free(currentSessionId);

Our documentation says to use `NS_Free()` for `ToNewCstring()`:
https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Internal_strings#Getting_a_char_*_buffer_from_a_String
I see that this matches existing code here though.

Is this an oversight or is the documentation outdated?

::: toolkit/crashreporter/nsExceptionHandler.cpp:2664
(Diff revision 1)
> +{
> +  if (!gExceptionHandler) {
> +    return;
> +  }
> +  if (currentSessionId) {
> +    free(currentSessionId);

Ditto on the `NS_Free()`.

This seems to potentially race with other uses of `currentSessionId`.
Are we sure those can't run concurrently/interleaved?

Currently we only have one call per session for `SetTelemetrySessionId()`, but this seems like a bad surprise to leave in.

::: toolkit/crashreporter/test/unit/test_crashreporter_crash.js:1
(Diff revision 1)
> +Components.utils.import("resource://gre/modules/TelemetrySession.jsm");

Unused import.

::: toolkit/crashreporter/test/unit/test_event_files.js:51
(Diff revision 1)
>    Assert.equal(crashes.length, 1);
>    let crash = crashes[0];
>    Assert.ok(crash.isOfType(cm.PROCESS_TYPE_MAIN, cm.CRASH_TYPE_CRASH));
>    Assert.equal(crash.id + ".dmp", basename, "ID recorded properly");
>    Assert.equal(crash.metadata.ShutdownProgress, "event-test");
> +  Assert.ok("TelemetrySessionId" in crash.metadata);

We could explicitly compare the session id here or at least check for a valid UUID format (to protect against junk or `null` values).

::: xpcom/system/nsICrashReporter.idl:130
(Diff revision 1)
>     * @throws NS_ERROR_NOT_INITIALIZED if crash reporting is disabled.
>     */
>    void saveMemoryReport();
> +
> +  /**
> +   * Set the telemetry session ID which is recorded in crash metadata.

Lets add a note that this is not to be submitted with crash reports and only used for crash pings etc.
Attachment #8730231 - Flags: review?(gfritzsche)
Attachment #8641649 - Attachment is obsolete: true
Attachment #8729014 - Attachment is obsolete: true
(Assignee)

Comment 13

3 years ago
https://reviewboard.mozilla.org/r/39769/#review36555

> Given that the addition is in the payload, we probably don't want to bump the version?

I don't know what bumping the version would mean in practice.

> Our documentation says to use `NS_Free()` for `ToNewCstring()`:
> https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Guide/Internal_strings#Getting_a_char_*_buffer_from_a_String
> I see that this matches existing code here though.
> 
> Is this an oversight or is the documentation outdated?

NS_Free is equivalent to free() when inside libxul. It's only an issue if you're writing external code and need the consistent heap. So we actually prefer free() for internal use now.

> Ditto on the `NS_Free()`.
> 
> This seems to potentially race with other uses of `currentSessionId`.
> Are we sure those can't run concurrently/interleaved?
> 
> Currently we only have one call per session for `SetTelemetrySessionId()`, but this seems like a bad surprise to leave in.

We don't have a safe way to protect from this kind of threading hazard. You can't lock because that would just deadlock. So we live with the slight risk.
(Assignee)

Comment 14

3 years ago
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/39769/diff/1-2/
Attachment #8730231 - Flags: review?(gfritzsche)
https://reviewboard.mozilla.org/r/39769/#review36555

> I don't know what bumping the version would mean in practice.

I think this is only interesting right now if we would make `sessionId` required.
As it is optional i think we can just leave it.

> NS_Free is equivalent to free() when inside libxul. It's only an issue if you're writing external code and need the consistent heap. So we actually prefer free() for internal use now.

Ok, updated the wiki.

> We don't have a safe way to protect from this kind of threading hazard. You can't lock because that would just deadlock. So we live with the slight risk.

What about an `atomic.exchange(nullptr)` using `Atomics.h`?
(Assignee)

Comment 16

3 years ago
That won't help. The only race here is if we crash while we are setting the session ID. And in that case, using an atomic-set or atomic exchange doesn't make a difference because the word-size pointer set is atomic on all our platforms anyway.
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

https://reviewboard.mozilla.org/r/39769/#review36689
Attachment #8730231 - Flags: review?(gfritzsche) → review+
Build bustage:
/toolkit/crashreporter/nsExceptionHandler.cpp(912) : error C2780: 'void CrashReporter::WriteAnnotation(CrashReporter::PlatformWriter &,const char (&)[N],const char *)' : expects 3 arguments - 2 provided

Sample log:
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win64/1458081472/mozilla-inbound-win64-bm74-build1-build333.txt.gz

Backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e9a5d52d88dd
Flags: needinfo?(benjamin)
(Assignee)

Updated

3 years ago
Flags: needinfo?(benjamin)
(Assignee)

Updated

3 years ago
Blocks: 1257321

Comment 21

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/a5ded32ae906
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox48: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
(Assignee)

Comment 22

2 years ago
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

Approval Request Comment
[Feature/regressing bug #]: Not a regression, but a necessary part of the crash analysis for bug 1257321 to implement the new telemetry-based stability dashboards
[Describe test coverage new/current, TreeHerder]: manual and automated tests
[Risks and why]: relatively low risk. Very isolated code paths. The main risk is that we're changing nsICrashReporter, but we're doing that at the end of a vtable so it's unlikely to matter.
[String/UUID change made/needed]: UUID changes are no longer done, but this does add a method to the end of nsICrashReporter.
Attachment #8730231 - Flags: approval-mozilla-beta?
Attachment #8730231 - Flags: approval-mozilla-aurora?

Updated

2 years ago
status-firefox46: --- → affected
status-firefox47: --- → affected
Comment on attachment 8730231 [details]
MozReview Request: Bug 1187270 - Add Telemetry session ID to crash annotations, r?gfritzsche

Another improvement to the crash metadata that will help us make more informed decisions about release quality and severity of crashes. Aurora47+, Beta46+
Attachment #8730231 - Flags: approval-mozilla-beta?
Attachment #8730231 - Flags: approval-mozilla-beta+
Attachment #8730231 - Flags: approval-mozilla-aurora?
Attachment #8730231 - Flags: approval-mozilla-aurora+

Comment 24

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/fcb8fe0f0090
status-firefox47: affected → fixed

Comment 25

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/06f0e63fbfb0
status-firefox46: affected → fixed
You need to log in before you can comment on or make changes to this bug.