Closed Bug 1452745 Opened 2 years ago Closed 2 years ago

Crash pings frequently seem to not have the displayVersion set correctly

Categories

(Toolkit :: Crash Reporting, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: wlach, Assigned: wlach)

References

Details

Attachments

(2 files)

:sylvestre noticed a few days ago that the mac beta main crash rate looked rather low:

https://data-missioncontrol.dev.mozaws.net/#/beta/mac

i.e. over the entire 60 beta cycle so far, we have only recorded 263 crashes

Looking more closely on stmo, it appears that the majority of the recorded crashes on Mac had a display version of "60.0" and *no* usage hours:

https://sql.telemetry.mozilla.org/queries/52441/source

Initially I had thought that we were not setting the displayVersion field correctly, based on this code:

https://searchfox.org/mozilla-central/rev/7ccb618f45a1398e31a086a009f87c8fd3a790b6/toolkit/crashreporter/client/ping.cpp#210
https://searchfox.org/mozilla-central/rev/7ccb618f45a1398e31a086a009f87c8fd3a790b6/toolkit/crashreporter/client/ping.cpp#323

However, on further investigation it appears as if most crash pings *do* have the field as expected:

https://gist.github.com/wlach/b26b59bd71ece458eeaf886354548863

... so there's evidently also some kind of problem in telemetry-streaming. I will file a bug for that, but meanwhile, :chutten, do you have any idea why we might not be setting the displayVersion field in crash pings correctly so much of the time? I'm super confused.
Flags: needinfo?(chutten)
In https://gist.github.com/wlach/b26b59bd71ece458eeaf886354548863 you're not limiting to main crashes. So it could be that the pings echo exactly what you're seeing in error_aggregates.

displayVersion comes from application.version which comes from the Version key in the .extra file.

Near as I can tell, that's all in cross-platform code[1] which shouldn't have any mac-specific behaviour.

Originally I wondered if I could blame this on release candidates, where the display version becomes that of a release version... but I think it's too early for that (we're still at 60.0 beta 10)

Now I wonder if we could blame it on pings not being built by pingsender. Pings built and sent by CrashManager have their application section built by TelemetryController which gets its displayVersion from MOZ_APP_VERSION_DISPLAY instead of from AppData (I can't tell if they have to be the same or not).

We could check this by splitting your gist's analysis based on whether or not there was an "meta/X-PingSender-Version" field in the ping or not. If all the '60.0's came from not-pingsender-sent pings, then we may have a lead.

[1]: https://searchfox.org/mozilla-central/rev/7ccb618f45a1398e31a086a009f87c8fd3a790b6/toolkit/xre/nsAppRunner.cpp#3465
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #1)
> In https://gist.github.com/wlach/b26b59bd71ece458eeaf886354548863 you're not
> limiting to main crashes. So it could be that the pings echo exactly what
> you're seeing in error_aggregates.

Ah good catch.

> We could check this by splitting your gist's analysis based on whether or
> not there was an "meta/X-PingSender-Version" field in the ping or not. If
> all the '60.0's came from not-pingsender-sent pings, then we may have a lead.
> 
> [1]:
> https://searchfox.org/mozilla-central/rev/
> 7ccb618f45a1398e31a086a009f87c8fd3a790b6/toolkit/xre/nsAppRunner.cpp#3465

It looks like it's the opposite of what you thought: pingsender pings *never* have the correct display version (which is actually what I'd expect given the source code I saw in the first comment):

https://gist.github.com/wlach/b26b59bd71ece458eeaf886354548863

Interestingly on Linux it looks like we have a large quantity of non-pingsender crash pings with the beta-less version (60.0) but I'm willing to chalk that up to that being a non-official version (note how few pingsender pings there are on that platform).

Can you think of a good person to assign this to?
Flags: needinfo?(chutten)
This does not appear to be a problem on nightly because the version is specified as the same as the display version:

https://sql.telemetry.mozilla.org/queries/52459/source
(In reply to William Lachance (:wlach) (use needinfo!) from comment #2)

> Can you think of a good person to assign this to?

On reflection this seems simultaneously more urgent and less hard to do than I thought, so I think I'm going to take a stab at fixing this myself.
Assignee: nobody → wlachance
Flags: needinfo?(chutten)
Gabriele Svelto's a good resource for pingsender stuff.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #7)
> Created attachment 8966707 [details]
> Bug 1452745 - Annotate crash pings with actual display and platform versions
> 
> Before we were falling back to using the raw version, which isn't
> correct on at least beta (i.e. we would get `60.0` instead of `60.0b1`).
> 
> Review commit: https://reviewboard.mozilla.org/r/235414/diff/#index_header
> See other reviews: https://reviewboard.mozilla.org/r/235414/

So I'm pretty confident this works, but am uncertain how to test this (it compiles fine). :gsvelto, do you have any ideas? (I've already flagged you for review)
(In reply to William Lachance (:wlach) (use needinfo!) from comment #8)
> So I'm pretty confident this works, but am uncertain how to test this (it
> compiles fine). :gsvelto, do you have any ideas? (I've already flagged you
> for review)

You can test this manually with the following procedure:

- Get the dummy telemetry server from https://github.com/mozilla/gzipServer
- Launch it on your machine (note, it requires Python 2.7 IIRC):

python gzipServer.py --port 8080

- Launch your nightly build and set the following three preferences, you'll have to create the last one:

toolkit.telemetry.enabled true
toolkit.telemetry.server "http://localhost:8080"
toolkit.telemetry.send.overrideOfficialCheck true

- Close your nightly build and launch it again, if you're using mach you'll need to enable the crash reporter explicitly:

./mach run --enable-crash-reporter

- Find the PID of the main process and kill it with:

kill -ABRT <PID of the main process>

- This will crash Firefox and launch the crashreporter client which in turn will assemble the ping and send it. The gzipServer.py script should save it to disk so you can inspect it.

Note that it is technically possible to write a unit test covering this using the functionality we recently implemented in bug 379290. However we haven't written any yet and it might be tricky to get everything right.
Just finished testing with a custom displayVersion. It works! Thanks for the instructions.

(In reply to Gabriele Svelto [:gsvelto] from comment #9)
> (In reply to William Lachance (:wlach) (use needinfo!) from comment #8)


> You can test this manually with the following procedure:
> 
> - Get the dummy telemetry server from https://github.com/mozilla/gzipServer
> - Launch it on your machine (note, it requires Python 2.7 IIRC):
> 
> python gzipServer.py --port 8080
> 
> - Launch your nightly build and set the following three preferences, you'll
> have to create the last one:
> 
> toolkit.telemetry.enabled true

This part appears to be out of date. Checking with :chutten, you now need to add `MOZ_TELEMETRY_REPORTING=1` to your .mozconfig.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #11)
> Just finished testing with a custom displayVersion. It works! Thanks for the
> instructions.

YW

> > - Launch your nightly build and set the following three preferences, you'll
> > have to create the last one:
> > 
> > toolkit.telemetry.enabled true
> 
> This part appears to be out of date. Checking with :chutten, you now need to
> add `MOZ_TELEMETRY_REPORTING=1` to your .mozconfig.

Thanks! I hadn't done it in a while and I didn't notice that the preference is locked now. Thanks for the heads up.
Comment on attachment 8966707 [details]
Bug 1452745 - Annotate crash pings with actual display and platform versions

https://reviewboard.mozilla.org/r/235414/#review241260

Looking good!
Attachment #8966707 - Flags: review?(gsvelto) → review+
I filed bug 1453238 for the Fennec fix.
Pushed by wlachance@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/706454cc63c4
Annotate crash pings with actual display and platform versions r=gsvelto
Comment on attachment 8966707 [details]
Bug 1452745 - Annotate crash pings with actual display and platform versions

https://reviewboard.mozilla.org/r/235414/#review241390

::: toolkit/components/telemetry/TelemetryEnvironment.jsm:1347
(Diff revision 1)
>        applicationName: Services.appinfo.name || null,
>        architecture: Services.sysinfo.get("arch"),
>        buildId: Services.appinfo.appBuildID || null,
>        version: Services.appinfo.version || null,
>        vendor: Services.appinfo.vendor || null,
> +      displayVersion: AppConstants.MOZ_APP_VERSION_DISPLAY || null,

Adding fields to the Environment is subject to Data Collection Review.
Attached file data review request
Data review request for chutten to approve the new field in the telemetry environment.
Attachment #8966988 - Flags: review?(chutten)
DATA COLLECTION REVIEW RESPONSE:

    Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes. The Telemetry Environment documentation will be updated to include this field.

    Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, standard Telemetry checkbox.

    If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes and no. As part of the environment it will be under some decent levels of scrutiny by many, but there is no facility (or need) for monitoring it. It is a duplication of the field of the same name in the application section of the common ping format.

    Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1.

    Is the data collection request for default-on or default-off?

Default on.

    Does the instrumentation include the addition of any new identifiers?

No. It is a duplicate field.

    Is the data collection covered by the existing Firefox privacy notice? 

Yes.

    Does there need to be a check-in in the future to determine whether to renew the data?

No. Though I have filed bug 1453344 to hopefully remove the duplicate field in future.

---
result: datareview+
Attachment #8966988 - Flags: review?(chutten) → review+
https://hg.mozilla.org/mozilla-central/rev/706454cc63c4
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Heya! Sorry, I just stumbled upon this :) William, any chance you could also file a PR against https://github.com/mozilla/probe-dictionary/blob/master/environment.json to add the |displayVersion| field there as well?
Flags: needinfo?(wlachance)
(In reply to Alessio Placitelli [:Dexter] from comment #20)
> Heya! Sorry, I just stumbled upon this :) William, any chance you could also
> file a PR against
> https://github.com/mozilla/probe-dictionary/blob/master/environment.json to
> add the |displayVersion| field there as well?

I can do this, but I don't think I'll have time before I return from PTO in a week. Leaving NI open to keep this on my radar.
Does this need to be uplifted to Beta60? This is so we can use MC dashboard beta crash data reliably.
Flags: needinfo?(jcristau)
(In reply to Ritu Kothari (:ritu) from comment #22)
> Does this need to be uplifted to Beta60? This is so we can use MC dashboard
> beta crash data reliably.

I thought of this but didn't think it was worth the churn at this point. On the one hand, the patch is simple so it's low risk. On the other, since most of the beta cycle has gone by without this information, so in practice we can't really rely on 60's overall beta numbers as any kind of baseline within mission control even if we do uplift the patch now. I think having info on the end of the beta cycle might have *some* value, but it would be extremely limited.

If you want to uplift you can, but my vote would be to just let this ride the trains.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #21)
> (In reply to Alessio Placitelli [:Dexter] from comment #20)
> > Heya! Sorry, I just stumbled upon this :) William, any chance you could also
> > file a PR against
> > https://github.com/mozilla/probe-dictionary/blob/master/environment.json to
> > add the |displayVersion| field there as well?
> 
> I can do this, but I don't think I'll have time before I return from PTO in
> a week. Leaving NI open to keep this on my radar.

Done now, over to you Alessio.

https://github.com/mozilla/probe-dictionary/pull/40
Flags: needinfo?(wlachance)
(In reply to William Lachance (:wlach) (use needinfo!) from comment #24)
> (In reply to William Lachance (:wlach) (use needinfo!) from comment #21)
> > (In reply to Alessio Placitelli [:Dexter] from comment #20)
> > > Heya! Sorry, I just stumbled upon this :) William, any chance you could also
> > > file a PR against
> > > https://github.com/mozilla/probe-dictionary/blob/master/environment.json to
> > > add the |displayVersion| field there as well?
> > 
> > I can do this, but I don't think I'll have time before I return from PTO in
> > a week. Leaving NI open to keep this on my radar.
> 
> Done now, over to you Alessio.
> 
> https://github.com/mozilla/probe-dictionary/pull/40

Thanks!
comment 23 makes sense to me, looking forward to seeing this in beta61!
Flags: needinfo?(jcristau)
You need to log in before you can comment on or make changes to this bug.