Closed Bug 1250640 Opened 4 years ago Closed 4 years ago

Unexpected subsessionLength values in Telemetry pings

Categories

(Toolkit :: Telemetry, defect, P2)

defect
Points:
3

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox48 --- fixed

People

(Reporter: benjamin, Assigned: gfritzsche)

References

Details

(Whiteboard: [measurement:client])

Attachments

(3 files, 1 obsolete file)

I am seeing some negative values for subsessionlength, and also some very large values which shouldn't be possible under normal operation. It's essential to have high confidence in this data so that things like crash rates are meaningful.

In yesterday's main_summary table, there are 362M rows.

2051 rows had negative values, as low as -1919721125 which is -22,000 days.
2.7M rows had values larger than 28 hours as a subsessionlength, the top ones had a subsessionlength as high as 2145922087 which is almost 25 days.
(14 rows had NULL subsessionlength, which is surprising but probably not relevant to this bug)

The large positive values might be geniune. It's possible that we put a computer to sleep and wake it up days or weeks later and somehow count them. However, we're not supposed to be counting sleep time on Mac/Linux, and those OSes are reporting some large times:

pipeline=> select os, count(*) FROM main_summary_20160222 WHERE subsessionlength > 3600 * 24 * 2 GROUP BY os;
   os    |  count  
---------+---------
 Mac     |   10134
 Windows | 1355607
 Linux   |     152
(3 rows)

This needs urgent investigation. For the time being, I think I'm going to cap subsessionlength in the range 0-25 hours for the stability rollups, but I'm concerned that may not be the right solution long-term.
Priority: -- → P2
Whiteboard: [measurement:client]
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #0)
> 2051 rows had negative values, as low as -1919721125 which is -22,000 days.

We can investigate to make sure that we don't have any breakage in the client code, but from previous data issues i think we will always see a low number of data corruption (this is 0.0007% of the data).

> 2.7M rows had values larger than 28 hours as a subsessionlength, the top
> ones had a subsessionlength as high as 2145922087 which is almost 25 days.
> (14 rows had NULL subsessionlength, which is surprising but probably not
> relevant to this bug)
> 
> The large positive values might be geniune. It's possible that we put a
> computer to sleep and wake it up days or weeks later and somehow count them.
> However, we're not supposed to be counting sleep time on Mac/Linux, and
> those OSes are reporting some large times:
> 
> pipeline=> select os, count(*) FROM main_summary_20160222 WHERE
> subsessionlength > 3600 * 24 * 2 GROUP BY os;
>    os    |  count  
> ---------+---------
>  Mac     |   10134
>  Windows | 1355607
>  Linux   |     152
> (3 rows)

Ok, this is strange - we will have to do manual testing to verify the assumed behavior on Mac & Linux here.
Besides manual spot-testing, we can also do sanity-checking on a client-history history basis and see if those values are possible in the individual histories (although the profile-cloning groups will make this problematic).
Priority: P2 → P1
Assignee: nobody → gfritzsche
Points: --- → 3
Looking at the distributions of subsession lengths on Darwin, this seems to be at a sane volume:
https://ec2-54-218-5-112.us-west-2.compute.amazonaws.com/queries/37/source#table

This still leaves the issue of how we can have subsessionLength > 1day in the first place, i'll dig into that next.
From what we understand of the monotonic clock behavior across platforms and our "at least daily" subsession splits, this should not be possible.
Summary: Negative subsessionlength in telemetry pings → Unexpected subsessionLength values in Telemetry pings
Comparing this to session length is interesting:
https://ec2-54-218-5-112.us-west-2.compute.amazonaws.com/queries/38/source#66

sessionLength does not have negative values.
Some breakdowns done with the longitudinal_v20160213 dataset, focusing on Darwin (some more details at [0]):

For the large subsessionLengths:
* 1.8% of the subsessionLengths are >1day
* 0.1% > 1.1 day and 0.09% > 1.2 days
* most of those (>97%) are "daily" pings
* this doesn't correlate with a specific OS X version (major or full)

subsessionLengths exceeding 1 day a bit is expected as we use longer scheduler tick times for idle users.
The bigger ones could be:
- the monotonic clock counting sleep time in some cases
- a bug preventing us from triggering daily pings
- client arithmetic failure

-------------------------------

For negative subsessionLengths:
* only 0.003% of the subsessionLengths are negative
* nearly all of the negative subsessionLengths are for the shutdown pings
* over 50% of those shutdown pings have:
  * thisPing.sessionLength < previousPing.sessionLength
  * subsessionLength = previousPing.sessionLength - thisPing.sessionLength

Suspisions here are:
- buggy client calculations
- clock issues around shutdown

-------------------------------

Fun side notes:
* we only see negative sessionLengths on Windows
* Win95/98 are submitting modern Telemetry data

0: https://docs.google.com/spreadsheets/d/11YqZrqktmOFrsmi1oSie6Ob88Bwp7f0LzMRzzVWRteM/
Win95/98 aren't actually submitting data: something triggered a Windows compatibility mode so that it *appears* to be an older version of Windows. But Firefox can't actually run on those versions.
See Also: → 1262386
I have been taking this through some investigation and testing.
I haven't been able to find a clear reason for this, only some suspicions, so the best thing seems to be to proceed with adding diagnostics and some hardening (patches coming up).
Attachment #8738486 - Flags: review?(alessio.placitelli)
Comment on attachment 8738483 [details] [diff] [review]
Part 1 - Avoid potential timer races in the TelemetrySession scheduler

Review of attachment 8738483 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +422,5 @@
>     * @return {Promise} A promise, only used when testing, resolved when the scheduled
>     *                   operation completes.
>     */
>    _onSchedulerTick: function() {
> +    // This call might not be triggered from a timeout. In that case we don't want to

nit: It might be worthwhile mentioning at least one case when this happens other than the timeout (e.g. user entering/exiting the idle status).

Good catch!
Attachment #8738483 - Flags: review?(alessio.placitelli) → review+
Attachment #8738485 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8738486 [details] [diff] [review]
Part 3 - Add diagnostics for suspected scheduler scenarios

Review of attachment 8738486 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good with the one below addressed.

Are you planning on adding tests in a separate patch or manually testing the probes?

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +435,5 @@
>      let promise = Promise.resolve();
>      try {
>        promise = this._schedulerTickLogic();
>      } catch (e) {
> +      Telemetry.getHistogramById("TELEMETRY_SCHEDULER_TICK_EXCEPTION", 1);

I think you meant getHistogramById("TELEMETRY_SCHEDULER_TICK_EXCEPTION").add(1)
Attachment #8738486 - Flags: review?(alessio.placitelli) → review+
(In reply to Alessio Placitelli [:Dexter] from comment #11)
> Are you planning on adding tests in a separate patch or manually testing the
> probes?

These are temporary measures, so i'm planning on manual testing.
Attachment #8738486 - Attachment is obsolete: true
Comment on attachment 8738507 [details] [diff] [review]
Part 3 - Add diagnostics for suspected scheduler scenarios

data-review?bsmedberg

These are temporary measures to help with diagnosing possible issues for this bug per comment 4.
If we don't get enough data out of aurora/beta, we might want to make this opt-out later.
Attachment #8738507 - Flags: review+
Attachment #8738507 - Flags: feedback?(benjamin)
Attachment #8738507 - Flags: feedback?(benjamin) → feedback+
Priority: P1 → P2
You need to log in before you can comment on or make changes to this bug.