Closed Bug 1250640 Opened 8 years ago Closed 8 years ago

Unexpected subsessionLength values in Telemetry pings

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla48

Tracking Flags:

Tracking

Status

firefox48

---

fixed

People

(Reporter: benjamin, Assigned: gfritzsche)

References

Details

(Whiteboard: [measurement:client])

Attachments

(3 files, 1 obsolete file)

Part 1 - Avoid potential timer races in the TelemetrySession scheduler 8 years ago Georg Fritzsche [:gfritzsche] 2.49 KB, patch	Dexter : review+	Details \| Diff \| Splinter Review
Part 2 - Start new subsessions even when payload assembly fails 8 years ago Georg Fritzsche [:gfritzsche] 2.61 KB, patch	Dexter : review+	Details \| Diff \| Splinter Review
Part 3 - Add diagnostics for suspected scheduler scenarios 8 years ago Georg Fritzsche [:gfritzsche] 5.11 KB, patch	Dexter : review+	Details \| Diff \| Splinter Review
Part 3 - Add diagnostics for suspected scheduler scenarios 8 years ago Georg Fritzsche [:gfritzsche] 5.12 KB, patch	gfritzsche : review+ benjamin : feedback+	Details \| Diff \| Splinter Review

Benjamin Smedberg

Reporter

Description

•

8 years ago

I am seeing some negative values for subsessionlength, and also some very large values which shouldn't be possible under normal operation. It's essential to have high confidence in this data so that things like crash rates are meaningful.

In yesterday's main_summary table, there are 362M rows.

2051 rows had negative values, as low as -1919721125 which is -22,000 days.
2.7M rows had values larger than 28 hours as a subsessionlength, the top ones had a subsessionlength as high as 2145922087 which is almost 25 days.
(14 rows had NULL subsessionlength, which is surprising but probably not relevant to this bug)

The large positive values might be geniune. It's possible that we put a computer to sleep and wake it up days or weeks later and somehow count them. However, we're not supposed to be counting sleep time on Mac/Linux, and those OSes are reporting some large times:

pipeline=> select os, count(*) FROM main_summary_20160222 WHERE subsessionlength > 3600 * 24 * 2 GROUP BY os;
   os    |  count  
---------+---------
 Mac     |   10134
 Windows | 1355607
 Linux   |     152
(3 rows)

This needs urgent investigation. For the time being, I think I'm going to cap subsessionlength in the range 0-25 hours for the stability rollups, but I'm concerned that may not be the right solution long-term.

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Priority: -- → P2

Whiteboard: [measurement:client]

Georg Fritzsche [:gfritzsche]

Assignee

Comment 1

•

8 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #0)
> 2051 rows had negative values, as low as -1919721125 which is -22,000 days.

We can investigate to make sure that we don't have any breakage in the client code, but from previous data issues i think we will always see a low number of data corruption (this is 0.0007% of the data).

> 2.7M rows had values larger than 28 hours as a subsessionlength, the top
> ones had a subsessionlength as high as 2145922087 which is almost 25 days.
> (14 rows had NULL subsessionlength, which is surprising but probably not
> relevant to this bug)
> 
> The large positive values might be geniune. It's possible that we put a
> computer to sleep and wake it up days or weeks later and somehow count them.
> However, we're not supposed to be counting sleep time on Mac/Linux, and
> those OSes are reporting some large times:
> 
> pipeline=> select os, count(*) FROM main_summary_20160222 WHERE
> subsessionlength > 3600 * 24 * 2 GROUP BY os;
>    os    |  count  
> ---------+---------
>  Mac     |   10134
>  Windows | 1355607
>  Linux   |     152
> (3 rows)

Ok, this is strange - we will have to do manual testing to verify the assumed behavior on Mac & Linux here.
Besides manual spot-testing, we can also do sanity-checking on a client-history history basis and see if those values are possible in the individual histories (although the profile-cloning groups will make this problematic).

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Priority: P2 → P1

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Assignee: nobody → gfritzsche

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Points: --- → 3

Georg Fritzsche [:gfritzsche]

Assignee

Comment 2

•

8 years ago

Looking at the distributions of subsession lengths on Darwin, this seems to be at a sane volume:
https://ec2-54-218-5-112.us-west-2.compute.amazonaws.com/queries/37/source#table

This still leaves the issue of how we can have subsessionLength > 1day in the first place, i'll dig into that next.
From what we understand of the monotonic clock behavior across platforms and our "at least daily" subsession splits, this should not be possible.

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Summary: Negative subsessionlength in telemetry pings → Unexpected subsessionLength values in Telemetry pings

Georg Fritzsche [:gfritzsche]

Assignee

Comment 3

•

8 years ago

Comparing this to session length is interesting:
https://ec2-54-218-5-112.us-west-2.compute.amazonaws.com/queries/38/source#66

sessionLength does not have negative values.

Georg Fritzsche [:gfritzsche]

Assignee

Comment 4

•

8 years ago

Some breakdowns done with the longitudinal_v20160213 dataset, focusing on Darwin (some more details at [0]):

For the large subsessionLengths:
* 1.8% of the subsessionLengths are >1day
* 0.1% > 1.1 day and 0.09% > 1.2 days
* most of those (>97%) are "daily" pings
* this doesn't correlate with a specific OS X version (major or full)

subsessionLengths exceeding 1 day a bit is expected as we use longer scheduler tick times for idle users.
The bigger ones could be:
- the monotonic clock counting sleep time in some cases
- a bug preventing us from triggering daily pings
- client arithmetic failure

-------------------------------

For negative subsessionLengths:
* only 0.003% of the subsessionLengths are negative
* nearly all of the negative subsessionLengths are for the shutdown pings
* over 50% of those shutdown pings have:
  * thisPing.sessionLength < previousPing.sessionLength
  * subsessionLength = previousPing.sessionLength - thisPing.sessionLength

Suspisions here are:
- buggy client calculations
- clock issues around shutdown

-------------------------------

Fun side notes:
* we only see negative sessionLengths on Windows
* Win95/98 are submitting modern Telemetry data

0: https://docs.google.com/spreadsheets/d/11YqZrqktmOFrsmi1oSie6Ob88Bwp7f0LzMRzzVWRteM/

Benjamin Smedberg

Reporter

Comment 5

•

8 years ago

Win95/98 aren't actually submitting data: something triggered a Windows compatibility mode so that it *appears* to be an older version of Windows. But Firefox can't actually run on those versions.

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Comment 6

•

8 years ago

I have been taking this through some investigation and testing.
I haven't been able to find a clear reason for this, only some suspicions, so the best thing seems to be to proceed with adding diagnostics and some hardening (patches coming up).

Georg Fritzsche [:gfritzsche]

Assignee

Comment 7

•

8 years ago

Attached patch Part 1 - Avoid potential timer races in the TelemetrySession scheduler — Details — Splinter Review

Attachment #8738483 - Flags: review?(alessio.placitelli)

Georg Fritzsche [:gfritzsche]

Assignee

Comment 8

•

8 years ago

Attached patch Part 2 - Start new subsessions even when payload assembly fails — Details — Splinter Review

Attachment #8738485 - Flags: review?(alessio.placitelli)

Georg Fritzsche [:gfritzsche]

Assignee

Comment 9

•

8 years ago

Attached patch Part 3 - Add diagnostics for suspected scheduler scenarios (obsolete) — Details — Splinter Review

Attachment #8738486 - Flags: review?(alessio.placitelli)

Alessio Placitelli [:Dexter]

Comment 10

•

8 years ago

Comment on attachment 8738483 [details] [diff] [review]
Part 1 - Avoid potential timer races in the TelemetrySession scheduler

Review of attachment 8738483 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +422,5 @@
>     * @return {Promise} A promise, only used when testing, resolved when the scheduled
>     *                   operation completes.
>     */
>    _onSchedulerTick: function() {
> +    // This call might not be triggered from a timeout. In that case we don't want to

nit: It might be worthwhile mentioning at least one case when this happens other than the timeout (e.g. user entering/exiting the idle status).

Good catch!

Attachment #8738483 - Flags: review?(alessio.placitelli) → review+

Alessio Placitelli [:Dexter]

Updated

•

8 years ago

Attachment #8738485 - Flags: review?(alessio.placitelli) → review+

Alessio Placitelli [:Dexter]

Comment 11

•

8 years ago

Comment on attachment 8738486 [details] [diff] [review]
Part 3 - Add diagnostics for suspected scheduler scenarios

Review of attachment 8738486 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good with the one below addressed.

Are you planning on adding tests in a separate patch or manually testing the probes?

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +435,5 @@
>      let promise = Promise.resolve();
>      try {
>        promise = this._schedulerTickLogic();
>      } catch (e) {
> +      Telemetry.getHistogramById("TELEMETRY_SCHEDULER_TICK_EXCEPTION", 1);

I think you meant getHistogramById("TELEMETRY_SCHEDULER_TICK_EXCEPTION").add(1)

Attachment #8738486 - Flags: review?(alessio.placitelli) → review+

Georg Fritzsche [:gfritzsche]

Assignee

Comment 12

•

8 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #11)
> Are you planning on adding tests in a separate patch or manually testing the
> probes?

These are temporary measures, so i'm planning on manual testing.

Georg Fritzsche [:gfritzsche]

Assignee

Comment 13

•

8 years ago

Attached patch Part 3 - Add diagnostics for suspected scheduler scenarios — Details — Splinter Review

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Attachment #8738486 - Attachment is obsolete: true

Georg Fritzsche [:gfritzsche]

Assignee

Comment 14

•

8 years ago

Comment on attachment 8738507 [details] [diff] [review]
Part 3 - Add diagnostics for suspected scheduler scenarios

data-review?bsmedberg

These are temporary measures to help with diagnosing possible issues for this bug per comment 4.
If we don't get enough data out of aurora/beta, we might want to make this opt-out later.

Attachment #8738507 - Flags: review+

Attachment #8738507 - Flags: feedback?(benjamin)

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Attachment #8738507 - Flags: feedback?(benjamin) → feedback+

Pulsebot

Comment 15

•

8 years ago

https://hg.mozilla.org/integration/fx-team/rev/2727f9056580
https://hg.mozilla.org/integration/fx-team/rev/d287704400aa
https://hg.mozilla.org/integration/fx-team/rev/61e7426512e5

Georg Fritzsche [:gfritzsche]

Assignee

Updated

•

8 years ago

Priority: P1 → P2

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 16

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2727f9056580
https://hg.mozilla.org/mozilla-central/rev/d287704400aa
https://hg.mozilla.org/mozilla-central/rev/61e7426512e5

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox48: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla48

You need to log in before you can comment on or make changes to this bug.