Closed
Bug 1250640
Opened 8 years ago
Closed 8 years ago
Unexpected subsessionLength values in Telemetry pings
Categories
(Toolkit :: Telemetry, defect, P2)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
mozilla48
Tracking | Status | |
---|---|---|
firefox48 | --- | fixed |
People
(Reporter: benjamin, Assigned: gfritzsche)
References
Details
(Whiteboard: [measurement:client])
Attachments
(3 files, 1 obsolete file)
2.49 KB,
patch
|
Dexter
:
review+
|
Details | Diff | Splinter Review |
2.61 KB,
patch
|
Dexter
:
review+
|
Details | Diff | Splinter Review |
5.12 KB,
patch
|
gfritzsche
:
review+
benjamin
:
feedback+
|
Details | Diff | Splinter Review |
I am seeing some negative values for subsessionlength, and also some very large values which shouldn't be possible under normal operation. It's essential to have high confidence in this data so that things like crash rates are meaningful. In yesterday's main_summary table, there are 362M rows. 2051 rows had negative values, as low as -1919721125 which is -22,000 days. 2.7M rows had values larger than 28 hours as a subsessionlength, the top ones had a subsessionlength as high as 2145922087 which is almost 25 days. (14 rows had NULL subsessionlength, which is surprising but probably not relevant to this bug) The large positive values might be geniune. It's possible that we put a computer to sleep and wake it up days or weeks later and somehow count them. However, we're not supposed to be counting sleep time on Mac/Linux, and those OSes are reporting some large times: pipeline=> select os, count(*) FROM main_summary_20160222 WHERE subsessionlength > 3600 * 24 * 2 GROUP BY os; os | count ---------+--------- Mac | 10134 Windows | 1355607 Linux | 152 (3 rows) This needs urgent investigation. For the time being, I think I'm going to cap subsessionlength in the range 0-25 hours for the stability rollups, but I'm concerned that may not be the right solution long-term.
Assignee | ||
Updated•8 years ago
|
Priority: -- → P2
Whiteboard: [measurement:client]
Assignee | ||
Comment 1•8 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #0) > 2051 rows had negative values, as low as -1919721125 which is -22,000 days. We can investigate to make sure that we don't have any breakage in the client code, but from previous data issues i think we will always see a low number of data corruption (this is 0.0007% of the data). > 2.7M rows had values larger than 28 hours as a subsessionlength, the top > ones had a subsessionlength as high as 2145922087 which is almost 25 days. > (14 rows had NULL subsessionlength, which is surprising but probably not > relevant to this bug) > > The large positive values might be geniune. It's possible that we put a > computer to sleep and wake it up days or weeks later and somehow count them. > However, we're not supposed to be counting sleep time on Mac/Linux, and > those OSes are reporting some large times: > > pipeline=> select os, count(*) FROM main_summary_20160222 WHERE > subsessionlength > 3600 * 24 * 2 GROUP BY os; > os | count > ---------+--------- > Mac | 10134 > Windows | 1355607 > Linux | 152 > (3 rows) Ok, this is strange - we will have to do manual testing to verify the assumed behavior on Mac & Linux here. Besides manual spot-testing, we can also do sanity-checking on a client-history history basis and see if those values are possible in the individual histories (although the profile-cloning groups will make this problematic).
Assignee | ||
Updated•8 years ago
|
Priority: P2 → P1
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → gfritzsche
Assignee | ||
Updated•8 years ago
|
Points: --- → 3
Assignee | ||
Comment 2•8 years ago
|
||
Looking at the distributions of subsession lengths on Darwin, this seems to be at a sane volume: https://ec2-54-218-5-112.us-west-2.compute.amazonaws.com/queries/37/source#table This still leaves the issue of how we can have subsessionLength > 1day in the first place, i'll dig into that next. From what we understand of the monotonic clock behavior across platforms and our "at least daily" subsession splits, this should not be possible.
Assignee | ||
Updated•8 years ago
|
Summary: Negative subsessionlength in telemetry pings → Unexpected subsessionLength values in Telemetry pings
Assignee | ||
Comment 3•8 years ago
|
||
Comparing this to session length is interesting: https://ec2-54-218-5-112.us-west-2.compute.amazonaws.com/queries/38/source#66 sessionLength does not have negative values.
Assignee | ||
Comment 4•8 years ago
|
||
Some breakdowns done with the longitudinal_v20160213 dataset, focusing on Darwin (some more details at [0]): For the large subsessionLengths: * 1.8% of the subsessionLengths are >1day * 0.1% > 1.1 day and 0.09% > 1.2 days * most of those (>97%) are "daily" pings * this doesn't correlate with a specific OS X version (major or full) subsessionLengths exceeding 1 day a bit is expected as we use longer scheduler tick times for idle users. The bigger ones could be: - the monotonic clock counting sleep time in some cases - a bug preventing us from triggering daily pings - client arithmetic failure ------------------------------- For negative subsessionLengths: * only 0.003% of the subsessionLengths are negative * nearly all of the negative subsessionLengths are for the shutdown pings * over 50% of those shutdown pings have: * thisPing.sessionLength < previousPing.sessionLength * subsessionLength = previousPing.sessionLength - thisPing.sessionLength Suspisions here are: - buggy client calculations - clock issues around shutdown ------------------------------- Fun side notes: * we only see negative sessionLengths on Windows * Win95/98 are submitting modern Telemetry data 0: https://docs.google.com/spreadsheets/d/11YqZrqktmOFrsmi1oSie6Ob88Bwp7f0LzMRzzVWRteM/
Reporter | ||
Comment 5•8 years ago
|
||
Win95/98 aren't actually submitting data: something triggered a Windows compatibility mode so that it *appears* to be an older version of Windows. But Firefox can't actually run on those versions.
Assignee | ||
Comment 6•8 years ago
|
||
I have been taking this through some investigation and testing. I haven't been able to find a clear reason for this, only some suspicions, so the best thing seems to be to proceed with adding diagnostics and some hardening (patches coming up).
Assignee | ||
Comment 7•8 years ago
|
||
Attachment #8738483 -
Flags: review?(alessio.placitelli)
Assignee | ||
Comment 8•8 years ago
|
||
Attachment #8738485 -
Flags: review?(alessio.placitelli)
Assignee | ||
Comment 9•8 years ago
|
||
Attachment #8738486 -
Flags: review?(alessio.placitelli)
Comment 10•8 years ago
|
||
Comment on attachment 8738483 [details] [diff] [review] Part 1 - Avoid potential timer races in the TelemetrySession scheduler Review of attachment 8738483 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/TelemetrySession.jsm @@ +422,5 @@ > * @return {Promise} A promise, only used when testing, resolved when the scheduled > * operation completes. > */ > _onSchedulerTick: function() { > + // This call might not be triggered from a timeout. In that case we don't want to nit: It might be worthwhile mentioning at least one case when this happens other than the timeout (e.g. user entering/exiting the idle status). Good catch!
Attachment #8738483 -
Flags: review?(alessio.placitelli) → review+
Updated•8 years ago
|
Attachment #8738485 -
Flags: review?(alessio.placitelli) → review+
Comment 11•8 years ago
|
||
Comment on attachment 8738486 [details] [diff] [review] Part 3 - Add diagnostics for suspected scheduler scenarios Review of attachment 8738486 [details] [diff] [review]: ----------------------------------------------------------------- This looks good with the one below addressed. Are you planning on adding tests in a separate patch or manually testing the probes? ::: toolkit/components/telemetry/TelemetrySession.jsm @@ +435,5 @@ > let promise = Promise.resolve(); > try { > promise = this._schedulerTickLogic(); > } catch (e) { > + Telemetry.getHistogramById("TELEMETRY_SCHEDULER_TICK_EXCEPTION", 1); I think you meant getHistogramById("TELEMETRY_SCHEDULER_TICK_EXCEPTION").add(1)
Attachment #8738486 -
Flags: review?(alessio.placitelli) → review+
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #11) > Are you planning on adding tests in a separate patch or manually testing the > probes? These are temporary measures, so i'm planning on manual testing.
Assignee | ||
Comment 13•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Attachment #8738486 -
Attachment is obsolete: true
Assignee | ||
Comment 14•8 years ago
|
||
Comment on attachment 8738507 [details] [diff] [review] Part 3 - Add diagnostics for suspected scheduler scenarios data-review?bsmedberg These are temporary measures to help with diagnosing possible issues for this bug per comment 4. If we don't get enough data out of aurora/beta, we might want to make this opt-out later.
Attachment #8738507 -
Flags: review+
Attachment #8738507 -
Flags: feedback?(benjamin)
Reporter | ||
Updated•8 years ago
|
Attachment #8738507 -
Flags: feedback?(benjamin) → feedback+
Comment 15•8 years ago
|
||
https://hg.mozilla.org/integration/fx-team/rev/2727f9056580 https://hg.mozilla.org/integration/fx-team/rev/d287704400aa https://hg.mozilla.org/integration/fx-team/rev/61e7426512e5
Assignee | ||
Updated•8 years ago
|
Priority: P1 → P2
Comment 16•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/2727f9056580 https://hg.mozilla.org/mozilla-central/rev/d287704400aa https://hg.mozilla.org/mozilla-central/rev/61e7426512e5
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox48:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
You need to log in
before you can comment on or make changes to this bug.
Description
•