Investigate increase in payload.info.subsessionLength errors in main ping
Categories
(Data Platform and Tools :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: ascholtz, Assigned: klukas)
References
Details
(Whiteboard: [dataquality])
Attachments
(1 file)
11.99 KB,
image/png
|
Details |
The number of payload.info.subsessionLength
errors for main has spiked at ~80k on 2022-03-12
Comment 1•3 years ago
|
||
Spiked to 1 million on 2022-03-14.
Assignee | ||
Comment 2•3 years ago
|
||
This is affecting main
and first-shutdown
.
The specific error is:
org.everit.json.schema.ValidationException: #/payload/info/subsessionLength: #: no subschema matched out of the total 2 subschemas
cc :chutten in case you have any thoughts on what could cause this to spike.
Assignee | ||
Comment 3•3 years ago
•
|
||
The values appear to be negative numbers (often -1
but also many different larger values like -82983
).
Example query:
SELECT
JSON_QUERY(udf_js.gunzip(payload), '$.payload.info.subsessionLength') AS v,
JSON_QUERY(udf_js.gunzip(payload), '$.payload.info') AS info
FROM
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
WHERE
DATE(submission_timestamp) = "2022-03-14"
and error_message like '%subsessionLength%'
limit 1000
Comment 4•3 years ago
|
||
Hrm. subsessionLength
is based on what should be monotonic time. We're using msSinceProcessStart
unconditionally since 2019 (see bug 1514392, Fx 67). More recently (bug 1205985, Fx 86) but still not recently enough to make sense that it's causing problems only now, :padenot made some changes to make time during sleep accounted for consistently across platforms. Going one step deeper, the most recent change to the underlying timestamp API we're using was bug 1751041 (Fx 98, so the timing might work out) where we moved from lazy-load to greedy-load. Not sure how that could cause any grief whatsoever, but it's the only change close to what's happening.
If we could split the errors by platform and app display version that could help determine whether it's a client-side change that could be causing this. (If it's Mac-only, then maybe it has to do with time during sleep. If it's cross-platform, but only starts in Fx 98 maybe bug 1751041 did have an unintended effect, etc)
Assignee | ||
Comment 5•3 years ago
|
||
OMG. I wonder if this is DST-related. The time change (in the US at least) happened on March 13; we'd expect most pings for that time period to be sent midnight local time of the next day.
Comment 6•3 years ago
|
||
geo'd be a fun thing to look at, if that were the case. However I challenge that most pings are sent at midnight local: most "main" pings are actually sent at Firefox shutdown (except Mac due to the no-windows-but-still-running thing that bug 1715855 talks about).
DST should have no effect on monotonic times which these should be... but wouldn't that be just grand.
Assignee | ||
Comment 7•3 years ago
|
||
About 350k of these are from US, then DE and FR each have ~100k. If this were DST-related I'd not expect Europe to be affected.
Assignee | ||
Comment 8•3 years ago
|
||
The vast majority are from 98.0 or above:
WITH base AS (
SELECT
*, udf_js.gunzip(payload) AS p
FROM
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
WHERE
DATE(submission_timestamp) = "2022-03-14"
and error_message like '%subsessionLength%')
select
udf.parse_desktop_telemetry_uri(uri).app_version,
count(*),
from base
group by 1 order by 2 desc
That gives 970985 on 98.0. The most significant entry for pre-98 is 97.0.2 at 2217 entries.
So this definitely looks like something related to a change in 98.0.
Assignee | ||
Comment 9•3 years ago
|
||
I'm seeing both Mac and Windows appearing frequently in this dataset. Linux occasionally as well, so doesn't appear to be platform-dependent.
Comment 10•3 years ago
|
||
:gsvelto, do you know of any way that bug 1751041 (or other TimeStamp changes in Fx98) might cause problems of this sort? Specifically, Telemetry uses TimeStamp::NowLoRes - TimeStamp::ProcessCreation
to get a monotonic "now". We're seeing pings that contain data showing that calling this twice can give us a later answer that is before the earlier answer, starting mostly in Fx98.
That is, we store a value _subsessionStartTime
, then later subtract it from the current monotonic "now" and find it is negative.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 11•3 years ago
|
||
We see another spike to 1 million for March 25.
Comment 12•3 years ago
|
||
The patch changed two things: when we compute the startup timestamp and when we re-compute it after a restart, not how it's computed so it's odd that you get a negative timestamp. That being said this code is complex and I've seen it interact in odd ways with other code so my change remains the most likely culprit. I need a few more bits of information to narrow it down: is this happening on all platforms or just one? AFAIK only Windows cares about the low-res part of the timestamp you use, so if the issue were happening only on Windows host it might be related to that (there might be a discrepancy between the startup time calculation and the low-res timebase). Additionally, is the negative result you get large or small? In general issues with the startup time have resulted in very small negative discrepancies in the past (on the order of a few hundred ms at most), so if it's small then it would likely be a timebase issue. If the value is large though that would point to something else, likely a significant logic problem in my patch.
Assignee | ||
Comment 13•3 years ago
|
||
is this happening on all platforms or just one?
It's happening on all platforms, not just windows.
is the negative result you get large or small?
The majority are small values (absolute value below 1000). The most common value is -1
with lots of -2
and -3
as well. Perhaps 25% are larger values; the biggest I saw in a spot check was -242723
.
Assignee | ||
Comment 14•3 years ago
|
||
We haven't seen a new spike since March 25, but there continues to be a high baseline rate for this error.
In the past week, we rejected approximately 0.05% of the overall volume of main pings due to negative subsessionLength.
Comment 15•3 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #13)
The majority are small values (absolute value below 1000). The most common value is
-1
with lots of-2
and-3
as well. Perhaps 25% are larger values; the biggest I saw in a spot check was-242723
.
This makes it likely that the issue is a time inversion when taking the first timestamp. The fact that it happens on all platforms indicates a significant flaw in that logic. Should I open a separate bug for the investigation/fix of the TimeStamp class or should I do it within this bug?
Assignee | ||
Comment 16•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] (pto) from comment #15)
This makes it likely that the issue is a time inversion when taking the first timestamp. The fact that it happens on all platforms indicates a significant flaw in that logic. Should I open a separate bug for the investigation/fix of the TimeStamp class or should I do it within this bug?
Feel free to use this bug, but it's also fine to file a separate one if that's better for tracking on your end.
Assignee | ||
Comment 18•3 years ago
|
||
This has returned to more nominal levels, so closing as fixed.
Updated•3 years ago
|
Reporter | ||
Updated•2 years ago
|
Comment 19•1 year ago
|
||
note says this should be fixed but increased last week
Description
•