Closed Bug 1515396 Opened 6 years ago Closed 6 years ago

Monotonic now functions are not actually monotonic

Categories

(Toolkit :: Telemetry, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: wlach, Unassigned)

Details

Following up from bug 1514392, I tried to determine if the behaviour that was obvious to me (that we weren't checking for invalid results from our monotonic now functions) was sufficient to explain the very large values we were seeing on Windows. tl;dr: I don't believe so, and I suspect the problem is that these "monotonic now" functions are not working as advertised. The [schema]((https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/dc458113a7a523e60a9ba50e1174a3b1e0cfdc24/schemas/telemetry/main/main.4.schema.json#L1009) we're using for the main ping currently discards any pings with negative values. I was wondering how many of these we're seeing... it seems like only a relatively small number, and the values aren't in the millions as we're seeing with the large subsession lengths: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/56411/command/56522 This led me to think that the problem isn't a variant "monotonic now" function misbehaving in a random way, as I thought might be the case. We calculate the subsession length based on two samples of this function (https://searchfox.org/mozilla-central/rev/2e5e28f518524f8af5b158ddb605022b6a2d68cf/toolkit/components/telemetry/pings/TelemetrySession.jsm#860) and if either could radically misbehave in a random way I would expect to see both very large negative numbers (in the case that the initial sample was high, and the second call was low) as well as very large ones (for the opposite). It seems like we are only seeing cases where the first sample is much lower than the second. But why? On closer inspection, it appears that we only get the really weird values on Windows (some suspicious values in the thousands on mac/linux, but that's it). What could explain that? It turns out that I don't think our "monotonic now" function is actually as monotonic as intended. Under the hood, it appears both variants use a function called MsSinceProcessStart: https://searchfox.org/mozilla-central/rev/2e5e28f518524f8af5b158ddb605022b6a2d68cf/toolkit/components/telemetry/core/TelemetryCommon.cpp#85 That function already trusts TimeStamp::NowLoRes(), which uses the system timestamp. In addition, the function it calls to get the process's startup time (ProcessCreation) also appears to use the same system time under the hood (not surprising, since it returns a system timestamp): https://searchfox.org/mozilla-central/rev/2e5e28f518524f8af5b158ddb605022b6a2d68cf/mozglue/misc/TimeStamp.cpp#69 My hypothesis of what's happening is that in some cases on Windows, the system timestamp is super low (like centuries off) when first sampled, and then it gets corrected to something like the current time. Or alternately, it starts off normal and then gets set to some time way off in the future. Same difference. We should be able to fix all of this by using the ComputeProcessUptime() function, which doesn't depend at all on the current timestamp: https://searchfox.org/mozilla-central/rev/2e5e28f518524f8af5b158ddb605022b6a2d68cf/mozglue/misc/TimeStamp.h#593 We should still fix the issues I pointed out in bug 1514392, but I think this will fix most if not all cases of the really weird usage hours we're seeing.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #0) > My hypothesis of what's happening is that in some cases on Windows, the > system timestamp is super low (like centuries off) when first sampled, and > then it gets corrected to something like the current time. Or alternately, > it starts off normal and then gets set to some time way off in the future. > Same difference. We should be able to fix all of this by using the > ComputeProcessUptime() function, which doesn't depend at all on the current > timestamp: > > https://searchfox.org/mozilla-central/rev/ > 2e5e28f518524f8af5b158ddb605022b6a2d68cf/mozglue/misc/TimeStamp.h#593 This part is wrong. As :froydnj pointed out to me, ComputeProcessUptime() depends on the system clock (to determine how long has elapsed since this process was started). Instead, just using Timestamp::Now() as is should do the trick: telemetry doesn't actually care about the process uptime, it just wants to sample a baseline unit to measure.
Telemetry sometimes does care about the process uptime, at least nominally. For instance, Event telemetry timestamps are in millis since main process start. We can probably change the documentation to adjust this implied guarantee, but there may be someone expecting it to bear some relation to session start.
Update: after several frustrating re-reads of the timestamp code (in particular ComputeProcessUptime and the ProcessCreation function that calls it), I'm now pretty sure most of the above is wrong. I feel like I've already caused quite a bit of confusion, so want to check my assumptions again before saying more, but for now please don't trust my diagnosis.
Ok, I went through the code with :mreid today during our 1:1 and I'm pretty confident now about what's going on. I don't think there any major issues with our monotonicity(sic?), rather the problem is in how our telemetry timestamp code handles exception conditions, which is bug 1514392. Let's close this as invalid and resume investigation/discussion there.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.