Open Bug 1204823 Opened 5 years ago Updated 2 years ago

TimeStamp::NowLowRes() behaves differently over computer sleep over platforms

Categories

(Core :: mozglue, defect, P4)

defect
Points:
1

Tracking

()

Tracking Status
firefox43 --- affected

People

(Reporter: gfritzsche, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [measurement:client])

We have TimeStamp::NowLowRes() etc. for monotonic clock usage, as e.g. used by nsITelemetry.msSinceProcessStart():
https://dxr.mozilla.org/mozilla-central/rev/9ed17db42e3e46f1c712e4dffd62d54e915e0fac/toolkit/components/telemetry/Telemetry.cpp#3565

However, we noticed that this behaves differently over platforms when a computer goes to sleep, so it's not useful for any cross-platform time measurements.

On Windows this comes down to GetTickCount64(), which does increase over sleep periods:
https://msdn.microsoft.com/en-us/library/windows/desktop/ms724411%28v=vs.85%29.aspx
https://dxr.mozilla.org/mozilla-central/rev/9ed17db42e3e46f1c712e4dffd62d54e915e0fac/mozglue/misc/TimeStamp_windows.cpp#530

OS X comes down to mach_absolute_time(), which does not increase over sleep periods:
https://dxr.mozilla.org/mozilla-central/rev/9ed17db42e3e46f1c712e4dffd62d54e915e0fac/mozglue/misc/TimeStamp_darwin.cpp#49

POSIX comes down to clock_gettime(CLOCK_MONOTONIC, &ts), which should not increase over sleep time either:
https://dxr.mozilla.org/mozilla-central/rev/9ed17db42e3e46f1c712e4dffd62d54e915e0fac/mozglue/misc/TimeStamp_posix.cpp#77
Whiteboard: [unifiedTelemetry]
Priority: -- → P2
Benjamin, do you have an opinion on the expected behavior here or know who might have one?

I'm leaning towards the Windows behavior of actual time elapsed, not "time spent in working state".
Flags: needinfo?(benjamin)
I don't have an opinion. You'd need to survey existing callers to figure out whether they expect one thing or the other. Nathan may know more.

But rather than asking about TimeStamp specifically, can you just articulate the behavior you actually want and we can decide where that implementation should live?
Flags: needinfo?(benjamin) → needinfo?(nfroyd)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> But rather than asking about TimeStamp specifically, can you just articulate
> the behavior you actually want and we can decide where that implementation
> should live?

There are two issues here:
* the TimeStamp implementation behaves inconsistent across platforms
* what behavior our affected use should have (moving that needinfo to another bug)
(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #3)
> * the TimeStamp implementation behaves inconsistent across platforms

Is this true for high resolution timing as well?  I can't tell straightaway from reading the Windows code.

I can see arguments both ways for what people prefer.  For telemetry measurements, I think you'd typically want actual time spent working.  But I'm not entirely sure about that--would have to think about it a little more.  The answer might depend on the particular measurement, too...
Flags: needinfo?(nfroyd)
From a casual check on DXR, TimeStamp seems to be used both ways:
(1) for checking how long some operation takes (including sleep time probably undesired)
(2) for checking how much time passed since a certain event (sleep time should probably be included)

I'm thinking defaulting to including sleep time might be the safer approach with less issues triggered.
Giving that is the effective behavior on Windows, features should already work ok with that.

E.g. the slow script detection works around sleep time jumps somewhat awkwardly:
https://dxr.mozilla.org/mozilla-central/rev/9ed17db42e3e46f1c712e4dffd62d54e915e0fac/js/xpconnect/src/XPCJSRuntime.cpp#1345
Blocks: 1205567
(In reply to Nathan Froyd [:froydnj] from comment #4)
> (In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #3)
> > * the TimeStamp implementation behaves inconsistent across platforms
> 
> Is this true for high resolution timing as well?  I can't tell straightaway
> from reading the Windows code.

From this:
http://www.janbambas.cz/usage-of-queryperformancecounter-in-gecko/
... and a quick look over this:
https://dxr.mozilla.org/mozilla-central/rev/9ed17db42e3e46f1c712e4dffd62d54e915e0fac/mozglue/misc/TimeStamp_windows.cpp#317

... my understanding is that high-resolution timing on Windows should just fall back to GetTickCount64() values after sleep periods and thus behave the same.

Honza, can you confirm?
Flags: needinfo?(honzab.moz)
Points: --- → 1
Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][measurement:client]
Priority: P2 → P1
>... my understanding is that high-resolution timing on Windows should just fall > back to GetTickCount64() values after sleep periods

Yes, the mechanism is described in my blog post.  It's rather complicated and we often work with "difference of a difference".  But gives good results.

> and thus behave the same

Yes.  If you use NowLoRes(), it always only works with GTC which (AFAIK) reflects the sleep time as "working time".  I don't see any problem with NowLoRes() on win according sleep.


On Windows, I'm not sure which of the timers [1] is "stopped" during sleep/hibernation.  I would have to check on all of them.  Also, it can differ from hardware to hardware (as QPC does).

The problem with sleep, GetTickCount and QueryPerformanceCoounter is that they perceive the sleep period time differently.  But when you are using NowLoRes() you don't need to worry ;)  It only uses GTC.  Also, the implementation as is ensures "monotonicity" and constant "calibration" according to GTC values.  GTC is considered (also according my findings) a very stable timer, tho with low usually 15.6ms resolution.


[1] https://bugzilla.mozilla.org/show_bug.cgi?id=676349#c0
Flags: needinfo?(honzab.moz)
Nathan, what do you think?
Do you think staying consistently with incrementing over sleep times sounds reasonable, given the last 3 comments?

The approach chosen here affects how we solve this in Telemetry.
Flags: needinfo?(nfroyd)
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> Nathan, what do you think?
> Do you think staying consistently with incrementing over sleep times sounds
> reasonable, given the last 3 comments?
> 
> The approach chosen here affects how we solve this in Telemetry.

Is it possible to address this in telemetry by always explicitly separating out things by platform (on the server/dashboard side) or by applying some sort of compensation for time spent asleep (possibly recorded in a separate histogram)?

It's a good point that things already work this way on Windows *ought* to make changing it on other platforms work out just fine.  But I'm a little leery of the consequences; support for monotonic suspend-aware clocks on Linux is not universal (>= 2.6.39, according to the manpage, which may not be reliable; I don't know how our kernel version and/or kernel support for this on Android looks), and it's certainly not portable across POSIX platforms.  Mac seems to have support for this, though: http://stackoverflow.com/questions/12488481/getting-ios-system-uptime-that-doesnt-pause-when-asleep/12490414#12490414

Also worth noting that somebody was at least thinking about suspend when they wrote the TimeStamp code:

http://mxr.mozilla.org/mozilla-central/source/mozglue/misc/TimeStamp.h#363

whether or not all the users of TimeStamp were aware of that is another question...

I am cautiously in favor, but I'd like to know if there are other alternatives we can pursue first.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #9)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> > Nathan, what do you think?
> > Do you think staying consistently with incrementing over sleep times sounds
> > reasonable, given the last 3 comments?
> > 
> > The approach chosen here affects how we solve this in Telemetry.
> 
> Is it possible to address this in telemetry by always explicitly separating
> out things by platform (on the server/dashboard side) or by applying some
> sort of compensation for time spent asleep (possibly recorded in a separate
> histogram)?

We have a chance to submit proper values now and we want to specifically measure time including sleep (and later also time excluding sleep etc.). So we should fix i now and avoid accumulating weird server-side special cases.
If the decision for this bug is that it's ok for `TimeStamp`, then i will implement something specifically for Telemetry that gives us what we need - i just need to know :)

> It's a good point that things already work this way on Windows *ought* to
> make changing it on other platforms work out just fine.  But I'm a little
> leery of the consequences; support for monotonic suspend-aware clocks on
> Linux is not universal (>= 2.6.39, according to the manpage, which may not
> be reliable; I don't know how our kernel version and/or kernel support for
> this on Android looks), and it's certainly not portable across POSIX
> platforms.  Mac seems to have support for this, though:
> http://stackoverflow.com/questions/12488481/getting-ios-system-uptime-that-
> doesnt-pause-when-asleep/12490414#12490414

Ok, so Linux/Android are the problematic parts. Do we have anyone with good knowledge there we could ping?
Flags: needinfo?(nfroyd)
(In reply to Georg Fritzsche [:gfritzsche] from comment #10)
> (In reply to Nathan Froyd [:froydnj] from comment #9)
> > (In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> > > Nathan, what do you think?
> > > Do you think staying consistently with incrementing over sleep times sounds
> > > reasonable, given the last 3 comments?
> > > 
> > > The approach chosen here affects how we solve this in Telemetry.
> > 
> > Is it possible to address this in telemetry by always explicitly separating
> > out things by platform (on the server/dashboard side) or by applying some
> > sort of compensation for time spent asleep (possibly recorded in a separate
> > histogram)?
> 
> We have a chance to submit proper values now and we want to specifically
> measure time including sleep (and later also time excluding sleep etc.). So
> we should fix i now and avoid accumulating weird server-side special cases.
> If the decision for this bug is that it's ok for `TimeStamp`, then i will
> implement something specifically for Telemetry that gives us what we need -
> i just need to know :)

You want to measure time including and excluding sleep for everything?

This paragraph confuses me.  Did you mean "...is that it's *not* ok for TimeStamp"?

> > It's a good point that things already work this way on Windows *ought* to
> > make changing it on other platforms work out just fine.  But I'm a little
> > leery of the consequences; support for monotonic suspend-aware clocks on
> > Linux is not universal (>= 2.6.39, according to the manpage, which may not
> > be reliable; I don't know how our kernel version and/or kernel support for
> > this on Android looks), and it's certainly not portable across POSIX
> > platforms.  Mac seems to have support for this, though:
> > http://stackoverflow.com/questions/12488481/getting-ios-system-uptime-that-
> > doesnt-pause-when-asleep/12490414#12490414
> 
> Ok, so Linux/Android are the problematic parts. Do we have anyone with good
> knowledge there we could ping?

I don't know if we have minimum kernel versions required for either Linux or Android, or whether our minimum required Android version implies a minimum kernel version.  Maybe :snorp would know more about the Android situation?
Flags: needinfo?(nfroyd) → needinfo?(snorp)
Oh, :jld might know something about our Linux kernel support, since he's groveled around in places that care about that.  Jed, see comment 11?
Flags: needinfo?(jld)
(In reply to Nathan Froyd [:froydnj] from comment #11)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #10)
> > (In reply to Nathan Froyd [:froydnj] from comment #9)
> > > (In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> > > > Nathan, what do you think?
> > > > Do you think staying consistently with incrementing over sleep times sounds
> > > > reasonable, given the last 3 comments?
> > > > 
> > > > The approach chosen here affects how we solve this in Telemetry.
> > > 
> > > Is it possible to address this in telemetry by always explicitly separating
> > > out things by platform (on the server/dashboard side) or by applying some
> > > sort of compensation for time spent asleep (possibly recorded in a separate
> > > histogram)?
> > 
> > We have a chance to submit proper values now and we want to specifically
> > measure time including sleep (and later also time excluding sleep etc.). So
> > we should fix i now and avoid accumulating weird server-side special cases.
> > If the decision for this bug is that it's ok for `TimeStamp`, then i will
> > implement something specifically for Telemetry that gives us what we need -
> > i just need to know :)
> 
> You want to measure time including and excluding sleep for everything?

Yes - starting with excluding sleep here. Browser activity/usage/... times are very important to our analysis needs and we need consistent data here.
 
> This paragraph confuses me.  Did you mean "...is that it's *not* ok for
> TimeStamp"?

I mean, if the decision here is to leave TimeStamp as is, then i will move on.
I just need to know either way, the current situation doesn't work for us.
One thing I know is that B2G ICS emulator is 2.6.29 but all the physical B2G ICS devices are 3.0.x.  I'm a little fuzzy on the larger Android ecosystem (specifically, I'm not sure if Android ICS shipped 2.6.29 on actual devices or just the emulator), but we still support Gingerbread for Fennec so I don't think we can require 2.6.39 there.  But... if I understand correctly, Android/B2G devices don't actually sleep like that, so if that's the case they could just use CLOCK_MONOTONIC.

On desktop, we support (and build on!) CentOS 6, which is on a kernel that calls itself 2.6.32 but has a lot of patches; I don't know offhand whether it has CLOCK_BOOTTIME.

If some kind of graceful degradation is possible and we're wondering what percentage of desktop Linux would be affected, FHR would have data on that but I'm pretty sure it's <5%.

Also maybe useful to know: Linux desktop automated tests other than gtests currently run on 3.2 (gtests run on the build host), and all tests will run on 3.13 kernels under TaskCluster.  This may have implications for test coverage.
Flags: needinfo?(jld)
kicked to p2 until we can pack into  a sprint (likely next)
Priority: P1 → P2
As Jed said, the Android minimum kernel is pretty low. Eugen did some work around this recently, though, so maybe he has some more insight.
Flags: needinfo?(snorp) → needinfo?(esawin)
Android 3 (Honeycomb) is based on 2.6.36, Android 4 (ICS) on 3.0.1 [1].

If monotonicity is not a hard requirement, then you could use CPU time [2] instead, which should give consistent results including "sleep times" across platforms.

[1] https://en.wikipedia.org/wiki/Android_version_history
[2] bug 1193821
Flags: needinfo?(esawin)
Priority: P2 → P3
Whiteboard: [unifiedTelemetry][measurement:client] → [measurement:client]
Priority: P3 → P4
Blocks: 1359403
One thing I came across while investigating bug 1359403 relating to this is that if TimeStamp *were* to use CLOCK_BOOTTIME on Linux, the sleeping code that the timer thread uses is based on CLOCK_MONOTONIC (via pthreads), and it doesn't appear that that code can use CLOCK_BOOTTIME.  That code would likely need to use different code for sleeping.

Also, CLOCK_BOOTTIME came out in Linux 2.6.39, which came out about 7 years ago now - are Android kernels still that far behind?
You need to log in before you can comment on or make changes to this bug.