Closed Bug 363258 Opened 18 years ago Closed 17 years ago

bad millisecond resolution for (new Date).getTime() / Date.now() on Windows

Categories

(Core :: JavaScript Engine, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

VERIFIED FIXED

People

(Reporter: Swatinem, Assigned: robarnold)

References

Details

Attachments

(5 files, 4 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20061207 Minefield/3.0a1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20061207 Minefield/3.0a1

(new Date).getTime() should output the time accurate to one millisecond. But in my case it has a resolution of 15~16ms, which means that every time difference (by calling getTime() twice) is a multiple of 15~16ms, which makes profiling applications (f.ex. Firebug 1.0 Beta, lori and venkmal) completely useless.

Reproducible: Always

Steps to Reproduce:
Open the testcase I will attach.
Actual Results:  
time elapsed: 0
time elapsed: 16
time elapsed: 16
time elapsed: 16
time elapsed: 31
time elapsed: 31
time elapsed: 47
time elapsed: 47
time elapsed: 47
time elapsed: 63

Expected Results:  
the time elapsed accurate to one ms.
Attached file testcase
testcase that mesures script execution time.
You can change the iteration count of the for-loops to better show the bug.
It seems likely that your complaint is against your operating system (or your system load), which is exhibiting either a) poor resolution in its timer (very doubtful), or b) ~15ms long context-switch times, and not against the browser, which is simply reporting the time sample it was delivered from the operating system.

Using "wall-clock duration" as a profiling tool, especially for work-loads as small as this one, is bound to lead you on a bit of a wild-goose chase.  I'd suggest checking out the profiling features in Venkman instead.
Status: UNCONFIRMED → RESOLVED
Closed: 18 years ago
Resolution: --- → INVALID
btw, the output of your test case on my system (Mac OS with a dual-core Intel CPU) is this:

time elapsed: 5
time elapsed: 9
time elapsed: 13
time elapsed: 17
time elapsed: 21
time elapsed: 25
time elapsed: 29
time elapsed: 34
time elapsed: 38
time elapsed: 42
I think Venkman is also affected by this bug.
And maybe it is a Windows specific Bug? I asked a few friends of mine with different Windows-based machines and they could reproduce the bug.
Well it seems like it really is a problem with windows. After a little research i found the following site which explains the details: http://www.sykronix.com/researching/browser/loop/
And I also tested Opera with my testcase and it also shows this "bug".
But how does for example PHPs microtime() work that accurate? Maybe it uses some other OS timer? I don't understand much of this but isn't it possible for Mozilla to use the same time source for getTime() as PHP uses for its microtime() function?

That would change this "bug report" into a "feature request". For example "Use a more accurate OS timer for getTime()" or "Provide a timer with the resolution of microseconds (instead of milliseconds) for extension authors". Would that be possible?
I know we've talked about doing the equivalent of a portable RDTSC, but I think that facility is actually becoming increasingly _less_ reliable, and I don't know what its counterpart would be on non-Intel platforms.  In either case, it wouldn't be appropriate functionality for the Date() object to provide, and would probably be a non-portable "extension" of Firefox/spidermonkey even if we could somehow provide it (either through the DOM or through core JS facilities?)

A better approach to this for your problem is probably to try increasing your work-load until millisecond timing becomes less of an issue.
I think Boost and the PHP timer function on Win32 use this GetSystemTimeAsFileTime routine, something like this (though the PHP version looks overly complicated): http://www.mail-archive.com/php-general@lists.php.net/msg04981.html

Here's boost's version:

    //! Return the local time based on computer clock settings
    static time_type local_time() {
      FILETIME ft;
      #if BOOST_WORKAROUND(__MWERKS__, BOOST_TESTED_AT(0x3205))
      // Some runtime library implementations expect local times as the norm for ctime.
      FILETIME ft_utc;
      GetSystemTimeAsFileTime(&ft_utc);
      FileTimeToLocalFileTime(&ft_utc,&ft);
      #else
      GetSystemTimeAsFileTime(&ft);
      #endif
      return create_time(ft, LOCAL);
    }
We need this for our performance tests.

This works fine on Mac and Linux, but shows 15/16 millisecond resolution on Windows.  Rob Arnold seems to have some ideas about how to reduce it.
Status: RESOLVED → UNCONFIRMED
Resolution: INVALID → ---
Summary: bad millisecond resolution for (new Date).getTime() → bad millisecond resolution for (new Date).getTime() / Date.now() on Windows
Rob - off to you and Jim can you assist with Win32 timer APIs as needed..
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #8)
>but shows 15/16 millisecond resolution on Windows.

Possibly due to change by Bug 307527, use timeGetTime() instead of QueryPerformanceCounter(), in order to bypass MS Win's problem in QueryPerformanceCounter when Multi-CPU&ACPI. (Use of RDTSC instruction by user is also affected. Hotfix is avail, but Service Pack is not shipped yet.)
See Bug 307527 Comment #102 and see patch for Bug 307527.

Is problem observed on Firefox or Seamonkey without the change? (The patch is not applied at leaset Firefox 1.5.0.0)
Just looking at this preliminarily I'm getting what appears to be correct results on

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9a6pre) Gecko/20070628 Minefield/3.0a6pre

time elapsed: 2
time elapsed: 5
time elapsed: 6
time elapsed: 8
time elapsed: 10
time elapsed: 12
time elapsed: 14
time elapsed: 16
time elapsed: 40
time elapsed: 42

If this is related to a specific hardware configuration it's not what I'm running - Vista / Core 2 6700 @ 2.66GHz.

Second test case also appears to be working correctly. 
Attached file Timer comparison program (obsolete) —
(In reply to comment #12)
> Just look at this preliminarily I'm getting what appears to be correct
> results on

Yes, my Vista machine gives the correct results on both tests, however my XP machine gets incorrect results on both tests. I don't know about Windows Server 2003, but I suspect all versions of windows except vista likely exhibit this problem.

Attached is the source to a small C++ program that runs some different windows timer apis through a roughly equivalent version of the javascript test pages. I do remember at one point seeing incorrect output from this program on Vista (from the FileTime/SystemTime timers) but I can't seem to reproduce that. Behavior on XP is as expected.
FYI.
"Whether QueryPerformanceCounter or timeGetTime" seems to have no relation.
Next page says timer resolution of Win-XP(probably NT or later) is 10ms(or 15ms).
  http://www.microsoft.com/whdc/system/CEC/mm-timer.mspx
And next page says timeBeginPeriod is required to use higer resolution.
  http://msdn2.microsoft.com/en-us/library/ms810433.aspx
And Bug 307527 Comment #102 says that timeBeginPeriod is not currently issued by NSPR.
Possibly a result of timer related spec change(improvement) on Windows Vista.
I've found a forum (MSDN Forums»Software Development for Windows Vista) article who says next by google search.
> when an audio client is created in exclusive mode, timer resolution is changed from 1ms to 15ms.
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1322935&SiteID=1
And next is wikipedia article which says "exclusive mode(also called DMA mode)".
http://en.wikipedia.org/wiki/Technical_features_new_to_Windows_Vista#Audio_stack_architecture
Does someone know about relation between "timer resolution" and "exclusive mode" in MS Windows Vista?   
I was under the impression this was something new but it doesn't appear to be. It looks like Date calls PRMJ_Now -

http://lxr.mozilla.org/mozilla/source/js/src/prmjtime.c#147

and uses GetSystemTimeAsFileTime for the value. This call has a
resolution of about 10ms from what I'm reading on the net.

Running Rob's app on an XP VMWare image yields - 

Using timer: GetSystemTimeAsFileTime
0 0 0 0 0 10 0 10 0 10 0 10 10 10 10 10 10 10 10 20 10 10 10 20 10 20 20 20
20 20 21 20 20 20 20 30 20 30 30 20

Using timer: GetSystemTime
0 0 10 0 0 0 10 0 10 0 10 10 0 0 10 10 20 10 10 10 20 10 20 10 20 20 10 20 2
0 20 20 20 30 20 20 30 20 31 20 30

Using timer: timeGetSystemTime
0 0 0 0 0 10 4 4 5 5 6 6 9 7 9 9 11 11 12 12 13 13 14 15 16 17 18 18 19 20 2
1 21 21 22 23 24 24 25 30 27

Using timer: timeGetSystemTime interpolated with multimedia
0 0 1 1 3 3 4 4 4 5 6 6 8 7 9 10 11 10 13 13 13 13 15 17 16 17 17 19 18 20 2
0 21 23 22 23 24 24 26 27 26

Using timer: timeGetSystemTime interpolated with QueryPerformanceFrequency
0 0 2 1 2 3 4 4 6 6 7 8 8 9 9 10 11 10 12 12 13 14 14 16 17 17 19 19 19 19 2
0 21 22 23 23 28 25 26 24 27

Using timer: GetSystemTimeAsFileTime interpolated with multimedia
0 0 0 1 2 3 3 4 5 6 6 8 8 9 9 9 11 13 12 12 13 14 16 15 16 16 17 19 19 20 19
 20 22 22 22 25 24 24 26 27

Using timer: GetSystemTimeAsFileTime interpolated with QueryPerformanceFrequency
0 1 1 2 3 3 3 5 5 6 8 8 9 9 16 10 14 15 15 22 17 19 17 18 16 18 21 18 20 22
30 27 23 22 24 23 29 32 27 27 

So update Now to use timeGetTime? The only catch are the required calls to timeBeginPeriod and timeEndPeriod.

http://msdn2.microsoft.com/en-us/library/ms713418.aspx
Hmm, maybe not. 

"The timer APIs work on elapsed times since a moving epoch. For example, timeGetTime() measures elapsed time in "system time" or milliseconds since the operating system was last started. We've seen it generate measurements to individual milliseconds, so it is better than ftime. However, to relate system times to calendar dates and times, you need to use the pair of methods GetSystemTimeAsFileTime and FileTimeAsSystemTime. Unfortunately, the resulting milliseconds accuracy is only 1/100th (similar to ftime). So, timeGetTime appears to be the best of these APIs but its output doesn't translate to calendar dates and times."
Just an idea(pseudo code).
 1. D1=GetSystemTimeAsFileTime();timeBeginPeriod();T1=timeGetTime();
 2. D2=GetSystemTimeAsFileTime();T2=timeGetTime();timeEndPeriod();
 3. Return D1+(T2-T1) as system time with higher timer resolution; 
 Note: QueryPerformanceCounter can be used instead of timeGetTime,
       when MS Win for single CPU, or MS Win XP SP3 or later(not shipped yet).  
Attached patch Quick fix (obsolete) — Splinter Review
This is a preliminary patch that should address the issues that we see on tinderbox and js timing in general, however there are a few remaining issues to be sorted out before it can be deployed for general use.

1. If the user changes the time (or DST kicks in) while the program is running, that change is not picked up and the old time is used instead.

2. If the user suspends/hibernates while the program is running, that change is also not picked up.

3. Multi-core/dynamic clockspeed timing issues.

Since no (known/documented) API call seems to exist in Windows 2k/XP/Vista that by default gives better than 15 ms precision, the strategy seems to be to grab the system time and use a high-resolution timer to extrapolate the current date. Windows provides two different timers for our task: the multimedia timer and QueryPerformanceCounter/Frequency.

Multimedia timer:
Pros:
 * Can provide 1ms resolution
 * Does not seem to suffer from multi-core/speedstep issues.
 * Can query system time directly, no interpolation needed (timeGetSystemTime)
Cons:
 * Obtaining 1 ms precision comes at a cost of battery life. See "Battery Life Impact of High-Resolution Periodic Timers" in http://72.14.253.104/search?q=cache:taVMNra3EcoJ:download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/PM_apps.doc+timeBeginPeriod+battery+life&hl=en&ct=clnk&cd=2&gl=us&client=firefox-a
 * timeGetTime is a 32 bit value, so we have to worry about wrap around

QueryPerformanceCounter/Frequency:
See http://blogs.msdn.com/oldnewthing/archive/2005/09/02/459952.aspx
Pros:
 * Microseconds resolution 
 * 64 bit value will not wrap around for years
 * Does not cause an extra drain the battery/low overhead
Cons:
 * Accuracy depends on the available hardware
 * Availability depends on hardware, though any Pentium-class processor should be sufficient (it may rely upon the RDTSC instruction).

I don't like the idea of draining battery life, so I chose to use QueryPerformanceCounter. Microsoft has put together a small guide aimed at game developers who want accurate timing on Windows: http://msdn2.microsoft.com/en-us/library/bb173458.aspx
In this article, it puts out 5 recommendations; this patch follows 3 of them (1,2,5). The remaining recommendations are to compute timing on a single thread (All JS content does this in Firefox) and to set the thread affinity to one processor. The latter could be dealt with by setting/unsetting the affinity around calls to QueryPerformanceCounter, but I haven't found out how expensive those calls are.

Multicore issues:
The counter value from QueryPerformanceCounter may decrease. See http://channel9.msdn.com/ShowPost.aspx?PostID=156175
There is a hot fix for XP (http://support.microsoft.com/?id=896256) as these users discovered (and as WADA indicated), but it has not been pushed out by Microsoft so we cannot rely on it being present. That said, for users who do not have this hotfix, I don't think we should try to provide anything better than 15ms resolution (and I'm not sure that we can). Still, since our interpolation method relies upon the counter to steadily increase, the timing will be off which is why we cache the largest value we've seen and use that if our current value is less than it. Thus Date.now() will never decrease.

Similarly, the counter can leap ahead by a few seconds (http://support.microsoft.com/kb/q274323/).

There is also the issue that when no dedicated timer hardware is present, QPC seems to use the processor's frequency/tsc however modern mobile cpus dynamically scale their frequency so the tsc will update at a different rate than our code expects. I do not think there is a good way to detect processor frequency changes.

Dealing with these issues should be handled by detecting and correcting clock skew. What I haven't figured out is how often to check, how to check and what should be recalibrated.

Also in the patch is some custom mutex code. The JS engine is required to be multithread safe, so potentially there could be multiple threads invoking PRMJ_Now for the 'first' time. Since the nspr mutex code is probably not expecting simultaneous initialization on multiple threads with the same mutex, I wrote a small mutex structure that can be initialized to 0 before execution. This is also used to protect writes/reads to the largest value.
I ran across an internal API (http://msdn2.microsoft.com/en-us/library/ms724512.aspx) that I thought might be promising. As I somewhat expected, it is affected by the timeBeginPeriod/timeEndPeriod calls, so it is effectively the same as timeGetTime/GetSystemTimeAsFileTime. I also wondered if calling timeBegin/EndPeriod around calls to those functions would increase the resolution, but it appears that there is a delay (perhaps the old duration) before they take effect, so they are not useful.
Attachment #270253 - Attachment is obsolete: true
(In reply to comment #20)
> Created an attachment (id=270629) [details]
> New timer comparison program
> 

Are you looking for a r+ on this or is this a WIP?
Cc-ing to Wan-Teh Chang.
Is use of QueryPerfornanceCounter safe?
I think that "negative increase in two consecutive QueryPerformanceCounter()" should be cared and "if negative, then consider it as ZERO" is sufficient, in order to avoid RDTSC related issue when multiple CPU and "advanced power management" is enabled.

To Rob Arnold:   
A) Wraparound can occur between two consecutive timer.sample() requests. I think "negative increase" should be cared. What do you think?
B) If PC is suspended between two consecutive timer.sample() request, returned  date/time value after resume can become invalid(very old) one.
Do you think care on this is required?
(In reply to comment #21)
The timer program I posted is intended only to show the characteristics of the different timers available on Windows. The patch (comment #19) is not ready for review unless it becomes ok to break Date.now() under the aforementioned circumstances. As it stands, the patch will fix the issues on the tinderbox machines.

(In reply to comment #22)
> 
> To Rob Arnold:   
> A) Wraparound can occur between two consecutive timer.sample() requests. I
> think "negative increase" should be cared. What do you think?

Yes, it should be cared for; most programmers do not expect it. In the patch I proposed (comment #19), there are no negative increases however in the timer comparison program, negative increases are not handled. I did not feel it was necessary to handle them in the program since negative increases occur only with QueryPerformanceCounter on certain systems (I have not found one I can test on yet).

> B) If PC is suspended between two consecutive timer.sample() request, returned 
> date/time value after resume can become invalid(very old) one.
> Do you think care on this is required?
> 
Care is absolutely required. I suspended by machine for a few seconds today and the time reported by Date.now() jumped ahead by approx. 4 minutes. As I mentioned in comment #19, this issue is one of the problems that my proposed patch does not deal with, but needs to deal with before it can be checked in.

(In reply to comment #23)
> negative increases occur only with QueryPerformanceCounter on certain systems
Number of the "certain systems" doesn't seem to be so small.
See Bug 277547, Bug 319138 and so on, and already pointed Bug 307527, please. 
(In reply to comment #24)
> (In reply to comment #23)
> > negative increases occur only with QueryPerformanceCounter on certain systems
> Number of the "certain systems" doesn't seem to be so small.
> See Bug 277547, Bug 319138 and so on, and already pointed Bug 307527, please. 
> 

The number of filed bugs is not necessarily indicative of the number of affected systems. That said, I know that the timer program I wrote does not handle negative increases. The timer program is not intended to contain reference implementations of PRMJ_Now; it has several notable flaws, some of which you have discovered already: it does not handle negative increases, it does not handle suspend/hibernate, it does not handle the user changing the time while it is running, some implementations do not actually return the current date, and it is not threadsafe as PRMJ_Now will have to be.

If you looks at the patch I proposed (see comment #19, please), you will see that it handles negative increases and it should be thread safe. As I mentioned in comment #19, the final patch will have to address suspend/hibernate and user changes the operating system time; my current patch does not do this.
WADA: NSPR is not using QueryPerformanceCounter() now.  This
change was made in a patch for bug 307527.
Attached patch Revised patch (obsolete) — Splinter Review
This patch should be suitable for general use. While I do not have a machine with a broken implementation of QueryPerformanceCounter, I think this will work.

We check for clock skew every call; the exact tolerance is the 'tick' rate of the NT kernel itself, conveniently exposed through an api call. If clock skew is detected, we recalibrate which takes up to the tick rate (15.6 ms on my machine). This is longer than I would like, but I see no alternative without sacrificing accuracy. Also to avoid rounding errors that were causing some false clock skews detections, all the calculations are done with long doubles which should represent all 64 bit integers exactly; while this does increase the cost of each call, the cost to recalibrate after a clock skew is much larger.
Attachment #270617 - Attachment is obsolete: true
Attachment #270802 - Flags: review?(brendan)
I think this patch is much more complicated than it needs to be.  It seems like once the timebase origin has been calibrated, it should be fairly simple to figure out if the QueryPerformanceCounter is somehow out of whack because of one of the many bugs noted above.  Once calibrated the current low-res time shouldn't be more than a kernel tick different from the current time calculated based on the difference of the current counter value to the counter value at calibration time.

At calibration time:

lowResStart (from GetSystemTimeAsFileTime)
countStart (from QueryPerformanceCounter)

When PRMJ_Now is called:

lowResNow (from GetSystemTimeAsFileTime) 
countNow (from QueryPerformanceCounter)

highResNow = ( countNow - countStart ) / countFreq + lowResStart

if ( abs( highResNow - lowResNow ) > kernel tick ) then recalibrate

As for multi-core CPUs, we should also be able to detect if the thread has moved to a different processor and handle that as a special case, using SetProcessAffinityMask to shift the thread around to the original processor.  Intel has published code for doing just this:

http://www.intel.com/cd/ids/developer/asmo-na/eng/209859.htm

Are the threads involved here relatively long-lived?  Can we assume that Date.now() values will not be shared across threads?  If so, it might simplify things a *lot* to use thread-local storage, obviating the need for mutex calls in a timing routine.

Can we avoid the goto?
FYI.
Sun looks to have rejected request for use of etimer way (which Intel presented, set CPU Affinity) or technique of "disabling of frequency changing" in System.nanoTime() of Java on MS Windows.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6440250
> Synopsis : On Windows System.nanoTime() may be 25x slower than System.currentTimeMillis()
(In reply to comment #28)
> I think this patch is much more complicated than it needs to be.  It seems like
> once the timebase origin has been calibrated, it should be fairly simple to
> figure out if the QueryPerformanceCounter is somehow out of whack because of
> one of the many bugs noted above.  Once calibrated the current low-res time
> shouldn't be more than a kernel tick different from the current time calculated
> based on the difference of the current counter value to the counter value at
> calibration time.
> 
> At calibration time:
> 
> lowResStart (from GetSystemTimeAsFileTime)
> countStart (from QueryPerformanceCounter)
> 
> When PRMJ_Now is called:
> 
> lowResNow (from GetSystemTimeAsFileTime) 
> countNow (from QueryPerformanceCounter)
> 
> highResNow = ( countNow - countStart ) / countFreq + lowResStart
> 
> if ( abs( highResNow - lowResNow ) > kernel tick ) then recalibrate

This is pretty much what is going on (modulo some arithmetic).  There is one additional optimization to avoid recalibration for small negative increases in time.

> 
> As for multi-core CPUs, we should also be able to detect if the thread has
> moved to a different processor

This is a bad line of thought; we cannot assume how long we will be on a particular processor.

> and handle that as a special case, using
> SetProcessAffinityMask to shift the thread around to the original processor. 
> Intel has published code for doing just this:
> 
> http://www.intel.com/cd/ids/developer/asmo-na/eng/209859.htm

I considered that, but I do not know what the cost of SetThread/ProcessAffinityMask is.  There is also the problem of picking the processor to be our timer; the user may not want us to run on a particular processor (i.e. they have a more important/critical task) so we ought to respect that.  The multiprocessor/core issue has two cases: time is behind what we last reported and time is ahead of what we would have reported if we only ran on one processor/core.

We handle the first case by caching the last value returned; this avoids the expensive recalibration step (one kernel tick) and prevents negative increases (time going backwards) since most code does not handle that case.

The second case we cannot detect without switching to the processor/core we last ran on and even then, there's no guarantee that we'll switch soon enough. Every api call may "take a while."  I do not think the difference between any two processors will be more than one kernel tick (Windows claims to maintain some synchronization between them).  Of course, the only way to really know is to test on such a machine, but I don't think I have one.

> 
> Are the threads involved here relatively long-lived?

Yes, most calls will be from the main fx thread.

> Can we assume that Date.now() values will not be shared across threads?

Probably.  Most if not all JS code runs on a single thread in fx.  As for other Spidermonkey users, I don't know but I don't want to ignore them.

> If so, it might simplify things a *lot* to use thread-local storage, 
> obviating the need for mutex calls in a timing routine.

TLS might make things easier, but it is more expensive than the static variables that are currently used.  There is still the problem of multithreaded initialization, so mutexes are not entirely out but they would not have to be locked every time.  Also, this does require a per-thread memory allocation that we have no good way to free.  For fx, this is probably fine since we know the usage pattern, but for other applications this might not be ok.

> 
> Can we avoid the goto?
> 

Yes, if I move the static variables into the global scope and move calibration into its own function.  I dislike using goto, but it seemed simpler than doing the former.  Spidermonkey is no stranger to goto either; there are over 1000 instances of goto in the code.
(In reply to comment #30)

Looking at the code a little more, seems like we should be storing the timer state in a structure attached to the JSContext, initializing it within js_NewContext and cleaning up in js_DestroyContext.  Or alternatively within the JSThread which uses thread-local storage that is attached to each JSContext.  This would allow you to clean out all the mutex code, simplify the structure of the routine and get rid of the goto.  Mutex's are not cheap and this is a timing routine. 

With regard to the problem on multi-core machines, I was thinking we could use GetCurrentProcessorNumber but this appears to only be available on Vista and XP x64, so it's not a simple task to figure out when the thread changes processor.  Clamping it as the patch does is probably the simplest alternative.  This at least assures that the value returned from PRMJ_Now monotonically increases.

We should also probably only be calling QueryPerformanceFrequency once since it never changes after boot, or at least that's what the MS docs say...  ;)
(In reply to comment #31)
> (In reply to comment #30)
> 
> Looking at the code a little more, seems like we should be storing the timer
> state in a structure attached to the JSContext, initializing it within
> js_NewContext and cleaning up in js_DestroyContext.  Or alternatively within
> the JSThread which uses thread-local storage that is attached to each
> JSContext.  This would allow you to clean out all the mutex code, simplify the
> structure of the routine and get rid of the goto.  Mutex's are not cheap and
> this is a timing routine. 

I did consider attaching the static variables to a JSContext, however this involves the caller actually having a JSContext which is not always the case (see JS_Now in jsapi.c and random_init in jsmath.c).  We could pass NULL and fall back to the old timing code, but that seems inconsistent, especially since JS_Now is a public API.  I will look into eliminating the need to lock the mutex every call.  I don't think the way that I've implemented the mutex is expensive, no more than the cost of QueryPerformanceCounter or any system call.

> 
> With regard to the problem on multi-core machines, I was thinking we could use
> GetCurrentProcessorNumber but this appears to only be available on Vista and XP
> x64, so it's not a simple task to figure out when the thread changes processor.

It's also not very useful since we should assume we can and will switch to a different processor at any instruction.  Microsoft states in their documentation that "This function is used to provide information for estimating process performance" which is not how we'd be using it.

>  Clamping it as the patch does is probably the simplest alternative.  This at
> least assures that the value returned from PRMJ_Now monotonically increases.
> 
> We should also probably only be calling QueryPerformanceFrequency once since it
> never changes after boot, or at least that's what the MS docs say...  ;)
> 

Yes, I meant to look into that on a machine where the actual counter frequency changes with clockspeed. If it doesn't, then yeah we should only call it once.

I'll review on Monday, current patch or newer if you get results from that "counter freq varies with clock speed" machine.

/be
(In reply to comment #33)
> I'll review on Monday, current patch or newer if you get results from that
> "counter freq varies with clock speed" machine.

The current patch is not threadsafe and I could not get QueryPerformanceFrequency to return different values while the program was running. I have an idea to improve concurrency as well.

> /be
> 

From the MSDN page on QueryPerformanceFrequency:

"The QueryPerformanceFrequency function retrieves the frequency of the high-resolution performance counter, if one exists. The frequency cannot change while the system is running."

(http://msdn2.microsoft.com/en-us/library/ms644905.aspx)

From the Intel article describing their approach to working around changing processor frequencies and multi-core timing issues:

"This Enhanced Timer (Etimer) is based on two Windows* API functions; QueryPerformanceCounter and QueryPerformanceFrequency. There is no way to know which frequency Microsoft is using to implement those two functions on any given platform. But one thing for sure is that the frequency of that timer will never change during the course of timing. The timer can be either the chipset timer or the power management timer or something else."

(http://www.intel.com/cd/ids/developer/asmo-na/eng/209859.htm?page=1)

Has anyone seen QueryPerformanceFrequency vary?

FYI.
Following is one of Google search search results for "QueryPerformanceFrequency acpi dual core". 
 - http://www.gamedev.net/community/forums/topic.asp?topic_id=370281
This forum article is report of following phenomenon of QueryPerformanceCounter on AMD Athlon64 X2 in 2004 (this is the problem).
> Timer: 0.0375251740255071 NewTickCount: 36148061846142 LastTickCount: 36147986698979
> Timer: -0.0197190838817925 NewTickCount: 36148022357099 LastTickCount: 36148061846142
> Timer: 0.00889879405566819 NewTickCount: 36148040177646 LastTickCount: 36148022357099
This report says;
> Additionally, in another test, I called QueryPerformanceFrequency every frame as well,
> and it always returned the same number. So that's at least constant.
although it is not clear whether problem was observed or not when no QueryPerformanceFrequency change was observed.
FYI.
Another source of "no QueryPerformanceFrequency value change".
 - http://msdn2.microsoft.com/en-us/library/ms979201.aspx
This .NET Framework says;
> Add a constructor. In the constructor, call QueryPerformanceFrequency
This indicates there is no need to call QueryPerformanceFrequency multiple times. 
Attached patch Proposed solution (obsolete) — Splinter Review
Changes this patch:
* No more goto!
* QueryPerformanceFrequency is called only once.
Separate initialization, calibration, shutdown functions.
* Style fixes.
* No more custom mutexes; win32 critical sections are now used so that long waits during calibration will cause threads to sleep (and less code to maintain).
* The skew threshold is doubled to two kernel ticks to avoid some strange behavior with other programs causing a double clock skew (see comment on lines 408-412 in the patch).
* There is a (apparently for Vista only) optimization that reduces calibration time from 15.6 ms to 1 ms
* To avoid regressing startup time, there is a small static counter that reverts to the old behavior for the first n calls (where n = 10 currently). This value was obtained by printing out each call during startup and the start page was blank.
* More comments, including a pseudo code outline of PRMJ_Now which makes it easier to spot race conditions/deadlocks.

Although this patch is not perfect, I have spent long enough trying to make it so. The two things I would like to see fixed are the startup regression hack and the double kernel tick workaround. Any suggestions or ideas are welcome.
Attachment #270802 - Attachment is obsolete: true
Attachment #272213 - Flags: review?(brendan)
Attachment #270802 - Flags: review?(brendan)
Comment on attachment 272213 [details] [diff] [review]
Proposed solution

> #ifdef XP_WIN
> #include <windef.h>
> #include <winbase.h>
>+/* for fabs */
>+#include <math.h>
>+/* for timeBegin/EndPeriod */
>+#include <mmsystem.h>

Nit: prevailing style would put minor comments (unpunctuated, uncapitalized sentence fragments) on the same line and to the right of each commented #include, indented to line up nicely.

>+    /* The performance counter frequency */
>+    long double freq;
>+    /* The low res 'epoch' */
>+    long double offset;
>+    /* The high res 'epoch' */
>+    long double timer_offset;

Uber-nit: these could be indented on the right, or if the comment is long enough and a sentence, as written above -- but then a blank line before each but the first (because it follows an opening brace) is prevailing style.

>+    JSBool calibrated;

This should relieve the need for a local calibrated variable in PRMJ_Now, or I'm misreading.

>+static
>+void

Prevailing style puts static void on one line before the unindented function declarator.

>+PRMJ_NowCalibrate()

Statics should not be named as if they were API entry points. They don't need any prmj_ or other prefix, so NowCalibrate is minimal good style here.

>+        /* By wrapping a timeBegin/EndPeriod pair of calls around this loop,
>+           the loop seems to take much less time (1 ms vs 15ms) on Vista. */

Wild, any idea why?

>+        timeBeginPeriod(1);
>+        GetSystemTimeAsFileTime(&ftStart);
>+        do {
>+            GetSystemTimeAsFileTime(&ft);
>+        } while (memcmp(&ftStart,&ft, sizeof(ft)) == 0);

Noting the spin-on-syscall for order msec here, while caller owns a critical section.

>+static
>+#ifdef JS_THREADSAFE
>+PRStatus PR_CALLBACK
>+#else
>+void
>+#endif
>+PRMJ_NowInit(void)

Ditto previous style and naming comment, plus better to avoid #ifdefs by making it return a PRStatus always, and just move the return PR_SUCCESS out of the #ifdef here:

>+{
>+    memset(&calibration, 0, sizeof(calibration));
>+    PRMJ_NowCalibrate();
>+#ifdef JS_THREADSAFE
>+    InitializeCriticalSectionAndSpinCount(&calibration.calibration_lock, CALIBRATIONLOCK_SPINCOUNT);
>+    InitializeCriticalSectionAndSpinCount(&calibration.data_lock, DATALOCK_SPINCOUNT);
>+    return PR_SUCCESS;
>+#endif
>+}

Just for Win3.1 (ha ha) compat, or perhaps my own sanity, you could make NowInit be a JS_STATIC_DLL_CALLBACK(void) function named prmj_NowInit (the prefix was good style because the static was removed by a definition of JS_STATIC_DLL_CALLBACK for Win3.1 -- IIRC!). But the main change I'm suggesting here, the declspec or whatever to make NowInit have the right calling convention, seems not to be necessary any longer, so take this whole comment with a grain of salt.

Nice to write Win32 code without fear of old linker or compiler brain damage!

>+#endif // defined XP_WIN

Just // XP_WIN is enough per house style.

>+/*
>+
>+Win32 pseudo code

(Psueudo-Python -- worth noting?)

>+        else:
>+            needCalibration = True

Nit: else clause is overindented one basic offset.

>+    else:
>+      returnedTime = lowres
>+  while needCalibration
>+
>+*/
>+
> JSInt64
> PRMJ_Now(void)
> {
> #ifdef XP_OS2
>     JSInt64 s, us, ms2us, s2us;
>-    struct timeb b;
>+    timeb b;

Don't change #ifdef'd code you aren't testing, and do note that my injunction to use typdef'd struct names instead of struct tags applies only to our code. IIRC there is no OS2 or Unix typedef-name "timeb", so this line must remain "struct timeb b". Yay, C.

> #if defined(XP_UNIX) || defined(XP_BEOS)
>-    struct timeval tv;
>+    timeval tv;

Same comment.

>+    /* For non threadsafe platforms, PRMJ_NowInit is not necessary */
>+#ifdef JS_THREADSAFE
>+    PR_CallOnce(&calibrationOnce, PRMJ_NowInit);
>+#endif
>+    do {
>+        if (!calibration.calibrated || needsCalibration) {
>+            MUTEX_LOCK(&calibration.calibration_lock);
>+            MUTEX_LOCK(&calibration.data_lock);
>+
>+            /* Recalibrate only if no one else did before us */
>+            if(calibration.offset == cachedOffset) {
>+                /* Since calibration can take a while, make any other
>+                   threads immediately wait */
>+                MUTEX_SETSPINCOUNT(&calibration.data_lock, 0);

Once again I'm concerned that nesting criticals and *then* trying to avoid competitive spinning is not going to save other threads from being hung out to dry for msec or tens of msec. The way to avoid this is to use even shorter criticals in conjunction with condition variables or semaphores. Was that something considered but rejected because too complex? Hard to weigh code complexity and implementation burden vs. the spinning hazard.

If my Windows ignorance is showing, and setting the spincount to 0 after entering criticals will get other CPUs to stop spinning right away, perhaps I'm worrying about too small a spin cost. How much real time is that 4096 spin-limit equal to on modern hardware?

/be
(In reply to comment #39)

> >+    JSBool calibrated;
> 
> This should relieve the need for a local calibrated variable in PRMJ_Now, or
> I'm misreading.

Never mind, I misread, read, then forgot to remove this comment.

/be
(In reply to comment #39)
> (From update of attachment 272213 [details] [diff] [review]) 
> >+        /* By wrapping a timeBegin/EndPeriod pair of calls around this loop,
> >+           the loop seems to take much less time (1 ms vs 15ms) on Vista. */
> 
> Wild, any idea why?
Yes, timeBeginPeriod increases the kernel tick rate. The changes seems to take effect immediately on Vista, but it didn't seem to effect XP (or at least not before timeEndPeriod was called). We don't leave it on all the time because it does require more power so battery life on mobiles would be shortened (see my  comments in an earlier post)

> 
> >+        timeBeginPeriod(1);
> >+        GetSystemTimeAsFileTime(&ftStart);
> >+        do {
> >+            GetSystemTimeAsFileTime(&ft);
> >+        } while (memcmp(&ftStart,&ft, sizeof(ft)) == 0);
> 
> Noting the spin-on-syscall for order msec here, while caller owns a critical
> section.

Unfortunate, but yes. All other threads in this function will immediately block on the critical sections while the caller owns them.

> 
> Nice to write Win32 code without fear of old linker or compiler brain damage!
> 

Yes, it's much cleaner. Users who want Win9x support should know that I removed some Win9x specific workaround for dates around midnight (in case they want to port my changes to 9x).

> >+/*
> >+
> >+Win32 pseudo code
> 
> (Psueudo-Python -- worth noting?)

Noted. I had written a ton of python code the night before.

> 
> >+        else:
> >+            needCalibration = True
> 
> Nit: else clause is overindented one basic offset.

I should have run it through the python interpreter first :)

> >+
> > JSInt64
> > PRMJ_Now(void)
> > {
> > #ifdef XP_OS2
> >     JSInt64 s, us, ms2us, s2us;
> >-    struct timeb b;
> >+    timeb b;
> 
> Don't change #ifdef'd code you aren't testing, and do note that my injunction
> to use typdef'd struct names instead of struct tags applies only to our code.
> IIRC there is no OS2 or Unix typedef-name "timeb", so this line must remain
> "struct timeb b". Yay, C.
> 
> > #if defined(XP_UNIX) || defined(XP_BEOS)
> >-    struct timeval tv;
> >+    timeval tv;
> 
> Same comment.

I did a s/struct // over the range of my changes and I missed these. Good catch, thanks.

> 
> >+    /* For non threadsafe platforms, PRMJ_NowInit is not necessary */
> >+#ifdef JS_THREADSAFE
> >+    PR_CallOnce(&calibrationOnce, PRMJ_NowInit);
> >+#endif
> >+    do {
> >+        if (!calibration.calibrated || needsCalibration) {
> >+            MUTEX_LOCK(&calibration.calibration_lock);
> >+            MUTEX_LOCK(&calibration.data_lock);
> >+
> >+            /* Recalibrate only if no one else did before us */
> >+            if(calibration.offset == cachedOffset) {
> >+                /* Since calibration can take a while, make any other
> >+                   threads immediately wait */
> >+                MUTEX_SETSPINCOUNT(&calibration.data_lock, 0);
> 
> Once again I'm concerned that nesting criticals and *then* trying to avoid
> competitive spinning is not going to save other threads from being hung out to
> dry for msec or tens of msec. The way to avoid this is to use even shorter
> criticals in conjunction with condition variables or semaphores. Was that
> something considered but rejected because too complex? Hard to weigh code
> complexity and implementation burden vs. the spinning hazard.

Avoiding the competitive spinning after acquiring the lock is done because doing it before didn't make as much sense to me. After acquiring the calibration lock (which has no spinning since it will always be held for 'a while'), we want to spin on the data lock in order to acquire it as fast as possible, just as most threads will do when no calibration is needed. Windows makes no guarantee about the order that threads are granted access to critical sections (or any other synchronization primitive), so I figured sleeping would decrease the chance of acquiring the lock since another thread would be scheduled which might vie for the lock and it will be spinning for it. I definitely do not want the norm to be no spinning since in the common case the data critical section will be held for a short period of time. If this becomes a worry, then the code could be modified to pull the data members from the calibration structure onto the thread's stack, release the critical section, then perform the computation, relock, update the .last field and then release.

> 
> If my Windows ignorance is showing, and setting the spincount to 0 after
> entering criticals will get other CPUs to stop spinning right away, perhaps I'm
> worrying about too small a spin cost. How much real time is that 4096
> spin-limit equal to on modern hardware?

I don't know if setting it will cause other threads to stop spinning if they're already spinning, but I imagine that it's more likely for threads to try and enter the critical section after we've set the spin count to 0.

For those who are curious, 4096 is approximately the number of cycles that Microsoft uses for the critical section for their heap (it is mentioned in the docs for InitializeCriticalSectionWithSpinCount and a few other places).

> 
> /be
>
Attachment #272213 - Attachment is obsolete: true
Attachment #272259 - Flags: review?(brendan)
Attachment #272213 - Flags: review?(brendan)
Comment on attachment 272259 [details] [diff] [review]
Proposed solution

Looks good, thanks for all the follow-through. Only remaining thought, fodder for a followup patch at this point: should there be some metering optioinally compiled in to tell how often the machinery recalibrates, takes the slow road, etc.?

/be
Attachment #272259 - Flags: review?(brendan) → review+
Assignee: general → robarnold
I checked Rob's patch in for him, marking this as FIXED.
Status: NEW → RESOLVED
Closed: 18 years ago17 years ago
Resolution: --- → FIXED
With this patch building the JS shell with "make -f Makefile.ref" fails at the link stage unless I set OTHER_LIBS=winmm.lib first. Probably all config/WINNTx.y.mk files must be updated so that LIB_LINK_FLAGS includes winmm.lib?
Built 1.8.1, 1.9.0 opt, debug shells with this. Blake, pretty please?
Attachment #272778 - Flags: review?(mrbkap)
Attachment #272778 - Flags: review?(mrbkap) → review+
Comment on attachment 272778 [details]
patch for WINNT*.mk to build shell

/cvsroot/mozilla/js/src/config/WINNT4.0.mk,v  <--  WINNT4.0.mk
new revision: 1.15; /cvsroot/mozilla/js/src/config/WINNT5.0.mk,v  <--  WINNT5.0.mk
new revision: 1.9; /cvsroot/mozilla/js/src/config/WINNT5.1.mk,v  <--  WINNT5.1.mk
new revision: 1.5; /cvsroot/mozilla/js/src/config/WINNT5.2.mk,v  <--  WINNT5.2.mk
new revision: 1.4; 

Thanks Seno!
/cvsroot/mozilla/js/tests/js1_5/extensions/regress-363258.js,v  <--  regress-363258.js
initial revision: 1.1
Flags: in-testsuite+
Note to self: this test passes in the shell and browser on standalone physical machines with winxp installed, but on the winxp vms it passes in the shell but fails in the browser.

verified fixed 1.9.0 winxp
Status: RESOLVED → VERIFIED
(In reply to comment #48)
> Note to self: this test passes in the shell and browser on standalone physical
> machines with winxp installed, but on the winxp vms it passes in the shell but
> fails in the browser.
> 
> verified fixed 1.9.0 winxp
> 

Out of curiosity, what's the host os and cpu of the vm?
(In reply to comment #49)

> Out of curiosity, what's the host os and cpu of the vm?

vmware esx. I don't know the cpu.
Flags: wanted1.9.2?
Flags: in-litmus+
Flags: blocking1.9.2?
Flags: blocking1.9.0.8?
Flags: blocking1.9.0.7?
Flags: blocking1.9.0.6?
Flags: blocking1.8.1.next?
Flags: blocking-fennec1.0?
This was fixed on the 1.9.0 branch on 2007-07-16. Please don't play with the nomination flags.
Flags: wanted1.9.2?
Flags: blocking1.9.2?
Flags: blocking1.9.0.8?
Flags: blocking1.9.0.7?
Flags: blocking1.9.0.6?
Flags: blocking-fennec1.0?
Not a 1.8.1 blocker, and I really doubt this is in litmus.
Flags: in-litmus?
Flags: in-litmus+
Flags: blocking1.8.1.next?
Flags: blocking1.8.1.next-
Flags: in-litmus?
Blocks: 603872
No longer blocks: 603872
Depends on: 603872
You need to log in before you can comment on or make changes to this bug.