Closed Bug 749894 Opened 12 years ago Closed 11 years ago

Intermittent test_performance_now.html | After a timeout, the value of now() should be strictly greater than before.

Categories

(Core :: DOM: Core & HTML, defect)

10 Branch
x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla15

People

(Reporter: philor, Assigned: Gijs)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 3 obsolete files)

Nice test messages.

https://tbpl.mozilla.org/php/getParsedLog.php?id=11281645&tree=Mozilla-Inbound
Rev3 WINNT 5.1 mozilla-inbound debug test mochitests-3/5 on 2012-04-27 16:40:14 PDT for push 1ed9525b2a9f

12352 INFO TEST-START | /tests/dom/tests/mochitest/general/test_performance_now.html
++DOMWINDOW == 56 (11D6A730) [serial = 8221] [outer = 05BEEDC0]
12353 INFO TEST-PASS | /tests/dom/tests/mochitest/general/test_performance_now.html | undefined
12354 INFO TEST-PASS | /tests/dom/tests/mochitest/general/test_performance_now.html | undefined
12355 INFO TEST-PASS | /tests/dom/tests/mochitest/general/test_performance_now.html | undefined
12356 INFO TEST-PASS | /tests/dom/tests/mochitest/general/test_performance_now.html | undefined
12357 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/general/test_performance_now.html | undefined
12358 INFO TEST-END | /tests/dom/tests/mochitest/general/test_performance_now.html | finished in 684ms
Yeah, those ok() calls should probably have a second argument.

Do we lack a high-precision timer on WinXP?  If we do, I can see the test failing due to the filter thing in timers.
(In reply to Boris Zbarsky (:bz) from comment #1)
> Yeah, those ok() calls should probably have a second argument.
> 
> Do we lack a high-precision timer on WinXP?  If we do, I can see the test
> failing due to the filter thing in timers.

Obviously the messages is something I can fix. But what is 'the filter thing'?

Also, I was afraid of this. :-\
Oh, basically you can't expect XPCOM timers to fire reliably.  When you ask for 20ms, it might fire the very next time through the event loop, or it might fire 100ms later.  The "filter thing" is just their attempt to fire "on time" that makes it possible for them to fire early....

If this only happens on WinXP, and if we don't have high-precision timers there, we should just skip the check on WinXP or something.  But we should double-check those to claims.
(In reply to Boris Zbarsky (:bz) from comment #3)
> Oh, basically you can't expect XPCOM timers to fire reliably.  When you ask
> for 20ms, it might fire the very next time through the event loop, or it
> might fire 100ms later.  The "filter thing" is just their attempt to fire
> "on time" that makes it possible for them to fire early....
> 
> If this only happens on WinXP, and if we don't have high-precision timers
> there, we should just skip the check on WinXP or something.  But we should
> double-check those to claims.

Could we abuse the offset parameter that they pass to throw another timeout if they cut us short? Or would that be too evil?

Regarding XP: bug 676349 seems to say they have no high-res timers on any Windows platform (where I consider high-res to be sub-ms).
I nixed that parameter a few months ago (bug 394769).  So no, you can't use it.

Bug 676349 was all about adding high-res TimeStamp on Windows, no?  Looks like it uses QueryPerformanceFrequency, which exists back to Win2k... but maybe not on some hardware (e.g. in VMs)?
(In reply to Boris Zbarsky (:bz) from comment #5)
> I nixed that parameter a few months ago (bug 394769).  So no, you can't use
> it.

:-(

> Bug 676349 was all about adding high-res TimeStamp on Windows, no?  Looks
> like it uses QueryPerformanceFrequency, which exists back to Win2k... but
> maybe not on some hardware (e.g. in VMs)?

bug 676349 comment #1 says:

> - "LONGLONG" QueryPerformanceCounter()
>   - always gives 1ms resolution, it is the best resolution timer available on Windows

And I thought that'd be milliseconds (that's how it's used everywhere else in that comment, AIUI). Anyway, googling proves my interpretation wrong, sorry for the confusion.

From reading the patch, it seems it'll fall back to GetTickCount if the calibration code detects a very large difference. I can't tell if this could be a real problem with the calibration code or if it's just XP (and/or how to tell the difference)...
(In reply to Boris Zbarsky (:bz) from comment #1)
> Yeah, those ok() calls should probably have a second argument.
> 
> Do we lack a high-precision timer on WinXP?  If we do, I can see the test
> failing due to the filter thing in timers.

OK, so irrespective of the XP situation, there'll probably always be platforms that miss this. Should we:
a) increase the timeout to something less likely to cause problems (100ms? 500ms?)
b) take out the timeout 'strict increase' test?
c) just disable the (entire?) test on each platform where we run into issues with the precision of the timer?

(I'd probably pick (a), but I'm not sure how high we can reasonably make this and if that'll successfully avoid the problem or if there'll always be a (significant) number of test runs that go orange because XPCOM timers screw us over)
SM 2.12:

http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1335880889.1335883711.26929.gz
WINNT 5.2 comm-central-trunk debug test mochitests-3/5 on 2012/05/01 07:01:29
s: cb-seamonkey-win32-03
{
9053 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/general/test_performance_now.html | undefined
}
Not sure this is the desired approach, I'll let bz decide that. Basically, the idea is to check whether the Date.now() time didn't increase, either, and whether there's no resolution difference between Date.now() and performance.now(), which signals we've fallen back to ms resolution. In this case, it does another timeout to see if we can get both of them to increase. I limited the number of tries so as not to hang the buildbot in case something ever goes seriously wrong.
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Attachment #621159 - Flags: review?(bzbarsky)
Comment on attachment 621159 [details] [diff] [review]
Add descriptions; hack around Windows issue

"n == d2 && n == d" makes no sense.  n has the zero point at page load start, while d and d2 have the zero point in 1970 or so.  Did you mean to compare d == d2?
(In reply to Boris Zbarsky (:bz) from comment #20)
> Comment on attachment 621159 [details] [diff] [review]
> Add descriptions; hack around Windows issue
> 
> "n == d2 && n == d" makes no sense.  n has the zero point at page load
> start, while d and d2 have the zero point in 1970 or so.  Did you mean to
> compare d == d2?

Errr, d'oh. OK, second attempt. If this is broken too, please feel free to r- and obsolete, and I'll try again later... I may be too tired to produce reasonable code at this point. :-\
Attachment #621159 - Attachment is obsolete: true
Attachment #621159 - Flags: review?(bzbarsky)
Attachment #621180 - Flags: review?(bzbarsky)
Comment on attachment 621180 [details] [diff] [review]
[checked in] Add descriptions; hack around Windows issues

You should probably just drop the |&& n + "" == n.toFixed(0)| bit; I don't see anything that would guarantee that.  Then r=me.
Attachment #621180 - Flags: review?(bzbarsky) → review+
Checked into inbound without the toFixed bits to get this cleared up:

http://hg.mozilla.org/integration/mozilla-inbound/rev/37088c921cb5
Target Milestone: --- → mozilla15
http://hg.mozilla.org/mozilla-central/rev/37088c921cb5
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
This has now changed to:
"12368 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/general/test_performance_now.html | After a timeout, the value of now() should be strictly greater than before. "

Given that was part of what was just checked in above, I'll morph this rather than filing new.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Intermittent test_performance_now.html | undefined → Intermittent test_performance_now.html | undefined (or: After a timeout, the value of now() should be strictly greater than before. ")
OK... now I'm not sure what could still be wrong, apart from the Windows implementation actually being broken in some way... :-\

I have an XP machine at home, but won't be able to look at this until at least Wednesday night, possibly Thursday (CEST). A quick way to have more info is to log the number of checks, d and d2, if anyone wants to look at this before then.

Which reminds me... what kind of tests does the TimeDuration/TimeStamp API for this actually have on its own?
Blocks: 539095
Summary: Intermittent test_performance_now.html | undefined (or: After a timeout, the value of now() should be strictly greater than before. ") → Intermittent test_performance_now.html | After a timeout, the value of now() should be strictly greater than before. ")
Attachment #621180 - Attachment description: Add descriptions; hack around Windows issues → [checked in] Add descriptions; hack around Windows issues
Attachment #621180 - Attachment is obsolete: true
Summary: Intermittent test_performance_now.html | After a timeout, the value of now() should be strictly greater than before. ") → Intermittent test_performance_now.html | After a timeout, the value of now() should be strictly greater than before.
Whiteboard: [orange]
The recent spike seems to coincide with some of Avi's timing work.
The test fails on XP, mostly in debug builds.

The part which fails is (semi pseudo code):

d = Date.now();
n = window.performance.now();

wait 20 ms (using setTimeout)

if (navigator.platform.indexOf("Win") != 0)
  ok(window.performance.now() > n, "expecting now() to have increased")
else
  iterate up to 5 times in 20ms timeouts as long as Date.now()==d
  AND performance.now()==n

ok(window.performance.now() > n, "expecting now() to have increased")


So the test can fail in one of the following cases:
1. 5 iterations of setTimeout(20) passed and performance.now() didn't change.
2. If it fails to detect the windows OS AND after setTimeout(20): performance.now() didn't change.
3. Date.time() changed but performance.now() didn't.


#1 would be a real bug.

#2 even without the windows "protection" of waiting 100ms instead of 20, I would have expected performance.now() to increase, as long as it's at least as accurate as Date.now() (since even back then we should have had a <20ms minimal resolution). Also, the fact that it was written to test that performance.now() changes after 100ms, points IMO to something seriously fishy to begin with.

3. This could be possible. What should the relation be between performance.now() and Date.now()?
Date.now() is wall-clock time, with millisecond accuracy.

performance.now() is monotonic time with microsecond accuracy.  So really, getting to performance.now() values in a row on trunk takes some doing, since getting window.performance takes order of a microsecond (yes, it's insane).

The fact that this was ever failing was, I assumed, due to the timer filter stuff causing the 20ms timeouts to fire way too early.  I have no idea why it would _still_ be failing with the timer filter removed, much less failing more often, unless performance.now() is actually just completely broken on those boxes.
(In reply to Boris Zbarsky (:bz) from comment #97)
> The fact that this was ever failing was, I assumed, due to the timer filter
> stuff causing the 20ms timeouts to fire way too early.  I have no idea why
> it would _still_ be failing with the timer filter removed, much less failing
> more often, unless performance.now() is actually just completely broken on
> those boxes.

The timer filter patch got backed out. This is now failing with the filter.

However, as much as we're not expecting the refresh driver to be part of this orange, Ryan says the regression range fits bug 842967 which landed a day or two before the timer filter patch. This bug will cause the refresh driver to target a frame later than it would without the patch, on cases where it returned too early in more than a frame.


Let's see what the logs say: https://tbpl.mozilla.org/?tree=Try&rev=ad9783bbd1dd
(In reply to Boris Zbarsky (:bz) from comment #97)
> Date.now() is wall-clock time, with millisecond accuracy.
> 
> performance.now() is monotonic time with microsecond accuracy.

I could be totally mistaken here (I'm speaking from memory and don't have time to doublecheck atm) but I believe this is not true on XP, where it's millisecond accuracy. Which seems dodgy as comment #95 says it's those boxen which are failing particularly.
While the most of the auto-comments on this bug are like comment #93 (for which we added the log prints), there are also few memleak failures like on comment #92, which look like this (and are probably a different symptom of the same basic issue which we tried to debug):

06:34:40 leaked 180 bytes during test execution
06:34:40 leaked 1 instance of Mutex with size 12 bytes
06:34:40 leaked 1 instance of ReentrantMonitor with size 16 bytes
06:34:40 leaked 3* instances of nsRunnable with size 12 bytes each (36 bytes total)
06:34:40 leaked 1 instance of nsTArray_base with size 4 bytes
06:34:40 leaked 1 instance of nsThread with size 112 bytes

* - number of leaked nsRunnable objects fluctuates. At my failures I've seen the values: 1,3,4,5 objects.

Got 2 M3 passes on XP-debug and 4 fails, and added few more which are still pending.

All the failures which my build produced are the memleak type, and the added log prints look fine.

It also appears that most (but definitely not all) of the recent M3 failures on inbound are the memleak type, but few are also the performance() test failing "normally", but unfortunately we don't have our log prints there...
The leak is bug 833769, our top non-Android intermittent orange. It's not related to this.
What exactly is the HW config of the XP machine?  I wonder if this is some QueryPerformanceCounter jitterring.  Where in C is performance.now() implemented?
207 nsPerformance::Now()
208 {
209   return GetDOMTiming()->TimeStampToDOMHighRes(mozilla::TimeStamp::Now());

All TimeStampToDOMHighRes does is:

98     mozilla::TimeDuration duration = aStamp - mNavigationStartTimeStamp;
99     return duration.ToMilliseconds();

So this is just pure TimeStamp behavior.
Wild hypothesis:

There's a potential time-limited deadlock which prevents getting updated performance.now() and/or Date.now(), and is more likely to happen on windows XP in debug builds.

If the deadlock doesn't happen or is released before or during the test starts: The test passes.

If deadlock is released after the test reported error but before shutdown: the test fails (performance.now() updates got held up).

If the deadlock isn't released until shutdown: we get the memleak failure.
Found one performance.now() failure from my logged build. Date.now() changed while performance.now() didn't ... :

test_performance_now.html | ALOG #0: D: 1362222758739 P: 453
test_performance_now.html | The value of now() should be equal to or greater than 0.
test_performance_now.html | The value of now() should monotonically increase.
test_performance_now.html | ALOG #1: perfnow: 453
test_performance_now.html | ALOG #2: navigator.platform: Win32  navigator.platform.indexOf("Win"): 0
test_performance_now.html | ALOG #3: Date.now(): 1362222758748
test_performance_now.html | ALOG #4: loop done: perfnow: 453
Also, note that the first setTimeout(20) returned after 9ms (1362222758748 - 1362222758739). So the 100ms Windows "slack" was, in fact, never used, since Date.now() got updated after the first timeout, and performance.now() stayed the same after those 9ms.
If this turns out to be an OS API failure, maybe we can workaround this by performance.now() implementation also checking Date.now(), and if performance < date, return date instead?
No.  performance.now() is supposed to be monotonic time, flowing at a constant rate.  Date.now() is the wall clock and is affected by NTP adjustments, the user messing with the control panel, etc.  They have totally different behavior and different flow rates.

The key part here is that 9ms had passed (unless NTP had adjusted the clock by 9ms for you just then) and performance.now() did not update.  If the resolution on performance.now() is in fact only 15ms or whatever on WinXP we should really fix that....  that's a serious bug.  The spec says it SHOULD be accurate to a microsecond and MAY be accurate to a millisecond if the system doesn't support the other, but being less accurate than that is a spec violation.
(In reply to Boris Zbarsky (:bz) from comment #111)
> The key part here is that 9ms had passed (unless NTP had adjusted the clock
> by 9ms for you just then) and performance.now() did not update.  If the
> resolution on performance.now() is in fact only 15ms or whatever on WinXP we
> should really fix that....  that's a serious bug.
> ...


Added 100 iterations of setTimeout(1) before the test starts, where each iteration logs the delta date.now and delta perf.now, and pushed to all platforms with mochitest 3. Let's see the perf.now resolution everywhere while we're at it. : https://tbpl.mozilla.org/?tree=Try&rev=799857063504

On my system resolution seems sub ms according to the values (#3 is a direct call, the first sample after setTimeout(1) is #4):

ALOG #3: Ellapsed: date.now: +0 perf.now: +0.014539259421383122
ALOG #4: Ellapsed: date.now: +0 perf.now: +0.5631824899422782
ALOG #5: Ellapsed: date.now: +5 perf.now: +5.444952653329324
ALOG #6: Ellapsed: date.now: +10 perf.now: +9.998734229179775
ALOG #7: Ellapsed: date.now: +10 perf.now: +10.48237841758123
ALOG #8: Ellapsed: date.now: +10 perf.now: +10.632047264566609
ALOG #9: Ellapsed: date.now: +12 perf.now: +11.949988368592471
ALOG #10: Ellapsed: date.now: +18 perf.now: +17.985491529598505
ALOG #11: Ellapsed: date.now: +27 perf.now: +27.00710200060206
ALOG #12: Ellapsed: date.now: +35 perf.now: +34.983596294370386

Also, note that the perf.now value on the xp-debug failures was an integer. I found another "proper" failure of this test (among the ~40 M3 retries), where the delta Date.now() was 11, and perf.now was also integer (and the test failed because perf.now didn't update after 11 ms).
I notice 4 classes of setTimeout(1) behavior:

1. *nix and OS X:
- setTimeout : Nice consistent behavior and intervals.
- perf.now   : true microseconds (upto float/double rounding errors).

2. Windows (not xp):
- setTimeout : Nice consistent behavior and intervals.
- perf.now   : monotonous, but doesn't look like true microsecond values.

3. Windows XP (opt or most debug):
- setTimeout : seems to have internal ~15ms resolution. Intervals are 0/1/2 or ~15ms.
- perf.now   : monotonous, but doesn't look like true microseconds values.

4. Windows XP (only saw in some of the debug tests):
- setTimeout: seems to have internal ~15ms resolution. Intervals are 0/1/2 or ~15ms.
- perf.now  : integer values.

So our bug is in case #4, where the sample below shows that date.now and perf.now don't always advance together, and we can have cases that date advances but perf doesn't.

Also, I pushed another build and added 50 setTimeout(0) before the 100 setTimeout(1). The logs shows that the XP machines still wait ~0ms or ~15ms. I would probably have expected setTimeout(0) to behave differently than with actual wait value, and not fall into this 15ms internal resolution, but it seems it doesn't: https://tbpl.mozilla.org/?tree=Try&rev=6dc2a2658168


Here are sample log extracts of the 4 classes:


Android:
-----------
- no performance_now test at Mochitest 3


1. Fedora, OS X
e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=20257642&tree=Try&full=1
-------------
ALOG #4: Ellapsed: date.now: +1 perf.now: +0.7200009999999963
ALOG #5: Ellapsed: date.now: +6 perf.now: +5.878120999999993
ALOG #6: Ellapsed: date.now: +8 perf.now: +7.933121
ALOG #7: Ellapsed: date.now: +9 perf.now: +8.528960999999995
ALOG #8: Ellapsed: date.now: +13 perf.now: +12.691760999999985
ALOG #9: Ellapsed: date.now: +13 perf.now: +13.293759999999992
ALOG #10: Ellapsed: date.now: +14 perf.now: +13.834960999999993
ALOG #11: Ellapsed: date.now: +14 perf.now: +14.292999999999992
ALOG #12: Ellapsed: date.now: +15 perf.now: +14.750961000000004
ALOG #13: Ellapsed: date.now: +15 perf.now: +15.197919999999996
ALOG #14: Ellapsed: date.now: +16 perf.now: +15.650000999999989


2. win-opt:
e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=20258438&tree=Try&full=1
--------
ALOG #58: Ellapsed: date.now: +51 perf.now: +50.43361320730304
ALOG #59: Ellapsed: date.now: +53 perf.now: +52.84401320476047
ALOG #60: Ellapsed: date.now: +55 perf.now: +54.83762149047078
ALOG #61: Ellapsed: date.now: +57 perf.now: +56.89479278232989
ALOG #62: Ellapsed: date.now: +59 perf.now: +58.830286319561196
ALOG #63: Ellapsed: date.now: +61 perf.now: +60.8284348199964
ALOG #64: Ellapsed: date.now: +64 perf.now: +63.83950522556012
ALOG #65: Ellapsed: date.now: +67 perf.now: +66.86510431824365
ALOG #66: Ellapsed: date.now: +70 perf.now: +69.84802539251291
ALOG #67: Ellapsed: date.now: +73 perf.now: +72.84865330420939
ALOG #68: Ellapsed: date.now: +76 perf.now: +75.78072397355959
ALOG #69: Ellapsed: date.now: +79 perf.now: +78.85308727790971


3. winxp opt ([filtered?] 15ms resolution?):
e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=20257968&tree=Try&full=1
---------------
ALOG #32: Ellapsed: date.now: +160 perf.now: +159.84545521847053
ALOG #33: Ellapsed: date.now: +175 perf.now: +175.50414927036815
ALOG #34: Ellapsed: date.now: +191 perf.now: +191.10026553654166
ALOG #35: Ellapsed: date.now: +191 perf.now: +191.26676714498632
ALOG #36: Ellapsed: date.now: +191 perf.now: +191.40533224194695
ALOG #37: Ellapsed: date.now: +191 perf.now: +191.5341195598882
ALOG #38: Ellapsed: date.now: +191 perf.now: +191.65983386156626
ALOG #39: Ellapsed: date.now: +192 perf.now: +191.78890054462232
ALOG #40: Ellapsed: date.now: +192 perf.now: +192.48507841080362
ALOG #41: Ellapsed: date.now: +192 perf.now: +192.62923081006107
ALOG #42: Ellapsed: date.now: +193 perf.now: +192.76556098610303
ALOG #43: Ellapsed: date.now: +193 perf.now: +192.90217052725973
ALOG #44: Ellapsed: date.now: +193 perf.now: +193.11448801453815
ALOG #45: Ellapsed: date.now: +193 perf.now: +193.2558467626472
ALOG #46: Ellapsed: date.now: +193 perf.now: +193.59611347252235
ALOG #47: Ellapsed: date.now: +193 perf.now: +193.73719285551658
ALOG #48: Ellapsed: date.now: +194 perf.now: +193.87743414316623
ALOG #49: Ellapsed: date.now: +206 perf.now: +206.71258497937583
ALOG #50: Ellapsed: date.now: +222 perf.now: +222.3696028405845
ALOG #51: Ellapsed: date.now: +222 perf.now: +222.51040285846386
ALOG #52: Ellapsed: date.now: +222 perf.now: +222.62857430204116
ALOG #53: Ellapsed: date.now: +222 perf.now: +222.75372987348953
ALOG #54: Ellapsed: date.now: +238 perf.now: +237.96013180446118


most xp-debug:
e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=20258429&tree=Try&full=1
-------------
- mostly same as xp-opt


4. some xp-debug
e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=20258434&tree=Try&full=1
------------
ALOG #4: Ellapsed: date.now: +1 perf.now: +0
ALOG #5: Ellapsed: date.now: +15 perf.now: +16
ALOG #6: Ellapsed: date.now: +28 perf.now: +32
ALOG #7: Ellapsed: date.now: +29 perf.now: +32
ALOG #8: Ellapsed: date.now: +30 perf.now: +32
ALOG #9: Ellapsed: date.now: +31 perf.now: +32
ALOG #10: Ellapsed: date.now: +31 perf.now: +32
ALOG #11: Ellapsed: date.now: +32 perf.now: +32
ALOG #12: Ellapsed: date.now: +33 perf.now: +32
ALOG #13: Ellapsed: date.now: +33 perf.now: +32
ALOG #14: Ellapsed: date.now: +34 perf.now: +32
ALOG #15: Ellapsed: date.now: +34 perf.now: +32
ALOG #16: Ellapsed: date.now: +35 perf.now: +32
ALOG #17: Ellapsed: date.now: +38 perf.now: +32
ALOG #18: Ellapsed: date.now: +41 perf.now: +47
ALOG #19: Ellapsed: date.now: +57 perf.now: +63
Ok; peformance.now() on that case #4 is likely not using QueryPerformanceCounter.  See the comments in http://dxr.mozilla.org/mozilla-central/xpcom/ds/TimeStamp_windows.cpp#l58 and other code below that; ew're probably disabling QPC on those machines..  This is likely due to the underlying hardware, HAL (ACPI vs. legacy), etc. that these particular machines are running.  I don't know what talos-r3-xp-083 is, but it would be useful to know.
> I would probably have expected setTimeout(0) to behave differently than with actual wait
> value

Arguments smaller than 4 are clamped to 4 (at least for nested timeouts, which we certainly have here).  So setTimeout(0) and setTimeout(4) would give you the same behavior.
(In reply to Boris Zbarsky (:bz) from comment #116)
> Arguments smaller than 4 are clamped to 4...

Thanks.

(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #115)
> ... ew're probably disabling QPC on those machines.. 
> This is likely due to the underlying hardware, HAL (ACPI vs. legacy), etc.
> that these particular machines are running.  I don't know what
> talos-r3-xp-083 is, but it would be useful to know.

So, if that's the case, then everything seems normal. It seems that on most of the test machines a single setTimeout would be enough to see performance.now() change, but on legacy HW we need to modify this test.

Do we want to distinguish the HW type and perform different test for each? can we do that? or do we test for the worst case of legacy HW?
(In reply to Avi Halachmi (:avih) from comment #117)
> Do we want to distinguish the HW type and perform different test for each?
> can we do that? or do we test for the worst case of legacy HW?

According to the logs, we can probably identify the HW type by checking if perf.now is an integer for 2-3 different values in a row. Ugly, but I'd say it should be pretty reliable.
This patch uses the integer test together with XP (navigator.oscpu) to determine if it's a low resolution timer. It also allows maximum of 1 setTimeout(1) in order to pass if it's not a low-res timer. For the low-res timer, it allows up to 50 iterations where Date.now() didn't change in more than 25ms. https://tbpl.mozilla.org/?tree=Try&rev=2fd856944d61

I ran 5 M3 on each desktop platform, 10 on windows platforms, and 20 on XP debug/opt. All performance.now() tests passed (I watched the log of each of the tests).

The low resolution case appeared on 4/20 runs on XP-opt, on 9/20 runs on XP-debug, and didn't appear on non-XP platforms.

Mostly 1-2 iterations were enough to get a change in perf.now value, but there were also cases of 12 or 22 iterations of setTimeout(1) to get a 15ms change in perf.now, with 8ms change in Date.now() ( https://tbpl.mozilla.org/php/getParsedLog.php?id=20276054&tree=Try&full=1 ).
Attachment #720434 - Flags: review?(bzbarsky)
Comment on attachment 720434 [details] [diff] [review]
Better spec compliance, stricter test for high res, better slack for low res

r=me.  Let's try this stuff...

The main worry is I'm not sure why we think the checks == 1 thing is reasonable to check even while we allow 2ms to elapse.  Can you at least comment that?
Attachment #720434 - Flags: review?(bzbarsky) → review+
If there is a way, you can turn on NSPR_LOG_MODULES="TimeStampWindows:5" to get logs to confirm we detect faulty QPC.
(In reply to Honza Bambas (:mayhemer) from comment #124)
> If there is a way, you can turn on NSPR_LOG_MODULES="TimeStampWindows:5" to
> get logs to confirm we detect faulty QPC.

https://tbpl.mozilla.org/?tree=Try&rev=88889739736c . I'll verify before landing. Thanks for the tip.
Attachment #720434 - Attachment is obsolete: true
Attachment #721063 - Flags: review+
- The test agrees with the NSPR log message of "QPC disabled" at 94% of the cases (30 of 32).

- On the 2 cases of disagreement, the test detected a low res timer, but there wasn't a "QPC disabled" message.
- On the disagreements, the values of delta performance.now() suggest that it is, in fact, a low resolution timer. FWIW, I think that currently the "integers only" test is a better indicator of low resolution timer than the jitter detection heuristics which is used by Firefox. I didn't try to understand why integer values indicate lowres timer.

try: https://tbpl.mozilla.org/?tree=Try&rev=88889739736c

Legends:
"test: lowres": the performance_now test decided that the timer is low resolution.
"dt" : delta Date.now() after performance.now() changed.
"dp" : delta performance.now() after it changed.
"N timeouts": N iterations of setTimeout(1) until performance.now changed.
"OK": test detected as low resolution timer, and also NSPR log message of "QPC Disabled".
"BAD": test lowres detection doesn't agree with NSPR log message.

winxp opt (16 runs of M3):
--------------------------
- BAD, test: lowres, 1 timeout,  dt  8ms, dp 15ms.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20366298&tree=Try&full=1


xp-debug (16 runs of M3):
-------------------------
- BAD, test: lowres, 2 timeouts, dt 20ms, dp 15ms.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20357690&tree=Try&full=1

- OK,  test: lowres, 2 timeouts, dt 21ms, dp 15ms.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20366809&tree=Try&full=1

- OK,  test: lowres, 1 timeout,  dt 8ms,  dp 16ms.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20366537&tree=Try&full=1

- OK,  test: lowres, 2 timeouts, dt 19ms, dp 15ms.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20366794&tree=Try&full=1

- OK,  test: lowres, 2 timeouts, dt 17ms, dp 16ms.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20366824&tree=Try&full=1

- OK,  test: lowres, 2 timeouts, dt 13ms, dp 15ms.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20366528&tree=Try&full=1
(In reply to Avi Halachmi (:avih) from comment #129)
> - The test agrees with the NSPR log message of "QPC disabled" at 94% of the
> cases (30 of 32).

Apologies. 100% agreement. There is a "QPC Disabled" message on the 2 cases which I indicated as "BAD". Not sure why I missed them.
So the fact that we're getting QPC disabled is bad and the increase in the occurrences of this problem suggest that bug 822490 has increased how often we disable QPC.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #131)
> So the fact that we're getting QPC disabled is bad and the increase in the
> occurrences of this problem suggest that bug 822490 has increased how often
> we disable QPC.

The existing test was designed to pass even if QPC is disabled. But it wasn't actually always passing on that case.

Since the test made it look like a valid case, and comment #115 didn't contradict that, my patch "fixes" it to properly handle the low-res case (and pass).

Should we remove the slack completely such that most of the "QPC Disabled" cases would fail the test? (most, because sometimes they do pass regardless)
Blocks: 822490
Avi, do we have access to the hardware the tests run on?  If I give you a piece of program that will do some QPC behavior analyzes, can you run it on the machine?  Also, what is the hardware (CPU, arch) just for curiosity?
(In reply to Honza Bambas (:mayhemer) from comment #134)
> Avi, do we have access to the hardware the tests run on?  If I give you a
> piece of program that will do some QPC behavior analyzes, can you run it on
> the machine?  Also, what is the hardware (CPU, arch) just for curiosity?

Discussed on IRC, logs from comment 129 are now enough info for me.
Blocks: 836869
No longer blocks: 836869
Depends on: 850846
(In reply to Vladan Djeric (:vladan) from comment #139)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/91952c264bcc

Thanks. Hopefully this will fix the oranges without the need to disable perf.now test, but this will make the test pass also when QPC is disabled, which isn't necessarily a good thing.

So I filed bug 850846 which is to decide if perf.now test should pass or fail when QPC is disabled. I set it as blocker for bug 836869 (improve QPC decision).
https://hg.mozilla.org/mozilla-central/rev/91952c264bcc
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Can we get this test-only fix uplifted to Aurora? Just starred an orange…
FYI, comment #142 and comment #144 are the old test. The fix seems solid so far.

(In reply to Richard Newman [:rnewman] from comment #143)
> Can we get this test-only fix uplifted to Aurora? Just starred an orange…

This seems pretty rare now, not sure it's worth the hassle of uplifting, your call.
(In reply to Avi Halachmi (:avih) from comment #145)

> This seems pretty rare now, not sure it's worth the hassle of uplifting,
> your call.

Concur, but please keep an eye on it!
(In reply to Richard Newman [:rnewman] from comment #146)
> Concur, but please keep an eye on it!

Always ;)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: