Closed Bug 831404 Opened 7 years ago Closed 6 years ago

Intermittent services\datareporting\tests\xpcshell\test_policy.js | false == true

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect, P1)

x86
Windows 8
defect

Tracking

(firefox24 fixed, firefox25 fixed, firefox26 fixed)

RESOLVED FIXED
Firefox 26
Tracking Status
firefox24 --- fixed
firefox25 --- fixed
firefox26 --- fixed

People

(Reporter: emorley, Assigned: gps)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files)

Rev3 WINNT 5.1 mozilla-inbound debug test xpcshell on 2013-01-15 13:19:10 PST for push 8e7daee5f5a9

slave: talos-r3-xp-063

https://tbpl.mozilla.org/php/getParsedLog.php?id=18831369&tree=Mozilla-Inbound

{
TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 610] 1 == 1
Adjusting fake system clock to Wed Jan 16 2013 14:32:01 GMT-0800 (Pacific Standard Time)

TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 616] 1 == 1

TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 617] 0 == 0
Adjusting fake system clock to Wed Jan 16 2013 14:32:11 GMT-0800 (Pacific Standard Time)
Adjusting fake system clock to Wed Jan 16 2013 14:32:16 GMT-0800 (Pacific Standard Time)

TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 625] 1 == 1

TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 626] 1 == 1

TEST-INFO | (xpcshell/head.js) | test 3 pending

TEST-INFO | (xpcshell/head.js) | test 3 finished

TEST-INFO | (xpcshell/head.js) | test 2 finished

TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-INFO | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | Starting test_polling

TEST-INFO | (xpcshell/head.js) | test 2 finished

TEST-UNEXPECTED-FAIL | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | false == true - See following stack:
JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_throw :: line 461
JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_report_result :: line 563
JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: _do_check_eq :: line 573
JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_check_eq :: line 580
JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_check_true :: line 594
JS frame :: C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js :: fakeCheckStateAndTrigger :: line 646
JS frame :: resource://gre/modules/services/datareporting/policy.jsm :: notify :: line 695
native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0
}
I pushed some logging additions which should make the root cause stand out.

My guess is that the Windows timer is imprecise, or that our use of Date was causing problems.

It's possible I also fixed the bug, but we'll see :)

https://hg.mozilla.org/integration/mozilla-inbound/rev/91db03dc9c5a
Whiteboard: [leave open]
We haven't seen this on a modern tree since rnewman's patch landed. So, I'm calling this one done.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → mozilla21
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Richard, this has flared up again in the last week - don't suppose you could take a look? :-)
Flags: needinfo?(rnewman)
Whiteboard: [disable-me 2013-04-01]
That last log has an additional test_sessionrecorder.js failure. What's happening in the failure is essentially:

1) Date.now() -> t0
2) nsITimer.init(50ms)
3) sleep 50ms
4) Date.now() -> t0

I know nsITimer sometimes fudges when it fires (it adjusts to account for previous deviation from the expected firing times). My guess is it is firing the 50ms timer immediately (or at least soon enough for Date.now to not increment by 1ms).

Perhaps if we used a precise timer for the sleep function (test only code). Or, we could always increase the sleep duration to reduce the risk for immediate timer firing.
Each of them is failing like this:

15:41:34     INFO -  Polled at 1363905694148 after 462ms, intended 500

which yes, implies that a timer is firing too soon, or we're not firing enough timers!
Flags: needinfo?(rnewman)
The equivalent code that's failing:

  let delay = 500;
  let then = Date.now();

  function onTimer() {
    let now = Date.now();
    let after = now - then;
    do_check_true(after > delay);
    then = Date.now();
  }

  Cc["@mozilla.org/timer;1"].createInstance(Ci.nsITimer)
                            .initWithCallback({
                              notify: function () {
                                onTimer();
                              }
                            }, delay, TYPE_REPEATING_SLACK);

This test essentially boils down to "do timers obey their spec?".

      // "Specified timer period will be at least the time between when
      // processing for last firing the callback completes and when the next
      // firing occurs."

That is, now[n] - then[n-1] > 500.

That is apparently not true on Windows.
khuey: could you take a look at comment 29, see if my analysis is correct? I'd welcome any suggestions.
Flags: needinfo?(khuey)
Well this has flared up because we brought Windows 8 tests online.  Every failure on trunk is on a Windows 8 box.  So perhaps something in the timing APIs is different on Windows 8?

As an aside, I would be *very* reluctant to make any assumptions on the ordering of XPCOM timers and Date.now, because they're sourced from two entirely different pieces of code.
Flags: needinfo?(khuey)
Thanks for the input!

> As an aside, I would be *very* reluctant to make any assumptions on the
> ordering of XPCOM timers and Date.now, because they're sourced from two
> entirely different pieces of code.

That's one of the pieces of advice in "Avoiding intermittent oranges", but on inspection we actually don't compare values between JS and XPCOM. The only dates we touch are Date.now() values, and the msec input to the nsITimer.

The only assumption here is that JS timestamp A, followed by a JS timestamp B grabbed in a 500msec timer handler, should be at least 500msec apart.

> Well this has flared up because we brought Windows 8 tests online.  Every
> failure on trunk is on a Windows 8 box.  So perhaps something in the timing
> APIs is different on Windows 8?

I wonder if it's worth kicking off a try push with DEBUG_TIMERS…
OS: Windows XP → Windows 8
Whiteboard: [disable-me 2013-04-01] → [disable-me 2013-04-07]
> The only assumption here is that JS timestamp A, followed by a JS timestamp
> B grabbed in a 500msec timer handler, should be at least 500msec apart.

You should not assume this. The timer code has fudge factors built in to predict when a timer event will hit the main event loop and sometimes dispatch timers a bit early. Also note that even minor variations in time from NTP clock adjustments will affect Date.now but will not affect timers. Basically this test is bogus and should be removed or significantly modified.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #49)
> > The only assumption here is that JS timestamp A, followed by a JS timestamp
> > B grabbed in a 500msec timer handler, should be at least 500msec apart.
> 
> You should not assume this. The timer code has fudge factors built in to
> predict when a timer event will hit the main event loop and sometimes
> dispatch timers a bit early.

Sounds like the MDN docs for nsITimer need to be clarified:

"Specified timer period will be at least the time between when processing for last firing the callback completes and when the next firing occurs."

Our test -- and the assumption above -- directly encode this piece of documentation as a test.

It looks like the timer code firing a timer up to 40msec early, which is quite surprising, particularly for a slack timer which implies that it should be waiting for the *next* tick (a little late), not firing early.

I will revise the nsITimer docs to remove any certainty!


> Also note that even minor variations in time
> from NTP clock adjustments will affect Date.now but will not affect timers.

I hope that we wouldn't be getting NTP adjusted within this 500msec range on (nearly?) every Windows push…

But yes, we care about this in the abstract.


> Basically this test is bogus and should be removed or significantly modified.

Gotcha. Are there any guarantees for how early the timer code can fire? If we tell it to wait for 500msec, can it fire earlier than (500msec - one event loop tick)?

I would prefer to adjust the test than discard it completely.
Does the fact that this test is failing on just Windows 8 seem to bother anybody else?

Is it possible that our wait code behaves differently due to a change in semantics inside the Win32 API when running on Windows 8? The rabbit hole tracing nsITimer has led me to PR_WaitCondVar and eventually https://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/w95cv.c#230 which uses WaitForSingleObject from the Win32 API. I doubt that changed in Windows 8, as that is a pretty core Win32 API and the world would break if it changed, I would think.

Now, what could be different about Windows 8 is different clocks in the different environments. I wouldn't be surprised if there was a difference between the clock source in Metro than in Win32 classic and we were somehow mixing them. There is more than 1 way to obtain the current time and time offsets, after all. Perhaps something in Windows 8 is different from Windows before.

At this point, this is a core platform issue or a Windows 8 specific problem. I don't want to see this test disabled (there might be legal implications if functionality covered by this test failed and went unnoticed). But, I'm not sure what more we can do to fix the problem that doesn't involve lapsing test coverage. Boo.
That last failure is actually different. Looks like the change I pushed made it reach the next point of failure, which is test_polling_implicit_acceptance.

That's weird, because that test isn't obviously timer-related.

15:34:21  WARNING -  TEST-UNEXPECTED-FAIL | C:/slave/test/build/tests/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | false == true - See following stack:
15:34:21     INFO -  JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_throw :: line 461
15:34:21     INFO -  JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_report_result :: line 563
15:34:21     INFO -  JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: _do_check_eq :: line 573
15:34:21     INFO -  JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_check_eq :: line 580
15:34:21     INFO -  JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_check_true :: line 594
15:34:21     INFO -  JS frame :: C:/slave/test/build/tests/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js :: CheckStateAndTriggerProxy :: line 721
15:34:21     INFO -  JS frame :: resource://gre/modules/services/datareporting/policy.jsm :: notify :: line 752

It looks like the issue is that after four 250ms ticks, our 750ms implicit acceptance time hasn't passed. That would be weird.

I'm prepping a patch to add some logging, see if it happens again.
Yup, on that version of Windows our 250ms timer is firing every 205ms, or thereabouts. Hooray.
This should fix the next failing test.
Attachment #729847 - Flags: review?(gps)
Attachment #729847 - Flags: review?(gps) → review+
(In reply to TinderboxPushlog Robot from comment #69)
> RyanVM
> https://tbpl.mozilla.org/php/getParsedLog.php?id=21154448&tree=Firefox
> WINNT 6.2 mozilla-central pgo test xpcshell on 2013-03-27 03:54:13
> slave: t-w864-ix-083
> 
> 04:45:53  WARNING -  TEST-UNEXPECTED-FAIL |
> C:
> \slave\test\build\tests\xpcshell\tests\services\datareporting\tests\xpcshell\
> test_policy.js | test failed (with xpcshell return code: 0), see following
> log:
> 04:45:53  WARNING -  TEST-UNEXPECTED-FAIL |
> C:/slave/test/build/tests/xpcshell/tests/services/datareporting/tests/
> xpcshell/test_policy.js | false == true - See following stack:
> 04:55:41    ERROR - Return code: 1

This was on the s-c merge push.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I have a one-line change for this that should fix Windows. Again.
https://hg.mozilla.org/mozilla-central/rev/400feaf9e495
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: mozilla21 → mozilla22
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The high failure frequency appears to be gone. I'm asserting this is no longer a candidate for disabling.

We'll continue to look at the remaining intermittent failures.
Whiteboard: [disable-me 2013-04-07]
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla22 → ---
Depends on: 860930
gps: take a look at this?
Flags: needinfo?(gps)
That's weird how we've encountered a spike in this failure.

Stefan is touching a lot of this code in bug 862563 and bug 850709 and I'm holding out hope his changes magically fix things. I was holding out hope a month ago too. But, his patches are very near r+ and I'd rather not bit rot them. If the spike continues, I can look into this.
Flags: needinfo?(gps)