831404 - Intermittent services\datareporting\tests\xpcshell\test_policy.js | false == true

Reporter

Description

•

13 years ago

Rev3 WINNT 5.1 mozilla-inbound debug test xpcshell on 2013-01-15 13:19:10 PST for push 8e7daee5f5a9 slave: talos-r3-xp-063 https://tbpl.mozilla.org/php/getParsedLog.php?id=18831369&tree=Mozilla-Inbound { TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 610] 1 == 1 Adjusting fake system clock to Wed Jan 16 2013 14:32:01 GMT-0800 (Pacific Standard Time) TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 616] 1 == 1 TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 617] 0 == 0 Adjusting fake system clock to Wed Jan 16 2013 14:32:11 GMT-0800 (Pacific Standard Time) Adjusting fake system clock to Wed Jan 16 2013 14:32:16 GMT-0800 (Pacific Standard Time) TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 625] 1 == 1 TEST-PASS | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | [test_delete_remote_data_in_progress_upload : 626] 1 == 1 TEST-INFO | (xpcshell/head.js) | test 3 pending TEST-INFO | (xpcshell/head.js) | test 3 finished TEST-INFO | (xpcshell/head.js) | test 2 finished TEST-INFO | (xpcshell/head.js) | test 2 pending TEST-INFO | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | Starting test_polling TEST-INFO | (xpcshell/head.js) | test 2 finished TEST-UNEXPECTED-FAIL | C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | false == true - See following stack: JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_throw :: line 461 JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_report_result :: line 563 JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: _do_check_eq :: line 573 JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_check_eq :: line 580 JS frame :: C:\talos-slave\test\build\xpcshell\head.js :: do_check_true :: line 594 JS frame :: C:/talos-slave/test/build/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js :: fakeCheckStateAndTrigger :: line 646 JS frame :: resource://gre/modules/services/datareporting/policy.jsm :: notify :: line 695 native frame :: <unknown filename> :: <TOP_LEVEL> :: line 0 }

Ed Morley [:emorley]

Reporter

Comment 1

•

13 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=18857612&tree=Mozilla-Inbound

Ed Morley [:emorley]

Reporter

Comment 2

•

13 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=18857513&tree=Mozilla-Inbound

Ed Morley [:emorley]

Reporter

Comment 3

•

13 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=18860588&tree=Mozilla-Inbound

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 7

•

13 years ago

I pushed some logging additions which should make the root cause stand out. My guess is that the Windows timer is imprecise, or that our use of Date was causing problems. It's possible I also fixed the bug, but we'll see :) https://hg.mozilla.org/integration/mozilla-inbound/rev/91db03dc9c5a

Whiteboard: [leave open]

Ryan VanderMeulen [:RyanVM]

Comment 8

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/91db03dc9c5a

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 10

•

13 years ago

We haven't seen this on a modern tree since rnewman's patch landed. So, I'm calling this one done.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Whiteboard: [leave open]

Target Milestone: --- → mozilla21

Comment hidden (Legacy TBPL/Treeherder Robot)

Ed Morley [:emorley]

Reporter

Updated

•

13 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment hidden (Legacy TBPL/Treeherder Robot)

Ed Morley [:emorley]

Reporter

Comment 26

•

13 years ago

Richard, this has flared up again in the last week - don't suppose you could take a look? :-)

Flags: needinfo?(rnewman)

Whiteboard: [disable-me 2013-04-01]

Gregory Szorc [:gps]

Assignee

Comment 27

•

13 years ago

That last log has an additional test_sessionrecorder.js failure. What's happening in the failure is essentially: 1) Date.now() -> t0 2) nsITimer.init(50ms) 3) sleep 50ms 4) Date.now() -> t0 I know nsITimer sometimes fudges when it fires (it adjusts to account for previous deviation from the expected firing times). My guess is it is firing the 50ms timer immediately (or at least soon enough for Date.now to not increment by 1ms). Perhaps if we used a precise timer for the sleep function (test only code). Or, we could always increase the sleep duration to reduce the risk for immediate timer firing.

Richard Newman [:rnewman]

Comment 28

•

13 years ago

Each of them is failing like this: 15:41:34 INFO - Polled at 1363905694148 after 462ms, intended 500 which yes, implies that a timer is firing too soon, or we're not firing enough timers!

Flags: needinfo?(rnewman)

Richard Newman [:rnewman]

Comment 29

•

13 years ago

The equivalent code that's failing: let delay = 500; let then = Date.now(); function onTimer() { let now = Date.now(); let after = now - then; do_check_true(after > delay); then = Date.now(); } Cc["@mozilla.org/timer;1"].createInstance(Ci.nsITimer) .initWithCallback({ notify: function () { onTimer(); } }, delay, TYPE_REPEATING_SLACK); This test essentially boils down to "do timers obey their spec?". // "Specified timer period will be at least the time between when // processing for last firing the callback completes and when the next // firing occurs." That is, now[n] - then[n-1] > 500. That is apparently not true on Windows.

Richard Newman [:rnewman]

Comment 30

•

13 years ago

We are in undocumented murky waters here: https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges#Tests_that_depend_on_time_differences_or_comparison

Richard Newman [:rnewman]

Comment 31

•

13 years ago

khuey: could you take a look at comment 29, see if my analysis is correct? I'd welcome any suggestions.

Flags: needinfo?(khuey)

Comment hidden (Legacy TBPL/Treeherder Robot)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 39

•

13 years ago

Well this has flared up because we brought Windows 8 tests online. Every failure on trunk is on a Windows 8 box. So perhaps something in the timing APIs is different on Windows 8? As an aside, I would be *very* reluctant to make any assumptions on the ordering of XPCOM timers and Date.now, because they're sourced from two entirely different pieces of code.

Flags: needinfo?(khuey)

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 41

•

13 years ago

Thanks for the input! > As an aside, I would be *very* reluctant to make any assumptions on the > ordering of XPCOM timers and Date.now, because they're sourced from two > entirely different pieces of code. That's one of the pieces of advice in "Avoiding intermittent oranges", but on inspection we actually don't compare values between JS and XPCOM. The only dates we touch are Date.now() values, and the msec input to the nsITimer. The only assumption here is that JS timestamp A, followed by a JS timestamp B grabbed in a 500msec timer handler, should be at least 500msec apart. > Well this has flared up because we brought Windows 8 tests online. Every > failure on trunk is on a Windows 8 box. So perhaps something in the timing > APIs is different on Windows 8? I wonder if it's worth kicking off a try push with DEBUG_TIMERS…

OS: Windows XP → Windows 8

Whiteboard: [disable-me 2013-04-01] → [disable-me 2013-04-07]

Comment hidden (Legacy TBPL/Treeherder Robot)

Benjamin Smedberg

Comment 49

•

13 years ago

> The only assumption here is that JS timestamp A, followed by a JS timestamp > B grabbed in a 500msec timer handler, should be at least 500msec apart. You should not assume this. The timer code has fudge factors built in to predict when a timer event will hit the main event loop and sometimes dispatch timers a bit early. Also note that even minor variations in time from NTP clock adjustments will affect Date.now but will not affect timers. Basically this test is bogus and should be removed or significantly modified.

Richard Newman [:rnewman]

Comment 50

•

13 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #49) > > The only assumption here is that JS timestamp A, followed by a JS timestamp > > B grabbed in a 500msec timer handler, should be at least 500msec apart. > > You should not assume this. The timer code has fudge factors built in to > predict when a timer event will hit the main event loop and sometimes > dispatch timers a bit early. Sounds like the MDN docs for nsITimer need to be clarified: "Specified timer period will be at least the time between when processing for last firing the callback completes and when the next firing occurs." Our test -- and the assumption above -- directly encode this piece of documentation as a test. It looks like the timer code firing a timer up to 40msec early, which is quite surprising, particularly for a slack timer which implies that it should be waiting for the *next* tick (a little late), not firing early. I will revise the nsITimer docs to remove any certainty! > Also note that even minor variations in time > from NTP clock adjustments will affect Date.now but will not affect timers. I hope that we wouldn't be getting NTP adjusted within this 500msec range on (nearly?) every Windows push… But yes, we care about this in the abstract. > Basically this test is bogus and should be removed or significantly modified. Gotcha. Are there any guarantees for how early the timer code can fire? If we tell it to wait for 500msec, can it fire earlier than (500msec - one event loop tick)? I would prefer to adjust the test than discard it completely.

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 52

•

13 years ago

Timer slop: https://hg.mozilla.org/services/services-central/rev/f260465b7a29 https://hg.mozilla.org/integration/mozilla-inbound/rev/8bf75f6d2fd7

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 56

•

13 years ago

Does the fact that this test is failing on just Windows 8 seem to bother anybody else? Is it possible that our wait code behaves differently due to a change in semantics inside the Win32 API when running on Windows 8? The rabbit hole tracing nsITimer has led me to PR_WaitCondVar and eventually https://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/w95cv.c#230 which uses WaitForSingleObject from the Win32 API. I doubt that changed in Windows 8, as that is a pretty core Win32 API and the world would break if it changed, I would think. Now, what could be different about Windows 8 is different clocks in the different environments. I wouldn't be surprised if there was a difference between the clock source in Metro than in Win32 classic and we were somehow mixing them. There is more than 1 way to obtain the current time and time offsets, after all. Perhaps something in Windows 8 is different from Windows before. At this point, this is a core platform issue or a Windows 8 specific problem. I don't want to see this test disabled (there might be legal implications if functionality covered by this test failed and went unnoticed). But, I'm not sure what more we can do to fix the problem that doesn't involve lapsing test coverage. Boo.

Richard Newman [:rnewman]

Comment 57

•

13 years ago

That last failure is actually different. Looks like the change I pushed made it reach the next point of failure, which is test_polling_implicit_acceptance. That's weird, because that test isn't obviously timer-related. 15:34:21 WARNING - TEST-UNEXPECTED-FAIL | C:/slave/test/build/tests/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js | false == true - See following stack: 15:34:21 INFO - JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_throw :: line 461 15:34:21 INFO - JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_report_result :: line 563 15:34:21 INFO - JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: _do_check_eq :: line 573 15:34:21 INFO - JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_check_eq :: line 580 15:34:21 INFO - JS frame :: C:\slave\test\build\tests\xpcshell\head.js :: do_check_true :: line 594 15:34:21 INFO - JS frame :: C:/slave/test/build/tests/xpcshell/tests/services/datareporting/tests/xpcshell/test_policy.js :: CheckStateAndTriggerProxy :: line 721 15:34:21 INFO - JS frame :: resource://gre/modules/services/datareporting/policy.jsm :: notify :: line 752 It looks like the issue is that after four 250ms ticks, our 750ms implicit acceptance time hasn't passed. That would be weird. I'm prepping a patch to add some logging, see if it happens again.

Richard Newman [:rnewman]

Comment 58

•

13 years ago

Logging: https://hg.mozilla.org/integration/mozilla-inbound/rev/1555815d144a

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Comment 60

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/8bf75f6d2fd7 https://hg.mozilla.org/mozilla-central/rev/1555815d144a

Assignee: nobody → rnewman

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 62

•

13 years ago

Yup, on that version of Windows our 250ms timer is firing every 205ms, or thereabouts. Hooray.

Richard Newman [:rnewman]

Comment 63

•

13 years ago

Attached patch Proposed patch. v1 — Details — Splinter Review

This should fix the next failing test.

Attachment #729847 - Flags: review?(gps)

Gregory Szorc [:gps]

Assignee

Updated

•

13 years ago

Attachment #729847 - Flags: review?(gps) → review+

Richard Newman [:rnewman]

Comment 64

•

13 years ago

Fingers crossed: https://hg.mozilla.org/services/services-central/rev/1965a470f497

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 68

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/f260465b7a29 https://hg.mozilla.org/mozilla-central/rev/c743ad0df73d https://hg.mozilla.org/mozilla-central/rev/1965a470f497

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Comment 70

•

13 years ago

(In reply to TinderboxPushlog Robot from comment #69) > RyanVM > https://tbpl.mozilla.org/php/getParsedLog.php?id=21154448&tree=Firefox > WINNT 6.2 mozilla-central pgo test xpcshell on 2013-03-27 03:54:13 > slave: t-w864-ix-083 > > 04:45:53 WARNING - TEST-UNEXPECTED-FAIL | > C: > \slave\test\build\tests\xpcshell\tests\services\datareporting\tests\xpcshell\ > test_policy.js | test failed (with xpcshell return code: 0), see following > log: > 04:45:53 WARNING - TEST-UNEXPECTED-FAIL | > C:/slave/test/build/tests/xpcshell/tests/services/datareporting/tests/ > xpcshell/test_policy.js | false == true - See following stack: > 04:55:41 ERROR - Return code: 1 This was on the s-c merge push.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 74

•

13 years ago

I have a one-line change for this that should fix Windows. Again.

Richard Newman [:rnewman]

Comment 75

•

13 years ago

https://hg.mozilla.org/services/services-central/rev/400feaf9e495 Waiting for inbound to open.

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Comment 78

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/400feaf9e495

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Target Milestone: mozilla21 → mozilla22

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Updated

•

13 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Gregory Szorc [:gps]

Assignee

Comment 85

•

13 years ago

The high failure frequency appears to be gone. I'm asserting this is no longer a candidate for disabling. We'll continue to look at the remaining intermittent failures.

Whiteboard: [disable-me 2013-04-07]

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Updated

•

13 years ago

Component: Metrics and Firefox Health Report → Client: Desktop

Product: Mozilla Services → Firefox Health Report

Target Milestone: mozilla22 → ---

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Updated

•

13 years ago

Depends on: 860930

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 259

•

12 years ago

gps: take a look at this?

Flags: needinfo?(gps)

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 264

•

12 years ago

That's weird how we've encountered a spike in this failure. Stefan is touching a lot of this code in bug 862563 and bug 850709 and I'm holding out hope his changes magically fix things. I was holding out hope a month ago too. But, his patches are very near r+ and I'd rather not bit rot them. If the spike continues, I can look into this.

Flags: needinfo?(gps)

Comment hidden (Legacy TBPL/Treeherder Robot)

Ryan VanderMeulen [:RyanVM]

Comment 321

•

12 years ago

Richard, any chance you could take another look at this?

status-firefox24: --- → affected

status-firefox25: --- → affected

status-firefox26: --- → affected

Flags: needinfo?(rnewman)

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 323

•

12 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #321) > Richard, any chance you could take another look at this? See Comment 264 -- this is very much in Greg's court!

Flags: needinfo?(rnewman)

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 325

•

12 years ago

The failing test will go away with bug 862563. The patch in that bug is in my review queue.

Depends on: 862563

Gregory Szorc [:gps]

Assignee

Comment 326

•

12 years ago

Attached patch Increase timer interval to hopefully prevent intermittent failure — Details — Splinter Review

Let's see if this fixes it.

Attachment #792396 - Flags: review?(rnewman)

Gregory Szorc [:gps]

Assignee

Updated

•

12 years ago

Assignee: rnewman → gps

Comment hidden (Legacy TBPL/Treeherder Robot)

Richard Newman [:rnewman]

Comment 329

•

12 years ago

Comment on attachment 792396 [details] [diff] [review] Increase timer interval to hopefully prevent intermittent failure By "r+" I mean "I have no idea if this will fix the problem, but it doesn't look like a dangerous change, and failing or passing tests will be the measure of success here".

Attachment #792396 - Flags: review?(rnewman) → review+

Gregory Szorc [:gps]

Assignee

Comment 330

•

12 years ago

https://hg.mozilla.org/integration/fx-team/rev/1ce965e9ea4c Let's see what happens...

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 337

•

12 years ago

Patch didn't work. I'll look at this more when I have time to escape from this work week.

Priority: -- → P1

Whiteboard: [leave open]

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 339

•

12 years ago

I reverted the original patch then lowered the implicit acceptance interval to allow some slack for timers firing. This arguably makes the test "less perfect," but since this test is going away in bug 862563 and sheriffs want an upliftable solution, this is a preferable outcome compared to disabling. https://hg.mozilla.org/integration/fx-team/rev/67f0ad6adc58

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Updated

•

12 years ago

Whiteboard: [leave open]

Comment hidden (Legacy TBPL/Treeherder Robot)

Ed Morley [:emorley]

Reporter

Comment 345

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/1ce965e9ea4c https://hg.mozilla.org/mozilla-central/rev/67f0ad6adc58

Status: REOPENED → RESOLVED

Closed: 13 years ago → 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → Firefox 26

Ryan VanderMeulen [:RyanVM]

Comment 346

•

12 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/bab2e697a32f https://hg.mozilla.org/releases/mozilla-beta/rev/148161ac177f Thanks!

status-firefox24: affected → fixed

status-firefox25: affected → fixed

status-firefox26: affected → fixed

Comment hidden (Legacy TBPL/Treeherder Robot)

Gregory Szorc [:gps]

Assignee

Comment 348

•

12 years ago

The last one is the same signature but different location. Is there a way to force TBPL to ignore this bug from now on?

Ed Morley [:emorley]

Reporter

Comment 349

•

12 years ago

We could just file another bug and people would star against that rather than this resolved bug (which appears with strikethrough in TBPL)

Comment hidden (Legacy TBPL/Treeherder Robot)

BMO Automation

Updated

•

7 years ago

Product: Firefox Health Report → Firefox Health Report Graveyard

Proposed patch. v1 13 years ago Richard Newman [:rnewman] 2.92 KB, patch	gps : review+	Details \| Diff \| Splinter Review
Increase timer interval to hopefully prevent intermittent failure 12 years ago Gregory Szorc [:gps] 2.07 KB, patch	rnewman : review+	Details \| Diff \| Splinter Review