Closed Bug 1121117 Opened 9 years ago Closed 9 years ago

Intermittent test_ocsp_timeout.js | xpcshell return code: 0 | - false == true

Categories

(Core :: Security: PSM, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox37 --- fixed
firefox38 --- fixed
firefox39 --- fixed
firefox-esr31 --- unaffected

People

(Reporter: RyanVM, Assigned: Cykesiopka)

References

Details

(Keywords: intermittent-failure)

Attachments

(2 files, 2 obsolete files)

08:36:10 INFO - TEST-START | security/manager/ssl/tests/unit/test_ocsp_timeout.js
08:36:21 WARNING - TEST-UNEXPECTED-FAIL | security/manager/ssl/tests/unit/test_ocsp_timeout.js | xpcshell return code: 0
08:36:21 INFO - TEST-INFO took 10525ms
08:36:21 INFO - >>>>>>>
08:36:21 INFO - (xpcshell/head.js) | test MAIN run_test pending (1)
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 0 pending (2)
08:36:21 INFO - (xpcshell/head.js) | test MAIN run_test finished (2)
08:36:21 INFO - running event loop
08:36:21 INFO - security/manager/ssl/tests/unit/test_ocsp_timeout.js | Starting
08:36:21 INFO - (xpcshell/head.js) | test pending (2)
08:36:21 INFO - TEST-PASS | security/manager/ssl/tests/unit/test_ocsp_timeout.js | - true == true
08:36:21 INFO - TEST-PASS | security/manager/ssl/tests/unit/test_ocsp_timeout.js | - true == true
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 0 finished (2)
08:36:21 INFO - (xpcshell/head.js) | test pending (2)
08:36:21 INFO - (xpcshell/head.js) | test finished (2)
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 1 pending (2)
08:36:21 INFO - (xpcshell/head.js) | test finished (2)
08:36:21 INFO - security/manager/ssl/tests/unit/test_ocsp_timeout.js | Starting
08:36:21 INFO - (xpcshell/head.js) | test pending (2)
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 2 pending (3)
08:36:21 INFO - (xpcshell/head.js) | test finished (3)
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 1 finished (2)
08:36:21 INFO - security/manager/ssl/tests/unit/test_ocsp_timeout.js | Starting
08:36:21 INFO - (xpcshell/head.js) | test pending (2)
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 2 finished (2)
08:36:21 INFO - "handling ocsp-stapling-none.example.com"
08:36:21 INFO - TEST-PASS | security/manager/ssl/tests/unit/test_ocsp_timeout.js | - 2153389959 == 2153389959
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 3 pending (2)
08:36:21 INFO - (xpcshell/head.js) | test finished (2)
08:36:21 INFO - security/manager/ssl/tests/unit/test_ocsp_timeout.js | Starting
08:36:21 INFO - (xpcshell/head.js) | test pending (2)
08:36:21 WARNING - TEST-UNEXPECTED-FAIL | security/manager/ssl/tests/unit/test_ocsp_timeout.js | - false == true
08:36:21 INFO - C:/slave/test/build/tests/xpcshell/tests/security/manager/ssl/tests/unit/test_ocsp_timeout.js:add_tests_in_mode/<:62
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:_run_next_test:1378
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:do_execute_soon/<.run:640
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:_do_main:208
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:_execute_test:500
08:36:21 INFO - -e:null:1
08:36:21 INFO - null:null:0
08:36:21 INFO - exiting test
08:36:21 INFO - 2147500036
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:_run_next_test:1380
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:do_execute_soon/<.run:640
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:_do_main:208
08:36:21 INFO - C:\slave\test\build\tests\xpcshell\head.js:_execute_test:500
08:36:21 INFO - -e:null:1
08:36:21 INFO - null:null:0
08:36:21 INFO - exiting test
08:36:21 INFO - (xpcshell/head.js) | test run_next_test 3 finished (2)
08:36:21 INFO - <<<<<<<
I've been investigating this issue every once in a while (e.g. https://treeherder.mozilla.org/#/jobs?repo=try&revision=091fd641644c ).

It would be nice to see exactly what the time difference is, so this patch adds some print logging to reveal what the exact values are.
Attachment #8572211 - Flags: review?(dkeeler)
Comment on attachment 8572211 [details] [diff] [review]
bug1121117_test_ocsp_timeout-add-logging_v1.patch

Review of attachment 8572211 [details] [diff] [review]:
-----------------------------------------------------------------

Sounds good to me.
Attachment #8572211 - Flags: review?(dkeeler) → review+
Thanks for the review.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c97b0d327cb5
Assignee: nobody → cykesiopka.bmo
Status: NEW → ASSIGNED
Attached patch bug1121117_add-fuzz_v1.patch (obsolete) — Splinter Review
After chasing one or two red herrings, it looks like the non-monotonicity of Date.now() is the issue here (see https://treeherder.mozilla.org/#/jobs?repo=try&revision=0bdd62e8d5c1 where I compared Date.now() to what should be a monotonic time source). 

I'm not aware of any way to access a monotonic time source in a xpcshell test other than abusing nsITelemetry::msSinceProcessStart(), so this patch just adds a 300ms fuzz to the assertion. So far, the lowest time difference has been 9765ms, so 300ms fuzz should hopefully be enough.

Try push looks good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=09b4269a5353
(BuildAPI says this is all green, don't know why TH still says some jobs are running).
Attachment #8578942 - Flags: review?(dkeeler)
Comment on attachment 8578942 [details] [diff] [review]
bug1121117_add-fuzz_v1.patch

Review of attachment 8578942 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/tests/unit/test_ocsp_timeout.js
@@ +65,5 @@
>      // With OCSP hard-fail on, we timeout after 10 seconds.
>      // With OCSP soft-fail, we timeout after 2 seconds.
> +    // Date.now() is not guaranteed to be monotonic, so add extra fuzz time
> +    // to prevent intermittent failures (see Bug 1121117).
> +    const FUZZ_MS = 300;

Seems reasonable. Let's also mention that this only appeared to be a problem on Win XP, so when we no longer support that platform, we can revert this. Alternatively, do you know of a reason we shouldn't use nsITelemetry.msSinceProcessStart() as in the debugging patch?
Attachment #8578942 - Flags: review?(dkeeler) → review+
(In reply to David Keeler [:keeler] (use needinfo?) from comment #54)
> Alternatively, do you know of a reason we shouldn't use
> nsITelemetry.msSinceProcessStart() as in the debugging patch?

Other than "it felt hacky" to me, none. I guess this is just a test file though... I'll submit a new patch.
This patch implements the alternative approach of using nsITelemetry.msSinceProcessStart().

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e453e7550fe
Attachment #8578942 - Attachment is obsolete: true
Attachment #8579927 - Flags: review?(dkeeler)
Comment on attachment 8579927 [details] [diff] [review]
bug1121117_monotonic-clock_v1.patch

Review of attachment 8579927 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think this is quite the approach we should take. More specifically, this test will be simpler and more maintainable if we use only one timing implementation - either telemetry if it doesn't throw exceptions or regular js Date() with the fuzz factor.

::: security/manager/ssl/tests/unit/test_ocsp_timeout.js
@@ +41,5 @@
>    run_next_test();
>  }
>  
>  function add_tests_in_mode(useHardFail) {
> +  const nsITelemetry = Cc["@mozilla.org/base/telemetry;1"]

My feeling about this is that it should be called something like "telemetry" or "telemetryService", since "nsITelemetry" is the name of the interface. Not a big deal, though.

@@ +50,5 @@
>    add_test(function () {
>      Services.prefs.setBoolPref("security.OCSP.require", useHardFail);
> +    startTimeNonMonotonic = new Date();
> +    try {
> +      startTime = nsITelemetry.msSinceProcessStart();

Does this tend to throw exceptions? (does telemetry need to be initialized earlier in the test?) I think it would be best to rely on only one timing implementation. If using telemetry is not reliable, then we'll be falling back to using Date() anyway, and without the fuzz factor we'll still see intermittent failures.
Attachment #8579927 - Flags: review?(dkeeler) → review-
(In reply to David Keeler [:keeler] (use needinfo?) from comment #62)
> I don't think this is quite the approach we should take. More specifically,
> this test will be simpler and more maintainable if we use only one timing
> implementation - either telemetry if it doesn't throw exceptions or regular
> js Date() with the fuzz factor.

Sure.

> My feeling about this is that it should be called something like "telemetry"
> or "telemetryService", since "nsITelemetry" is the name of the interface.
> Not a big deal, though.

Fair.

> Does this tend to throw exceptions? (does telemetry need to be initialized
> earlier in the test?)

I don't think this will throw often (assuming I'm understanding the current implementation correctly).

> If using telemetry is not reliable, then we'll be falling back to using Date()
> anyway, and without the fuzz factor we'll still see intermittent failures.

Right. I guess it's simpler to just add the fuzz factor then.
This is attachment 8578942 [details] [diff] [review] with the feedback raised in comment 54 addressed.
Attachment #8579927 - Attachment is obsolete: true
Attachment #8580465 - Flags: review+
Comment on attachment 8580465 [details] [diff] [review]
bug1121117_add-fuzz_v2.patch

Thanks for the reviews!

(Try push is in comment 53).
Attachment #8580465 - Flags: checkin?
Keywords: leave-open
Attachment #8580465 - Flags: checkin? → checkin+
https://hg.mozilla.org/mozilla-central/rev/8cd33e72eb47
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: