Closed Bug 1772416 Opened 2 years ago Closed 2 years ago

Address intermittent WPT failure: measure_associated_with_navigation_timing.html

Categories

(Core :: DOM: Performance, enhancement)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: mcomella, Assigned: mcomella)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Type: task → enhancement

I don't know what changed but these are all green now.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID

These actually appear to be intermittently failing so reopening.

Status: RESOLVED → REOPENED
Resolution: INVALID → ---

This doesn't fail in our CI on desktop because the test isn't disabled. Similarly, I've been unable to reproduce the error on macOS with ./mach wpt --retry-until-unexpected. I wonder what's different about WPT's configuration that it fails intermittently on Linux but not in our CI.

Curiously, the tests that fail intermittently are actually the same tests that are disabled on Android for intermittent failures. I was unable to reproduce these failures on the Android emulator at all. I tried with ./mach wpt --retry-until-unexpected or ./mach wpt <test> --verify (the latter failed but due to a harness timeout). I was also unable to reproduce after 10 runs on try or by visiting http://wpt.live/user-timing/measure_associated_with_navigation_timing.html and rerunning the tests manually tens of times.

After talking with :sefeng, we decided to take two approaches:

  • re-enable the tests on Android in CI and see if there are failures (since we can't reproduce by hand)
  • when the tests fail, add the specific timings to the error message so we get debug info in wpt.fyi

See bug 1461298 for the intermittent failures that caused these tests to be disabled on Android.

See Also: → 1461298

This test failed intermittently 4 years ago when it was disabled. However, I'm
unable to reproduce the failure today on try (10 iterations) or locally so let's
re-enable to see if it's still an issue.

Assignee: nobody → michael.l.comella

Here is sample output if this assertion function fails:
FAIL Second measure of current mark to navigationStart should be negative value.

  • assert_greater_than: Second measure of current mark to navigationStart should
    be negative value. expected a number greater than 352.1 but got 332.44
Pushed by mcomella@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e25f63783185
enable measure_associated_with_navigation_timing on android. r=sefeng
Pushed by mcomella@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/47a07cd0d028
print failure values in user-timing/ test_greater_than*. r=sefeng
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/35506 for changes under testing/web-platform/tests
Upstream PR merged by moz-wptsync-bot

With the additional debugging info, the new error messages on wpt.fyi (after clicking the "Show Details" selector) are:

FAIL message: assert_greater_than: Second measure of current mark to navigationStart should be negative value. expected a number greater than 84 but got 84

FAIL message: assert_greater_than: Measure from domComplete event to most recent mark "a" should have longer duration. expected a number greater than 2 but got 2

FAIL message: assert_greater_than: Measure from most recent mark to navigationStart should have longer duration. expected a number greater than -84 but got -84

The root cause of the intermittent failure is that the test records one set of
PerformanceMarks, records another, and verifies the duration of the measures of
the second set are larger than the first set but they're unexpectedly equal.

I speculate this happens because either the test executes fast enough that the
clock doesn't increment or our precision reduction code (to mitigate against
timing attacks) is rounding the before/after values to the same value.

This solution is speculative because I was unable to reproduce the error
locally so need to test it on WPT CI. The solution may fail intermittently if we
don't sleep for long enough.

A more correct solution to this problem would be to mock the clock and
programmatically advance it but I didn't want to take the time. Another possible
solution, that is stable, is to change the assertions to test_greater_or_equal
but this could hide legitimate bugs, e.g. we accidentally return the same
PerformanceMeasure for both queries.

Summary: Make WPT tests pass: measure_associated_with_navigation_timing.html → Address intermittent WPT failure: measure_associated_with_navigation_timing.html
Pushed by mcomella@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/93de313ddf70
sleep in measure_associated_with_navigation_timing. r=sefeng
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/35603 for changes under testing/web-platform/tests
Upstream PR merged by moz-wptsync-bot

Over the past two weeks, we've seen 1 failure (on 2022-09-09T10:52:01.202Z) out of 17 runs. This is much improved but not good enough.

My best idea is to just sleep longer in the test. To make this less random, we can determine how much we reduce precision by and calculate the theoretical worst case and sleep longer than that (assuming the precision reduction is the root cause of this issue).

Over two weeks, this test intermittently failed on Firefox in 1/17 runs. After
reading the timestamp precision reduction code, the cause of the failure
appears to be that we're not sleeping long enough. We clamp the value (i.e.
discard sub-millisecond precision) and randomly round up to the next millisecond
value via the midpoint & clampedAndJittered code. In the worst case, this will
cause values 1.999ms apart to be rounded to the same value, e.g. 10ms and
11.999ms round to 11ms. In theory, we need to sleep at least 2ms but we've been
sleeping 1ms.

Pushed by mcomella@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/61831cfe799a
increase sleep duration in measure_associated... WPT. r=sefeng
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/35855 for changes under testing/web-platform/tests
Upstream PR merged by moz-wptsync-bot

Next step: review in ~2 weeks to see if we're still experiencing intermittent test failures.

There have been no failures in 20 runs from 9/29 to 10/4 (6 days) so I'm closing this as fixed.

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: