WPT event-timing/duration-with-target-low.html intermittently times out in Firefox
Categories
(Core :: DOM: Performance APIs, defect, P3)
Tracking
()
People
(Reporter: dholbert, Assigned: canova)
References
Details
Attachments
(1 file)
This WPT intermittently times out in Firefox:
https://wpt.fyi/results/event-timing/duration-with-target-low.html
Here are some recent wpt.fyi runs where we had this timeout:
https://wpt.fyi/results/event-timing/duration-with-target-low.html?sha=6f1045ad48&label=master&max-count=1&product=firefox
https://wpt.fyi/results/event-timing/duration-with-target-low.html?sha=2d4988ea24&label=master&max-count=1&product=firefox
https://wpt.fyi/results/event-timing/duration-with-target-low.html?sha=8ffa4295d6&label=master&max-count=1&product=firefox
https://wpt.fyi/results/event-timing/duration-with-target-low.html?sha=78be46cde7&label=master&max-count=1&product=firefox
https://wpt.fyi/results/event-timing/duration-with-target-low.html?sha=0963c452f6&label=master&max-count=1&product=firefox
https://wpt.fyi/results/event-timing/duration-with-target-low.html?sha=8e25c28d31&label=master&max-count=1&product=firefox
We have an in-tree annotation to allow it to intermittently time out (the test as-a-whole as well as the subtest):
https://searchfox.org/firefox-main/source/testing/web-platform/meta/event-timing/duration-with-target-low.html.ini
Probably we want to remove that .ini file as part of investigating this. (And maybe/hopefully the existence of that .ini file is a clue that we do hit this intermittently on TreeHerder on some platforms, and can investigate with try runs etc.)
| Reporter | ||
Comment 1•16 days ago
•
|
||
On wpt.fyi at least, I found those 6 test runs that had timeouts when reviewing the most-recent 60 test runs (and the remaining 54 test runs had green results for Firefox nightly, aside from a few cases where Firefox didn't get run as part of the cycle). So, that suggests this timeout happens about 10% of the time.
| Reporter | ||
Comment 2•16 days ago
|
||
Adding dependency on bug 1968204 since that's the last bug to have touched the .ini file (removing a bunch of expected-failure annotations), and changing component to match.
ni=canova who's investigating.
| Assignee | ||
Comment 3•16 days ago
•
|
||
Thanks Daniel! I can reproduce it locally. Here are my findings:
Before Bug 1968204, this test was perma failing. The issue then was because of the rounding. The spec expected duration value to be the rounded to the multiplies of 8.
The duration of event blocking is 300ms and the test expects 304 (as it's the rounded value to closest multiples of 8).
It looks like in the timing out case, this performance event timing entry is never firing. So "duration of pointerdown event" >= 300 check is returning false somehow.
I put some printf values to see, and here's the failing case:
0:05.27 pid:3577367 DispatchPendingEventTimingEntries: renderingTime=672.912660, startTime=373.236058, calculatedDuration=299.676602
0:05.27 pid:3577367 EventTiming::SetDuration [pointerdown]: raw=299.676602, rounded=296.000000
0:05.27 pid:3577367 DispatchPendingEventTimingEntries: renderingTime=672.912660, startTime=373.236058, calculatedDuration=299.676602
0:05.27 pid:3577367 EventTiming::ShouldAddEntryToObserverBuffer [pointerdown]: rawDuration=296.000000, threshold=300.000000, result=0
So the tests blocks the main thread for 300ms, but we get 299.676602 as the duration for this event, which is lower than 300 threshold.
It looks like it's essentially because the expected duration is lower than the threshold, which ideally shouldn't happen. I'm not 100% sure what the best solution is though. We might be taking the start time too late? Or end too early?
This is where we get the start time (aEvent->mTimeStamp) : https://searchfox.org/firefox-main/rev/3eaf7e2acf8186eb7aa579561eaa1312cb89132b/dom/performance/PerformanceEventTiming.cpp#138
This is where we get the end time: https://searchfox.org/firefox-main/rev/3eaf7e2acf8186eb7aa579561eaa1312cb89132b/dom/performance/PerformanceMainThread.cpp#295,303
| Reporter | ||
Comment 4•16 days ago
|
||
The test does its 300ms-main-thread-blocking here (with ms = 300):
https://searchfox.org/firefox-main/rev/3eaf7e2acf8186eb7aa579561eaa1312cb89132b/testing/web-platform/tests/event-timing/resources/event-timing-test-utils.js#1-4
function mainThreadBusy(ms) {
const target = performance.now() + ms;
while (performance.now() < target);
}
This doesn't reliably block the main thread for 300ms, specifically in Firefox.
In Firefox, by default, performance.now() rounds to the nearest 1ms[1]. That means the above loop might actually only block the main thread for 299 milliseconds for us, if the rounding happens to work out poorly. E.g. using 0-based timing to make the numbers easy... let's say this loop starts at t=0.499ms (which rounds down to t=0). So we set target = 300. And then we loop, and as soon as we reach the loop iteration where performance.now() internally observes a timestamp of 299.5ms (which rounds up to 300ms), we'll have reached the target and we'll exit the loop. So we will have only blocked the main thread for ~299ms, and there'll be an up-to-1ms window where things might go wrong if the test logic assumes that 300ms have truly passed.
[1] This is behind the pref privacy.reduceTimerPrecision, which is a spectre mitigation. Googling turns up https://github.com/lumen/threading-benchmarks/issues/7 with a record of it; I didn't do all the bugzilla archeology to find where we actually added this behavior.
| Reporter | ||
Comment 5•16 days ago
•
|
||
Assuming comment 4 is all there is, then we could work around this by adjusting mainThreadBusy to add 1ms to the ms value, hypothetically. That would ensure that at least that many milliseconds have passed (0-1ms more than the passed-in value).
[EDIT: Maybe +2ms would be better, to also account for the "jitter" privacy feature; see next comment]
I'm not sure if that's the right level at which to make that change; this is a "utils" script, so depending on how other tests use it and whether they make similar or completely-different assumptions about its behavior, I could imagine this being a good or bad place to make that change.
| Reporter | ||
Comment 6•16 days ago
|
||
(In reply to Daniel Holbert [:dholbert] from comment #4)
[1] This is behind the pref
privacy.reduceTimerPrecision, which is a spectre mitigation. Googling turns up https://github.com/lumen/threading-benchmarks/issues/7 with a record of it; I didn't do all the bugzilla archeology to find where we actually added this behavior.
Did some archeology -- our clamping-to-1ms behavior comes from bug 1451790 (where we set privacy.resistFingerprinting.reduceTimerPrecision.microseconds to 1000, where 1000 microseconds = 1ms).
One other bit of error that we may need to account for:
we intentionally apply a fuzzy jitter factor to timestamps (via privacy.resistFingerprinting.reduceTimerPrecision.jitter, added in bug 1425462), which may make timestamps potentially wrong by up-to-one-precision-value (1ms), though still always monotonically increasing when observed. That might conceivably be another source of "off-by-one" issues here. Though maybe the jitter would be consistent between the two clocks here (the event-based clock and the performance.now() based clock)?
So -- out of an abundance of caution, it'd probably be best for the busy-loop to overshoot by 2ms, to allow for 1ms of error due to rounding and 1ms of error due to jitter. But if the jitter is really shared between the performance.now() and event-timing timestamps here (I haven't checked), then it's fine for the busy-loop to just overshoot by 1ms, I think.
| Reporter | ||
Comment 7•16 days ago
|
||
canova, could you handle fixing the test logic to take care of this and anything else that you noticed that might need fixing here?
| Assignee | ||
Comment 8•15 days ago
|
||
Ah, thanks for looking into it! Yes, it looks like the root of the issue. I'm running this test locally in a loop until it fails and I can't really reproduce the intermittent timeout with --setpref privacy.reduceTimerPrecision=false anymore (which I could easily reproduce in less than 10 iteration without it).
I'm trying to update mainThreadBusy to to wait for duration + 2ms now. It looks like other tests are still running successfully. Will verify with the other browsers too.
| Assignee | ||
Comment 9•15 days ago
|
||
This is the try run before my patch, but the ini file removed: https://treeherder.mozilla.org/jobs?repo=try&revision=bd0ccdbbff808f754dd96be88e829cd5fcaacb50
3 failures in 20 triggers.
This is the try run with the duration adjusted to +2ms, and all the .ini files removed from passing tests: https://treeherder.mozilla.org/jobs?repo=try&revision=37e7e6544cf6fdc32994a01a28fcbc65a23bd413
No failure in 100 triggers.
So it looks like the duration-with-target-low.html timeouts are gone. And the other tests are passing as well.
Unrelated to this bug, but it looks like there is an intermittent test failure on /event-timing/gap-keydown-keyup.html. I filed bug 2012095 for that, but it's not a part of interop 2025.
| Assignee | ||
Comment 10•15 days ago
|
||
Also pushed another try run with all OSes to double check if the removed wpt ini files were necessary or not: https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=174643
Removing quite a bit of ini files, assuming that they are not needed anymore, but if you think we should start small, we can only remove the ini file of duration-with-target-low.html
| Assignee | ||
Comment 11•15 days ago
|
||
Firefox rounds performance.now() to 1ms for privacy protection against
timing attacks like Spectre. This can cause the busy-wait loop to exit
early (for example 299ms instead of 300ms in duration-with-target-low.html)
when timestamps round unfavorably. Adding a 2ms accounts for both
rounding and jitter in timestamp measurements.
This fixes intermittent timeouts in event-timing/duration-with-target-low.html
and other tests that rely on mainThreadBusy.
Updated•15 days ago
|
| Assignee | ||
Comment 12•15 days ago
•
|
||
One test failure on Android was legitimate. I reverted that ini file in my patch above.
| Assignee | ||
Comment 13•15 days ago
|
||
I filed an interop issue to get a greenlight from other interop parties: https://github.com/web-platform-tests/interop/issues/1247
| Assignee | ||
Comment 14•15 days ago
|
||
The test changes are moved to the upstream PR: https://github.com/web-platform-tests/wpt/pull/57314
I will remove the test changes from the phabricator patch and land the ini file changes after this PR lands.
Updated•4 days ago
|
Description
•