[wayland] Remote test failures: Timestamp <random> in expected range [<before> - <after>]
Categories
(Core :: JavaScript Engine, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr115 | --- | unaffected |
firefox121 | --- | wontfix |
firefox122 | --- | wontfix |
firefox123 | --- | wontfix |
People
(Reporter: whimboo, Assigned: stransky)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: regression)
Attachments
(5 obsolete files)
CDP remote tests are randomly failing with the following error:
Timestamp 1697616453.232 in expected range [1697616453.101 - 1697616452.987]
Here an example:
[task 2023-10-18T08:07:36.523Z] 08:07:36 INFO - TEST-UNEXPECTED-FAIL | remote/cdp/test/browser/page/browser_domContentEventFired.js | Timestamp 1697616453.232 in expected range [1697616453.101 - 1697616452.987] -
[task 2023-10-18T08:07:36.523Z] 08:07:36 INFO - Stack trace:
[task 2023-10-18T08:07:36.524Z] 08:07:36 INFO - chrome://mochikit/content/browser-test.js:test_ok:1583
[task 2023-10-18T08:07:36.524Z] 08:07:36 INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/page/browser_domContentEventFired.js:runContentEventFiredTest:91
[task 2023-10-18T08:07:36.524Z] 08:07:36 INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/page/browser_domContentEventFired.js:eventWhenNavigatingWithNestedFrames:52
[task 2023-10-18T08:07:36.524Z] 08:07:36 INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/head.js:fn:87
[task 2023-10-18T08:07:36.524Z] 08:07:36 INFO - chrome://mochikit/content/browser-test.js:handleTask:1134
[task 2023-10-18T08:07:36.524Z] 08:07:36 INFO - chrome://mochikit/content/browser-test.js:_runTaskBasedTest:1206
[task 2023-10-18T08:07:36.525Z] 08:07:36 INFO - chrome://mochikit/content/browser-test.js:Tester_execTest:1348
[task 2023-10-18T08:07:36.525Z] 08:07:36 INFO - chrome://mochikit/content/browser-test.js:nextTest/<:1123
This happens for different events and cannot easily be reproduced so far.
We should check why this is only happening on test jobs running on the new Wayland workers.
Reporter | ||
Updated•7 months ago
|
Reporter | ||
Comment 1•7 months ago
|
||
The test remote/cdp/test/browser/page/browser_navigate.js
is the one that failed most of the time. I pushed a try build for only this specific test so that we can get the full logs:
https://treeherder.mozilla.org/jobs?repo=try&revision=9980cba28b988fb1906b6390bfbb84f5b311b14c
Reporter | ||
Updated•7 months ago
|
Reporter | ||
Comment 2•7 months ago
•
|
||
As discussed on Monday I've now pushed a try build that iterates 100 times over a timestamp check for the load event. Lets see if that gives us some more details about this failure:
https://treeherder.mozilla.org/jobs?repo=try&revision=8add05d34fba14ee278874714977ed1432766fd2
Reporter | ||
Comment 3•7 months ago
|
||
This is actually a very strange behavior on Wayland! See the following failure:
Timestamp 1699389658.01 in expected range [1699389657.871 - 1699389657.297]
The logged time for start is later than the logged time for end. In both cases we use Date.now() / 1000
.
This failure can only be seen on Wayland and we never had it on any other platform.
Martin, do you have an idea why this could happen?
Reporter | ||
Comment 4•7 months ago
|
||
Here a new try push with just checking the values from the Date.now()
calls in ms and s:
https://treeherder.mozilla.org/jobs?repo=try&revision=2e26ab3436c752f291e069a937bb110d3f8cb69e
Assignee | ||
Comment 5•7 months ago
|
||
First of all I need to check how Date.now() is implemented on widget code. I see there's nsWindow::GetEventTimeStamp() https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/widget/gtk/nsWindow.cpp#4972 but not's sure it's related.
Reporter | ||
Comment 6•7 months ago
|
||
With the updated test from the last try build it's clear that it is indeed Date.now()
that is misbehaving once in a while:
https://treeherder.mozilla.org/logviewer?job_id=435346305&repo=try&lineNumber=3479-3490
The test looks like:
const timeBefore = Date.now();
await new Promise(resolve => setTimeout(resolve, 200));
const timeAfter = Date.now();
ok(
timeBefore < timeAfter,
`Timestamps (ms) in expected order [${timeBefore} - ${timeAfter}]`
);
Test results:
INFO - TEST-PASS | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (ms) in expected order [1699397093616 - 1699397093817] -
INFO - TEST-PASS | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (sec) in expected order [1699397093616 - 1699397093817] -
INFO - TEST-UNEXPECTED-FAIL | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (ms) in expected order [1699397093819 - 1699397093404] -
Assignee | ||
Comment 7•7 months ago
|
||
Date.now() calls js::date_now() which leads to NowAsMillis()/PRMJ_Now() call and gettimeofday(). I don't see what can be wrong here. Is clampAndJitter set in the test? https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/js/src/jsdate.cpp#1869
Assignee | ||
Comment 8•7 months ago
|
||
I don't see the failure locally (I updated it to run 1000 times). I wonder if it's related to test environment.
Assignee | ||
Comment 9•7 months ago
|
||
Btw tested as
./mach test remote/cdp/test/browser/page/browser_loadEventFired.js
Reporter | ||
Comment 10•7 months ago
•
|
||
Interesting. Smaug also pointed me to the following preferences:
https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/modules/libpref/init/StaticPrefList.yaml#14417-14430
I may try to set timer.maximum_firing_delay_tolerance_ms
to 0
to disable timer coalescing. As of now it's not clear if that could have an impact here. The new try build:
https://treeherder.mozilla.org/jobs?repo=try&revision=2b53eb85da3aba7cd69f4e26b3521b96f7c0d7cb
Reporter | ||
Comment 11•7 months ago
|
||
Other preferences around timers that we should try are: privacy.reduceTimerPrecision.*
and privacy.resistFingerprinting.*
. I will await results first and try to push those later today to try.
Reporter | ||
Comment 12•7 months ago
|
||
Here two new try pushes with privacy.reduceTimerPrecision=false
set:
artifact: https://treeherder.mozilla.org/jobs?repo=try&revision=3bd9b4a255ab3ea72d5a2e94929d2481ad4a2726
full+pernosco: https://treeherder.mozilla.org/jobs?repo=try&revision=1d5b2804f28a88e96fa70bc53166671f62c47430
If we cannot reproduce it locally maybe we could use a Pernosco recording? Maybe the 2nd try build might work and can catch the problem.
Reporter | ||
Comment 13•7 months ago
|
||
The failure is still present with the latest try build and the above mentioned preferences set.
(In reply to Martin Stránský [:stransky] (ni? me) from comment #7)
Date.now() calls js::date_now() which leads to NowAsMillis()/PRMJ_Now() call and gettimeofday(). I don't see what can be wrong here. Is clampAndJitter set in the test? https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/js/src/jsdate.cpp#1869
I'm not sure if the before mentioned prefs affect the behavior of clampAndJitter
or how to test that.
Tom, do you have an idea what might go on here on Wayland? For details please see my comment 6. Thanks!
Comment 14•7 months ago
|
||
Bug 1857488 landed 4 days before this bug was filed. However, this test doesn't modify any of the timer preferences, so I'm not sure how it might interact. But it's a strong contender if this just started happening.
As far as prefs go:
- Set
privacy.reduceTimerPrecision=false
andprivacy.reduceTimerPrecision.unconditional=false
to test time precision with no clamping or jittering. It's nanosecond precision. If that test fails with this, it's not almost certainly not related to any timer clamping code. - Set
privacy.reduceTimerPrecision=false
andprivacy.reduceTimerPrecision.unconditional=true
to get 20 microseconds everywhere. - Set
privacy.reduceTimerPrecision=true
andprivacy.resistFingerprinting.reduceTimerPrecision.jitter=false
andprivacy.resistFingerprinting.reduceTimerPrecision.microseconds=20
to get 20 microseconds without jitter* - Set
privacy.reduceTimerPrecision=true
andprivacy.resistFingerprinting.reduceTimerPrecision.jitter=false
andprivacy.resistFingerprinting.reduceTimerPrecision.microseconds=XXX
to get XXX microseconds without jitter and you can dial that up to 1000 which is the default** - Set
privacy.reduceTimerPrecision=true
andprivacy.resistFingerprinting.reduceTimerPrecision.jitter=false
andprivacy.resistFingerprinting.reduceTimerPrecision.microseconds=1000
to get normal browser behavior without jitter - Clear all prefs to get normal browser behavior (1000usec/1ms + jitter) which is how this test appears to run
* Jitter means take a random midpoint between two timestamps (e.g. with a precision of 100 usec, a timestamp of 1.3ms and 1.4ms a random midpoint would be 1.372ms) and round the actual timestamp up or down based on that midpoint (e.g a timestamp of 1.370 would go down and a timestamp of 1.374 would go up)
** In normal behavior, a page running with COEP+COOP (meaning it's granted SAB) will get 20 microsecond precision, other pages will get 1ms - and jitter is on for both.
Reporter | ||
Comment 15•7 months ago
|
||
As discussed with Tom via Matrix it would be good to have try runs for the first 2 settings. Verbose logging would be helpful especially on the second one. And this should be doable via MOZ_LOG="nsResistFingerprinting:5"
.
Reporter | ||
Comment 16•7 months ago
|
||
(In reply to Tom Ritter [:tjr] from comment #14)
- Set
privacy.reduceTimerPrecision=false
andprivacy.reduceTimerPrecision.unconditional=false
to test time precision with no clamping or jittering. It's nanosecond precision. If that test fails with this, it's not almost certainly not related to any timer clamping code.
Try build: https://treeherder.mozilla.org/jobs?repo=try&revision=1948e6f46cf106b84d884b2fd2e53d1041ce2fa5
- Set
privacy.reduceTimerPrecision=false
andprivacy.reduceTimerPrecision.unconditional=true
to get 20 microseconds everywhere.
Try build: https://treeherder.mozilla.org/jobs?repo=try&revision=3a1004756ec5c5d1b3f74eb33f6ffea2caed08d1
Lets see what the results are.
Reporter | ||
Comment 17•7 months ago
|
||
Tom, both try builds show the same failure so I assume it's not related to fingerprinting? Do you have another idea in how to get further details? Would it make sense to maybe log the native call when we get the timestamp? If yes, where would that have to be done?
Comment 18•7 months ago
|
||
You could add fprintf(stderr
logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...
After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...
Assignee | ||
Comment 19•7 months ago
|
||
PRMJ_Now() which is primary source of time data is the same for X11 and Wayland:
https://searchfox.org/mozilla-central/rev/21001eee69767de5db95952154d49f01b72318ba/js/src/vm/Time.cpp#37
Reporter | ||
Comment 20•7 months ago
|
||
Thanks! I'll try that next.
For now I just pushed a try build that actually disables setting the CDP specific preferences that set fission.bfcacheInParent=false
and
fission.webContentIsolationStrategy=0
. Given that we do not see this problem in other Remote Protocol related test suites this might be a factor as well.
https://treeherder.mozilla.org/jobs?repo=try&revision=83643f8d23a9a25804254b0433bc90a7c792ad8f
Reporter | ||
Comment 21•6 months ago
|
||
(In reply to Tom Ritter [:tjr] from comment #18)
You could add
fprintf(stderr
logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...
I was finally able to add the print statement close to the call to PRMJ_Now()
in jsdate.cpp, right before the clamping code that I commented out. And from the log lines it makes it clear that this call returns wrong timestamps:
[task 2023-11-24T21:34:28.701Z] 21:34:28 INFO - GECKO(4367) | *** current datetime: 1700861668694.864990
[task 2023-11-24T21:34:26.419Z] 21:34:26 INFO - GECKO(4367) | *** current datetime: 1700861666418.251953
[task 2023-11-24T21:34:26.420Z] 21:34:26 INFO - GECKO(4367) | *** current datetime: 1700861666419.215088
The same timestamp difference can actually be seen as well in the mozlog timestamp. So I assume that it's not related to any code in Firefox / Gecko but really an issue with the platform / worker?
Martin, do you have any idea what to do next?
Andrew and Joel, are we are of other already ported test suites if we see a larger amount of test failures compared to the old 18.04 workers? Because of time differences I could image all kinds of possible race conditions.
Comment 22•6 months ago
|
||
There's a breakdown of which suites have been enabled here:
https://docs.google.com/spreadsheets/d/173Y7PIldTomoDbGvtcc_YL-NSeUDjU1c_5d2oV3rhdg/edit#gid=0
There's been the usual slew of failures / intermittents that always accompany any new test platform, but I can't say it's seemed particularly bad or anything. What's wrong with timestamps, because the first one is bigger?
I wonder if this is a case of the test harness buffering messages and not emitting the logs in the exact order it received them?
Updated•6 months ago
|
Assignee | ||
Comment 23•6 months ago
•
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #21)
(In reply to Tom Ritter [:tjr] from comment #18)
You could add
fprintf(stderr
logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...
I was finally able to add the print statement close to the call to
PRMJ_Now()
in jsdate.cpp, right before the clamping code that I commented out. And from the log lines it makes it clear that this call returns wrong timestamps:[task 2023-11-24T21:34:28.701Z] 21:34:28 INFO - GECKO(4367) | *** current datetime: 1700861668694.864990 [task 2023-11-24T21:34:26.419Z] 21:34:26 INFO - GECKO(4367) | *** current datetime: 1700861666418.251953 [task 2023-11-24T21:34:26.420Z] 21:34:26 INFO - GECKO(4367) | *** current datetime: 1700861666419.215088
The same timestamp difference can actually be seen as well in the mozlog timestamp. So I assume that it's not related to any code in Firefox / Gecko but really an issue with the platform / worker?
Martin, do you have any idea what to do next?
I'd put the logging directly here:
https://searchfox.org/mozilla-central/rev/6eb2ebcafb1b4a8576eb513e6cd2c61e3f3ae6dc/js/src/vm/Time.cpp#45
that's the OS level source of time data and it should be correct. Or at least save latest PRMJ_Now() value and compare any new call is bigger than the previous one? Something like:
int64_t PRMJ_Now() {
...
static int64_t latest_time = 0;
int64_t recent_time = int64_t(tv.tv_sec) * PRMJ_USEC_PER_SEC + int64_t(tv.tv_usec);
MOZ_RELEASE_ASSERT(latest_time < recent_time);
latest_time = recent_time;
return recent_time;
}
Reporter | ||
Comment 24•6 months ago
|
||
That's a good idea. Let me try the assertion way. I've pushed a new try build here:
https://treeherder.mozilla.org/jobs?repo=try&revision=5254d5e6416a48f84ebf7b2f36499781da004924
Once tests are run and fail I might request a one time loaner and try to check the behavior on the command line as well.
Reporter | ||
Comment 25•6 months ago
|
||
And the assertion fails:
https://treeherder.mozilla.org/jobs?repo=try&revision=5254d5e6416a48f84ebf7b2f36499781da004924&selectedTaskRun=eqN1j9gaT3iM_sPTDZn_Vw.0
I tried to create an interactive task for the remote job on Wayland but that doesn't seem to be possible. Andrew, don't we support interactive tasks for Wayland at all yet?
Assignee | ||
Comment 26•6 months ago
|
||
Looks like it's a known bug and depends on Linux kernel version: https://bugzilla.redhat.com/show_bug.cgi?id=244697
I guess it's a bug in Ubuntu 22.04 kernel.
Assignee | ||
Comment 27•6 months ago
|
||
Looks like we should use clock_gettime(CLOCK_MONOTONIC) instead of gettimeofday():
https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html
Assignee | ||
Updated•6 months ago
|
Assignee | ||
Comment 28•6 months ago
|
||
Will look at it. We may need a different code path for systems without monotonic time.
Assignee | ||
Comment 29•6 months ago
|
||
Assignee | ||
Comment 30•6 months ago
|
||
Reporter | ||
Comment 31•6 months ago
|
||
Great find Martin! Nevertheless I still wonder why we do not see such a massive amount of test failures for jobs already running on Wayland. The list in Andrew's link is quite large.
I've pushed a new try build with my test and your patch applied:
https://treeherder.mozilla.org/jobs?repo=try&revision=917cb508facb7733e7d634af0592ec411cc44bdd
I wonder if such a test would be helpful as well to detect a problem like that.
Assignee | ||
Comment 32•6 months ago
|
||
Monotonic doesn't work here. Let's try CLOCK_TAI which is available on Linux 3.0+
Assignee | ||
Comment 33•6 months ago
•
|
||
try with CLOCK_TAI: https://treeherder.mozilla.org/jobs?repo=try&revision=861ef258f93155bbbc2f59b5c7a597b3d9f6cf79
It should use the same timebase as gettimeofday() but with guaranteed time value increment.
Assignee | ||
Updated•6 months ago
|
Updated•6 months ago
|
Assignee | ||
Comment 34•6 months ago
|
||
Henrik, please try the latest patch (Use CLOCK_TAI in PRMJ_Now()) on Wayland / Ubuntu 22.04 as it's supposed to work there.
Thanks.
Reporter | ||
Comment 35•6 months ago
|
||
Sure, here the try build for the timestamp test:
https://treeherder.mozilla.org/jobs?repo=try&revision=e39aa3229723d308dd5c3399cfc9ae8b435088f0
Reporter | ||
Comment 36•6 months ago
|
||
Martin, I have sad news. As my above try build shows the problem is still present in CI, but maybe at a lower rate.
Comment 37•6 months ago
|
||
here are the tests skipped so far:
https://searchfox.org/mozilla-central/search?q=display+%3D%3D+%27wayland&path=&case=false®exp=false
Reporter | ||
Comment 38•6 months ago
|
||
Martin, do you have any further idea? Is there something that we might want to check on such a system directly via the command line? If that would be helpful I'm happy to find out how to get an interactive task.
Assignee | ||
Comment 39•6 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #38)
Martin, do you have any further idea? Is there something that we might want to check on such a system directly via the command line? If that would be helpful I'm happy to find out how to get an interactive task.
Hello, I did my own try here:
and it looks quite good. The CLOCK_TAI generally works as a replacement of gettimeofday() and uses the same timebase so it's compatible with it. So I believe it's the way forward.
But CLOCK_TAI is guaranteed to be anomic and monotonic (https://man7.org/linux/man-pages/man3/clock_gettime.3.html). If we see a different behavior here it's definitely a serious bug somewhere.
Reporter | ||
Comment 40•6 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #39)
Hello, I did my own try here:
This is using the actual remote tests which have this problem intermittently and do not fail that much often that it can easily be caught. As such I've written a patch that I applied on top of the changes so far and you should probably do the same for your try push.
Should we also have such a very basic xpcshell or something like very basic unit test to verify the behavior? I don't think a browser chrome test is appropriate for that.
Assignee | ||
Comment 41•6 months ago
|
||
Okay, I'll do debugging with your patch.
Assignee | ||
Comment 42•6 months ago
|
||
Try with TAI patch and your debugging patch + assertion:
https://treeherder.mozilla.org/jobs?repo=try&revision=bc7e9a213603fce2589c4a5c925b53348618e5ba&selectedTaskRun=RVvZxMiXTYe-OX-DZOH8KQ.0
that looks quite good.
Assignee | ||
Comment 43•6 months ago
|
||
I wonder if the problem happens when clock_gettime() returns the same time twice but we fuzzle the second one to future. Is that possible?
Updated•6 months ago
|
Assignee | ||
Comment 44•6 months ago
|
||
Depends on D194968
Assignee | ||
Comment 45•6 months ago
|
||
Also attached the test patch. It makes sure the time is monotonic.
Reporter | ||
Comment 46•6 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #43)
I wonder if the problem happens when clock_gettime() returns the same time twice but we fuzzle the second one to future. Is that possible?
Martin, that's an interesting idea! I triggered some more tests and they still fail! Given that you have the assertion in place (right) it would mean that we indeed do some modifications to the timestamp.
Updated•6 months ago
|
Assignee | ||
Comment 47•6 months ago
|
||
Depends on D194968
Assignee | ||
Comment 48•6 months ago
|
||
Assignee | ||
Comment 49•6 months ago
|
||
Depends on D195579
Assignee | ||
Comment 50•6 months ago
|
||
debug try with assertion: https://treeherder.mozilla.org/jobs?repo=try&revision=ee8ef8c560c4911b72f58a75e1d8756260a813fa
Reporter | ||
Comment 51•6 months ago
|
||
Note that I triggered some more tests and the assertion as addef for PRMJ_Now()
still fails intermittently.
Assignee | ||
Comment 52•6 months ago
•
|
||
Yes. Let's look at: https://treeherder.mozilla.org/jobs?repo=try&revision=ee8ef8c560c4911b72f58a75e1d8756260a813fa&selectedTaskRun=O7EGGBcMR02JLTC7CI2BZQ.0
It fails because we build it on old system (Ubuntu 18.04?) without CLOCK_TAI so gettimeofday() is used. But we run it then on Ubuntu 22.04. Looks like we should use run-time check.
Reporter | ||
Comment 53•6 months ago
|
||
Oh, that is a good finding and would also explain why you haven't seen it locally given that you might run on the same kernel, right?
Comment 54•6 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #52)
It fails because we build it on old system (Ubuntu 18.04?) without CLOCK_TAI so gettimeofday() is used. But we run it then on Ubuntu 22.04. Looks like we should use run-time check.
I was going to mention that during review, but I checked our build sysroot (see bug 1690930 and related) and it defines CLOCK_TAI
in <linux/time.h>
:
/home/jld/.mozbuild/sysroot-x86_64-linux-gnu/usr/include/linux/time.h
58:#define CLOCK_TAI 11
…so I thought it wasn't a concern for Mozilla's builds. But, what I missed is that <time.h>
includes <bits/time.h>
, which is from glibc, and it looks like that doesn't have CLOCK_TAI
. (Modern versions do, but not the old one in the sysroot.) Also we probably can't just include <linux/time.h>
because it will conflict with <time.h>
.
So, it looks like we'll need to define it locally like we do for a few other things (and a few more we inherited from Chromium, and also these in IPC, etc.),
Assignee | ||
Comment 55•6 months ago
|
||
I guess we need to implement run-time fallback to gettimeofday() for BSD/Sun and similar systems. CLOCK_TAI may be Linux only.
Assignee | ||
Comment 56•6 months ago
•
|
||
Did another try with various debug infos:
https://treeherder.mozilla.org/jobs?repo=try&revision=2d1f3763238ec9cd77517adce3b043b34bacb800
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf1f36d3f1e4599da067a7e9fba38095ee35fe20
So far it looks like CLOCK_TAI is broken too which is quite strange. It uses https://phabricator.services.mozilla.com/D195766 with run-time check and CLOCK_TAI is used according to log.
Assignee | ||
Comment 57•6 months ago
|
||
Depends on D195623
Assignee | ||
Comment 58•6 months ago
|
||
Assignee | ||
Comment 59•6 months ago
|
||
There's latest try: https://treeherder.mozilla.org/logviewer?job_id=439348214&repo=try&lineNumber=3076
[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** last_ts.tv_sec = 1702032429 last_ts.tv_nsec = 37248922
[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** ts.tv_sec = 1702032428 ts.tv_nsec = 406780504
[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** lastTime = 1702032429037248 recentTime = 1702032428406780
[task 2023-12-08T10:47:08.411Z] 10:47:08 INFO - GECKO(3776) | Assertion failure: lastTime <= recentTime (clock_gettime() returned time from past?), at /builds/worker/checkouts/gecko/js/src/vm/Time.cpp:71
Assignee | ||
Comment 60•6 months ago
|
||
So it's really looks like a bug with the system itself.
Reporter | ||
Comment 61•6 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)
So it's really looks like a bug with the system itself.
Does it mean that we do not require any change / addition to Firefox but require folks to check what's wrong with the docker image that we are using for these tests?
Assignee | ||
Comment 62•6 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #61)
(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)
So it's really looks like a bug with the system itself.
Does it mean that we do not require any change / addition to Firefox but require folks to check what's wrong with the docker image that we are using for these tests?
IMHO the patches here are correct and should work and may be better than original getttimeofday(). But the real problem is kernel call failures - CLOCK_TAI is guaranteed to be monotonic but it isn't (unless we deal with multi-thread scenario here - Is PRMJ_Now() supposed to run from various threads?).
Reporter | ||
Comment 63•6 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #62)
IMHO the patches here are correct and should work and may be better than original getttimeofday(). But the real problem is kernel call failures - CLOCK_TAI is guaranteed to be monotonic but it isn't (unless we deal with multi-thread scenario here - Is PRMJ_Now() supposed to run from various threads?).
Jed, do you have an answer to this question or know whom we can ask here? If that's not the reason we should find someone who can help us to most likely fix the docker image that we use. At least it would also explain why normal users might not have this problem.
Comment 64•6 months ago
•
|
||
I don't know the JS engine very well, but JS is multithreaded (Web Workers), so there could be problems with non-atomic shared state. However, comment #59 reports a ~600ms (not µs) back-step if I'm reading it right, which is a larger order of magnitude than I'd expect for a simple data race.
However — and this is something I should have noticed earlier — it looks like this is the backend function for JS Date.now()
, in which case the spec and the linked definition of “time value” implies that using TAI instead of UTC is incorrect, but it doesn't explicitly state that time values can't decrease if the system clock is changed. Maybe it's the test that's incorrect.
And, now that I look at Linux's clock documentation more closely, the description of CLOCK_TAI
simply says that it doesn't experience discontinuities or backsteps caused by leap seconds (which occur at most twice a year, haven't occurred since 2016, and so far have all been positive jumps). Further, the text “nonsettable [and] derived from wall-clock time but ignoring leap seconds” kind of suggests that it's really just CLOCK_REALTIME
but with (currently) 37 seconds added. So that would explain why it's not helping.
As for what is happening: I don't know exactly how timekeeping works on this system, but the ntpd(8)
man page mentions that it will slew the clock (make it run slower or faster but still increase monotonically; see adjtime(2)
) if the adjustment needed is less than 128ms, but will step the clock (see clock_settime(2)
) otherwise. So, if the new containers are running on something with an unusually erratic clock — maybe a VM? — that could explain things.
Note in particular that, if I understand the time_namespaces(7)
man page correctly, CLOCK_REALTIME
is always systemwide, so this would be due to a timekeeping daemon running outside of the container.
Comment 65•6 months ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #64)
leap seconds (which occur at most twice a year, haven't occurred since 2016, and so far have all been positive jumps).
Slight mistake here: positive leap seconds are extra seconds that actually happen, i.e. there's a minute with 61 seconds and so on, so it's a point when TAI advances and UTC doesn't, so that is where UTC steps backwards if I understand correctly. (Also, “occur at most twice a year” is historically true but wasn't a rule.) Nonetheless, there hasn't been one since the end of 2016.
Reporter | ||
Comment 66•6 months ago
|
||
Can I still assume that it is still not clear where this problem happens? I don't think that the test is wrong given that it works fine everywhere else including the old Ubuntu 18.04 test docker image. It's only failing for the new one.
What are the next steps that we should take? Anything else that we would have to patch for our date implementation or shall we request more information about the docker image as used and maybe to get it upgraded in case its kernel version is broken and which got already fixed?
Assignee | ||
Comment 67•6 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #66)
What are the next steps that we should take? Anything else that we would have to patch for our date implementation or shall we request more information about the docker image as used and maybe to get it upgraded in case its kernel version is broken and which got already fixed?
Looks like external problem with the VM. I'll try the tests on local VM with Ubuntu 22.04.
Assignee | ||
Comment 68•6 months ago
|
||
Tested unpatched nightly on Ubuntu 22.04 VM but I can't reproduce (with the test patch https://hg.mozilla.org/try/rev/807ac30048530acbb8d2aa33da10feb5343b0a83). So it really looks like problem with Mozilla VM. Maybe the time is fiddled with there? As both getttimeofday() and CLOCK_TAI/CLOCK_REALTIME is broken.
Assignee | ||
Comment 69•6 months ago
|
||
For instance does the Mozilla box use NTP? In such case the system time may be updated during test.
Reporter | ||
Comment 70•6 months ago
|
||
(In reply to Martin Stránský [:stransky] (ni? me) from comment #69)
For instance does the Mozilla box use NTP? In such case the system time may be updated during test.
But that shouldn't run multiple times in a row, right? When it fails then quite often within a single test job while others are not affected at all.
Joel, who actually takes care of the Linux docket images with Wayland? Looks like we need some help. Thanks.
Comment 71•6 months ago
|
||
keep in mind this is a VM, not docker, and right now :aerickson is the mastermind behind it.
Comment 72•6 months ago
|
||
The workers are Virtualbox VMs inside a Google GCP VMs (we have work planned to replace with a bare VM now that we've fixed the GCP kernel issue). It's not too surprising that there is system clock funkiness.
I thought we were just greening up functional tests, not perf/timing tests on these workers.
Reporter | ||
Comment 73•6 months ago
|
||
(In reply to Andrew Erickson [:aerickson] from comment #72)
The workers are Virtualbox VMs inside a Google GCP VMs (we have work planned to replace with a bare VM now that we've fixed the GCP kernel issue). It's not too surprising that there is system clock funkiness.
Do you have any ETA yet how long this will take? If that's a quarter then we should definitely revert the remote tests to run with the old Ubuntu 18.04 test image. The number of failures is just too high and we would like to see this gone. See our meta bug 1859239 and its dependencies.
I thought we were just greening up functional tests, not perf/timing tests on these workers.
These tests are functional tests but they also check the time and order of events as sent out by our CDP (Chrome DevTools Protocol) implementation. Right now we can see timing issues, mixed orders of events, events not being sent at all (most likely setTimeout etc doesn't work correctly), and other issues that I haven't investigated yet because I feel that this bug is the underlying issue.
Comment 74•6 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #73)
(In reply to Andrew Erickson [:aerickson] from comment #72)
Do you have any ETA yet how long this will take? If that's a quarter then we should definitely revert the remote tests to run with the old Ubuntu 18.04 test image. The number of failures is just too high and we would like to see this gone. See our meta bug 1859239 and its dependencies.
I'd like to have a wayland bare vm pool up by mid January... so a month or so.
Here are the tickets to track:
bare vm wayland image: https://mozilla-hub.atlassian.net/browse/RELOPS-793
d2g image: https://mozilla-hub.atlassian.net/browse/RELOPS-528 (same image will be used for 2204 headless, https://mozilla-hub.atlassian.net/browse/RELOPS-692)
Reporter | ||
Comment 75•5 months ago
|
||
Thanks aerickson! This is still a bit of time. But maybe we could wait.
Ahal, what do you think? Would it cause any trouble to revert the Taskcluster changes for remote
mochitests to use the old 18.04 docker image? Then once we have the new bare VM we could run try builds to proof that the timestamp issues are no longer present. If the bare VM cannot solve it we would keep this top failure for our tests which is not satisfying.
Reporter | ||
Comment 76•5 months ago
|
||
In case Ahal is already out maybe Jmaher could still help with that decision? As it turned out this is really a regression from enabling remote tests on Wayland (bug 1857319).
Comment 77•5 months ago
|
||
the goal was to get as much automation turned on for Wayland, but there was no hard requirement that specific suites be enabled. everything is still running on 18.04. I would vote for disabling these on 22.04 wayland for now and then look into turning them back on when we have the new VM image.
Comment 78•5 months ago
|
||
Yeah we can turn it off if we need to, but further regressions could slip in in the meantime, meaning you'd be causing yourself more work in the long run.
Is it possible to disable a subset of the tests that fail most frequently to get this down to an acceptable intermittency rate instead?
Reporter | ||
Comment 79•5 months ago
|
||
Thanks. I think that we will take that risk to have clean tests again. Personally I think it should help us later to switch to the updated test workers and that we an ensure that no new test failures starts to happen.
As it looks like we only have to disable the tests for CDP under /remote/cdp/test/browser/
. How do we do that for the tags = "cdp"
?
Comment 80•5 months ago
|
||
You can just add a skip-if
to the [DEFAULT]
section of the manifest. E.g,:
https://searchfox.org/mozilla-central/source/accessible/tests/browser/atk/browser.toml#3
You can skip on the condition display == "wayland"
in this case.
Reporter | ||
Comment 81•5 months ago
|
||
All tests for CDP will be skipped soon via bug 1871611.
Reporter | ||
Comment 82•5 months ago
|
||
We have now turned off all remote tests for CDP on Wayland, which should make this error go away. Based on the feedback from Martin and Andrew this issue is based on the current worker setup and should be fixed hopefully by late January by a replaced worker.
This means that this bug can be marked as invalid given that the problem is not a fault on our side. Thanks for all of your time and feedback Martin!
Updated•5 months ago
|
Updated•4 months ago
|
Updated•4 months ago
|
Updated•4 months ago
|
Updated•4 months ago
|
Description
•