Closed Bug 1861933 Opened 7 months ago Closed 5 months ago

[wayland] Remote test failures: Timestamp <random> in expected range [<before> - <after>]

Categories

(Core :: JavaScript Engine, defect, P3)

defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox-esr115 --- unaffected
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- wontfix

People

(Reporter: whimboo, Assigned: stransky)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(5 obsolete files)

CDP remote tests are randomly failing with the following error:

Timestamp 1697616453.232 in expected range [1697616453.101 - 1697616452.987]

Here an example:

[task 2023-10-18T08:07:36.523Z] 08:07:36     INFO - TEST-UNEXPECTED-FAIL | remote/cdp/test/browser/page/browser_domContentEventFired.js | Timestamp 1697616453.232 in expected range [1697616453.101 - 1697616452.987] - 
[task 2023-10-18T08:07:36.523Z] 08:07:36     INFO - Stack trace:
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:test_ok:1583
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/page/browser_domContentEventFired.js:runContentEventFiredTest:91
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/page/browser_domContentEventFired.js:eventWhenNavigatingWithNestedFrames:52
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/head.js:fn:87
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:handleTask:1134
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:_runTaskBasedTest:1206
[task 2023-10-18T08:07:36.525Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:Tester_execTest:1348
[task 2023-10-18T08:07:36.525Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:nextTest/<:1123

This happens for different events and cannot easily be reproduced so far.

We should check why this is only happening on test jobs running on the new Wayland workers.

Blocks: 1794938
Blocks: 1859310
Blocks: 1809418
Blocks: 1838833
Blocks: 1776024
Blocks: 1776064
Depends on: 1781376
Blocks: 1781376
No longer depends on: 1781376
Blocks: 1858636
Blocks: 1858635

The test remote/cdp/test/browser/page/browser_navigate.js is the one that failed most of the time. I pushed a try build for only this specific test so that we can get the full logs:

https://treeherder.mozilla.org/jobs?repo=try&revision=9980cba28b988fb1906b6390bfbb84f5b311b14c

Severity: -- → S3
Priority: -- → P3

As discussed on Monday I've now pushed a try build that iterates 100 times over a timestamp check for the load event. Lets see if that gives us some more details about this failure:

https://treeherder.mozilla.org/jobs?repo=try&revision=8add05d34fba14ee278874714977ed1432766fd2

This is actually a very strange behavior on Wayland! See the following failure:

Timestamp 1699389658.01 in expected range [1699389657.871 - 1699389657.297]

The logged time for start is later than the logged time for end. In both cases we use Date.now() / 1000.

This failure can only be seen on Wayland and we never had it on any other platform.

Martin, do you have an idea why this could happen?

Flags: needinfo?(stransky)

Here a new try push with just checking the values from the Date.now() calls in ms and s:
https://treeherder.mozilla.org/jobs?repo=try&revision=2e26ab3436c752f291e069a937bb110d3f8cb69e

First of all I need to check how Date.now() is implemented on widget code. I see there's nsWindow::GetEventTimeStamp() https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/widget/gtk/nsWindow.cpp#4972 but not's sure it's related.

With the updated test from the last try build it's clear that it is indeed Date.now() that is misbehaving once in a while:
https://treeherder.mozilla.org/logviewer?job_id=435346305&repo=try&lineNumber=3479-3490

The test looks like:

    const timeBefore = Date.now();
    await new Promise(resolve => setTimeout(resolve, 200));
    const timeAfter = Date.now();

    ok(
      timeBefore < timeAfter,
      `Timestamps (ms) in expected order [${timeBefore} - ${timeAfter}]`
    );

Test results:

INFO - TEST-PASS | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (ms) in expected order [1699397093616 - 1699397093817] -
INFO - TEST-PASS | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (sec) in expected order [1699397093616 - 1699397093817] -
INFO - TEST-UNEXPECTED-FAIL | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (ms) in expected order [1699397093819 - 1699397093404] -

Date.now() calls js::date_now() which leads to NowAsMillis()/PRMJ_Now() call and gettimeofday(). I don't see what can be wrong here. Is clampAndJitter set in the test? https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/js/src/jsdate.cpp#1869

I don't see the failure locally (I updated it to run 1000 times). I wonder if it's related to test environment.

Btw tested as

./mach test remote/cdp/test/browser/page/browser_loadEventFired.js
Flags: needinfo?(stransky)

Interesting. Smaug also pointed me to the following preferences:
https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/modules/libpref/init/StaticPrefList.yaml#14417-14430

I may try to set timer.maximum_firing_delay_tolerance_ms to 0 to disable timer coalescing. As of now it's not clear if that could have an impact here. The new try build:

https://treeherder.mozilla.org/jobs?repo=try&revision=2b53eb85da3aba7cd69f4e26b3521b96f7c0d7cb

Other preferences around timers that we should try are: privacy.reduceTimerPrecision.* and privacy.resistFingerprinting.*. I will await results first and try to push those later today to try.

Blocks: 1781337

Here two new try pushes with privacy.reduceTimerPrecision=false set:

artifact: https://treeherder.mozilla.org/jobs?repo=try&revision=3bd9b4a255ab3ea72d5a2e94929d2481ad4a2726
full+pernosco: https://treeherder.mozilla.org/jobs?repo=try&revision=1d5b2804f28a88e96fa70bc53166671f62c47430

If we cannot reproduce it locally maybe we could use a Pernosco recording? Maybe the 2nd try build might work and can catch the problem.

The failure is still present with the latest try build and the above mentioned preferences set.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #7)

Date.now() calls js::date_now() which leads to NowAsMillis()/PRMJ_Now() call and gettimeofday(). I don't see what can be wrong here. Is clampAndJitter set in the test? https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/js/src/jsdate.cpp#1869

I'm not sure if the before mentioned prefs affect the behavior of clampAndJitter or how to test that.

Tom, do you have an idea what might go on here on Wayland? For details please see my comment 6. Thanks!

Flags: needinfo?(tom)

Bug 1857488 landed 4 days before this bug was filed. However, this test doesn't modify any of the timer preferences, so I'm not sure how it might interact. But it's a strong contender if this just started happening.

As far as prefs go:

  1. Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=false to test time precision with no clamping or jittering. It's nanosecond precision. If that test fails with this, it's not almost certainly not related to any timer clamping code.
  2. Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=true to get 20 microseconds everywhere.
  3. Set privacy.reduceTimerPrecision=true and privacy.resistFingerprinting.reduceTimerPrecision.jitter=false and privacy.resistFingerprinting.reduceTimerPrecision.microseconds=20 to get 20 microseconds without jitter*
  4. Set privacy.reduceTimerPrecision=true and privacy.resistFingerprinting.reduceTimerPrecision.jitter=false and privacy.resistFingerprinting.reduceTimerPrecision.microseconds=XXX to get XXX microseconds without jitter and you can dial that up to 1000 which is the default**
  5. Set privacy.reduceTimerPrecision=true and privacy.resistFingerprinting.reduceTimerPrecision.jitter=false and privacy.resistFingerprinting.reduceTimerPrecision.microseconds=1000 to get normal browser behavior without jitter
  6. Clear all prefs to get normal browser behavior (1000usec/1ms + jitter) which is how this test appears to run

* Jitter means take a random midpoint between two timestamps (e.g. with a precision of 100 usec, a timestamp of 1.3ms and 1.4ms a random midpoint would be 1.372ms) and round the actual timestamp up or down based on that midpoint (e.g a timestamp of 1.370 would go down and a timestamp of 1.374 would go up)

** In normal behavior, a page running with COEP+COOP (meaning it's granted SAB) will get 20 microsecond precision, other pages will get 1ms - and jitter is on for both.

Flags: needinfo?(tom)

As discussed with Tom via Matrix it would be good to have try runs for the first 2 settings. Verbose logging would be helpful especially on the second one. And this should be doable via MOZ_LOG="nsResistFingerprinting:5".

(In reply to Tom Ritter [:tjr] from comment #14)

  1. Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=false to test time precision with no clamping or jittering. It's nanosecond precision. If that test fails with this, it's not almost certainly not related to any timer clamping code.

Try build: https://treeherder.mozilla.org/jobs?repo=try&revision=1948e6f46cf106b84d884b2fd2e53d1041ce2fa5

  1. Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=true to get 20 microseconds everywhere.

Try build: https://treeherder.mozilla.org/jobs?repo=try&revision=3a1004756ec5c5d1b3f74eb33f6ffea2caed08d1

Lets see what the results are.

Tom, both try builds show the same failure so I assume it's not related to fingerprinting? Do you have another idea in how to get further details? Would it make sense to maybe log the native call when we get the timestamp? If yes, where would that have to be done?

Flags: needinfo?(tom)

You could add fprintf(stderr logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...

After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...

Flags: needinfo?(tom)

PRMJ_Now() which is primary source of time data is the same for X11 and Wayland:
https://searchfox.org/mozilla-central/rev/21001eee69767de5db95952154d49f01b72318ba/js/src/vm/Time.cpp#37

Thanks! I'll try that next.

For now I just pushed a try build that actually disables setting the CDP specific preferences that set fission.bfcacheInParent=false and
fission.webContentIsolationStrategy=0. Given that we do not see this problem in other Remote Protocol related test suites this might be a factor as well.

https://treeherder.mozilla.org/jobs?repo=try&revision=83643f8d23a9a25804254b0433bc90a7c792ad8f

Blocks: 1861765
Blocks: 1861320
Blocks: 1858778
Blocks: 1863926
Blocks: 1858632

(In reply to Tom Ritter [:tjr] from comment #18)

You could add fprintf(stderr logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...

After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...

I was finally able to add the print statement close to the call to PRMJ_Now() in jsdate.cpp, right before the clamping code that I commented out. And from the log lines it makes it clear that this call returns wrong timestamps:

[task 2023-11-24T21:34:28.701Z] 21:34:28     INFO - GECKO(4367) | *** current datetime: 1700861668694.864990
[task 2023-11-24T21:34:26.419Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666418.251953
[task 2023-11-24T21:34:26.420Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666419.215088

The same timestamp difference can actually be seen as well in the mozlog timestamp. So I assume that it's not related to any code in Firefox / Gecko but really an issue with the platform / worker?

Martin, do you have any idea what to do next?

Andrew and Joel, are we are of other already ported test suites if we see a larger amount of test failures compared to the old 18.04 workers? Because of time differences I could image all kinds of possible race conditions.

Flags: needinfo?(stransky)
Flags: needinfo?(jmaher)
Flags: needinfo?(ahal)
Blocks: 1858282
Blocks: 1858353

There's a breakdown of which suites have been enabled here:
https://docs.google.com/spreadsheets/d/173Y7PIldTomoDbGvtcc_YL-NSeUDjU1c_5d2oV3rhdg/edit#gid=0

There's been the usual slew of failures / intermittents that always accompany any new test platform, but I can't say it's seemed particularly bad or anything. What's wrong with timestamps, because the first one is bigger?

I wonder if this is a case of the test harness buffering messages and not emitting the logs in the exact order it received them?

Flags: needinfo?(ahal)
Flags: needinfo?(jmaher)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #21)

(In reply to Tom Ritter [:tjr] from comment #18)

You could add fprintf(stderr logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...

After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...

I was finally able to add the print statement close to the call to PRMJ_Now() in jsdate.cpp, right before the clamping code that I commented out. And from the log lines it makes it clear that this call returns wrong timestamps:

[task 2023-11-24T21:34:28.701Z] 21:34:28     INFO - GECKO(4367) | *** current datetime: 1700861668694.864990
[task 2023-11-24T21:34:26.419Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666418.251953
[task 2023-11-24T21:34:26.420Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666419.215088

The same timestamp difference can actually be seen as well in the mozlog timestamp. So I assume that it's not related to any code in Firefox / Gecko but really an issue with the platform / worker?

Martin, do you have any idea what to do next?

I'd put the logging directly here:
https://searchfox.org/mozilla-central/rev/6eb2ebcafb1b4a8576eb513e6cd2c61e3f3ae6dc/js/src/vm/Time.cpp#45

that's the OS level source of time data and it should be correct. Or at least save latest PRMJ_Now() value and compare any new call is bigger than the previous one? Something like:

int64_t PRMJ_Now() {
  ...
  static int64_t  latest_time = 0;
  int64_t recent_time = int64_t(tv.tv_sec) * PRMJ_USEC_PER_SEC + int64_t(tv.tv_usec);
  MOZ_RELEASE_ASSERT(latest_time < recent_time);
  latest_time = recent_time;
  return recent_time;
}
Flags: needinfo?(stransky)

That's a good idea. Let me try the assertion way. I've pushed a new try build here:
https://treeherder.mozilla.org/jobs?repo=try&revision=5254d5e6416a48f84ebf7b2f36499781da004924

Once tests are run and fail I might request a one time loaner and try to check the behavior on the command line as well.

And the assertion fails:
https://treeherder.mozilla.org/jobs?repo=try&revision=5254d5e6416a48f84ebf7b2f36499781da004924&selectedTaskRun=eqN1j9gaT3iM_sPTDZn_Vw.0

I tried to create an interactive task for the remote job on Wayland but that doesn't seem to be possible. Andrew, don't we support interactive tasks for Wayland at all yet?

Flags: needinfo?(ahal)

Looks like it's a known bug and depends on Linux kernel version: https://bugzilla.redhat.com/show_bug.cgi?id=244697
I guess it's a bug in Ubuntu 22.04 kernel.

Looks like we should use clock_gettime(CLOCK_MONOTONIC) instead of gettimeofday():
https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html

Flags: needinfo?(stransky)

Will look at it. We may need a different code path for systems without monotonic time.

Assignee: nobody → stransky

Great find Martin! Nevertheless I still wonder why we do not see such a massive amount of test failures for jobs already running on Wayland. The list in Andrew's link is quite large.

I've pushed a new try build with my test and your patch applied:
https://treeherder.mozilla.org/jobs?repo=try&revision=917cb508facb7733e7d634af0592ec411cc44bdd

I wonder if such a test would be helpful as well to detect a problem like that.

Status: NEW → ASSIGNED
Component: CDP → JavaScript Engine
Flags: needinfo?(ahal)
Product: Remote Protocol → Core

Monotonic doesn't work here. Let's try CLOCK_TAI which is available on Linux 3.0+

try with CLOCK_TAI: https://treeherder.mozilla.org/jobs?repo=try&revision=861ef258f93155bbbc2f59b5c7a597b3d9f6cf79
It should use the same timebase as gettimeofday() but with guaranteed time value increment.

Flags: needinfo?(stransky)
Attachment #9366045 - Attachment description: Bug 1861933 [Linux] Use CLOCK_MONOTONIC in PRMJ_Now() r?tjr → Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?tjr

Henrik, please try the latest patch (Use CLOCK_TAI in PRMJ_Now()) on Wayland / Ubuntu 22.04 as it's supposed to work there.
Thanks.

Flags: needinfo?(hskupin)
Flags: needinfo?(hskupin)

Martin, I have sad news. As my above try build shows the problem is still present in CI, but maybe at a lower rate.

Martin, do you have any further idea? Is there something that we might want to check on such a system directly via the command line? If that would be helpful I'm happy to find out how to get an interactive task.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #38)

Martin, do you have any further idea? Is there something that we might want to check on such a system directly via the command line? If that would be helpful I'm happy to find out how to get an interactive task.

Hello, I did my own try here:

https://treeherder.mozilla.org/jobs?repo=try&revision=8399c77ac22b8816ca80a81efafb80c56bfc30bc&selectedTaskRun=TLBRLNWZQ0CfU9yjt3CZRw.0

and it looks quite good. The CLOCK_TAI generally works as a replacement of gettimeofday() and uses the same timebase so it's compatible with it. So I believe it's the way forward.

But CLOCK_TAI is guaranteed to be anomic and monotonic (https://man7.org/linux/man-pages/man3/clock_gettime.3.html). If we see a different behavior here it's definitely a serious bug somewhere.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #39)

Hello, I did my own try here:

https://treeherder.mozilla.org/jobs?repo=try&revision=8399c77ac22b8816ca80a81efafb80c56bfc30bc&selectedTaskRun=TLBRLNWZQ0CfU9yjt3CZRw.0

This is using the actual remote tests which have this problem intermittently and do not fail that much often that it can easily be caught. As such I've written a patch that I applied on top of the changes so far and you should probably do the same for your try push.

Should we also have such a very basic xpcshell or something like very basic unit test to verify the behavior? I don't think a browser chrome test is appropriate for that.

Okay, I'll do debugging with your patch.

I wonder if the problem happens when clock_gettime() returns the same time twice but we fuzzle the second one to future. Is that possible?

Attachment #9366045 - Attachment description: Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?tjr → Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?jld

Also attached the test patch. It makes sure the time is monotonic.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #43)

I wonder if the problem happens when clock_gettime() returns the same time twice but we fuzzle the second one to future. Is that possible?

Martin, that's an interesting idea! I triggered some more tests and they still fail! Given that you have the assertion in place (right) it would mean that we indeed do some modifications to the timestamp.

Blocks: 1861767
Attachment #9366509 - Attachment is obsolete: true

Note that I triggered some more tests and the assertion as addef for PRMJ_Now() still fails intermittently.

Yes. Let's look at: https://treeherder.mozilla.org/jobs?repo=try&revision=ee8ef8c560c4911b72f58a75e1d8756260a813fa&selectedTaskRun=O7EGGBcMR02JLTC7CI2BZQ.0

It fails because we build it on old system (Ubuntu 18.04?) without CLOCK_TAI so gettimeofday() is used. But we run it then on Ubuntu 22.04. Looks like we should use run-time check.

Oh, that is a good finding and would also explain why you haven't seen it locally given that you might run on the same kernel, right?

(In reply to Martin Stránský [:stransky] (ni? me) from comment #52)

It fails because we build it on old system (Ubuntu 18.04?) without CLOCK_TAI so gettimeofday() is used. But we run it then on Ubuntu 22.04. Looks like we should use run-time check.

I was going to mention that during review, but I checked our build sysroot (see bug 1690930 and related) and it defines CLOCK_TAI in <linux/time.h>:

/home/jld/.mozbuild/sysroot-x86_64-linux-gnu/usr/include/linux/time.h
58:#define CLOCK_TAI                    11

…so I thought it wasn't a concern for Mozilla's builds. But, what I missed is that <time.h> includes <bits/time.h>, which is from glibc, and it looks like that doesn't have CLOCK_TAI. (Modern versions do, but not the old one in the sysroot.) Also we probably can't just include <linux/time.h> because it will conflict with <time.h>.

So, it looks like we'll need to define it locally like we do for a few other things (and a few more we inherited from Chromium, and also these in IPC, etc.),

I guess we need to implement run-time fallback to gettimeofday() for BSD/Sun and similar systems. CLOCK_TAI may be Linux only.

Did another try with various debug infos:
https://treeherder.mozilla.org/jobs?repo=try&revision=2d1f3763238ec9cd77517adce3b043b34bacb800
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf1f36d3f1e4599da067a7e9fba38095ee35fe20

So far it looks like CLOCK_TAI is broken too which is quite strange. It uses https://phabricator.services.mozilla.com/D195766 with run-time check and CLOCK_TAI is used according to log.

There's latest try: https://treeherder.mozilla.org/logviewer?job_id=439348214&repo=try&lineNumber=3076

[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** last_ts.tv_sec = 1702032429 last_ts.tv_nsec = 37248922
[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** ts.tv_sec = 1702032428 ts.tv_nsec = 406780504
[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** lastTime = 1702032429037248 recentTime = 1702032428406780
[task 2023-12-08T10:47:08.411Z] 10:47:08 INFO - GECKO(3776) | Assertion failure: lastTime <= recentTime (clock_gettime() returned time from past?), at /builds/worker/checkouts/gecko/js/src/vm/Time.cpp:71

So it's really looks like a bug with the system itself.

Flags: needinfo?(stransky)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)

So it's really looks like a bug with the system itself.

Does it mean that we do not require any change / addition to Firefox but require folks to check what's wrong with the docker image that we are using for these tests?

Flags: needinfo?(stransky)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #61)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)

So it's really looks like a bug with the system itself.

Does it mean that we do not require any change / addition to Firefox but require folks to check what's wrong with the docker image that we are using for these tests?

IMHO the patches here are correct and should work and may be better than original getttimeofday(). But the real problem is kernel call failures - CLOCK_TAI is guaranteed to be monotonic but it isn't (unless we deal with multi-thread scenario here - Is PRMJ_Now() supposed to run from various threads?).

Flags: needinfo?(stransky)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #62)

IMHO the patches here are correct and should work and may be better than original getttimeofday(). But the real problem is kernel call failures - CLOCK_TAI is guaranteed to be monotonic but it isn't (unless we deal with multi-thread scenario here - Is PRMJ_Now() supposed to run from various threads?).

Jed, do you have an answer to this question or know whom we can ask here? If that's not the reason we should find someone who can help us to most likely fix the docker image that we use. At least it would also explain why normal users might not have this problem.

Flags: needinfo?(jld)

I don't know the JS engine very well, but JS is multithreaded (Web Workers), so there could be problems with non-atomic shared state. However, comment #59 reports a ~600ms (not µs) back-step if I'm reading it right, which is a larger order of magnitude than I'd expect for a simple data race.

However — and this is something I should have noticed earlier — it looks like this is the backend function for JS Date.now(), in which case the spec and the linked definition of “time value” implies that using TAI instead of UTC is incorrect, but it doesn't explicitly state that time values can't decrease if the system clock is changed. Maybe it's the test that's incorrect.

And, now that I look at Linux's clock documentation more closely, the description of CLOCK_TAI simply says that it doesn't experience discontinuities or backsteps caused by leap seconds (which occur at most twice a year, haven't occurred since 2016, and so far have all been positive jumps). Further, the text “nonsettable [and] derived from wall-clock time but ignoring leap seconds” kind of suggests that it's really just CLOCK_REALTIME but with (currently) 37 seconds added. So that would explain why it's not helping.

As for what is happening: I don't know exactly how timekeeping works on this system, but the ntpd(8) man page mentions that it will slew the clock (make it run slower or faster but still increase monotonically; see adjtime(2)) if the adjustment needed is less than 128ms, but will step the clock (see clock_settime(2)) otherwise. So, if the new containers are running on something with an unusually erratic clock ­— maybe a VM? — that could explain things.

Note in particular that, if I understand the time_namespaces(7) man page correctly, CLOCK_REALTIME is always systemwide, so this would be due to a timekeeping daemon running outside of the container.

Flags: needinfo?(jld)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #64)

leap seconds (which occur at most twice a year, haven't occurred since 2016, and so far have all been positive jumps).

Slight mistake here: positive leap seconds are extra seconds that actually happen, i.e. there's a minute with 61 seconds and so on, so it's a point when TAI advances and UTC doesn't, so that is where UTC steps backwards if I understand correctly. (Also, “occur at most twice a year” is historically true but wasn't a rule.) Nonetheless, there hasn't been one since the end of 2016.

Can I still assume that it is still not clear where this problem happens? I don't think that the test is wrong given that it works fine everywhere else including the old Ubuntu 18.04 test docker image. It's only failing for the new one.

What are the next steps that we should take? Anything else that we would have to patch for our date implementation or shall we request more information about the docker image as used and maybe to get it upgraded in case its kernel version is broken and which got already fixed?

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #66)

What are the next steps that we should take? Anything else that we would have to patch for our date implementation or shall we request more information about the docker image as used and maybe to get it upgraded in case its kernel version is broken and which got already fixed?

Looks like external problem with the VM. I'll try the tests on local VM with Ubuntu 22.04.

Tested unpatched nightly on Ubuntu 22.04 VM but I can't reproduce (with the test patch https://hg.mozilla.org/try/rev/807ac30048530acbb8d2aa33da10feb5343b0a83). So it really looks like problem with Mozilla VM. Maybe the time is fiddled with there? As both getttimeofday() and CLOCK_TAI/CLOCK_REALTIME is broken.

For instance does the Mozilla box use NTP? In such case the system time may be updated during test.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #69)

For instance does the Mozilla box use NTP? In such case the system time may be updated during test.

But that shouldn't run multiple times in a row, right? When it fails then quite often within a single test job while others are not affected at all.

Joel, who actually takes care of the Linux docket images with Wayland? Looks like we need some help. Thanks.

Flags: needinfo?(jmaher)

keep in mind this is a VM, not docker, and right now :aerickson is the mastermind behind it.

Flags: needinfo?(jmaher) → needinfo?(aerickson)

The workers are Virtualbox VMs inside a Google GCP VMs (we have work planned to replace with a bare VM now that we've fixed the GCP kernel issue). It's not too surprising that there is system clock funkiness.

I thought we were just greening up functional tests, not perf/timing tests on these workers.

Flags: needinfo?(aerickson)

(In reply to Andrew Erickson [:aerickson] from comment #72)

The workers are Virtualbox VMs inside a Google GCP VMs (we have work planned to replace with a bare VM now that we've fixed the GCP kernel issue). It's not too surprising that there is system clock funkiness.

Do you have any ETA yet how long this will take? If that's a quarter then we should definitely revert the remote tests to run with the old Ubuntu 18.04 test image. The number of failures is just too high and we would like to see this gone. See our meta bug 1859239 and its dependencies.

I thought we were just greening up functional tests, not perf/timing tests on these workers.

These tests are functional tests but they also check the time and order of events as sent out by our CDP (Chrome DevTools Protocol) implementation. Right now we can see timing issues, mixed orders of events, events not being sent at all (most likely setTimeout etc doesn't work correctly), and other issues that I haven't investigated yet because I feel that this bug is the underlying issue.

Flags: needinfo?(aerickson)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #73)

(In reply to Andrew Erickson [:aerickson] from comment #72)
Do you have any ETA yet how long this will take? If that's a quarter then we should definitely revert the remote tests to run with the old Ubuntu 18.04 test image. The number of failures is just too high and we would like to see this gone. See our meta bug 1859239 and its dependencies.

I'd like to have a wayland bare vm pool up by mid January... so a month or so.

Here are the tickets to track:

bare vm wayland image: https://mozilla-hub.atlassian.net/browse/RELOPS-793
d2g image: https://mozilla-hub.atlassian.net/browse/RELOPS-528 (same image will be used for 2204 headless, https://mozilla-hub.atlassian.net/browse/RELOPS-692)

Flags: needinfo?(aerickson)
See Also: → 1871279

Thanks aerickson! This is still a bit of time. But maybe we could wait.

Ahal, what do you think? Would it cause any trouble to revert the Taskcluster changes for remote mochitests to use the old 18.04 docker image? Then once we have the new bare VM we could run try builds to proof that the timestamp issues are no longer present. If the bare VM cannot solve it we would keep this top failure for our tests which is not satisfying.

Flags: needinfo?(ahal)

In case Ahal is already out maybe Jmaher could still help with that decision? As it turned out this is really a regression from enabling remote tests on Wayland (bug 1857319).

Flags: needinfo?(jmaher)
Keywords: regression
Regressed by: 1857319

the goal was to get as much automation turned on for Wayland, but there was no hard requirement that specific suites be enabled. everything is still running on 18.04. I would vote for disabling these on 22.04 wayland for now and then look into turning them back on when we have the new VM image.

Flags: needinfo?(jmaher)

Yeah we can turn it off if we need to, but further regressions could slip in in the meantime, meaning you'd be causing yourself more work in the long run.

Is it possible to disable a subset of the tests that fail most frequently to get this down to an acceptable intermittency rate instead?

Flags: needinfo?(ahal)

Thanks. I think that we will take that risk to have clean tests again. Personally I think it should help us later to switch to the updated test workers and that we an ensure that no new test failures starts to happen.

As it looks like we only have to disable the tests for CDP under /remote/cdp/test/browser/. How do we do that for the tags = "cdp"?

Flags: needinfo?(ahal)

You can just add a skip-if to the [DEFAULT] section of the manifest. E.g,:
https://searchfox.org/mozilla-central/source/accessible/tests/browser/atk/browser.toml#3

You can skip on the condition display == "wayland" in this case.

Flags: needinfo?(ahal)
Depends on: 1871611

All tests for CDP will be skipped soon via bug 1871611.

We have now turned off all remote tests for CDP on Wayland, which should make this error go away. Based on the feedback from Martin and Andrew this issue is based on the current worker setup and should be fixed hopefully by late January by a replaced worker.

This means that this bug can be marked as invalid given that the problem is not a fault on our side. Thanks for all of your time and feedback Martin!

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → INVALID
Blocks: 1871279
Attachment #9366045 - Attachment is obsolete: true
Attachment #9367127 - Attachment is obsolete: true
Attachment #9367202 - Attachment is obsolete: true
Attachment #9367443 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: