1861933 - [wayland] Remote test failures: Timestamp <random> in expected range [<before> - <after>]

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Description

•

7 months ago

CDP remote tests are randomly failing with the following error:

Timestamp 1697616453.232 in expected range [1697616453.101 - 1697616452.987]

Here an example:

[task 2023-10-18T08:07:36.523Z] 08:07:36     INFO - TEST-UNEXPECTED-FAIL | remote/cdp/test/browser/page/browser_domContentEventFired.js | Timestamp 1697616453.232 in expected range [1697616453.101 - 1697616452.987] - 
[task 2023-10-18T08:07:36.523Z] 08:07:36     INFO - Stack trace:
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:test_ok:1583
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/page/browser_domContentEventFired.js:runContentEventFiredTest:91
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/page/browser_domContentEventFired.js:eventWhenNavigatingWithNestedFrames:52
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochitests/content/browser/remote/cdp/test/browser/head.js:fn:87
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:handleTask:1134
[task 2023-10-18T08:07:36.524Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:_runTaskBasedTest:1206
[task 2023-10-18T08:07:36.525Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:Tester_execTest:1348
[task 2023-10-18T08:07:36.525Z] 08:07:36     INFO - chrome://mochikit/content/browser-test.js:nextTest/<:1123

This happens for different events and cannot easily be reproduced so far.

We should check why this is only happening on test jobs running on the new Wayland workers.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1794938

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1859310

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1809418

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1838833

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1776024

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1776064

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Depends on: 1781376

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1781376

No longer depends on: 1781376

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1858636

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1858635

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 1

•

7 months ago

The test remote/cdp/test/browser/page/browser_navigate.js is the one that failed most of the time. I pushed a try build for only this specific test so that we can get the full logs:

https://treeherder.mozilla.org/jobs?repo=try&revision=9980cba28b988fb1906b6390bfbb84f5b311b14c

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Severity: -- → S3

Priority: -- → P3

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 2

•

7 months ago

•

Edited

As discussed on Monday I've now pushed a try build that iterates 100 times over a timestamp check for the load event. Lets see if that gives us some more details about this failure:

https://treeherder.mozilla.org/jobs?repo=try&revision=8add05d34fba14ee278874714977ed1432766fd2

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 3

•

7 months ago

This is actually a very strange behavior on Wayland! See the following failure:

Timestamp 1699389658.01 in expected range [1699389657.871 - 1699389657.297]

The logged time for start is later than the logged time for end. In both cases we use Date.now() / 1000.

This failure can only be seen on Wayland and we never had it on any other platform.

Martin, do you have an idea why this could happen?

Flags: needinfo?(stransky)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 4

•

7 months ago

Here a new try push with just checking the values from the Date.now() calls in ms and s:
https://treeherder.mozilla.org/jobs?repo=try&revision=2e26ab3436c752f291e069a937bb110d3f8cb69e

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 5

•

7 months ago

First of all I need to check how Date.now() is implemented on widget code. I see there's nsWindow::GetEventTimeStamp() https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/widget/gtk/nsWindow.cpp#4972 but not's sure it's related.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 6

•

7 months ago

With the updated test from the last try build it's clear that it is indeed Date.now() that is misbehaving once in a while:
https://treeherder.mozilla.org/logviewer?job_id=435346305&repo=try&lineNumber=3479-3490

The test looks like:

    const timeBefore = Date.now();
    await new Promise(resolve => setTimeout(resolve, 200));
    const timeAfter = Date.now();

    ok(
      timeBefore < timeAfter,
      `Timestamps (ms) in expected order [${timeBefore} - ${timeAfter}]`
    );

Test results:

INFO - TEST-PASS | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (ms) in expected order [1699397093616 - 1699397093817] -
INFO - TEST-PASS | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (sec) in expected order [1699397093616 - 1699397093817] -
INFO - TEST-UNEXPECTED-FAIL | remote/cdp/test/browser/page/browser_loadEventFired.js | Timestamps (ms) in expected order [1699397093819 - 1699397093404] -

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 7

•

7 months ago

Date.now() calls js::date_now() which leads to NowAsMillis()/PRMJ_Now() call and gettimeofday(). I don't see what can be wrong here. Is clampAndJitter set in the test? https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/js/src/jsdate.cpp#1869

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 8

•

7 months ago

I don't see the failure locally (I updated it to run 1000 times). I wonder if it's related to test environment.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 9

•

7 months ago

Btw tested as

./mach test remote/cdp/test/browser/page/browser_loadEventFired.js

Flags: needinfo?(stransky)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 10

•

7 months ago

•

Edited

Interesting. Smaug also pointed me to the following preferences:
https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/modules/libpref/init/StaticPrefList.yaml#14417-14430

I may try to set timer.maximum_firing_delay_tolerance_ms to 0 to disable timer coalescing. As of now it's not clear if that could have an impact here. The new try build:

https://treeherder.mozilla.org/jobs?repo=try&revision=2b53eb85da3aba7cd69f4e26b3521b96f7c0d7cb

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 11

•

7 months ago

Other preferences around timers that we should try are: privacy.reduceTimerPrecision.* and privacy.resistFingerprinting.*. I will await results first and try to push those later today to try.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1781337

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 12

•

7 months ago

Here two new try pushes with privacy.reduceTimerPrecision=false set:

artifact: https://treeherder.mozilla.org/jobs?repo=try&revision=3bd9b4a255ab3ea72d5a2e94929d2481ad4a2726
full+pernosco: https://treeherder.mozilla.org/jobs?repo=try&revision=1d5b2804f28a88e96fa70bc53166671f62c47430

If we cannot reproduce it locally maybe we could use a Pernosco recording? Maybe the 2nd try build might work and can catch the problem.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 13

•

7 months ago

The failure is still present with the latest try build and the above mentioned preferences set.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #7)

Date.now() calls js::date_now() which leads to NowAsMillis()/PRMJ_Now() call and gettimeofday(). I don't see what can be wrong here. Is clampAndJitter set in the test? https://searchfox.org/mozilla-central/rev/02841791400cf7cf5760c0cfaf31f5d772624253/js/src/jsdate.cpp#1869

I'm not sure if the before mentioned prefs affect the behavior of clampAndJitter or how to test that.

Tom, do you have an idea what might go on here on Wayland? For details please see my comment 6. Thanks!

Flags: needinfo?(tom)

Tom Ritter [:tjr]

Comment 14

•

7 months ago

Bug 1857488 landed 4 days before this bug was filed. However, this test doesn't modify any of the timer preferences, so I'm not sure how it might interact. But it's a strong contender if this just started happening.

As far as prefs go:

Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=false to test time precision with no clamping or jittering. It's nanosecond precision. If that test fails with this, it's not almost certainly not related to any timer clamping code.
Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=true to get 20 microseconds everywhere.
Set privacy.reduceTimerPrecision=true and privacy.resistFingerprinting.reduceTimerPrecision.jitter=false and privacy.resistFingerprinting.reduceTimerPrecision.microseconds=20 to get 20 microseconds without jitter*
Set privacy.reduceTimerPrecision=true and privacy.resistFingerprinting.reduceTimerPrecision.jitter=false and privacy.resistFingerprinting.reduceTimerPrecision.microseconds=XXX to get XXX microseconds without jitter and you can dial that up to 1000 which is the default**
Set privacy.reduceTimerPrecision=true and privacy.resistFingerprinting.reduceTimerPrecision.jitter=false and privacy.resistFingerprinting.reduceTimerPrecision.microseconds=1000 to get normal browser behavior without jitter
Clear all prefs to get normal browser behavior (1000usec/1ms + jitter) which is how this test appears to run

* Jitter means take a random midpoint between two timestamps (e.g. with a precision of 100 usec, a timestamp of 1.3ms and 1.4ms a random midpoint would be 1.372ms) and round the actual timestamp up or down based on that midpoint (e.g a timestamp of 1.370 would go down and a timestamp of 1.374 would go up)

** In normal behavior, a page running with COEP+COOP (meaning it's granted SAB) will get 20 microsecond precision, other pages will get 1ms - and jitter is on for both.

Flags: needinfo?(tom)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 15

•

7 months ago

As discussed with Tom via Matrix it would be good to have try runs for the first 2 settings. Verbose logging would be helpful especially on the second one. And this should be doable via MOZ_LOG="nsResistFingerprinting:5".

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 16

•

7 months ago

(In reply to Tom Ritter [:tjr] from comment #14)

Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=false to test time precision with no clamping or jittering. It's nanosecond precision. If that test fails with this, it's not almost certainly not related to any timer clamping code.

Try build: https://treeherder.mozilla.org/jobs?repo=try&revision=1948e6f46cf106b84d884b2fd2e53d1041ce2fa5

Set privacy.reduceTimerPrecision=false and privacy.reduceTimerPrecision.unconditional=true to get 20 microseconds everywhere.

Try build: https://treeherder.mozilla.org/jobs?repo=try&revision=3a1004756ec5c5d1b3f74eb33f6ffea2caed08d1

Lets see what the results are.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 17

•

7 months ago

Tom, both try builds show the same failure so I assume it's not related to fingerprinting? Do you have another idea in how to get further details? Would it make sense to maybe log the native call when we get the timestamp? If yes, where would that have to be done?

Flags: needinfo?(tom)

Tom Ritter [:tjr]

Comment 18

•

7 months ago

You could add fprintf(stderr logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...

After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...

Flags: needinfo?(tom)

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 19

•

7 months ago

PRMJ_Now() which is primary source of time data is the same for X11 and Wayland:
https://searchfox.org/mozilla-central/rev/21001eee69767de5db95952154d49f01b72318ba/js/src/vm/Time.cpp#37

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 20

•

7 months ago

Thanks! I'll try that next.

For now I just pushed a try build that actually disables setting the CDP specific preferences that set fission.bfcacheInParent=false and
fission.webContentIsolationStrategy=0. Given that we do not see this problem in other Remote Protocol related test suites this might be a factor as well.

https://treeherder.mozilla.org/jobs?repo=try&revision=83643f8d23a9a25804254b0433bc90a7c792ad8f

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1861765

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1861320

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1858778

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1863926

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

7 months ago

Blocks: 1858632

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 21

•

6 months ago

(In reply to Tom Ritter [:tjr] from comment #18)

You could add fprintf(stderr logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...

After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...

I was finally able to add the print statement close to the call to PRMJ_Now() in jsdate.cpp, right before the clamping code that I commented out. And from the log lines it makes it clear that this call returns wrong timestamps:

[task 2023-11-24T21:34:28.701Z] 21:34:28     INFO - GECKO(4367) | *** current datetime: 1700861668694.864990
[task 2023-11-24T21:34:26.419Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666418.251953
[task 2023-11-24T21:34:26.420Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666419.215088

The same timestamp difference can actually be seen as well in the mozlog timestamp. So I assume that it's not related to any code in Firefox / Gecko but really an issue with the platform / worker?

Martin, do you have any idea what to do next?

Andrew and Joel, are we are of other already ported test suites if we see a larger amount of test failures compared to the old 18.04 workers? Because of time differences I could image all kinds of possible race conditions.

Flags: needinfo?(stransky)

Flags: needinfo?(jmaher)

Flags: needinfo?(ahal)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

6 months ago

Blocks: 1858282

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

6 months ago

Blocks: 1858353

Andrew Halberstadt [:ahal]

Comment 22

•

6 months ago

There's a breakdown of which suites have been enabled here:
https://docs.google.com/spreadsheets/d/173Y7PIldTomoDbGvtcc_YL-NSeUDjU1c_5d2oV3rhdg/edit#gid=0

There's been the usual slew of failures / intermittents that always accompany any new test platform, but I can't say it's seemed particularly bad or anything. What's wrong with timestamps, because the first one is bigger?

I wonder if this is a case of the test harness buffering messages and not emitting the logs in the exact order it received them?

Flags: needinfo?(ahal)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 months ago

Flags: needinfo?(jmaher)

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 23

•

6 months ago

•

Edited

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #21)

(In reply to Tom Ritter [:tjr] from comment #18)

You could add fprintf(stderr logging over in jsdate and comment out the timestamp clamping code (which shouldn't be doing anything, with those prefs set) to be safe...

After that, it's getting into the internals of setTimeout and timers and I'm not familiar with that...

I was finally able to add the print statement close to the call to PRMJ_Now() in jsdate.cpp, right before the clamping code that I commented out. And from the log lines it makes it clear that this call returns wrong timestamps:
[task 2023-11-24T21:34:28.701Z] 21:34:28     INFO - GECKO(4367) | *** current datetime: 1700861668694.864990
[task 2023-11-24T21:34:26.419Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666418.251953
[task 2023-11-24T21:34:26.420Z] 21:34:26     INFO - GECKO(4367) | *** current datetime: 1700861666419.215088
The same timestamp difference can actually be seen as well in the mozlog timestamp. So I assume that it's not related to any code in Firefox / Gecko but really an issue with the platform / worker?

Martin, do you have any idea what to do next?

I'd put the logging directly here:
https://searchfox.org/mozilla-central/rev/6eb2ebcafb1b4a8576eb513e6cd2c61e3f3ae6dc/js/src/vm/Time.cpp#45

that's the OS level source of time data and it should be correct. Or at least save latest PRMJ_Now() value and compare any new call is bigger than the previous one? Something like:

int64_t PRMJ_Now() {
  ...
  static int64_t  latest_time = 0;
  int64_t recent_time = int64_t(tv.tv_sec) * PRMJ_USEC_PER_SEC + int64_t(tv.tv_usec);
  MOZ_RELEASE_ASSERT(latest_time < recent_time);
  latest_time = recent_time;
  return recent_time;
}

Flags: needinfo?(stransky)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 24

•

6 months ago

That's a good idea. Let me try the assertion way. I've pushed a new try build here:
https://treeherder.mozilla.org/jobs?repo=try&revision=5254d5e6416a48f84ebf7b2f36499781da004924

Once tests are run and fail I might request a one time loaner and try to check the behavior on the command line as well.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 25

•

6 months ago

And the assertion fails:
https://treeherder.mozilla.org/jobs?repo=try&revision=5254d5e6416a48f84ebf7b2f36499781da004924&selectedTaskRun=eqN1j9gaT3iM_sPTDZn_Vw.0

I tried to create an interactive task for the remote job on Wayland but that doesn't seem to be possible. Andrew, don't we support interactive tasks for Wayland at all yet?

Flags: needinfo?(ahal)

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 26

•

6 months ago

Looks like it's a known bug and depends on Linux kernel version: https://bugzilla.redhat.com/show_bug.cgi?id=244697
I guess it's a bug in Ubuntu 22.04 kernel.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 27

•

6 months ago

Looks like we should use clock_gettime(CLOCK_MONOTONIC) instead of gettimeofday():
https://blog.habets.se/2010/09/gettimeofday-should-never-be-used-to-measure-time.html

Martin Stránský [:stransky] (ni? me)

Assignee

Updated

•

6 months ago

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 28

•

6 months ago

Will look at it. We may need a different code path for systems without monotonic time.

Assignee: nobody → stransky

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 29

•

6 months ago

Attached file Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?jld (obsolete) — Details

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 30

•

6 months ago

try: https://treeherder.mozilla.org/jobs?repo=try&revision=143b402f6c7d5fa917983e05662858ac34e6f66f

Flags: needinfo?(stransky)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 31

•

6 months ago

Great find Martin! Nevertheless I still wonder why we do not see such a massive amount of test failures for jobs already running on Wayland. The list in Andrew's link is quite large.

I've pushed a new try build with my test and your patch applied:
https://treeherder.mozilla.org/jobs?repo=try&revision=917cb508facb7733e7d634af0592ec411cc44bdd

I wonder if such a test would be helpful as well to detect a problem like that.

Status: NEW → ASSIGNED

Component: CDP → JavaScript Engine

Flags: needinfo?(ahal)

Product: Remote Protocol → Core

See Also: → https://bugzilla.redhat.com/show_bug.cgi?id=244697

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 32

•

6 months ago

Monotonic doesn't work here. Let's try CLOCK_TAI which is available on Linux 3.0+

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 33

•

6 months ago

•

Edited

try with CLOCK_TAI: https://treeherder.mozilla.org/jobs?repo=try&revision=861ef258f93155bbbc2f59b5c7a597b3d9f6cf79
It should use the same timebase as gettimeofday() but with guaranteed time value increment.

Martin Stránský [:stransky] (ni? me)

Assignee

Updated

•

6 months ago

Flags: needinfo?(stransky)

Phabricator Automation

Updated

•

6 months ago

Attachment #9366045 - Attachment description: Bug 1861933 [Linux] Use CLOCK_MONOTONIC in PRMJ_Now() r?tjr → Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?tjr

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 34

•

6 months ago

Henrik, please try the latest patch (Use CLOCK_TAI in PRMJ_Now()) on Wayland / Ubuntu 22.04 as it's supposed to work there.
Thanks.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 35

•

6 months ago

Sure, here the try build for the timestamp test:
https://treeherder.mozilla.org/jobs?repo=try&revision=e39aa3229723d308dd5c3399cfc9ae8b435088f0

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 36

•

6 months ago

Martin, I have sad news. As my above try build shows the problem is still present in CI, but maybe at a lower rate.

Joel Maher ( :jmaher ) (UTC -8)

Comment 37

•

6 months ago

here are the tests skipped so far:
https://searchfox.org/mozilla-central/search?q=display+%3D%3D+%27wayland&path=&case=false&regexp=false

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 38

•

6 months ago

Martin, do you have any further idea? Is there something that we might want to check on such a system directly via the command line? If that would be helpful I'm happy to find out how to get an interactive task.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 39

•

6 months ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #38)

Martin, do you have any further idea? Is there something that we might want to check on such a system directly via the command line? If that would be helpful I'm happy to find out how to get an interactive task.

Hello, I did my own try here:

https://treeherder.mozilla.org/jobs?repo=try&revision=8399c77ac22b8816ca80a81efafb80c56bfc30bc&selectedTaskRun=TLBRLNWZQ0CfU9yjt3CZRw.0

and it looks quite good. The CLOCK_TAI generally works as a replacement of gettimeofday() and uses the same timebase so it's compatible with it. So I believe it's the way forward.

But CLOCK_TAI is guaranteed to be anomic and monotonic (https://man7.org/linux/man-pages/man3/clock_gettime.3.html). If we see a different behavior here it's definitely a serious bug somewhere.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 40

•

6 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #39)

Hello, I did my own try here:

https://treeherder.mozilla.org/jobs?repo=try&revision=8399c77ac22b8816ca80a81efafb80c56bfc30bc&selectedTaskRun=TLBRLNWZQ0CfU9yjt3CZRw.0

This is using the actual remote tests which have this problem intermittently and do not fail that much often that it can easily be caught. As such I've written a patch that I applied on top of the changes so far and you should probably do the same for your try push.

Should we also have such a very basic xpcshell or something like very basic unit test to verify the behavior? I don't think a browser chrome test is appropriate for that.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 41

•

6 months ago

Okay, I'll do debugging with your patch.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 42

•

6 months ago

Try with TAI patch and your debugging patch + assertion:
https://treeherder.mozilla.org/jobs?repo=try&revision=bc7e9a213603fce2589c4a5c925b53348618e5ba&selectedTaskRun=RVvZxMiXTYe-OX-DZOH8KQ.0
that looks quite good.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 43

•

6 months ago

I wonder if the problem happens when clock_gettime() returns the same time twice but we fuzzle the second one to future. Is that possible?

Phabricator Automation

Updated

•

6 months ago

Attachment #9366045 - Attachment description: Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?tjr → Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?jld

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 44

•

6 months ago

Attached file WIP: Bug 1861933 [Linux] Test patch, make sure time is monotonic (obsolete) — Details

Depends on D194968

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 45

•

6 months ago

Also attached the test patch. It makes sure the time is monotonic.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 46

•

6 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #43)

I wonder if the problem happens when clock_gettime() returns the same time twice but we fuzzle the second one to future. Is that possible?

Martin, that's an interesting idea! I triggered some more tests and they still fail! Given that you have the assertion in place (right) it would mean that we indeed do some modifications to the timestamp.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

6 months ago

Blocks: 1861767

Phabricator Automation

Updated

•

6 months ago

Attachment #9366509 - Attachment is obsolete: true

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 47

•

6 months ago

Attached file Bug 1861933 [Linux] Allow CLOCK_TAI call from sandbox r?jld (obsolete) — Details

Depends on D194968

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 48

•

6 months ago

try with debug patch: https://treeherder.mozilla.org/jobs?repo=try&revision=1dc3a7000d57f75915fc6e6f5eda63661f75e06e

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 49

•

6 months ago

Attached file Bug 1861933 Add debuging assert to PRMJ_Now() to ensure we don't return time from past r?jld (obsolete) — Details

Depends on D195579

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 50

•

6 months ago

debug try with assertion: https://treeherder.mozilla.org/jobs?repo=try&revision=ee8ef8c560c4911b72f58a75e1d8756260a813fa

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 51

•

6 months ago

Note that I triggered some more tests and the assertion as addef for PRMJ_Now() still fails intermittently.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 52

•

6 months ago

•

Edited

Yes. Let's look at: https://treeherder.mozilla.org/jobs?repo=try&revision=ee8ef8c560c4911b72f58a75e1d8756260a813fa&selectedTaskRun=O7EGGBcMR02JLTC7CI2BZQ.0

It fails because we build it on old system (Ubuntu 18.04?) without CLOCK_TAI so gettimeofday() is used. But we run it then on Ubuntu 22.04. Looks like we should use run-time check.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 53

•

6 months ago

Oh, that is a good finding and would also explain why you haven't seen it locally given that you might run on the same kernel, right?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 54

•

6 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #52)

It fails because we build it on old system (Ubuntu 18.04?) without CLOCK_TAI so gettimeofday() is used. But we run it then on Ubuntu 22.04. Looks like we should use run-time check.

I was going to mention that during review, but I checked our build sysroot (see bug 1690930 and related) and it defines CLOCK_TAI in <linux/time.h>:

/home/jld/.mozbuild/sysroot-x86_64-linux-gnu/usr/include/linux/time.h
58:#define CLOCK_TAI                    11

…so I thought it wasn't a concern for Mozilla's builds. But, what I missed is that <time.h> includes <bits/time.h>, which is from glibc, and it looks like that doesn't have CLOCK_TAI. (Modern versions do, but not the old one in the sysroot.) Also we probably can't just include <linux/time.h> because it will conflict with <time.h>.

So, it looks like we'll need to define it locally like we do for a few other things (and a few more we inherited from Chromium, and also these in IPC, etc.),

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 55

•

6 months ago

I guess we need to implement run-time fallback to gettimeofday() for BSD/Sun and similar systems. CLOCK_TAI may be Linux only.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 56

•

6 months ago

•

Edited

Did another try with various debug infos:
https://treeherder.mozilla.org/jobs?repo=try&revision=2d1f3763238ec9cd77517adce3b043b34bacb800
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf1f36d3f1e4599da067a7e9fba38095ee35fe20

So far it looks like CLOCK_TAI is broken too which is quite strange. It uses https://phabricator.services.mozilla.com/D195766 with run-time check and CLOCK_TAI is used according to log.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 57

•

6 months ago

Attached file Bug 1861933 [Linux] Try CLOCK_TAI and fall back to gettimeofday() r?jld (obsolete) — Details

Depends on D195623

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 58

•

6 months ago

try: https://treeherder.mozilla.org/jobs?repo=try&revision=e20e6b5c13e1368d5432365339932c24df6dbac8

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 59

•

6 months ago

There's latest try: https://treeherder.mozilla.org/logviewer?job_id=439348214&repo=try&lineNumber=3076

[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** last_ts.tv_sec = 1702032429 last_ts.tv_nsec = 37248922
[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** ts.tv_sec = 1702032428 ts.tv_nsec = 406780504
[task 2023-12-08T10:47:08.410Z] 10:47:08 INFO - GECKO(3776) | ******** lastTime = 1702032429037248 recentTime = 1702032428406780
[task 2023-12-08T10:47:08.411Z] 10:47:08 INFO - GECKO(3776) | Assertion failure: lastTime <= recentTime (clock_gettime() returned time from past?), at /builds/worker/checkouts/gecko/js/src/vm/Time.cpp:71

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 60

•

6 months ago

So it's really looks like a bug with the system itself.

Flags: needinfo?(stransky)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 61

•

6 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)

So it's really looks like a bug with the system itself.

Does it mean that we do not require any change / addition to Firefox but require folks to check what's wrong with the docker image that we are using for these tests?

Flags: needinfo?(stransky)

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 62

•

6 months ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #61)

(In reply to Martin Stránský [:stransky] (ni? me) from comment #60)

So it's really looks like a bug with the system itself.

Does it mean that we do not require any change / addition to Firefox but require folks to check what's wrong with the docker image that we are using for these tests?

IMHO the patches here are correct and should work and may be better than original getttimeofday(). But the real problem is kernel call failures - CLOCK_TAI is guaranteed to be monotonic but it isn't (unless we deal with multi-thread scenario here - Is PRMJ_Now() supposed to run from various threads?).

Flags: needinfo?(stransky)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 63

•

6 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #62)

IMHO the patches here are correct and should work and may be better than original getttimeofday(). But the real problem is kernel call failures - CLOCK_TAI is guaranteed to be monotonic but it isn't (unless we deal with multi-thread scenario here - Is PRMJ_Now() supposed to run from various threads?).

Jed, do you have an answer to this question or know whom we can ask here? If that's not the reason we should find someone who can help us to most likely fix the docker image that we use. At least it would also explain why normal users might not have this problem.

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 64

•

6 months ago

•

Edited

I don't know the JS engine very well, but JS is multithreaded (Web Workers), so there could be problems with non-atomic shared state. However, comment #59 reports a ~600ms (not µs) back-step if I'm reading it right, which is a larger order of magnitude than I'd expect for a simple data race.

However — and this is something I should have noticed earlier — it looks like this is the backend function for JS Date.now(), in which case the spec and the linked definition of “time value” implies that using TAI instead of UTC is incorrect, but it doesn't explicitly state that time values can't decrease if the system clock is changed. Maybe it's the test that's incorrect.

And, now that I look at Linux's clock documentation more closely, the description of CLOCK_TAI simply says that it doesn't experience discontinuities or backsteps caused by leap seconds (which occur at most twice a year, haven't occurred since 2016, and so far have all been positive jumps). Further, the text “nonsettable [and] derived from wall-clock time but ignoring leap seconds” kind of suggests that it's really just CLOCK_REALTIME but with (currently) 37 seconds added. So that would explain why it's not helping.

As for what is happening: I don't know exactly how timekeeping works on this system, but the ntpd(8) man page mentions that it will slew the clock (make it run slower or faster but still increase monotonically; see adjtime(2)) if the adjustment needed is less than 128ms, but will step the clock (see clock_settime(2)) otherwise. So, if the new containers are running on something with an unusually erratic clock — maybe a VM? — that could explain things.

Note in particular that, if I understand the time_namespaces(7) man page correctly, CLOCK_REALTIME is always systemwide, so this would be due to a timekeeping daemon running outside of the container.

Flags: needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 65

•

6 months ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #64)

leap seconds (which occur at most twice a year, haven't occurred since 2016, and so far have all been positive jumps).

Slight mistake here: positive leap seconds are extra seconds that actually happen, i.e. there's a minute with 61 seconds and so on, so it's a point when TAI advances and UTC doesn't, so that is where UTC steps backwards if I understand correctly. (Also, “occur at most twice a year” is historically true but wasn't a rule.) Nonetheless, there hasn't been one since the end of 2016.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 66

•

6 months ago

Can I still assume that it is still not clear where this problem happens? I don't think that the test is wrong given that it works fine everywhere else including the old Ubuntu 18.04 test docker image. It's only failing for the new one.

What are the next steps that we should take? Anything else that we would have to patch for our date implementation or shall we request more information about the docker image as used and maybe to get it upgraded in case its kernel version is broken and which got already fixed?

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 67

•

6 months ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #66)

What are the next steps that we should take? Anything else that we would have to patch for our date implementation or shall we request more information about the docker image as used and maybe to get it upgraded in case its kernel version is broken and which got already fixed?

Looks like external problem with the VM. I'll try the tests on local VM with Ubuntu 22.04.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 68

•

6 months ago

Tested unpatched nightly on Ubuntu 22.04 VM but I can't reproduce (with the test patch https://hg.mozilla.org/try/rev/807ac30048530acbb8d2aa33da10feb5343b0a83). So it really looks like problem with Mozilla VM. Maybe the time is fiddled with there? As both getttimeofday() and CLOCK_TAI/CLOCK_REALTIME is broken.

Martin Stránský [:stransky] (ni? me)

Assignee

Comment 69

•

6 months ago

For instance does the Mozilla box use NTP? In such case the system time may be updated during test.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 70

•

6 months ago

(In reply to Martin Stránský [:stransky] (ni? me) from comment #69)

For instance does the Mozilla box use NTP? In such case the system time may be updated during test.

But that shouldn't run multiple times in a row, right? When it fails then quite often within a single test job while others are not affected at all.

Joel, who actually takes care of the Linux docket images with Wayland? Looks like we need some help. Thanks.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 71

•

6 months ago

keep in mind this is a VM, not docker, and right now :aerickson is the mastermind behind it.

Flags: needinfo?(jmaher) → needinfo?(aerickson)

Andrew Erickson [:aerickson]

Comment 72

•

6 months ago

The workers are Virtualbox VMs inside a Google GCP VMs (we have work planned to replace with a bare VM now that we've fixed the GCP kernel issue). It's not too surprising that there is system clock funkiness.

I thought we were just greening up functional tests, not perf/timing tests on these workers.

Flags: needinfo?(aerickson)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 73

•

6 months ago

(In reply to Andrew Erickson [:aerickson] from comment #72)

The workers are Virtualbox VMs inside a Google GCP VMs (we have work planned to replace with a bare VM now that we've fixed the GCP kernel issue). It's not too surprising that there is system clock funkiness.

Do you have any ETA yet how long this will take? If that's a quarter then we should definitely revert the remote tests to run with the old Ubuntu 18.04 test image. The number of failures is just too high and we would like to see this gone. See our meta bug 1859239 and its dependencies.

I thought we were just greening up functional tests, not perf/timing tests on these workers.

These tests are functional tests but they also check the time and order of events as sent out by our CDP (Chrome DevTools Protocol) implementation. Right now we can see timing issues, mixed orders of events, events not being sent at all (most likely setTimeout etc doesn't work correctly), and other issues that I haven't investigated yet because I feel that this bug is the underlying issue.

Flags: needinfo?(aerickson)

Andrew Erickson [:aerickson]

Comment 74

•

6 months ago

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #73)

(In reply to Andrew Erickson [:aerickson] from comment #72)
Do you have any ETA yet how long this will take? If that's a quarter then we should definitely revert the remote tests to run with the old Ubuntu 18.04 test image. The number of failures is just too high and we would like to see this gone. See our meta bug 1859239 and its dependencies.

I'd like to have a wayland bare vm pool up by mid January... so a month or so.

Here are the tickets to track:

bare vm wayland image: https://mozilla-hub.atlassian.net/browse/RELOPS-793
d2g image: https://mozilla-hub.atlassian.net/browse/RELOPS-528 (same image will be used for 2204 headless, https://mozilla-hub.atlassian.net/browse/RELOPS-692)

Flags: needinfo?(aerickson)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

5 months ago

Comment 75

•

5 months ago

Thanks aerickson! This is still a bit of time. But maybe we could wait.

Ahal, what do you think? Would it cause any trouble to revert the Taskcluster changes for remote mochitests to use the old 18.04 docker image? Then once we have the new bare VM we could run try builds to proof that the timestamp issues are no longer present. If the bare VM cannot solve it we would keep this top failure for our tests which is not satisfying.

Flags: needinfo?(ahal)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 76

•

5 months ago

In case Ahal is already out maybe Jmaher could still help with that decision? As it turned out this is really a regression from enabling remote tests on Wayland (bug 1857319).

status-firefox121: --- → wontfix

status-firefox122: --- → affected

status-firefox123: --- → affected

status-firefox-esr115: --- → unaffected

Flags: needinfo?(jmaher)

Keywords: regression

Regressed by: 1857319

Joel Maher ( :jmaher ) (UTC -8)

Comment 77

•

5 months ago

the goal was to get as much automation turned on for Wayland, but there was no hard requirement that specific suites be enabled. everything is still running on 18.04. I would vote for disabling these on 22.04 wayland for now and then look into turning them back on when we have the new VM image.

Flags: needinfo?(jmaher)

Andrew Halberstadt [:ahal]

Comment 78

•

5 months ago

Yeah we can turn it off if we need to, but further regressions could slip in in the meantime, meaning you'd be causing yourself more work in the long run.

Is it possible to disable a subset of the tests that fail most frequently to get this down to an acceptable intermittency rate instead?

Flags: needinfo?(ahal)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 79

•

5 months ago

Thanks. I think that we will take that risk to have clean tests again. Personally I think it should help us later to switch to the updated test workers and that we an ensure that no new test failures starts to happen.

As it looks like we only have to disable the tests for CDP under /remote/cdp/test/browser/. How do we do that for the tags = "cdp"?

Flags: needinfo?(ahal)

Andrew Halberstadt [:ahal]

Comment 80

•

5 months ago

You can just add a skip-if to the [DEFAULT] section of the manifest. E.g,:
https://searchfox.org/mozilla-central/source/accessible/tests/browser/atk/browser.toml#3

You can skip on the condition display == "wayland" in this case.

Flags: needinfo?(ahal)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

5 months ago

Depends on: 1871611

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 81

•

5 months ago

All tests for CDP will be skipped soon via bug 1871611.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 82

•

5 months ago

We have now turned off all remote tests for CDP on Wayland, which should make this error go away. Based on the feedback from Martin and Andrew this issue is based on the current worker setup and should be fixed hopefully by late January by a replaced worker.

This means that this bug can be marked as invalid given that the problem is not a fault on our side. Thanks for all of your time and feedback Martin!

Status: ASSIGNED → RESOLVED

Closed: 5 months ago

Resolution: --- → INVALID

Donal Meehan [:dmeehan]

Updated

•

5 months ago

status-firefox122: affected → wontfix

status-firefox123: affected → wontfix

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

5 months ago

Blocks: 1871279

Phabricator Automation

Updated

•

4 months ago

Attachment #9366045 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 months ago

Attachment #9367127 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 months ago

Attachment #9367202 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 months ago

Attachment #9367443 - Attachment is obsolete: true

Bug 1861933 [Linux] Use CLOCK_TAI in PRMJ_Now() r?jld 6 months ago Martin Stránský [:stransky] (ni? me) 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1861933 [Linux] Test patch, make sure time is monotonic 6 months ago Martin Stránský [:stransky] (ni? me) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1861933 [Linux] Allow CLOCK_TAI call from sandbox r?jld 6 months ago Martin Stránský [:stransky] (ni? me) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1861933 Add debuging assert to PRMJ_Now() to ensure we don't return time from past r?jld 6 months ago Martin Stránský [:stransky] (ni? me) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1861933 [Linux] Try CLOCK_TAI and fall back to gettimeofday() r?jld 6 months ago Martin Stránský [:stransky] (ni? me) 48 bytes, text/x-phabricator-request		Details \| Review