Closed Bug 1624868 Opened 5 years ago Closed 4 years ago

Perma linux ccov TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output [browser/base/content/test/performance/io]

Categories

(Firefox :: General, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox78 --- disabled

People

(Reporter: gbrown, Assigned: aryx)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, test-disabled)

Attachments

(1 file, 1 obsolete file)

See https://bugzilla.mozilla.org/show_bug.cgi?id=1414495#c376 through https://bugzilla.mozilla.org/show_bug.cgi?id=1414495#c393: on linux/ccov, browser-chrome mochitests on browser/base/content/test/performance/io perma-fail with a 370 second timeout because browser startup exceeds the 180 seconds that marionette waits for startup.

browser/base/content/test/performance/io/browser.ini defines extra prefs and environment and we expect the startup to be slow, more so on ccov.

Assignee: nobody → gbrown
Priority: -- → P1

(In reply to Geoff Brown [:gbrown] (reduced availability) from comment #0)

browser/base/content/test/performance/io/browser.ini defines extra prefs and environment and we expect the startup to be slow, more so on ccov.

I think we should add another pref to disable taking screenshots in startupRecorder for that folder.

I was going to simply skip the directory on ccov, as we have for other slow platforms:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f6b37a89430030896a6ce95c982961615fde6e8
https://hg.mozilla.org/try/rev/478bf2538fb481a20e30b699ccd1b4e118a87bfc#l1.8

but take the bug if you want another approach.

Skip the entire performance/io/browser.ini manifest on ccov (as on other slower platforms)
to avoid perma-fail.

(In reply to Geoff Brown [:gbrown] (reduced availability) from comment #2)

I was going to simply skip the directory on ccov, as we have for other slow platforms:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f6b37a89430030896a6ce95c982961615fde6e8
https://hg.mozilla.org/try/rev/478bf2538fb481a20e30b699ccd1b4e118a87bfc#l1.8

but take the bug if you want another approach.

I'll first get a profile from try of the test when it doesn't fail, to see where the time is spent. If that confirms my impression that the screenshots are what's taking time, I'll disable only the screenshot, and that should save time on all platforms.

From looking more at the failure reports in bug 1414495 comment 382 and bug 1414495 comment 394, it seems this is (almost) perma fail on both linux1804-64-ccov and windows10-64 asan.

Attachment #9135770 - Attachment description: Bug 1624868 - Skip browser/base/content/test/performance/io tests on ccov; r= → Bug 1624868 - Skip browser/base/content/test/performance/io tests on ccov and windows asan; r=

After several attempts using the try server, I finally managed to get useful profiles of the problem on Linux.

The patch at bug 1414495 comment 391 to increase the timeout helped me a lot to extract information out of runs that would normally fail.

Here's a profile of a Linux ccov run: https://bit.ly/2WJEF2U We are blocked for 116s when making the first window visible on mozilla::a11y::ShouldA11yBeEnabled() that calls dbus_connection_send_preallocated which blocks.

It turns out Linux opt runs also block on this, but for a shorter amount of time. Here's a profile of a Linux opt run: https://bit.ly/2Upcwg9 We are blocked for "only" 59s there, and that's not enough to cause timeouts.

Geoff, do you have a sense of whether this is likely a configuration issue on our Linux test machines, or if this is a Firefox bug in our Linux accessibility code?
I ran these builds locally on a thinkpad running Ubuntu 18.04 and didn't encounter these delays when profiling startup there.

Flags: needinfo?(gbrown)

Sorry, no idea.

Flags: needinfo?(gbrown)

(In reply to Florian Quèze [:florian] from comment #6)

Here's a profile of a Linux ccov run: https://bit.ly/2WJEF2U We are blocked for 116s when making the first window visible on mozilla::a11y::ShouldA11yBeEnabled() that calls dbus_connection_send_preallocated which blocks.

It turns out Linux opt runs also block on this, but for a shorter amount of time. Here's a profile of a Linux opt run: https://bit.ly/2Upcwg9 We are blocked for "only" 59s there, and that's not enough to cause timeouts.

(In reply to Florian Quèze [:florian] from comment #7)

Geoff, do you have a sense of whether this is likely a configuration issue on our Linux test machines, or if this is a Firefox bug in our Linux accessibility code?
I ran these builds locally on a thinkpad running Ubuntu 18.04 and didn't encounter these delays when profiling startup there.

Geoff didn't know - Jamie, do you know what this dbus call is doing and why it'd be slow (or who might know) ?

Flags: needinfo?(jteh)

For the Windows asan failures, even with the longer timeouts we still get a timeout without getting a profile: https://treeherder.mozilla.org/#/jobs?repo=try&author=florian%40queze.net&selectedJob=294958636

I'm not super familiar with DBUS internals. DBUS is what Linux a11y uses to communicate. ShouldA11yBeEnabled is trying to ask (via DBUS) whether any a11y tools are currently enabled on the system. If there are any, Gecko a11y gets enabled.

This page notes that at-spi-bus-launcher and at-spi2-registryd need to be running. This should normally done by the Desktop Environment session manager. Are these running on those machines? I'm not sure if that would cause this call to block - I just don't know enough about it - but it seems like something worth checking.

Failing that, as a last resort, I guess we could disable a11y for this set of tests. You can do this by setting the pref accessibility.force_disabled to 1 or setting environment variable GNOME_ACCESSIBILITY to 0.

Flags: needinfo?(jteh)

:egao - I thought comment 11 might be of interest (another interaction between tests and OS for you to consider). Do you know much about dbus on Ubuntu 18.04? Any idea about at-spi-bus-launcher and at-spi2-registryd on 18.04?

Flags: needinfo?(egao)

Florian, any idea why we haven't seen those extreme long startup times with a Talos test yet? I mean we run Talos on the Ubuntu 18.04 platform for various scenarios and report to Perfherder, right?

Flags: needinfo?(florian)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)

Florian, any idea why we haven't seen those extreme long startup times with a Talos test yet? I mean we run Talos on the Ubuntu 18.04 platform for various scenarios and report to Perfherder, right?

I think there's probably some interaction between the startup recording / profiling etc. and the dbus thing - but it's not clear what it is...

The weird thing is that on the linux ccov runs, we now run the perf/io/ tests after running accessible/tests/browser/e10s/browser.ini, so if anything I'd expect any caching effects to help, not hurt.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)

Florian, any idea why we haven't seen those extreme long startup times with a Talos test yet? I mean we run Talos on the Ubuntu 18.04 platform for various scenarios and report to Perfherder, right?

Possible ideas: Talos machines may be configured slightly differently, or Talos startup tests ignoring the first startup could hide the problem.

Flags: needinfo?(florian)

(In reply to Florian Quèze [:florian] from comment #10)

For the Windows asan failures, even with the longer timeouts we still get a timeout without getting a profile: https://treeherder.mozilla.org/#/jobs?repo=try&author=florian%40queze.net&selectedJob=294958636

Looking at the history of failures (https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-28&endday=2020-03-27&tree=all&bug=1414495), the first Windows asan failure happened on March 13, 2020.

Gerald, can you think of any change made to the profiler landed in the last few weeks that would make startup with startup profiling (including mainthread IO) more likely to deadlock on Windows asan? It's hard to bisect to find when this started as in my recent try runs we encountered the failure about once every 11 runs.

Flags: needinfo?(gsquelart)

(In reply to Florian Quèze [:florian] from comment #15)

Possible ideas: Talos machines may be configured slightly differently, or Talos startup tests ignoring the first startup could hide the problem.

Got the information from Joel this morning that Talos actually run on Ubuntu 16.04 still. Updating them to 18.04 might happen late in 2020 or more in 2021. So that seems to be the reason.

Yes, talos and raptor (and any other tests that run on hardware) are still running on Ubuntu 16.04 as upgrading the physical OS isn't as easy as setting up a docker image.

As for dbus - I recall I had difficulty setting up an environment that worked for tests. Some of that was due to dbus that needed to be running, but the docker containers we run are not privileged. I had to work around that with dbus.sh and then calling this script from the Dockerfile.

The 1804 image, while good enough to run most of the tests, still has some issues that I will admit is over my head. Specifically, there appear to be important differences between 1604 and 1804 that is causing some really hard to debug issues, this bug being one and Bug 1607713 being another.

I suggest that we loop in an Linux and/or Docker expert to take a look at this because I've spent quite some time at Bug 1607713 trying to figure out the issues to no avail, as frustrating as this is.

Flags: needinfo?(egao)

Karl, is there anything you can help us with? See comment 6, comment 11, and comment 18 for details. Thanks!

Flags: needinfo?(karlt)
See Also: → 1540367

IIUC comment 2 confirms that browser/base/content/test/performance/io/browser.ini tests are part of what is going wrong on Linux.
Whether finer disabling makes a difference might provide further clues.

Putting GNOME_ACCESSIBILITY=0 in the environment would confirm that the a11y dbus call is part of the problem on Linux (distinguishing the possibility that it just happens to be where this shows up first).

The dbus session message bus, used by a11y, is separate from the system bus provided by dbus.sh.

If there is no session bus on first attempt to connect, then a new dbus process is spawned. This is one dbus path that uses poll() (identified in comment 6). Could something in performance/io affect child processes?
Or file descriptor behavior?

Flags: needinfo?(karlt)

(In reply to Karl Tomlinson (:karlt) from comment #20)

IIUC comment 2 confirms that browser/base/content/test/performance/io/browser.ini tests are part of what is going wrong on Linux.
Whether finer disabling makes a difference might provide further clues.

https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown%40mozilla.com&tochange=a4bd5238ebb2d4bd2814f7807ebea91099dffb33&fromchange=d934d6f673c5de7d25104e1b6578347a8cd48f87&test_paths=performance%2Fio

It appears that frequent failures continue if any one test is skipped alone.

(In reply to Karl Tomlinson (:karlt) from comment #20)

IIUC comment 2 confirms that browser/base/content/test/performance/io/browser.ini tests are part of what is going wrong on Linux.
Whether finer disabling makes a difference might provide further clues.

Putting GNOME_ACCESSIBILITY=0 in the environment would confirm that the a11y dbus call is part of the problem on Linux (distinguishing the possibility that it just happens to be where this shows up first).

Putting GNOME_ACCESSIBILITY=0 in the environment 'fixes' the problem. Startup takes 25s on a ccov build and 5s on the opt build. (My try run is at https://treeherder.mozilla.org/#/jobs?repo=try&revision=697dd3b291f404e9f95c1671fd9c42d20a13c1fd if you would like to look at the profiles yourself).

Could something in performance/io affect child processes?
Or file descriptor behavior?

For this folder we take a profile of startup, using the environment variables at: https://searchfox.org/mozilla-central/rev/064b0f9501ad76802853b43f18e33d8713fd54d3/browser/base/content/test/performance/io/browser.ini#18
This means the IO interposer is active to give us information about which I/O is happening. AFAIK on Linux this only means we interpose IO calls at the NSPR level, so I wouldn't expect this to affect system libraries.

Any idea about what the next step here could be?

Flags: needinfo?(karlt)
Assignee: gbrown → nobody

Looking at the profile from the try run in comment 22 with GNOME_ACCESSIBILITY=0, I don't see any other dbus calls, so that doesn't actually distinguish whether it is the a11y dbus call or dbus itself.

Comment 22 does confirm that setting GNOME_ACCESSIBILITY=0 in browser.ini would be a effective workaround for Linux.

An alternative may be to provide a dbus-daemon --session process in the test system, which would usually exist in user's desktops. That would require setting DBUS_SESSION_BUS_ADDRESS to match. I'm not familiar with the exact procedure required, sorry. GNOME_ACCESSIBILITY=0 would be a simpler option at least until we have dbus-daemon --session.

Looking at IOInterposer::Init(), I also see only NSPR-level interposing, and I don't know how that would affect child processes.

However, the a11y dbus call is performed even for other tests. If similar problems are not showing up elsewhere, then I guess that indicates some interaction between startup recording or the IO interposer and a11y or dbus. If the tests can run with GNOME_ACCESSIBILITY=0, then the priority of analysing that interaction and removing the workaround can be addressed separately.

MOZ_PROFILER_STARTUP_FEATURES=js,mainthreadio,ipcmessages looks like it has three different features. I guess we're assuming the mainthreadio feature, but I don't know whether that has been confirmed.

Flags: needinfo?(karlt)

(In reply to Karl Tomlinson (:karlt) from comment #23)

MOZ_PROFILER_STARTUP_FEATURES=js,mainthreadio,ipcmessages looks like it has three different features. I guess we're assuming the mainthreadio feature, but I don't know whether that has been confirmed.

We looked into the mainthreadio feature because you asked if something could affect file descriptor behavior.

I just did a try push with only the js and stackwalk features: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=296388852&revision=73b2fa3e939a0fb72f180228138a15935f0ce9c5 (obviously the tests fail as they don't have the data they need, but I got a profile anyway).
Here's the resulting startup profile, where startup is blocked for 61700ms on dbus: https://perfht.ml/34cG6Zv
The initial profiles here were without the stackwalk feature (which samples native code), and the js (sampling only JS code) feature doesn't play a role in this part of the code.

I tried looking at the startup times for other Linux 18.04 startups for bc jobs, and it seems to be typically 5s. So maybe the super slow startup is only for the performance/io folder where we use startup profiling. If that's the case, I think we should just add GNOME_ACCESSIBILITY=0 in the environment and move on. But not understanding what's actually going on is frustrating.

Thank you for doing that analysis. It is helpful to at least narrow down somewhat the factors involved.

I see from the logs now that DBUS_SESSION_BUS_ADDRESS is already set implying that dbus-daemon --session is already running. That means the a11y dbus call should not be launching a new process, and I was jumping to conclusions in suspecting it was.

Updating the bug summary and keyword to make it an intermittent bug which can be classified by sheriffs.

Summary: Perma-fail application timed out after 370 seconds with no output on Linux/ccov on browser/base/content/test/performance/io → Perma linux ccov TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output [browser/base/content/test/performance/io]
Attachment #9135770 - Attachment is obsolete: true

(In reply to Karl Tomlinson (:karlt) from comment #26)

Thank you for doing that analysis. It is helpful to at least narrow down somewhat the factors involved.

I see from the logs now that DBUS_SESSION_BUS_ADDRESS is already set implying that dbus-daemon --session is already running. That means the a11y dbus call should not be launching a new process, and I was jumping to conclusions in suspecting it was.

Do we have an idea of what the next step would be here to understand what's going on? If we don't make progress I think we'll need to add GNOME_ACCESSIBILITY=0 in the environment.

Flags: needinfo?(karlt)

Setting GNOME_ACCESSIBILITY=0 for this test seems to me a reasonable thing to do.

I can only guess at what might be helpful to find the core issue. I know little about this test nor what startup recording does, but perhaps whatever these do can be further reduced to identify a trigger.

Flags: needinfo?(karlt)
Assignee: nobody → aryx.bugmail
Status: NEW → ASSIGNED
Attachment #9146152 - Attachment description: Bug 1624868 - Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian → Bug 1624868 - Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian DONTBUILD
Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/0a92f3dc6085 Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian DONTBUILD
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 78
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Target Milestone: Firefox 78 → ---
Assignee: aryx.bugmail → nobody
Status: REOPENED → NEW
Keywords: test-disabled

The commit message in comment 34 was incorrect, the patch (after adressing my review comment) didn't disable the test, but only disabled GNOME_ACCESSIBILITY which was interfering with the startup profiling. I would say this bug is fixed. The intermittent failures reported in comments 42 and 43 are unrelated.

Status: NEW → RESOLVED
Closed: 5 years ago4 years ago
Flags: needinfo?(gsquelart)
Resolution: --- → FIXED
Assignee: nobody → aryx.bugmail
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: