1624868 - Perma linux ccov TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output [browser/base/content/test/performance/io]

Reporter

Description

•

5 years ago

See https://bugzilla.mozilla.org/show_bug.cgi?id=1414495#c376 through https://bugzilla.mozilla.org/show_bug.cgi?id=1414495#c393: on linux/ccov, browser-chrome mochitests on browser/base/content/test/performance/io perma-fail with a 370 second timeout because browser startup exceeds the 180 seconds that marionette waits for startup.

browser/base/content/test/performance/io/browser.ini defines extra prefs and environment and we expect the startup to be slow, more so on ccov.

Geoff Brown [:gbrown]

Reporter

Updated

•

5 years ago

Assignee: nobody → gbrown

Priority: -- → P1

Florian Quèze [:florian]

Comment 1

•

5 years ago

(In reply to Geoff Brown [:gbrown] (reduced availability) from comment #0)

browser/base/content/test/performance/io/browser.ini defines extra prefs and environment and we expect the startup to be slow, more so on ccov.

I think we should add another pref to disable taking screenshots in startupRecorder for that folder.

Geoff Brown [:gbrown]

Reporter

Comment 2

•

5 years ago

I was going to simply skip the directory on ccov, as we have for other slow platforms:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f6b37a89430030896a6ce95c982961615fde6e8
https://hg.mozilla.org/try/rev/478bf2538fb481a20e30b699ccd1b4e118a87bfc#l1.8

but take the bug if you want another approach.

Geoff Brown [:gbrown]

Reporter

Comment 3

•

5 years ago

Attached file Bug 1624868 - Skip browser/base/content/test/performance/io tests on ccov and windows asan; r= (obsolete) — Details

Skip the entire performance/io/browser.ini manifest on ccov (as on other slower platforms)
to avoid perma-fail.

Florian Quèze [:florian]

Comment 4

•

5 years ago

(In reply to Geoff Brown [:gbrown] (reduced availability) from comment #2)

I was going to simply skip the directory on ccov, as we have for other slow platforms:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f6b37a89430030896a6ce95c982961615fde6e8
https://hg.mozilla.org/try/rev/478bf2538fb481a20e30b699ccd1b4e118a87bfc#l1.8

but take the bug if you want another approach.

I'll first get a profile from try of the test when it doesn't fail, to see where the time is spent. If that confirms my impression that the screenshots are what's taking time, I'll disable only the screenshot, and that should save time on all platforms.

Florian Quèze [:florian]

Comment 5

•

5 years ago

From looking more at the failure reports in bug 1414495 comment 382 and bug 1414495 comment 394, it seems this is (almost) perma fail on both linux1804-64-ccov and windows10-64 asan.

Phabricator Automation

Updated

•

5 years ago

Attachment #9135770 - Attachment description: Bug 1624868 - Skip browser/base/content/test/performance/io tests on ccov; r= → Bug 1624868 - Skip browser/base/content/test/performance/io tests on ccov and windows asan; r=

Florian Quèze [:florian]

Comment 6

•

5 years ago

After several attempts using the try server, I finally managed to get useful profiles of the problem on Linux.

The patch at bug 1414495 comment 391 to increase the timeout helped me a lot to extract information out of runs that would normally fail.

Here's a profile of a Linux ccov run: https://bit.ly/2WJEF2U We are blocked for 116s when making the first window visible on mozilla::a11y::ShouldA11yBeEnabled() that calls dbus_connection_send_preallocated which blocks.

It turns out Linux opt runs also block on this, but for a shorter amount of time. Here's a profile of a Linux opt run: https://bit.ly/2Upcwg9 We are blocked for "only" 59s there, and that's not enough to cause timeouts.

Florian Quèze [:florian]

Comment 7

•

5 years ago

Geoff, do you have a sense of whether this is likely a configuration issue on our Linux test machines, or if this is a Firefox bug in our Linux accessibility code?
I ran these builds locally on a thinkpad running Ubuntu 18.04 and didn't encounter these delays when profiling startup there.

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Reporter

Comment 8

•

5 years ago

Sorry, no idea.

Flags: needinfo?(gbrown)

:Gijs (out for now; he/him)

Comment 9

•

5 years ago

(In reply to Florian Quèze [:florian] from comment #6)

Here's a profile of a Linux ccov run: https://bit.ly/2WJEF2U We are blocked for 116s when making the first window visible on mozilla::a11y::ShouldA11yBeEnabled() that calls dbus_connection_send_preallocated which blocks.

It turns out Linux opt runs also block on this, but for a shorter amount of time. Here's a profile of a Linux opt run: https://bit.ly/2Upcwg9 We are blocked for "only" 59s there, and that's not enough to cause timeouts.

(In reply to Florian Quèze [:florian] from comment #7)

Geoff, do you have a sense of whether this is likely a configuration issue on our Linux test machines, or if this is a Firefox bug in our Linux accessibility code?
I ran these builds locally on a thinkpad running Ubuntu 18.04 and didn't encounter these delays when profiling startup there.

Geoff didn't know - Jamie, do you know what this dbus call is doing and why it'd be slow (or who might know) ?

Flags: needinfo?(jteh)

Florian Quèze [:florian]

Comment 10

•

5 years ago

For the Windows asan failures, even with the longer timeouts we still get a timeout without getting a profile: https://treeherder.mozilla.org/#/jobs?repo=try&author=florian%40queze.net&selectedJob=294958636

James Teh [:Jamie]

Comment 11

•

5 years ago

I'm not super familiar with DBUS internals. DBUS is what Linux a11y uses to communicate. ShouldA11yBeEnabled is trying to ask (via DBUS) whether any a11y tools are currently enabled on the system. If there are any, Gecko a11y gets enabled.

This page notes that at-spi-bus-launcher and at-spi2-registryd need to be running. This should normally done by the Desktop Environment session manager. Are these running on those machines? I'm not sure if that would cause this call to block - I just don't know enough about it - but it seems like something worth checking.

Failing that, as a last resort, I guess we could disable a11y for this set of tests. You can do this by setting the pref accessibility.force_disabled to 1 or setting environment variable GNOME_ACCESSIBILITY to 0.

Flags: needinfo?(jteh)

Geoff Brown [:gbrown]

Reporter

Comment 12

•

5 years ago

:egao - I thought comment 11 might be of interest (another interaction between tests and OS for you to consider). Do you know much about dbus on Ubuntu 18.04? Any idea about at-spi-bus-launcher and at-spi2-registryd on 18.04?

Flags: needinfo?(egao)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 13

•

5 years ago

Florian, any idea why we haven't seen those extreme long startup times with a Talos test yet? I mean we run Talos on the Ubuntu 18.04 platform for various scenarios and report to Perfherder, right?

Flags: needinfo?(florian)

:Gijs (out for now; he/him)

Comment 14

•

5 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)

Florian, any idea why we haven't seen those extreme long startup times with a Talos test yet? I mean we run Talos on the Ubuntu 18.04 platform for various scenarios and report to Perfherder, right?

I think there's probably some interaction between the startup recording / profiling etc. and the dbus thing - but it's not clear what it is...

The weird thing is that on the linux ccov runs, we now run the perf/io/ tests after running accessible/tests/browser/e10s/browser.ini, so if anything I'd expect any caching effects to help, not hurt.

Florian Quèze [:florian]

Comment 15

•

5 years ago

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #13)

Florian, any idea why we haven't seen those extreme long startup times with a Talos test yet? I mean we run Talos on the Ubuntu 18.04 platform for various scenarios and report to Perfherder, right?

Possible ideas: Talos machines may be configured slightly differently, or Talos startup tests ignoring the first startup could hide the problem.

Flags: needinfo?(florian)

Florian Quèze [:florian]

Comment 16

•

5 years ago

(In reply to Florian Quèze [:florian] from comment #10)

For the Windows asan failures, even with the longer timeouts we still get a timeout without getting a profile: https://treeherder.mozilla.org/#/jobs?repo=try&author=florian%40queze.net&selectedJob=294958636

Looking at the history of failures (https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-11-28&endday=2020-03-27&tree=all&bug=1414495), the first Windows asan failure happened on March 13, 2020.

Gerald, can you think of any change made to the profiler landed in the last few weeks that would make startup with startup profiling (including mainthread IO) more likely to deadlock on Windows asan? It's hard to bisect to find when this started as in my recent try runs we encountered the failure about once every 11 runs.

Flags: needinfo?(gsquelart)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 17

•

5 years ago

(In reply to Florian Quèze [:florian] from comment #15)

Possible ideas: Talos machines may be configured slightly differently, or Talos startup tests ignoring the first startup could hide the problem.

Got the information from Joel this morning that Talos actually run on Ubuntu 16.04 still. Updating them to 18.04 might happen late in 2020 or more in 2021. So that seems to be the reason.

Edwin Takahashi (:egao | infrequent contributor)

Comment 18

•

5 years ago

Yes, talos and raptor (and any other tests that run on hardware) are still running on Ubuntu 16.04 as upgrading the physical OS isn't as easy as setting up a docker image.

As for dbus - I recall I had difficulty setting up an environment that worked for tests. Some of that was due to dbus that needed to be running, but the docker containers we run are not privileged. I had to work around that with dbus.sh and then calling this script from the Dockerfile.

The 1804 image, while good enough to run most of the tests, still has some issues that I will admit is over my head. Specifically, there appear to be important differences between 1604 and 1804 that is causing some really hard to debug issues, this bug being one and Bug 1607713 being another.

I suggest that we loop in an Linux and/or Docker expert to take a look at this because I've spent quite some time at Bug 1607713 trying to figure out the issues to no avail, as frustrating as this is.

Flags: needinfo?(egao)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 19

•

5 years ago

Karl, is there anything you can help us with? See comment 6, comment 11, and comment 18 for details. Thanks!

Flags: needinfo?(karlt)

Florian Quèze [:florian]

Updated

•

5 years ago

Comment 20

•

5 years ago

IIUC comment 2 confirms that browser/base/content/test/performance/io/browser.ini tests are part of what is going wrong on Linux.
Whether finer disabling makes a difference might provide further clues.

Putting GNOME_ACCESSIBILITY=0 in the environment would confirm that the a11y dbus call is part of the problem on Linux (distinguishing the possibility that it just happens to be where this shows up first).

The dbus session message bus, used by a11y, is separate from the system bus provided by dbus.sh.

If there is no session bus on first attempt to connect, then a new dbus process is spawned. This is one dbus path that uses poll() (identified in comment 6). Could something in performance/io affect child processes?
Or file descriptor behavior?

Flags: needinfo?(karlt)

Geoff Brown [:gbrown]

Reporter

Comment 21

•

5 years ago

(In reply to Karl Tomlinson (:karlt) from comment #20)

IIUC comment 2 confirms that browser/base/content/test/performance/io/browser.ini tests are part of what is going wrong on Linux.
Whether finer disabling makes a difference might provide further clues.

https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown%40mozilla.com&tochange=a4bd5238ebb2d4bd2814f7807ebea91099dffb33&fromchange=d934d6f673c5de7d25104e1b6578347a8cd48f87&test_paths=performance%2Fio

It appears that frequent failures continue if any one test is skipped alone.

Florian Quèze [:florian]

Comment 22

•

5 years ago

(In reply to Karl Tomlinson (:karlt) from comment #20)

IIUC comment 2 confirms that browser/base/content/test/performance/io/browser.ini tests are part of what is going wrong on Linux.
Whether finer disabling makes a difference might provide further clues.

Putting GNOME_ACCESSIBILITY=0 in the environment would confirm that the a11y dbus call is part of the problem on Linux (distinguishing the possibility that it just happens to be where this shows up first).

Putting GNOME_ACCESSIBILITY=0 in the environment 'fixes' the problem. Startup takes 25s on a ccov build and 5s on the opt build. (My try run is at https://treeherder.mozilla.org/#/jobs?repo=try&revision=697dd3b291f404e9f95c1671fd9c42d20a13c1fd if you would like to look at the profiles yourself).

Could something in performance/io affect child processes?
Or file descriptor behavior?

For this folder we take a profile of startup, using the environment variables at: https://searchfox.org/mozilla-central/rev/064b0f9501ad76802853b43f18e33d8713fd54d3/browser/base/content/test/performance/io/browser.ini#18
This means the IO interposer is active to give us information about which I/O is happening. AFAIK on Linux this only means we interpose IO calls at the NSPR level, so I wouldn't expect this to affect system libraries.

Any idea about what the next step here could be?

Flags: needinfo?(karlt)

Geoff Brown [:gbrown]

Reporter

Updated

•

5 years ago

Assignee: gbrown → nobody

Karl Tomlinson (:karlt)

Comment 23

•

5 years ago

Looking at the profile from the try run in comment 22 with GNOME_ACCESSIBILITY=0, I don't see any other dbus calls, so that doesn't actually distinguish whether it is the a11y dbus call or dbus itself.

Comment 22 does confirm that setting GNOME_ACCESSIBILITY=0 in browser.ini would be a effective workaround for Linux.

An alternative may be to provide a dbus-daemon --session process in the test system, which would usually exist in user's desktops. That would require setting DBUS_SESSION_BUS_ADDRESS to match. I'm not familiar with the exact procedure required, sorry. GNOME_ACCESSIBILITY=0 would be a simpler option at least until we have dbus-daemon --session.

Looking at IOInterposer::Init(), I also see only NSPR-level interposing, and I don't know how that would affect child processes.

However, the a11y dbus call is performed even for other tests. If similar problems are not showing up elsewhere, then I guess that indicates some interaction between startup recording or the IO interposer and a11y or dbus. If the tests can run with GNOME_ACCESSIBILITY=0, then the priority of analysing that interaction and removing the workaround can be addressed separately.

MOZ_PROFILER_STARTUP_FEATURES=js,mainthreadio,ipcmessages looks like it has three different features. I guess we're assuming the mainthreadio feature, but I don't know whether that has been confirmed.

Flags: needinfo?(karlt)

Florian Quèze [:florian]

Comment 24

•

5 years ago

(In reply to Karl Tomlinson (:karlt) from comment #23)

MOZ_PROFILER_STARTUP_FEATURES=js,mainthreadio,ipcmessages looks like it has three different features. I guess we're assuming the mainthreadio feature, but I don't know whether that has been confirmed.

We looked into the mainthreadio feature because you asked if something could affect file descriptor behavior.

I just did a try push with only the js and stackwalk features: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=296388852&revision=73b2fa3e939a0fb72f180228138a15935f0ce9c5 (obviously the tests fail as they don't have the data they need, but I got a profile anyway).
Here's the resulting startup profile, where startup is blocked for 61700ms on dbus: https://perfht.ml/34cG6Zv
The initial profiles here were without the stackwalk feature (which samples native code), and the js (sampling only JS code) feature doesn't play a role in this part of the code.

Florian Quèze [:florian]

Comment 25

•

5 years ago

I tried looking at the startup times for other Linux 18.04 startups for bc jobs, and it seems to be typically 5s. So maybe the super slow startup is only for the performance/io folder where we use startup profiling. If that's the case, I think we should just add GNOME_ACCESSIBILITY=0 in the environment and move on. But not understanding what's actually going on is frustrating.

Karl Tomlinson (:karlt)

Comment 26

•

5 years ago

Thank you for doing that analysis. It is helpful to at least narrow down somewhat the factors involved.

I see from the logs now that DBUS_SESSION_BUS_ADDRESS is already set implying that dbus-daemon --session is already running. That means the a11y dbus call should not be launching a new process, and I was jumping to conclusions in suspecting it was.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 27

•

5 years ago

Updating the bug summary and keyword to make it an intermittent bug which can be classified by sheriffs.

Keywords: intermittent-failure

Summary: Perma-fail application timed out after 370 seconds with no output on Linux/ccov on browser/base/content/test/performance/io → Perma linux ccov TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output [browser/base/content/test/performance/io]

Phabricator Automation

Updated

•

5 years ago

Attachment #9135770 - Attachment is obsolete: true

Comment hidden (Intermittent Failures Robot)

Florian Quèze [:florian]

Comment 30

•

5 years ago

(In reply to Karl Tomlinson (:karlt) from comment #26)

Thank you for doing that analysis. It is helpful to at least narrow down somewhat the factors involved.

I see from the logs now that DBUS_SESSION_BUS_ADDRESS is already set implying that dbus-daemon --session is already running. That means the a11y dbus call should not be launching a new process, and I was jumping to conclusions in suspecting it was.

Do we have an idea of what the next step would be here to understand what's going on? If we don't make progress I think we'll need to add GNOME_ACCESSIBILITY=0 in the environment.

Flags: needinfo?(karlt)

Karl Tomlinson (:karlt)

Comment 31

•

5 years ago

Setting GNOME_ACCESSIBILITY=0 for this test seems to me a reasonable thing to do.

I can only guess at what might be helpful to find the core issue. I know little about this test nor what startup recording does, but perhaps whatever these do can be further reduced to identify a trigger.

Flags: needinfo?(karlt)

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

5 years ago

Comment 33

•

5 years ago

Attached file Bug 1624868 - Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian DONTBUILD — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → aryx.bugmail

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

5 years ago

Attachment #9146152 - Attachment description: Bug 1624868 - Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian → Bug 1624868 - Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian DONTBUILD

Pulsebot

Comment 34

•

5 years ago

Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/0a92f3dc6085 Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian DONTBUILD

Cosmin Sabou [:CosminS]

Comment 35

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0a92f3dc6085

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox78: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 78

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Assignee

Updated

•

5 years ago

Status: RESOLVED → REOPENED

status-firefox78: fixed → ---

Keywords: leave-open

Resolution: FIXED → ---

Target Milestone: Firefox 78 → ---

Comment hidden (Intermittent Failures Robot)

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

5 years ago

Assignee: aryx.bugmail → nobody

Status: REOPENED → NEW

status-firefox78: --- → disabled

Keywords: test-disabled

Comment hidden (Intermittent Failures Robot)

Florian Quèze [:florian]

Comment 44

•

4 years ago

The commit message in comment 34 was incorrect, the patch (after adressing my review comment) didn't disable the test, but only disabled GNOME_ACCESSIBILITY which was interfering with the startup profiling. I would say this bug is fixed. The intermittent failures reported in comments 42 and 43 are unrelated.

Status: NEW → RESOLVED

Closed: 5 years ago → 4 years ago

Flags: needinfo?(gsquelart)

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Assignee: nobody → aryx.bugmail

BugBot [:suhaib / :marco/ :calixte]

Updated

•

4 years ago

Keywords: leave-open

Bug 1624868 - Skip browser/base/content/test/performance/io tests on ccov and windows asan; r= 5 years ago Geoff Brown [:gbrown] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1624868 - Disable browser/base/content/test/performance/io/ on Linux ccov for almost failing permanently. r=florian DONTBUILD 5 years ago Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) 47 bytes, text/x-phabricator-request		Details \| Review