Closed Bug 1635451 (semi-headless) Opened 4 years ago Closed 3 years ago

Limit number of client connections to X for fission to avoid xclient exhaustion (fission)

Categories

(Core :: Widget: Gtk, defect, P2)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
91 Branch
Fission Milestone M7a
Tracking Status
firefox91 + fixed

People

(Reporter: jesup, Assigned: jld)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fission])

Attachments

(7 files, 2 obsolete files)

In fission, we may have a large number of content processes. In linux, each currently opens 2 xclient connections. Xorg code defaults to a max number of clients from all programs of 256, though it can be manually increased to either 512 or 2048. Running fission with ~100 content processes, I've run out of xclient connections multiple times.

See Bug 1621762, bug 788319, bug 1535761 and bug 1129492

See also bug 1129492 comment #21 — if there aren't other regressions, we might be able to connect lazily to X if and when WebGL is used, or at least remove one of the two connections per process. That would require non-native widgets but it wouldn't block on WebGL remoting.

From #ipc:
"So it turns out we actually connect to the X server twice per content process — once from gtk_init, once from gfxPlatform::Init.
Which is odd, because I didn't see them separately in xlsclients, but I can replay in rr and see it open both sockets and never close them."

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #1)

See also bug 1129492 comment #21 — if there aren't other regressions, we might be able to connect lazily to X if and when WebGL is used, or at least remove one of the two connections per process. That would require non-native widgets but it wouldn't block on WebGL remoting.

I should add that this is all theoretical; there might turn out to be reasons why we'd need both X connections just for WebGL, and the idea of lazily initializing the X client is even more speculative.

Bug 1195359 appears to be the source of the two X connections; I don't know how much the original rationale there still applies.

Blocks: 1636493

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #24)

There are two problems here:

  • Remove X11 connections from content processes in all cases, for security, which is bug 1129492
  • Avoid X11 connections in the common case, to avoid exhausting resource limits, which is bug 1635451

Based on Jed's bug 1129492 comment 24, we should try to fix this bug (avoiding X11 in the common case) before Fission Nightly. Tracking for Fission Nightly milestone M6c.

Fission Milestone: --- → M6c

https://bugzilla.mozilla.org/show_bug.cgi?id=1617078#c17 indicates that the X connections limit is also seen with running ~100 copies of reftest layout/reftests/scrolling/frame-scrolling-attr-1.html

See Also: → 1617078
Blocks: 1629093
Priority: -- → P2

There has been a change to our docker images to work around this issue ignoring the underlying cause, see bug 1638183.

See Also: → 1638183

Hi Neha, Martin, this is spiking up on bug 1629093. If I understand correctly, someone should look into this here in this bug?

Flags: needinfo?(stransky)
Flags: needinfo?(nkochar)

This is a duplicate of Bug 1129492 where the work is tracked.

Flags: needinfo?(stransky)
Depends on: 1657315

Some thoughts how to limit the number of X11 connections:

  • load/cache GTK theme info - sizes/borders/etc and the close the X11 connection. Unfortunately almost every page contain scrollbar which is painted in content process. Can we re-connect to X server just for the widget repaint?
  • To use Gtk styles without X11 connection. That needs gsettings which needs X11 display. Not sure if we can fiddle with that somehow.
  • Use a fake X11 connection (like Xvfb) for the offscreen drawing.

(In reply to Martin Stránský [:stransky] from comment #10)

Some thoughts how to limit the number of X11 connections:

  • load/cache GTK theme info - sizes/borders/etc and the close the X11 connection. Unfortunately almost every page contain scrollbar which is painted in content process. Can we re-connect to X server just for the widget repaint?
  • To use Gtk styles without X11 connection. That needs gsettings which needs X11 display. Not sure if we can fiddle with that somehow.
  • Use a fake X11 connection (like Xvfb) for the offscreen drawing.

Aren't we already drawing widgets offscreen and compositing them ourselves? I remember talking to someone about that, probably karlt, in mid-2017, and using a debugger to break on writing to the X11 socket seemed to confirm it.

But there are problems with loading GTK at all: the memory usage mentioned in bug 1470983 is the main one, but also we occasionally get sandbox crash reports where GTK theme loading was trying to use DBus or something else like that instead of just reading files from a well-known location.

If GTK3 themes were close enough to standard CSS, and if it were feasible to implement these widgets in HTML itself via shadow DOM or similar, then there would be a relatively nice solution to the problem of transporting the theme between processes and rendering it without any external dependencies, but I don't know if either of those preconditions is true.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #11)

Aren't we already drawing widgets offscreen and compositing them ourselves? I remember talking to someone about that, probably karlt, in mid-2017, and using a debugger to break on writing to the X11 socket seemed to confirm it.

Yes, you're right. We use GtkStyleContext to draw the widgets to cairo surfaces. And on recent enough Gtk systems we use styles created from CSS styles: https://searchfox.org/mozilla-central/rev/dfd9c0f72f9765bd4a187444e0c1e19e8834a506/widget/gtk/WidgetStyleCache.cpp#1356

The problem is:

  • The style can't be created without display connection / gsettings as Gtk itself is not designed to run without active display connection (Wayland/X11). If something works it may be just partial functionality and it's not guaranteed.
  • Some widgets can't be created from CSS and we use actual widgets to get the style.

But there are problems with loading GTK at all: the memory usage mentioned in bug 1470983 is the main one, but also we occasionally get sandbox crash reports where GTK theme loading was trying to use DBus or something else like that instead of just reading files from a well-known location.

That's interesting. I think the DBus can be avoided, I'm not aware of any styling over DBus or so. I suspect the DBus connection comes from unrelated parts, I'm interesting to hear more about it.

If GTK3 themes were close enough to standard CSS, and if it were feasible to implement these widgets in HTML itself via shadow DOM or similar, then there would be a relatively nice solution to the problem of transporting the theme between processes and rendering it without any external dependencies, but I don't know if either of those preconditions is true.

I'm afraid we need to take Gtk as-is unless we want to migrate to Gtk4.

I think the easiest solution may be to fake the X11 connection to make Gtk3 happy or close display connection after the style initialization (if that's even possible).

Comment 13 would make impossible to handle theme changes and such, right? We use some gtk function calls to actually paint the widget backgrounds and borders and such which may not work if you close the display connection like this right?

Does something like extracting all the geometry+style data we need from the parent process and forward it, then use Gecko instead of gtk to paint with that style information, sound doable?

So, like, make a Gecko re-implementation of gtk_render_background / gtk_render_frame, etc? I think making a WebRender-based one should be relatively straight-forward (if a bit of code). Not sure how easy would it be to do a skia-based one though, hopefully not too terrible.

Just for curiosity I asked the WebKitGTK folks, and they have apparently given up on rendering native form controls and scrollbars for both GTK3 and GTK4: https://bugs.webkit.org/show_bug.cgi?id=208129. They told me there's a switch for scrollbars in GTK3 but that it'll go away.

Another thing that we use GtkStyleContext for but wouldn't be solved by the non-native theme (at least as-is) is the selection colors and other system colors.

We could probably forward them from the parent process and such (we have the LookAndFeel caches to do this kind of thing).

(In reply to Emilio Cobos Álvarez (:emilio) from comment #15)

Comment 13 would make impossible to handle theme changes and such, right?

I suppose so.

We use some gtk function calls to actually paint the widget backgrounds and borders and such which may not work if you close the display connection like this right?

The painting is done via cairo to cairo context we create which is an offscreen image. I don't think that needs an active display connection. But It may depend on actual screen configuration, like desktop scale for instance.

A bit crazy idea may be to bundle gtk3 library, stripped of unrelated code and fixed to work offscreen, which may be used just for the widget/style rendering. That means to create a fake display connection and some stock screen/settings. That library may still load themes from /usr/share and use it.

Just for curiosity I asked the WebKitGTK folks, and they have apparently given up on rendering native form controls and scrollbars for both GTK3 and GTK4: https://bugs.webkit.org/show_bug.cgi?id=208129. They told me there's a switch for scrollbars in GTK3 but that it'll go away.

Just so everyone is on the same page, we are planning to do this for Firefox as well:
https://bugzilla.mozilla.org/show_bug.cgi?id=1535761
https://bugzilla.mozilla.org/show_bug.cgi?id=1411425

The underlying driver isn't directly this bug (although it helps fix it as a side effect so that's nice...), but rather that drawing native controls means the content process ends up accessing (native) graphics APIs (with all their dependencies) which ends up making sandboxing more complicated and less secure, because we have no control over what those libraries are and end up doing (which tends to be an even larger problem on Linux due to ecosystem diversity). Being able to draw them ourselves with the same appearance as a native one would be a nice to have, but the aforementioned upsides mean that a non-entirely native look wouldn't be a blocker - especially as unstyled form controls aren't that common.

Attached file A stub GdkScreen for GdkStyleContext (obsolete) —

(In reply to Martin Stránský [:stransky] from comment #14)

As for the style creation, we're blocked here:
https://gitlab.gnome.org/GNOME/gtk/blob/3.24.20/gtk/gtkstylecontext.c#L345

Thanks for identifying that. A GdkScreen is required for settings. It need not be a GdkX11Screen, but there is no existing headless backend. Perhaps the broadway backend might be adaptable, but I assume we can't depend on that being installed.

We can create a plain GdkScreen, but it's an abstract class really. Filling in the necessary parts for an implementation, such as screen settings from ipc, requires assumptions about GDK internals. That would make this approach risky, but maybe it's an option if there will be no more GDK 3 development?

Attachment #9179809 - Attachment mime type: text/x-c++src → text/plain

To be clear, much of this discussion is about the possibility of an interim solution until bug 1411425 is fixed.

Would it be feasible to have Fission use a new process only when under some limit (say 100 processes), or is the code starting to make assumptions that same process means same site? I'm wondering whether 100 would be sufficient for the vast majority of users at least.

As a means for delaying reaching the limit, lazy GdkDisplay creation may be possible for when native widgets are drawn or sized, which is at least usually on-demand. We'd still need to send at least some of the LookAndFeel info from the parent (bug 1470983 and/or bug 1503054) because some of this is not on-demand. I don't have a good feel for how many cross-site iframes would use native widgets, but I guess that would be a smaller number. Lazy GdkDisplay creation would help only if processes are being started and stopped often. If processes are pooled for re-use, then they'll all end up having a connection.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #15)

Does something like extracting all the geometry+style data we need from the parent process and forward it, then use Gecko instead of gtk to paint with that style information, sound doable?

So, like, make a Gecko re-implementation of gtk_render_background / gtk_render_frame, etc?

I expect this is doable, but a large chunk of work. There would be a large number different combinations of widgets and states and properties to serialize, and there is also some likelihood that different GTK versions will have different implementations of the render functions. Any interim solution would kind of be wasted work, given the plan is to remove it, so I'd prefer a solution with less work.

See Also: → 1626536

Would it be feasible to have Fission use a new process only when under some limit (say 100 processes), or is the code starting to make assumptions that same process means same site? I'm wondering whether 100 would be sufficient for the vast majority of users at least.

I am told that, unlike e10s, Fission's process architecture does not allow content processes to be reused for multiple sites. So capping the number of Fission content processes as an interim solution is not feasible.

We will need to wait for non-native theming (bug 1411425) and the other work gcp listed in bug 1129492 comment 29.

(In reply to Karl Tomlinson (:karlt) from comment #21)

We can create a plain GdkScreen, but it's an abstract class really. Filling in the necessary parts for an implementation, such as screen settings from ipc, requires assumptions about GDK internals. That would make this approach risky, but maybe it's an option if there will be no more GDK 3 development?

Yes, there's no development on Gtk3, all Gtk devs are focused to Gtk4 and Gtk3 is in maintenance mode.

Fission Milestone: M6c → M7
See Also: → 1683420

Nika doesn't think this bug needs to block Fission MVP. Non-native theming will remove the biggest use of X11 in content processes and that's probably good enough for Fission MVP.

Fission Milestone: M7 → Future

(In reply to Chris Peterson [:cpeterson] from comment #31)

Nika doesn't think this bug needs to block Fission MVP. Non-native theming will remove the biggest use of X11 in content processes and that's probably good enough for Fission MVP.

Not exactly. If you just flip the widget.disable-native-theme-for-content pref, we'll still connect to the X server at content process startup, and have various dependencies on it. If we can't ship with one connection per content process, and we don't think out-of-process WebGL will be ready in time, then we'll need to set MOZ_HEADLESS on content processes add code to WebGL so that it can do its own XOpenDisplay (and ideally also close the display if it's unused — we might not want to tie up a connection for every domain that uses WebGL only for fingerprinting).

(We'd also need a pref for that that isn't security.sandbox.content.headless, because that also controls sandbox policy and would break in-process WebGL, but that's a minor issue.)

See Also: → 1687853

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #32)

(In reply to Chris Peterson [:cpeterson] from comment #31)

Nika doesn't think this bug needs to block Fission MVP. Non-native theming will remove the biggest use of X11 in content processes and that's probably good enough for Fission MVP.

Not exactly. If you just flip the widget.disable-native-theme-for-content pref, we'll still connect to the X server at content process startup, and have various dependencies on it. If we can't ship with one connection per content process, and we don't think out-of-process WebGL will be ready in time, then we'll need to set MOZ_HEADLESS on content processes add code to WebGL so that it can do its own XOpenDisplay (and ideally also close the display if it's unused — we might not want to tie up a connection for every domain that uses WebGL only for fingerprinting).

In that case, I filed new bug 1687853 for Fission triage to discuss. We definitely do not want to ship Fission with one X11 connection per content process. Sounds like there is work to do, even if we enable the widget.disable-native-theme-for-content pref and out-of-process WebGL.

Fission Milestone: Future → ?

I think this should be targeted for M8 as it will definitely block Fission to Release channel.

Fission Milestone: ? → M8
Depends on: 1687853
See Also: 1687853

I have some proof-of-concept patches that accomplish this:

  1. Start most content processes in MOZ_HEADLESS mode (the exceptions being where moz-icon URLs are needed; see also bug 1695381).
  2. Enhance our GLX support to open/close its own X server connection when needed, similar to what we already do with EGL on X11.

In theory the second part could be its own bug, but I don't think there's much point: the only ways to test it are either the X11-avoidance mode (which depends on the GLX change to not break WebGL) or security.sandbox.content.headless which is still experimental.

There remain a few issues:

  1. Not breaking Wayland support; my plan is to have Wayland mode continue to use GTK as normal.
  2. Minor memory leaks in Mesa which probably don't matter in practice but break our ASan test runs; I'll file a separate bug for that.

(More words about Wayland: for in-process WebGL, in general we need to supply the EGL library with a Wayland display connection (bug 1460603), and we'd need to tell the “headless” content process whether it should be using X11 or Wayland. For out-of-process WebGL, MOZ_HEADLESS should probably just work but I haven't tested it.)

Assignee: nobody → jld

Comment on attachment 9179809 [details]
A stub GdkScreen for GdkStyleContext

I don't think we'll need this (but it remains in Bugzilla for future reference).

Attachment #9179809 - Attachment is obsolete: true

One other concern: avoiding use-after-free of the Display; this will change its lifetime, so I need to find anywhere the GL context's Display* can flow into and make sure it's handled properly (including and especially in error cases which we don't test on automation — I already had one issue when context creation failed due to bug 1700799).

This is required for adding Linux users to Beta experiment (M7a) so fixing the fission milestone flag on this.

Status: NEW → ASSIGNED
Fission Milestone: M8 → M7a
Alias: semi-headless
See Also: → 1704965

It would be nice to have a test case to verify that the fix for this is working (or detect similar issues, like bug 1704965); one approach is to somehow launch a large number of content processes and then obtain a count of the X server's clients.

The attached C program handles the second part: it uses the XRes extension to get a count of clients for the server (including ones with no windows, which xlsclients doesn't show; credit here goes to :gcp for finding the more complex tool xrestop). Turning this into something we could run in an automated test may be complicated, especially given that it adds a new dependency on the libxres1 package.

This follows what we're already doing for EGL: a refcounted object
which can own the X connection, where we hold a weak reference from
the library object so that multiple contexts opportunistically share
the display but we close the connection when the last context is
freed/GCed.

In a process where GTK is initialized, we borrow its display instead of
opening a new one, which preserves the existing behavior.

This patch launches content processes with the MOZ_HEADLESS env var set if:

  1. the widget toolkit is GTK
  2. the display is X11
  3. the content process doesn't need GTK for other reasons (e.g., bug 1695381)

The goal is to avoid exhausting Xorg's default limit of 256 clients if
there are many content processes due to Fission. If these conditions
are met, the content process doesn't need to eagerly connect to the X
server. This does not affect the sandbox policy, and content processes
can still use X if needed for, e.g., WebGL.

The new env var MOZ_HEADLESS_BROWSER is added, set for content
processes to indicate if the entire browser is headless (rather than
just a specific process); this is needed in several cases.

The boolean pref dom.ipc.avoid-x11, set by default, controls this feature.

Note that disabling widget.non-native-theme.enabled, which is also
enabled by default, will restore the use of X11 in all content processes
even if this pref is set; the alternative is that widgets wouldn't render
in that case.

This change should also save us some memory for now-unnecessary
instances of GTK's global state, and may improve content processes
startup time.

Blocks: fission

Current work-in-progress patch stack, in git format-patch format. Also includes bug 1715182 and bug 1695381.

Currently we return an error when creating a WebGL context in headless
mode, but our WebGL implementation renders to an offscreen context, so
in theory it could work normally in a headless browser, and in practice
it already does work on some OSes. This patch removes that check; the
attempt to use GL may fail, in which case we'll return an error to
content.

The main purpose of this patch is to run content processes with headless
mode set in an otherwise non-headless browser, but it should also be
useful for fully headless mode. Comments in bug NNNNNNN indicate that
this change should be sufficient for headless WebGL on Windows and MacOS,
although it may not have been extensively tested.

Linux is more complicated. The EGL/X11 backend manages its own
connection to the X server (indirectly via the EGL library); a later
patch in this series allows doing that in GLX mode as well. Our Wayland
support can't do this yet, but it should be possible.

This patch also modifies the Linux sandbox policy so that content
processes can connect to a local X server (via the file broker) even when
the parent process is in headless mode.

Depends on D118687

Currently we return an error when creating a WebGL context in headless
mode, but our WebGL implementation renders to an offscreen context, so
in theory it could work normally in a headless browser, and in practice
it already does work on some OSes. This patch removes that check; the
attempt to use GL may fail, in which case we'll return an error to
content.

The main purpose of this patch is to run content processes with headless
mode set in an otherwise non-headless browser, but it should also be
useful for fully headless mode. Comments in bug NNNNNNN indicate that
this change should be sufficient for headless WebGL on Windows and MacOS,
although it may not have been extensively tested.

Linux is more complicated. The EGL/X11 backend manages its own
connection to the X server (indirectly via the EGL library); a later
patch in this series allows doing that in GLX mode as well. Our Wayland
support can't do this yet, but it should be possible.

This patch also modifies the Linux sandbox policy so that content
processes can connect to a local X server (via the file broker) even when
the parent process is in headless mode.

Attachment #9216100 - Attachment description: Bug 1635451 - Allow GLX to work in headless content processes. → Bug 1635451 - Allow GLX to work in headless content processes. r?jgilbert
Attachment #9216102 - Attachment description: Bug 1635451 - Minimize content processes' connections to the X server. → Bug 1635451 - Minimize content processes' connections to the X server. r?jgilbert,stransky,nika
Attachment #9228709 - Attachment is obsolete: true

Local run of BUILD_DEBUG=1 xvfb-run ./mach test dom/ipc/tests/browser_very_fission.js without the patches triggers some exhaustion of X clients, and it looks like the same run with all the patches passes fine with the 256 tabs.

Pushed by alissy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2974b1ba2beb
Attempt to start WebGL even in headless mode. r=jgilbert
https://hg.mozilla.org/integration/autoland/rev/c1bd0996764c
Allow GLX to work in headless content processes. r=jgilbert
https://hg.mozilla.org/integration/autoland/rev/c49de061c1fa
Minimize content processes' connections to the X server. r=jgilbert,stransky,nika

== Change summary for alert #30525 (as of Thu, 08 Jul 2021 04:50:56 GMT) ==

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
41% Base Content Heap Unclassified linux1804-64-shippable 3,522,987.00 -> 2,090,219.33
40% Base Content Heap Unclassified linux1804-64-shippable-qr 3,565,518.33 -> 2,132,685.00
16% Base Content Explicit linux1804-64-shippable-qr 8,958,618.67 -> 7,553,178.67
16% Base Content Explicit linux1804-64-shippable 8,915,440.00 -> 7,530,480.00
12% Heap Unclassified linux1804-64-shippable 64,812,045.25 -> 57,191,783.93
... ... ... ... ... ...
2% Explicit Memory linux1804-64-shippable 362,469,246.48 -> 353,560,435.66

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=30525

Regressions: 1719668
See Also: → 1654658
See Also: 1654658
Regressions: 1721611
See Also: → 1503991
Regressions: 1725573
Regressions: 1725148
See Also: → 1658523
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: