[e10s] There is a delay for about 1 seconds to display <select> drop-down list with 1600+ items

NEW
Unassigned

Status

()

Core
Layout: Form Controls
P2
normal
3 years ago
a month ago

People

(Reporter: Alice0775 White, Unassigned)

Tracking

(Depends on: 2 bugs, Blocks: 4 bugs, {multiprocess, perf, testcase})

Trunk
multiprocess, perf, testcase
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s+)

Details

(Whiteboard: [qf:f61][qf:p1])

Attachments

(3 attachments, 1 obsolete attachment)

(Reporter)

Description

3 years ago
Steps to reproduce:
0. Open e10s window
1. open https://bugzilla.mozilla.org/request.cgi?action=queue&requester=&product=&type=all&requestee=jdaggett%40mozilla.com&component=&group=type
2. Click drop-down marker of "Component:"

Actual results:
There is a delay for about 1 seconds

Expected results:
It should be as same as on non-e10s window
Assignee: nobody → gwright
tracking-e10s: ? → m6+
Flags: needinfo?(gwright)

Updated

3 years ago
Blocks: 1154677
tracking-e10s: m6+ → m7+
Flags: needinfo?(gwright)
tracking-e10s: m7+ → m8+
tracking-e10s: m8+ → +
Using reported STRs, I see 2-3 tenths of a second delay on Mac.  In a non-e10s window the same <select> is opened without any perceptible delay.
Priority: -- → P2
From profile [1], there's a sync IPC when chrome process is busy restyle/reflow (>600ms). However, clicking another drop-down "Product:" does not have the same issue, seems it does less in populateChildren() so the restyle/reflow is much shorter (~50ms).

[1] http://people.mozilla.org/~bgirard/cleopatra/#report=e100ac486f3367683eaaa1bb50381b4729a1d86f&select=333870,334900
[2] http://people.mozilla.org/~bgirard/cleopatra/#report=0453644e63cf393173d955bb32fc329ffe4b9649&select=838790,838900
The dropdown "Component" has 1593 options, but "Prodcut" has only 124 options and 5 groups.

I am lack of e10s background knowledge, one thing I don't understand is why the content of <select> need to send to parent to draw (XUL).

Comment 5

2 years ago
The title of this bug is misleading, I have experienced times when it takes a good 5+ seconds to open a SELECT element.

Comment 6

2 years ago
Renominating: this is the kind of perf regression which should block M9.
tracking-e10s: + → ?

Updated

2 years ago
Flags: needinfo?(jgriffiths)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> Renominating: this is the kind of perf regression which should block M9.

We don't really have enough data to understand the regression. The missing pieces are:

1. STR has a list with 1620 items. At what length do we really run into problems? On my Mac ( a super fast machine ) the time is way less than 1 second but it's noticeable. If I start reducing the length of the list, the delay gets way better at around 600-800 items. 

2. based on data from 1, how long are lists on the web? there are extreme cases like our components select but I'm betting most lists are short.

If we had a talos test that showed us how list perf degrades based on list length across platforms, and if the list length at which a list starts to take significantly longer than baseline is more than most lists out there, this is low priority.

Jim: in absence of data this is '+' and P2. If we got data that told us this is more serious, I would set it to block but my intuition is that even 400+ list lengths are an edge case.

If we had infinity engineering time I would say let's measure this. I'm not sure it's worth our time, I would rather gut this out and defer the work past M9.
Flags: needinfo?(jmathies)
Summary: [e10s] There is a delay for about 1 seconds to display <select> drop-down list → [e10s] There is a delay for about 1 seconds to display <select> drop-down list with 1600+ items
Flags: needinfo?(jgriffiths)

Updated

2 years ago
tracking-e10s: ? → +
Flags: needinfo?(jmathies)
(In reply to Ting-Yu Chou [:ting] from comment #3)
> I am lack of e10s background knowledge, one thing I don't understand is why
> the content of <select> need to send to parent to draw (XUL).

Seems it's because of bug 897060.

Updated

2 years ago
Duplicate of this bug: 1293599
I frequently use a <select> with ~7420 entries. With recent updates enabling e10s I've had complaints from users that the lists are slow, occasionally taking minutes to render.
Tests on my development PC show instant display with e10s disabled, and 3.5s with e10s enabled.

e10s also ignores CSS applied to <option>s.
Duplicate of this bug: 1336428
Created attachment 8833705 [details]
select.html

Here is a link to a profile that was recorded by opening the popup on the attached HTML file (10,000 options):

https://cleopatra.io/#report=0ac09b5ee28b5c391bd7fea4b04780e722d76018&selection=0,87,88,89,90,91,92,1,93,94,95,96,97,98,99,100,200,101,307,308,309,310,311,675,313,676,677,678,39,9,3,714,14,15,9,10,11,9,3,1015,14,15,9,10,11,9,1016

The profile shows that the majority of the time in the delay comes from the call to getBoundingClientRect within SelectParentHelper.jsm's open():

http://searchfox.org/mozilla-central/rev/b1aadb3572eaf7d2c70e19a2ba5413809d9ac698/toolkit/modules/SelectParentHelper.jsm#64
Two thirds of the time is creating the frames for the options, one third is reflowing them. Do we just create the frames and reflow them eagerly in old style (non-e10s) selects so that we pay that cost on page load but the select is then ready to display later?
To put it in to context:

435 samples for getBoundingClientRect in SelectParentHelper.jsm's open().
195 samples for populateChildren() in SelectParentHelper.jsm
40  samples for set_textContent in SelectParentHelper.jsm's populate().

Those three categories cover 99% of the work involved in opening the popup. So finding a faster approach than using getBoundingClientRect() in open() potentially could solve up to 65% of the delay. The work in populateChildren() is a long tail, but there are two operations that each account for 55 and 35 samples separately. The 55 sample chunk looks like its related to a js::ProxyGetProperty call, and the 35 sample chunk looks like its related to a CSS style that is set as a property.
(In reply to Jared Wein [:jaws] (please needinfo? me) from comment #14)
> Those three categories cover 99% of the work involved in opening the popup.
> So finding a faster approach than using getBoundingClientRect() in open()
> potentially could solve up to 65% of the delay. The work in

Is the time taken by getBoundingClientRect frame construction/reflow work that we would have to be doing anyways? Or is it "getBoundingClientRect, change something that requires layout, getBoundingClientRect,  change something that requires layout, ..."?
The getBoundingClientRect work in that profile may not have come from the line that I initially thought it was[1]. Indeed that call to getBoundingClientRect is called each time the popup is opened, same with the one above it to get the height of the first item[2].

However, that may have been a red herring from the profiler. If I use the debugger and step through the code, all of the delay within open() is coming from when we set menulist.hidden=false[3]. At this point, the menulist has already been populated and we are now adding it to the frame tree. This line is here to fix a test that was failing because the menulist was getting focus when it shouldn't have[4].

Ideally we could find a way to do this that wasn't synchronous. populate() takes some time already, and moving the menulist.hidden=false line to run before the list is populated almost tripled the delay (rough counting in my head while waiting).

[1] http://searchfox.org/mozilla-central/rev/b1aadb3572eaf7d2c70e19a2ba5413809d9ac698/toolkit/modules/SelectParentHelper.jsm#64
[2] http://searchfox.org/mozilla-central/rev/b1aadb3572eaf7d2c70e19a2ba5413809d9ac698/toolkit/modules/SelectParentHelper.jsm#53
[3] http://searchfox.org/mozilla-central/rev/b1aadb3572eaf7d2c70e19a2ba5413809d9ac698/toolkit/modules/SelectParentHelper.jsm#37
[4] https://people-mozilla.org/~mconley2/bugnotes/bug-1108761.html
Also, I should add that it looks like we could replace that call to browser.getBoundingClientRect() by using window.screen, and the avail* properties. Though this didn't fix the slow-down as described by comment 16.
For the record the non-e10s case is pretty much instant for loading the page and opening the dropdown. So there's no reason this can't be fast.

Digging into a local profile the one big flush takes 2781ms. Of that 1424 ms are in DidDoReflow calling the reflow callbacks nsAsyncAccesskeyUpdate and nsAsyncMenuInitialization. 1352 ms of that is removing weak frames. Both those classes hold a weakframe. Removing a weakframe just walks a linear linked list of all existing weak frames. I'm guessing we get one of these classes for each select option, so we get a huge number of weakframes that we have to search.

We don't need to use weakframes for this, we can just revoke the reflow callback in the frames Destroy function.

That's about 50% of the time knocked off right there, But that still leaves over 1000ms of work we are doing. Reflow is 468ms is gross archaic XUL layout. 888ms is in constructing frames, which it looks like we do one at a time for each option. If you can set display: none on some element that contains everything, then insert all the options, then remove the display: none we could see if constructing them in one go improves things at all.
Depends on: 1340451
Depends on: 1340452
(In reply to Timothy Nikkel (:tnikkel) from comment #18)
> Digging into a local profile the one big flush takes 2781ms. Of that 1424 ms
> are in DidDoReflow calling the reflow callbacks nsAsyncAccesskeyUpdate and
> nsAsyncMenuInitialization. 1352 ms of that is removing weak frames. Both
> those classes hold a weakframe. Removing a weakframe just walks a linear
> linked list of all existing weak frames. I'm guessing we get one of these
> classes for each select option, so we get a huge number of weakframes that
> we have to search.
> 
> We don't need to use weakframes for this, we can just revoke the reflow
> callback in the frames Destroy function.

Bug 1340452 and bug 1340451 fix this. They knock off about half the time.
I'm away from my desktop computer until Monday (which is a US holiday), so maybe consider it Tuesday.

However, these nodes should have display:none on them already. This is why we have menulist.hidden=false at the beginning of open(). The menulist is hidden in the XUL at http://searchfox.org/mozilla-central/source/browser/base/content/browser.xul#179 and then is made visible in SelectParentHelper.jsm's open() call at http://searchfox.org/mozilla-central/source/toolkit/modules/SelectParentHelper.jsm#63.

The menulist is populated by the populate() function, but open() is called after populate(), as seen at http://searchfox.org/mozilla-central/rev/ea31c4b64f34a29415a69fb768f8051495547315/toolkit/content/widgets/remote-browser.xml#474-478
Flags: needinfo?(jaws)
Hmm, why is the e10s <select> combobox implemented using XUL frames!?!?
Please re-implement it using HTML+CSS instead, which is highly optimized,
unlike XUL.  Our goal in platform is to *remove* XUL!
Mike, are you able to provide any background on comment #21?
Flags: needinfo?(mconley)
Bug 1108761 was where the original work was done. The general idea was that this was the shortest path, at the time, to get the appearance and behaviour characteristics that we wanted of the popup.

If we were to try to move this over to HTML, we'd need some kind of popup thing to put the HTML into. I mean, we could use a popup with a <xul:browser> and then put some HTML in there for the items but... but that sounds crazy.

Unless I'm completely missing a different solution.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) from comment #23)
> If we were to try to move this over to HTML, we'd need some kind of popup
> thing to put the HTML into. I mean, we could use a popup with a
> <xul:browser> and then put some HTML in there for the items but... but that
> sounds crazy.

Using XUL popups is much much less of a problem since the menu popup code is relatively modern, understood, and well maintained. So using a xul popup and then as thin a layer of XUL as possible to contain HTML would be good. Isn't it possible for XUL for contain HTML elements without putting them in an iframe?
Flags: needinfo?(jaws)
Depends on: 1340771
No longer depends on: 1340452

Updated

a year ago
Duplicate of this bug: 1303352
(Reporter)

Updated

a year ago
Blocks: 1353344
(In reply to Timothy Nikkel (:tnikkel) from comment #24)
> 
> Isn't
> it possible for XUL for contain HTML elements without putting them in an
> iframe?

Yes, but mixing XUL and HTML has burned front-end in the past, particularly around sizing bugs, since the box models differ between the two and this causes weird bugs to creep in.

In any case, I agree that XUL is not ideal here. We did what we could to ship e10s with working <select>'s, and maybe this is worth a revisit - especially if performance gains are to be had.
Blocks: 1348289
Whiteboard: [qf]

Updated

a year ago
Whiteboard: [qf] → [qf:p1]

Updated

a year ago
Assignee: gw → nobody
See Also: → bug 1359263

Updated

11 months ago
No longer blocks: 1348289
Here's a profile that I just captured in current Nightly (fresh profile), for opening the menu, on Linux:
 https://perfht.ml/2qTil54

There are two red jank-bars there, the second of which is 3218ms long and is mostly frame construction & XUL layout.
OK, here's a large overview of everything that happens in this profile from comment 27.

FIRST RED JANK BAR (814ms): Content process
===========================================
https://perf-html.io/public/629acab11dec2d5d2ce5376ba515daa51c0a56b4/calltree/?range=1.9350_2.7492&thread=4

* This is mostly the child process calling buildOptionListForChildren(), for 500ms
https://perf-html.io/public/629acab11dec2d5d2ce5376ba515daa51c0a56b4/calltree/?callTreeFilters=prefix-01&range=1.9350_2.7492&thread=4
Source link:
https://dxr.mozilla.org/mozilla-central/rev/5bc1c758ab57c1885dceab4e7837e58af27b998c/toolkit/modules/SelectContentHelper.jsm#329

The profiler shows buildOptionListForChildren only calling into code that doesn't have symbols (maybe it's jitted code or something?).  But drilling into them, a lot of it seems to be restyles.  For example: the most expensive "child" (0x7fb101523e6f) is mostly a query to CSS2PropertiesBinding::get_display.  This must correspond to the JS line "display: cs.display" in SelectContentHelper.jsm:
https://dxr.mozilla.org/mozilla-central/rev/5bc1c758ab57c1885dceab4e7837e58af27b998c/toolkit/modules/SelectContentHelper.jsm#352,359
And we spend 63ms in get_display because that call triggers a style flush.  (I'm not sure if this is a one-time large cost or whether there's a penalty we're paying for each child.)


SECOND RED JANK BAR (3.2 sec): Parent process
=============================================
https://perf-html.io/public/629acab11dec2d5d2ce5376ba515daa51c0a56b4/calltree/?range=2.7481_5.9705&thread=0

* This is basically one long call to mozilla::dom::TabParent::RecvAsyncMessage.

* First: we spend 72ms continuously in ReadFromBuffer, which is probably parsing the message from the content process.

* Then we spend 436ms in a JS function called "populate":
http://searchfox.org/mozilla-central/rev/6c2dbacbba1d58b8679cee700fd0a54189e0cf1b/toolkit/modules/SelectParentHelper.jsm#30
  - Of that, 111ms is in JS clearing textContent ("set_textContent") and destroying some XUL, which is from this JS code in populate:
http://searchfox.org/mozilla-central/rev/6c2dbacbba1d58b8679cee700fd0a54189e0cf1b/toolkit/modules/SelectParentHelper.jsm#33-34
  - And the remaining 325ms is in a JS function called "populateChildren" (in a bunch of no-symbols child functions, which are probably jitted code)

* Then we spend 2134ms in a JS function called "open", and nearly all of that time (2132ms) is in "getBoundingClientRect" as mconley hinted. That breaks down like so:
  - 586ms in frame construction (ConstructFramesFromItemList)
  - 1491ms in XUL layout (nsBoxFrame::DoXULLayout)
  - 51ms in nsAsyncAccesskeyUpdate::ReflowFinished()

* Then some GTK/GDK calls like "gtk_lock_button_set_permission" which result in:
  - 49ms of restyles (RestyleTracker::ProcessRestyles) which seems to be for a text/font-related restyle
  - 460ms of XUL relayout.

* And finally, we spend 6ms in UpdateNativeWidgetZIndexes.
Here's a profile from Windows, for what it's worth: https://perfht.ml/2rzQREp

I haven't looked at it much yet.
Here's a better Windows profile (1ms instead of 0.5ms, since times smaller than 1ms seem to be a bit busted right now): https://perfht.ml/2rzS0vr
(In reply to Daniel Holbert [:dholbert] (reduced availability - travel & post-PTO backlog) from comment #30)
> FIRST RED JANK BAR (814ms): Content process
> And we spend 63ms in get_display because that call triggers a style flush. 
> (I'm not sure if this is a one-time large cost or whether there's a penalty
> we're paying for each child.)

We're trashing with a restyle and flush for each child we iterate, because inDOMUtils::AddPseudClassLock and inDOMUtils::ClearPseudClassLocks cause content state changes that post restyles.
(In reply to David Baron :dbaron: ⌚️UTC-4 from comment #33)
> We're trashing with a restyle and flush for each child we iterate, because
> inDOMUtils::AddPseudClassLock and inDOMUtils::ClearPseudClassLocks cause
> content state changes that post restyles.

I'm working on a patch for this.
(I'm patching the JS... though it's likely it would also help to fix Element::LockStyleStates and Element::UnlockStyleStates to notify that there's a style change only when it's actually changing the state, though that might have other disadvantages.)
Created attachment 8870878 [details] [diff] [review]
fix the restyle thrashing in the child

This fixes the restyle thrashing in the child process, and removes the restyle stuff from the profile... but doesn't help very much with the overall time of the child process's pause!
Attachment #8870878 - Attachment is patch: true
Attachment #8870878 - Attachment mime type: message/rfc822 → text/plain
OK, I made it actually be a significant win and split it out into bug 1367505.
Depends on: 1367505
So for the parent process hang.  I have a profile of a 10196ms hang, which got 10001ms of profiling samples.  There are four main pieces of the profile, in this order:
 - 268ms is in IPC code receiving the Forms::ShowDropDown IPC message
 - 1104ms in the populate JS function in SelectParentHelper.jsm
   - the bulk of the time in this seems to be JS execution and various JS engine functions
 - 6897ms in the open JS function in SelectParentHelper.jsm (much of which is the native code below)
 - then there's a final 148ms restyle and 1483ms Reflow

For interesting native code (split between running inside the open JS function and the final restyle and reflow):
 - 4624ms is in nsNativeThemeGTK methods
 - 2157ms is in RestyleTracker::DoProcessRestyles
   - within this, 1059ms is in nsXBLService::LoadBindings
(It seems to me that this would have been much better if the thing we were using in the parent process were just our old select code, rather than all this XUL and XBL stuff...)
(In reply to David Baron :dbaron: ⌚️UTC-4 from comment #38)
>  - 268ms is in IPC code receiving the Forms::ShowDropDown IPC message

The time here is *entirely* inside JSStructuredCloneReader::read (called via mozilla::dom::TabParent::RecvAsyncMessage -> mozilla::dom::TabParent::ReceiveMessage -> nsFrameMessageManager::ReceiveMessage -> nsFrameMessageManager::ReceiveMessage -> mozilla::dom::StructuredCloneHolder::ReadFromBuffer -> mozilla::dom::StructuredCloneHolder::ReadFromBuffer -> ReadStructuredClone -> JSStructuredCloneReader::read).
Comment hidden (typo)

Updated

11 months ago
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #39)
> (It seems to me that this would have been much better if the thing we were
> using in the parent process were just our old select code, rather than all
> this XUL and XBL stuff...)

Do we already have a bug for this? It seems at first glance that we should have a dependent bug here.

People were asking in Quantum Flow triage if we could assign this bug to you, dbaron :) But maybe it should be a meta bug?
Duplicate of this bug: 1364143
No longer blocks: 1353344
Duplicate of this bug: 1353344

Comment 45

10 months ago
(In reply to Andrew Overholt [:overholt] from comment #42)
> (In reply to David Baron :dbaron: ⌚️UTC-7 from comment #39)
> > (It seems to me that this would have been much better if the thing we were
> > using in the parent process were just our old select code, rather than all
> > this XUL and XBL stuff...)
> 
> Do we already have a bug for this? It seems at first glance that we should
> have a dependent bug here.
> 
> People were asking in Quantum Flow triage if we could assign this bug to
> you, dbaron :) But maybe it should be a meta bug?

It looks like it was a conscious decision (see bug 1300784) to unify this stuff using the current code though I can't tell why we went with the more complicated option.
Comment on attachment 8870878 [details] [diff] [review]
fix the restyle thrashing in the child

obsoleting since this moved to bug 1367505.
Attachment #8870878 - Attachment is obsolete: true
Something we could look into here:
 * can we avoid having an XBL binding per option?  Or if we can't, can we make that binding cheaper?
(Reporter)

Updated

10 months ago
Blocks: 1374998
(Reporter)

Updated

10 months ago
No longer blocks: 1374998
Duplicate of this bug: 1374998

Updated

10 months ago
Duplicate of this bug: 1377155

Updated

10 months ago
Duplicate of this bug: 1378367

Comment 52

10 months ago
I see delay about 4~5 seconds to display <select> of attachment 8833705 [details] or opening "Written in this language" on GitHub's Advanced search, while e10s on only. Nightly 56.0a1 (2017-07-06) on Win10.
OS: Windows 7 → All
Hardware: x86_64 → All
Duplicate of this bug: 1379870
(In reply to Mats Palmgren (vacation - back in August) from comment #21)
> Hmm, why is the e10s <select> combobox implemented using XUL frames!?!?
> Please re-implement it using HTML+CSS instead, which is highly optimized,
> unlike XUL.  Our goal in platform is to *remove* XUL!

FWIW, if we were to move the e10s <select> to an HTML+CSS implementation, that might also provide an opportunity to fix the vertical writing-mode regression that e10s introduced (see bug 1170129).
(Reporter)

Updated

9 months ago
Duplicate of this bug: 1382119
It's not clear to me if someone's working on this or if it should remain qf:p1. Jet?
Flags: needinfo?(bugs)

Updated

9 months ago
Blocks: 1383205

Comment 57

9 months ago
(In reply to Andrew Overholt [:overholt] from comment #56)
> It's not clear to me if someone's working on this or if it should remain
> qf:p1. Jet?

The test case in bug 1383205 is really bad and I'd still love to fix this.

(In reply to David Baron :dbaron: ⌚️UTC+1 from comment #47)
> Something we could look into here:
>  * can we avoid having an XBL binding per option?  Or if we can't, can we
> make that binding cheaper?

I think this could still help, though I'm not sure how to wire that up. 
Olli: I can probably find an assignee if you suggest an approach for the XBL.
Flags: needinfo?(bugs) → needinfo?(bugs)
Just don't use XBL?
I assume nsXBLService::LoadBindings shows up badly only when doing the first load, since
XBL's prototypes are then used for the next bindings.
Of course if XBL binding constructor takes lots of time, that can be rather bad.

Has anyone considered using xul:tree? Or something similar implemented using JS+HTML?
Flags: needinfo?(bugs)
Assignee: nobody → npancholi
(In reply to Jet Villegas (:jet) from comment #57)
> (In reply to Andrew Overholt [:overholt] from comment #56)
> > It's not clear to me if someone's working on this or if it should remain
> > qf:p1. Jet?
> 
> The test case in bug 1383205 is really bad and I'd still love to fix this.

Do we understand the problem well enough to think 57 is realistic?

My original assessment last year was that I didn't think 400+ item lists were common enough to block shipping e10s to, I believe, beta at the time.

I agree it is galling loading the test case in Canary only to see the drop list render and be interactive immediately...
No longer blocks: 1383205
Depends on: 1383205
Depends on: 1386015
(In reply to Jeff Griffiths (:canuckistani) (:⚡︎) from comment #59)
> Do we understand the problem well enough to think 57 is realistic?

I suspect 57 isn't realistic here, particularly given that we're trying to front-load riskier changes at the very start of the 57 cycle (i.e. landing that sort of stuff ~next week), and this will be a risky change.  And we're not entirely settled on precisely what sort of change this will be, even.

So, I'm adjusting this to qf:p2 to match reality.  A fix for 57 would be great, but unless we happen to get a ready-to-land patch real soon, we should probably target this for the 58 timeframe to avoid introducing webcompat/stability risk late in the 57 cycle.

> My original assessment last year was that I didn't think 400+ item lists
> were common enough to block shipping e10s to, I believe, beta at the time.

(I share your hope that this sort of content is rare/pathological enough that users don't hit this super-often.  If we end up needing to make more calls about priority here, we could probably collect telemetry to confirm/refute this suspicion; but at this point I'm not sure that'd be worth it.)
Whiteboard: [qf:p1] → [qf:p2]
Hi Mats, 

After speaking with mconley, one solution to this problem seems to be that we can constrain the size of the popup to stay within the content area and then draw the popup in the content process itself. It seems like for E10S, we are already constraining the popup because of the fix for Bug 1276976 (Although, Felipe noticed that this might be wrong, will confirm with :Enn when he gets back)

The problem with doing this is that carrying the OS styling over to the child process might not be trivial. Do you have any suggestions for if this is doable/any pointers on how this can be done?
(**I guess someone from UX will have to weight in on constraining the popup for both E10S and non-E10S and also on the pop up styling itself, if we cannot keep some/all of the OS styling)

I also sent you an email with additional details and screenshots. If there is anything that you think is relevant to this, please let me know.

Thanks a lot!
Neerja
Flags: needinfo?(mats)
Bram,
Wondering if you could weigh in on constraining the popup to be drawn within the content area only. We are already doing this with E10S enabled right now because of Bug 1276976 (but this might be unintentional, will confirm that). 
What do you think of keeping this restriction for both E10S and non-E10S? 
Thanks!
Flags: needinfo?(bram)

Comment 63

9 months ago
Hi Neerja, we’ll get back to you on this. I’ve contacted Stephen Horlander who is designing our Photon form elements.
Flags: needinfo?(bram)
(In reply to Bram Pitoyo [:bram] from comment #63)
> Hi Neerja, we’ll get back to you on this. I’ve contacted Stephen Horlander
> who is designing our Photon form elements.

Thanks bram - I've set a needinfo? on shorlander so that (hopefully) this doesn't slip through the cracks[1].

[1]: Since I know how busy we are with 57. :)
Flags: needinfo?(shorlander)
(In reply to Neerja Pancholi[:neerja] from comment #62)
> Wondering if you could weigh in on constraining the popup to be drawn within
> the content area only. We are already doing this with E10S enabled right now
> because of Bug 1276976 (but this might be unintentional, will confirm that). 
> What do you think of keeping this restriction for both E10S and non-E10S? 
> Thanks!

So the ability to draw outside of the browser window is the main reason to have the complexity of sending the options to the parent process and drawing them there.  Given that we're not doing that, it seems like we'd be better off just drawing the popup in the child process, as you suggest.

Keeping more-native looking styling as we do since the switch to e10s (which I think was a side-effect of writing new code for e10s, so I think not intentionally) should be quite doable.  Native-theme drawing should work just fine in the child process.
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #65)
> So the ability to draw outside of the browser window is the main reason to
> have the complexity of sending the options to the parent process and drawing
> them there.  Given that we're not doing that, it seems like we'd be better
> off just drawing the popup in the child process, as you suggest.
> 
> Keeping more-native looking styling as we do since the switch to e10s (which
> I think was a side-effect of writing new code for e10s, so I think not
> intentionally) should be quite doable.  Native-theme drawing should work
> just fine in the child process.

Thanks Dbaron!
In that case, what do you think about breaking this fix into two parts? 
(1) Move the popup drawing code to the child process (Without considering the styling)
(2) Fix the native styling in the child process

It think it will be simpler and faster to first get the popup drawing to work in the child process and the styling can come later since it might be optional. What do you think?
(In reply to Neerja Pancholi[:neerja] from comment #66)
> (1) Move the popup drawing code to the child process (Without considering
> the styling)

Incidentally, this should also fix the problem that webfonts (loaded by the child process) are not available in the parent, and therefore a <select> that tries to use a webfont (e.g. for icons) in its menu items will currently fail to render them properly when the popup appears; see bug 1345793.

Marking that bug as dependent on this one, given that it sounds like what we're going to do here will resolve it.
Blocks: 1345793
(In reply to Neerja Pancholi[:neerja] from comment #66)
> (1) Move the popup drawing code to the child process (Without considering
> the styling)

I looked into doing (1) and realized it won't actually solve this problem in the two process architecture. We can't really go back to the way we were drawing the popup in the single process architecture because the way it works now is that the parent process sends the click event to the child process, then the child process creates the option list, which is huge, and sends it via IPC. The parent then processes this huge list and at the end only a small part of it is actually drawn. Even if we try to "draw" in the content process, the most we can do is build and send a huge "layer tree object" to the parent process because we don't have direct control over the drawing in the content process. 

So, the only way to make this performant in the two process architecture is to limit the list built by the child process to only the items that will actually be drawn. This should speedup every method mentioned in the performance analysis. Eg: In dholbert's  Comment 30, it should remove most of the 500ms overhead in buildOptionListForChildren() because buildOptionList skips hidden childen -> https://dxr.mozilla.org/mozilla-central/rev/5bc1c758ab57c1885dceab4e7837e58af27b998c/toolkit/modules/SelectContentHelper.jsm#335-338. Similarly we will send a much smaller object via IPC, drastically reducing the RecvAsyncMessage, ReadFromBuffer and XUL layout time. (Yay!)

After talking with :smaug it seems like XUL tree already seems to do this. One example that smaug gave me was that xul:tree is currently being used in Thunderbird to display the email list (>220000) and it's still super fast because it only lays out what is currently being shown. So I'm going to try this out. 

If something here feels amiss please feel free to let me know!

Comment 69

8 months ago
(In reply to Neerja Pancholi[:neerja] from comment #68)
> After talking with :smaug it seems like XUL tree already seems to do this.
> One example that smaug gave me was that xul:tree is currently being used in
> Thunderbird to display the email list (>220000) and it's still super fast
> because it only lays out what is currently being shown. So I'm going to try
> this out. 

While this approach will improve perf, replacing the widget is too big a change to pick up for v57. As noted, our Photon UI depends on the styling support that we also don't want to lose. 

> So, the only way to make this performant in the two process architecture is 
> to limit the list built by the child process to only the items that will 
> actually be drawn.

I agree, and I'd like to get that capability into the current widget. Can we spin off a new bug for that? Thx!
Flags: needinfo?(npancholi)
I think moving the popup drawing back into the child process might be easier than you think, because we have existing code to do that in the non-e10s codepath, and that code is totally different from the code we use for e10s.  It would only need to be adjusted to:

 * not use native popup widgets (which the child process can't do), but instead use the display list to be on top of everything else drawn by the child process

 * (maybe) make some of the adjustments to styling that we've made for e10s (i.e., a more native look), though this probably isn't a requirement since it would just be reverting things to the way they were a few releases ago.
Flags: needinfo?(dbaron)
Created attachment 8898018 [details]
Safari select dropdown

I should note that the OS native look wasn't an accident from the current implementation, it was the main driver that constrained the solution to rendering in the parent, and it's the direction that all browsers are going.

Through the years of e10s testing I've seen reports of "select dropdowns suddenly look ugly", from users who got moved to non-e10s by accident (due to installing an e10s-incompatible addon) and were unaware.

Look at the attached screenshot from Safari to see how they look nice. I think by rendering in the content we will be limiting ourselves to not be able to implement that (although, if this is wrong, and we could implement some sort of OS-style transparency in content, I'd be happy to hear it!)

(Firefox and Chrome look the same as Safari right now, except for the bug where we don't render it outside of the viewport, and should be fixed)
Created attachment 8898020 [details]
Firefox e10s vs. non-e10s dropdowns

And a comparison of our e10s vs. non-e10s dropdowns
I don't think it should be particularly hard to apply the same native-theme styling that we apply in the parent to the child.  I don't see why that required rewriting everything to use XUL popups instead of the existing pre-e10s select code, which didn't have the performance problems.

comment 62 contradicts the claim that we want to draw the popup outside of the content area.  And *that* was, I believe, the primary reason we had to draw the options in the parent.
I spoke with dbaron today and here is a summary. It seems like going the display list route only makes sense if we are ok with not painting anywhere outside of the content area of the child process. This seemed like an ok assumption at the time but all other browsers, including us in non-E10S mode (we paint only *below* the address bar) allow painting outside of the content process. So it's not clear why we constrain this only for E10S and relying on this doesn't seem wise. Also, using the display list approach will mean a significant amount of work to make the styling appear just as the native styling that we support now. Because of this, it doesn't seem worthwhile going the display list route.

The alternative is to use something like a XUL tree so that we don't create the entire option list in BuildOptionList, but on the basis of my trials, adding a XUL tree is a lot of work. Also, there is a possibility here that we might see a beach ball when scrolling through the option list (as we do when the email list is loading in Thunderbird)

So dbaron agrees that the amount of effort any of these would take to implement is not worth it for now. But, I'm filing a new bug as Jet suggested in case there is a better way other than XUL tree to only build items in the option list that will currently be seen. 

(Clearing the needinfo for mats and shorelander and removing the bug dependency added by jfkthame since it doesn't seem relevant anymore)
No longer blocks: 1345793
Flags: needinfo?(shorlander)
Flags: needinfo?(npancholi)
Flags: needinfo?(mats)
Depends on: 1394063

Comment 75

8 months ago
I have to ask: Does this mean that after 3 years you come to the conclusion to not fix this bug?
[removing autohiding tag from comment 75, since it seems like it was just a legitimate misunderstanding]

(In reply to Michael Hohner from comment #75)
> I have to ask: Does this mean that after 3 years you come to the conclusion
> to not fix this bug?

No, that's not what Neerja meant.  It's unacceptable long-term for us to have dropdown lists that are this janky, and we do need to fix it.

Comment 74 is just saying that:
(a) there is no easy fix that would fit into the Firefox 57 timeframe (and that's what we're primarily focused on at the moment, since 57 is a big release)
...and:
(b) we've eliminated some possible solutions as impractical, and we've spun off as helper-bug 1394063 as the best way forward for the time being (and that may be sufficient to fix this)

Comment 77

7 months ago
This issue is now breaking my site. Full info in a support case I created ( https://support.mozilla.org/en-US/questions/1175379 ) but in short I was updating an outdated page with a new look.

The current page (has the 1 sec delay issue):

http://www3.co.henrico.va.us/maps/subdivnamesearch.php

The new page with bootstrap (hangs and then crashes when you try to select a Street Name):

http://www3.co.henrico.va.us/mapstest/

Disabling e10s fixes the issue on both pages

I guess I'll post on the helper bug too.
(In reply to favosys from comment #77)
> This issue is now breaking my site. Full info in a support case I created (
> https://support.mozilla.org/en-US/questions/1175379 ) but in short I was
> updating an outdated page with a new look.
> 
> The current page (has the 1 sec delay issue):
> 
> http://www3.co.henrico.va.us/maps/subdivnamesearch.php
> 
> The new page with bootstrap (hangs and then crashes when you try to select a
> Street Name):
> 
> http://www3.co.henrico.va.us/mapstest/
> 
> Disabling e10s fixes the issue on both pages
> 
> I guess I'll post on the helper bug too.

I took a look at your new page and what's affecting it badly is a combination of bug 1383205 (fixed in 56) and bug 1386015 (fixed in 57), which is unfortunately caused by bootstrap's CSS styling.

In Firefox 57 there's still a slight delay opening it, but it's much shorter, and there are no freezes anymore.

If you want to apply a workaround for Firefox 56 and earlier, you can remove the CSS border transition on :focus, and the text color for the select element.

I'm gonna mention in bug 1386015 that it'd be nice to uplift it to beta so that this is fixed in 56 instead of just 57.

Comment 79

7 months ago
Thank you very much, I'll try playing with the CSS.

Updated

7 months ago
Flags: needinfo?(bmo)
Duplicate of this bug: 1379563
Duplicate of this bug: 1402045

Updated

7 months ago
Duplicate of this bug: 1402697

Updated

7 months ago
Flags: needinfo?(bmo)

Comment 83

7 months ago
From the front of bug 1379563 ( marked as duplicate of this one ) : 

I am not really able to run on a nightly build . On the latest release ( 56 ) version, the problem persists ( The strange is that this comes up in the same page on the same resource on some workstation ). 

May be other plugins installed make this to be more <<intense>> as a problem.

I will try to check this on a nightly build of 57. Should I try to disable e10s in order to verify the bug?

Updated

7 months ago
Duplicate of this bug: 1403892

Comment 85

7 months ago
We've just had a couple of users of a system report this. The select (single select dropdown) has 1742 options in it. For all browser/OS combinations that we tried bar one - and including Firefox on Linux, the dropdown appears effectively instantaneously: Chromium on Linux, Firefox 56.0 on Linux, Pale Moon 27.5.0 on both Linux and Windows, Edge on Windows, Internet Explorer 11 on Windows. The exception is Firefox on Windows, whereby there is a delay of between 3 to 4 seconds, meaning that users are clicking multiple times because they think their first click didn't register. We tried Firefox 56.0 and Firefox 57.0 beta 3 for Windows, and they are both affected by this issue. (We also tried a nightly, but it wasn't in a functional state.) Having found this thread, we've told the users that it isn't likely to be fixed any time soon, and we have suggested that they switch to using any other browser.
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #73)
> I don't think it should be particularly hard to apply the same native-theme
> styling that we apply in the parent to the child.  I don't see why that
> required rewriting everything to use XUL popups instead of the existing
> pre-e10s select code, which didn't have the performance problems.

BTW, there may be some work useful for doing this in bug 674443.
I'm fairly strongly in favor of rendering the dropdown menu in the content
process as we do for non-e10s.  It has good performance; it's a lot simpler
design (less error prone); and sandboxing all content to the content process
seems inherently more secure to me.

Constraining the dropdown to the viewport doesn't seem like a problem in
practice.  We can just "flip it inwards" when it would overlap an edge.
We can handle the edge case of extremely small window heights by resizing
the menu to fit and add a scrollbar to it, we already have code for this.
(Note that menu systems on the web implemented in JS/DOM/CSS are already
constrained to thi viewport so I think users are already used to this
behavior to some extent.)

A little over a year ago (bug 1300784 comment 2), I suggested making the
dropdown an abs.pos. box with a reserved z-index to make it render above
all other content on the page.  That still seems like a reasonable
solution to me.

FWIW, I recently read in some w3c or whatwg forum(*) that web designers
want more control over the <select> rendering, such as allowing <i> and
<b> inside <option> to style the text.  I think we should expect that
standards will move in this direction and I suspect that rendering content
in the parent process will become increasingly more difficult as a result.
We have no problem rendering arbitrary content in <option> in the content
process though (we actually allowed that at some point, IIRC).

(*) sorry, I don't remember exactly where

Updated

6 months ago
See Also: → bug 1399156
Duplicate of this bug: 1410219
Whiteboard: [qf:p2] → [perf:p2]

Updated

5 months ago
Whiteboard: [perf:p2] → [qf:p2]
I created Bug 1421229 to work on the idea that mats proposed in comment 87. I already had a hacky patch to render the dropdown menu in the content process, and it seems to work well without this performance issue.

Updated

5 months ago
Whiteboard: [qf:p2] → [qf:p1]

Updated

4 months ago
Whiteboard: [qf:p1] → [qf:i60][qf:p1]

Updated

4 months ago
Whiteboard: [qf:i60][qf:p1] → [qf:f60][qf:p1]

Updated

4 months ago
Duplicate of this bug: 1424301

Updated

2 months ago
Whiteboard: [qf:f60][qf:p1] → [qf:f61][qf:p1]
Whiteboard: [qf:f61][qf:p1] → [qf:f61][qf:p1][fxperf]

Comment 91

2 months ago
Since FF 58 even the workaround of disabling e10s does not work anymore. Now drop-down lists are slow always.
(In reply to Michael Hohner from comment #91)
> Since FF 58 even the workaround of disabling e10s does not work anymore. Now
> drop-down lists are slow always.

Are you sure you are still disabling e10s? Your workaround might have been ignored in a new version.

Comment 93

2 months ago
## Cross reference

e10s in Waterfox 56.0.4.x: multi-process versus performance of the 'Select Device' menu/form at samsung-updates.com · Issue #442 · MrAlex94/Waterfox

From <https://github.com/MrAlex94/Waterfox/issues/442#issuecomment-366548741>: 

> Firefox
> 5.8.0.2

> … a split-second delay before appearance of the menu. …

Comment 94

2 months ago
Dropping the automatically added [fxperf] tag here because this seems like it's squarely in the [qf] bucket of things. I'll chase in some of the deps to ensure we keep pushing to (eventually) fix this.
Whiteboard: [qf:f61][qf:p1][fxperf] → [qf:f61][qf:p1]
Depends on: 1421229
Assignee: npancholi → nobody
(In reply to Michael Hohner from comment #91)
> Since FF 58 even the workaround of disabling e10s does not work anymore. Now
> drop-down lists are slow always.

The setting name has switched back to browser.tabs.remote.autostart, set this to false, restart and you're good to go.

I don't know why the setting had to change, it was annoying enough that the e10s rollout was increasing with such a huge performance hit, but then to change the setting to that everyone that already had the fix in place was impacted again is beyond comprehension.

Amusingly, if you set disable e10s to have a performant browser, responsive design mode is disabled. As a result, having specified Firefox as the default for 12 years, we've switched to Chrome.
You need to log in before you can comment on or make changes to this bug.