Open Bug 1503581 Opened 6 years ago Updated 1 month ago

text with lots of dom nodes makes text-selection take multiple seconds in webrender

Categories

(Core :: DOM: Selection, enhancement, P3)

enhancement

Tracking

()

ASSIGNED
Tracking Status
firefox65 --- affected

People

(Reporter: Gankra, Assigned: jjaschke)

References

(Depends on 2 open bugs, Blocks 2 open bugs)

Details

Attachments

(3 files)

Attached file testcase.html
The time we spend in the text painting code computing selections (nsRange::IsNodeSelected) is still really harsh. The performance issues with this code are worse in webrender because we don't have any kind of partial display list invalidation -- we have to recompute the selection status of every text node in the display list whenever the page is dirtied at all. profile: http://bit.ly/2CT3nUp I believe the correct fix, for webrender, is to move back to caching selection state on frames, and doing a global linear pass over the nodes to set that state before starting to invoke CreateWebrenderCommands on the individual items. I'm not sure this caching even needs to have fine-grained invalidation for our purposes. Just *only* doing a single O(n) pass for each scene (display list build) should be a huge win.
Attachment #9021518 - Attachment mime type: text/plain → text/html
Marking as P3-trains since this is pretty nasty but potentially deferrable, since on "normal" text pages this isn't bad (since the actual number of nodes is fairly low). Could easily see it bumped to P2 though.
Priority: -- → P3
I'm not sure mats will be super-happy of putting the selection state back in frames...
Even if we clobber it on every paint? Cache might be the wrong word here -- I just want to compute it all at once, but actually recompute it on every frame. We could later make that mechanism into an actually invalidated cache, but it's not clear to me that that's totally necessary.
ni?smaug and mats because matt and I think they would have insights/opinions on this: ----- Problem ----- When running the display list painting code, we have each item in the display list individually lookup all the live selections and try to determine if its frame is selected. This code (nsRange::IsNodeSelected) is really expensive. Expensive enough that doing this for every display item takes *seconds* on pages with lots of sibling nodes (such as 1000 comma seperated links). In this case I believe we take ~O(n^2) time computing selections, where `n` is the number of nodes in the DOM. Mostly this isn't a problem for gecko because we have display item invalidation, letting us skip this work on nodes outside the range of the selection. However webrender has no such feature, and as such becomes incredibly unresponsive (checkerboarding due to rendering at ~0.2 fps). Adding more caching to this code has improved it a lot, but we're still orders of magnitude off of where we need to be. ----- Solution ----- I believe we should be able to overcome this by *amortizing* the selection code by batching the selection computation. In particular, once we have all the state set up to compute whether a given node is selected, it becomes much simpler to compute if its neighbour is selected, as we just need to check if we're entering/leaving any of the selection ranges by moving over one node. At that point we can store the computed selected-ness on the frame or its display items. Note that this does not propose caching and invalidating this information across multiple display lists. Rather the selectedness would only be cached for the current display list build. However if desirable it would be a natural extension for us to start preserving and invalidating this information across display lists. As mattwoodrow noted: we already are technically caching/invalidating selection state in the display list, in that selection state is visible in the rasterization, and we are able to correctly invalidate the rasterization in a fine-grain way when selection changes. But let's assume we're not doing any invalidation more fine-grain than "something changed, all selection state must be recomputed". The question then becomes in what way do we do this amortized selection pass. Here are a few options we discussed in IRC: ----- Range-Directed (bad) ----- The most naive option would be to simply iterate the selection ranges and mark all the frames inside of them as selected. We don't think this is a practical/viable solution because it would not scale well to very large pages, where most of the DOM is not actually visible. O(size-of-selection) is probably too expensive. (Note that we could at least avoid the O(size-of-page) cost of clearning the selection bit on every frame by doing a sort of epoch-based approach, where each frame stores the last epoch where it was selected, and so `IsSelected = mSelectionEpoch == sCurrentEpoch`) A better solution is to base our amortization off the display list, which is only the visible frames. ----- Naive-DL-Directed (impossible?) ----- Naively trying to run an amortized analysis on the DL can't really work because z-sorting clobbers the relationship between DL order and DOM order. ----- StackingContext-Directed (maybe ok?) ----- However z-sorting is a pass that is done in chunks: we build the sub-DL for a stacking context, z-sort it, then concatenate it onto the full DL. As such we have a point where we have a chunk of contiguous items in DOM-order. The kind of page we are struggling on tends to be sparse in stacking contexts (e.g. the comma seperated links example is all in one stacking context), and so this may work well enough in practice. ----- DL-Construction-Directed (maybe best?) ----- Alternatively, it may be possible to amortize this analysis over the entire DL construction by just holding onto this state as we walk over the frame tree and compute which ones should produce display items. Not sure if this would be easier or harder than the StackingContext-Directed approach. It is unclear how exactly skipping over hidden frames should work -- should we be forced to walk over them to maintain our amortized state, or simply hard reset our state, and hope that visibility is "usually" contiguous enough that we don't need to reset state much? The latter sounds plausibly true? ---- Conclusions ---- The high level idea seems promising, but this is all based on a pretty fuzzy understanding of the whole system. In particular I personally have no idea how to implement this, and what assumptions actually hold true. Any insights would be greatly appreciated!
Flags: needinfo?(mats)
Flags: needinfo?(bugs)
A variant of range directed approach might not be too bad. We wouldn't need to go through the whole document or range (if a large page is fully selected) or anything like that. Start from point A based on whatever is being painted. If A is under a selection container, check if A's next x nodes are selected and store X[i] = selected|not_selected|partially_selected in a, say, ring buffer. Also store the flag on node or nsIFrame. When buffer is full, remove one item and clear the flags stored on it. (Also at the end clear the buffer and flags on the relevant nodes.) If some other node B isn't then cached, do the slow selectedness check and re-populate the cache. If a node/nsIFrame has selected or not_selected flag, returning from IsNodeSelected would be O(1) with very small constant. If partially selected, would need to do a bit slower operation, but that seems to be fast already in current wr. Unfortunately nsIFrame doesn't seem to have enough spare FrameState bits, and nsINode has spare bits on 64bit only (but I need to check if we could have extra 4 bytes on 32bit too.).
Flags: needinfo?(bugs)
(In reply to Alexis Beingessner [:Gankro] from comment #3) > Even if we clobber it on every paint? Cache might be the wrong word here -- > I just want to compute it all at once, but actually recompute it on every > frame. We could later make that mechanism into an actually invalidated > cache, but it's not clear to me that that's totally necessary. Adding a frame bit to mark it as selected during one painting pass is fine with me as long as it's clobbered at the start of every paint (and that the bit is used for painting only, obviously). I'm skeptical about a persistent selection state on frames that needs to be updated on DOM changes though, since we had that in the past and it was buggy, slow and hard to maintain. (In reply to Alexis Beingessner [:Gankro] from comment #4) > However z-sorting is a pass that is done in chunks: we build the sub-DL for > a stacking context, z-sort it, then concatenate it onto the full DL. As such > we have a point where we have a chunk of contiguous items in DOM-order. I don't think you can infer anything about DOM order from DislayLists because grid/flex items may have CSS 'order': https://drafts.csswg.org/css-flexbox-1/#propdef-order (and XUL elements can have 'ordinal' attribute)
Flags: needinfo?(mats)
Smaug was asking about this so I did a quick checkin with jrmuizel to clarify my understanding of this. Why is Vanilla gecko and WR so different here, and where are they same? Both Vanilla and WR make use of the "retained (gecko) display list" feature, which reuses parts of the Gecko DL whenever the scene changes (rather than recomputing it from scratch from the frame tree every time). Also async changes such as scrolling or animations don't count as a scene change here; they skip over to later subsystems. We now determine that we would like the Gecko DL to be painted. Both WR and Vanilla take the Gecko DL, which is on the main thread, and as quickly as possible convert it into a representation that can be taken off the main thread to unblock it. This is a process of running over every display item, gathering up all the style and layout information for it, and producing lower-level painting commands. Both Vanilla and WR largely use the same code for this, since the style and layout calculations are the same regardless. It's just a matter of what is done with these computed values. Of note for this issue, the computation of selection style is completely identical. For Vanilla, we use DrawTargetCapture to capture all the item's drawing commands and replay them on the painting thread. For WR, we push WR display items onto the WR display list, which is then sent over IPC to the WR process to be further processed. **However, there is one key difference:** Vanilla has FrameLayerBuilder which implements *display-list-based painting invalidation*. FrameLayerBuilder is (somehow) able to know when a change to the Gecko DL might have invalidated the painting commands (DrawTargetCapture) for a gecko display item, and can skip recomputing them. WR doesn't have FrameLayerBuilder or any kind of display-list-based invalidation. It just always recomputes all the WR display items for every scene change. So if you select some text with Vanilla, we only rerun the selection code on text items that have *just* become selected or unselected for that paint. But with WR, we rerun the selection code on all text items every time your selection area changes. So WR is much more exposed to the high cost of checking if a text item is selected. This implies two "obvious" solutions: * Make text-selection fast (through any of the approaches described in this thread) * Reimplement and/or abstract enough of FrameLayerBuilder for WR to be able to determine if the WR display items for a text item need to be recomputed, and also add a way to retain WR DL fragments between scenes. I was naturally inclined towards the former solution because making text selection checks faster is a win for Vanilla *anyway*, but if it's deemed infeasible, we'll need to look into invalidating the WR display list. There *has* been some desire to have a bit of this anyway just because sending all the text-runs over IPC all the time is pretty expensive.
Assignee: nobody → bugs
(I still believe webrender should utilize invalidation information and repaint less if possible, but selection handling can certainly be a lot faster too.)
Component: Graphics: WebRender → Selection
Priority: P3 → P4
Another problem is that even if checking whether some node is selected was fast, painting selected node isn't https://searchfox.org/mozilla-central/source/layout/generic/nsTextFrame.cpp#6477,6482
Since this is actually about couple of different things, splitting this to sub-bugs. (but I still wonder if webrender could just somehow utilize the setup non-webrender, so that it wouldn't need to paint so much)
Assignee: bugs → nobody
Blocks: wr-67
No longer blocks: stage-wr-trains
Priority: P4 → P3
Blocks: wr-68
No longer blocks: wr-67
Depends on: 1558926
Blocks: wr-70
No longer blocks: wr-68
Blocks: stage-wr-next
No longer blocks: wr-70
See Also: → 1679645
Severity: normal → S3

Retesting after After bug 1867249 has landed.
Action: Select some lines of text and then scroll on the page.

Profile of the testcase : https://share.firefox.dev/3WDDK1c
Profile of testcase with 10x the loop count: https://share.firefox.dev/4fwydlP

:jjaschke, is there anything to improve here further?

Flags: needinfo?(jjaschke)
Attached file testcase-50k-nodes

Attached a testcase with 50k nodes instead of just 5k for testing.

I've looked into the profiles and this is my hypothesis of what's causing the bad performance here:

The cache implemented in bug 1867249 deals with fully selected nodes. To check if a node is fully selected now only takes a hashtable lookup instead of comparing the node's position to the range start/end positions. Comparing the node's position to start/end of the range involves computing the index of the node in its parent, which is a O(n) operation (linked list).

The performance problems we see in the latest profile are now only the partially selected nodes. Since dragging a selection updates the end point of the range constantly, the painting code has to recompute the selection for (almost) every frame. Since the last selected node is likely only partially selected, it falls back to the old implementation and compares its position to range start/end.
The farther down one scrolls, the more expensive this gets, as this example is basically one parent node with a lot of child nodes. I think this is noticeable - dragging the selection on the top of the view-source isn't too bad, but if you scroll down while dragging the selection, it gets increasingly less responsive.

I have an idea that I want to try out to overcome this. I'm wondering if it's possible to have a node->index cache (ie., hashtable) in the document. The cache would need to be cleared each time a DOM mutation happens, and given that hashtables have an overhead on their own, it's likely that there needs to be a threshold for when to use it.

Whenever nsINode::ComputeIndexOf() is being called:

  • first, do the checks that happen (is child a child of this, is it the first/last child)
  • if the cache is active (ie. mChildCount > THRESHOLD)
    • look up aPossibleChild in the cache, if found, return its index
    • otherwise, call ComputeIndexOf(aPossibleChild->GetPreviousSibling()), which starts a recursion until we have a cache hit or reach the first child
    • add 1 to the returned index of the previous node, insert it into the cache and return it
  • Otherwise, run old code path

This would introduce a to-be-measured initial overhead due to the hashtable lookups/insertions. But a subsequent call would only need to traverse the siblings until a cache hit, which should be faster than traversing to the start/end. And if there's a direct cache hit, computing the index is just one hashtable lookup. For the given situation where no DOM mutations happen, the cache would only be populated once. This should be faster overall... let's find out! :)

Mayank, thanks for bringing this to my attention!

Flags: needinfo?(jjaschke)

Here's a profile of the 50k nodes test case after implementing the mentioned cache: https://share.firefox.dev/46CJZXE

This is way better than before, so the general idea seems to be promising.

The try build crashes for me if I open the 50k testcase and do a "Ctrl + A".
Unfortunately, the crash reporter doesnt submit these crashes. And it is not symbolicated.

Can't repro on my local build. What build did you use?

Ah, it's likely too much recursion depth. :) I'm preparing a patch that works without recursion.

Assignee: nobody → jjaschke
Attachment #9418161 - Attachment description: WIP: Bug 1503581: Speed up drag Selection for nodes with many child nodes. → Bug 1503581: Implement a cache that reduces computing the index of a node in its parent to a hash table lookup. r=masayuki!,#dom-core
Status: NEW → ASSIGNED

Try build makes the testcase from bug 1892514 very slow (I waited for ~1 minute before giving up).

Thanks for bringing this up. It’s not unexpected that I missed some cases. Maybe I should land the patch behind a pref so I can search for bad cases without affecting everyone else.

Here's what I found:

  • Open this attachment, copy content, paste into external app (eg. VSCode)
  • Select all text in VSCode, copy
  • Open the codepen example
  • Paste
  • Scroll around
  • Select all and delete

Profile (local build/macOS): https://share.firefox.dev/3ArzEC3

There's a 43ish second jank visible when pasting the content. It looks like this is running a code path which keeps clearing the cache and forces it to be rebuilt constantly. I'll investigate if I can add the batching there as well. Thank you!

Is it just me, or does the codepen example only highlight the first line after pasting? The original code is highlighted throughout, but after pasting the d3.js code, only the first line seems to be highlighted. This happens in local build, current Nightly, and Release, so it should be unrelated to what we're doing here.

Masayuki, I added the node index cache batching guard here. This solves the issue and pasting happens instantly (<1s). Is it safe to do that or would it break other things?

Flags: needinfo?(masayuki)

Ah, when replacing the pasted code with pasting jQuery, the code is highlighted properly. So this not working properly with the original example is likely a bug in the codepen code.

Mayank, here is a try run with the latest changes in case you want to follow along. (ni? for visibility)

Flags: needinfo?(mayankleoboy1)

(In reply to Jan Jaeschke [:jjaschke] from comment #25)

Mayank, here is a try run with the latest changes in case you want to follow along. (ni? for visibility)

STR from the 50k testcase on the current bug:
Selecting some text and scrolling from the scrollbar: https://share.firefox.dev/4dK4GTX
Selecting all the text and scrolling from the scrollbar: https://share.firefox.dev/3AmqV46
Selecting text by dragging mouse to the bottom of the screen and letting the page scroll: https://share.firefox.dev/3M6LKTB
Rapidly selecting and deselecting some text: https://share.firefox.dev/3SPqHZq

STR from comment #22 : https://share.firefox.dev/3yNlgDu
STR from bug 1912069 : https://share.firefox.dev/4dmVzbT

Edit: updated all the profiles.

Flags: needinfo?(mayankleoboy1)

Replied in Phablicator.
https://phabricator.services.mozilla.com/D218755#inline-1218332

On the other hand here is a right place for the patch?

Flags: needinfo?(masayuki)
Blocks: 1914833
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: