Open Bug 1503581 Opened 2 years ago Updated 13 days ago
text with lots of dom nodes makes text-selection take multiple seconds in webrender
The time we spend in the text painting code computing selections (nsRange::IsNodeSelected) is still really harsh. The performance issues with this code are worse in webrender because we don't have any kind of partial display list invalidation -- we have to recompute the selection status of every text node in the display list whenever the page is dirtied at all. profile: http://bit.ly/2CT3nUp I believe the correct fix, for webrender, is to move back to caching selection state on frames, and doing a global linear pass over the nodes to set that state before starting to invoke CreateWebrenderCommands on the individual items. I'm not sure this caching even needs to have fine-grained invalidation for our purposes. Just *only* doing a single O(n) pass for each scene (display list build) should be a huge win.
Attachment #9021518 - Attachment mime type: text/plain → text/html
Marking as P3-trains since this is pretty nasty but potentially deferrable, since on "normal" text pages this isn't bad (since the actual number of nodes is fairly low). Could easily see it bumped to P2 though.
I'm not sure mats will be super-happy of putting the selection state back in frames...
Even if we clobber it on every paint? Cache might be the wrong word here -- I just want to compute it all at once, but actually recompute it on every frame. We could later make that mechanism into an actually invalidated cache, but it's not clear to me that that's totally necessary.
ni?smaug and mats because matt and I think they would have insights/opinions on this: ----- Problem ----- When running the display list painting code, we have each item in the display list individually lookup all the live selections and try to determine if its frame is selected. This code (nsRange::IsNodeSelected) is really expensive. Expensive enough that doing this for every display item takes *seconds* on pages with lots of sibling nodes (such as 1000 comma seperated links). In this case I believe we take ~O(n^2) time computing selections, where `n` is the number of nodes in the DOM. Mostly this isn't a problem for gecko because we have display item invalidation, letting us skip this work on nodes outside the range of the selection. However webrender has no such feature, and as such becomes incredibly unresponsive (checkerboarding due to rendering at ~0.2 fps). Adding more caching to this code has improved it a lot, but we're still orders of magnitude off of where we need to be. ----- Solution ----- I believe we should be able to overcome this by *amortizing* the selection code by batching the selection computation. In particular, once we have all the state set up to compute whether a given node is selected, it becomes much simpler to compute if its neighbour is selected, as we just need to check if we're entering/leaving any of the selection ranges by moving over one node. At that point we can store the computed selected-ness on the frame or its display items. Note that this does not propose caching and invalidating this information across multiple display lists. Rather the selectedness would only be cached for the current display list build. However if desirable it would be a natural extension for us to start preserving and invalidating this information across display lists. As mattwoodrow noted: we already are technically caching/invalidating selection state in the display list, in that selection state is visible in the rasterization, and we are able to correctly invalidate the rasterization in a fine-grain way when selection changes. But let's assume we're not doing any invalidation more fine-grain than "something changed, all selection state must be recomputed". The question then becomes in what way do we do this amortized selection pass. Here are a few options we discussed in IRC: ----- Range-Directed (bad) ----- The most naive option would be to simply iterate the selection ranges and mark all the frames inside of them as selected. We don't think this is a practical/viable solution because it would not scale well to very large pages, where most of the DOM is not actually visible. O(size-of-selection) is probably too expensive. (Note that we could at least avoid the O(size-of-page) cost of clearning the selection bit on every frame by doing a sort of epoch-based approach, where each frame stores the last epoch where it was selected, and so `IsSelected = mSelectionEpoch == sCurrentEpoch`) A better solution is to base our amortization off the display list, which is only the visible frames. ----- Naive-DL-Directed (impossible?) ----- Naively trying to run an amortized analysis on the DL can't really work because z-sorting clobbers the relationship between DL order and DOM order. ----- StackingContext-Directed (maybe ok?) ----- However z-sorting is a pass that is done in chunks: we build the sub-DL for a stacking context, z-sort it, then concatenate it onto the full DL. As such we have a point where we have a chunk of contiguous items in DOM-order. The kind of page we are struggling on tends to be sparse in stacking contexts (e.g. the comma seperated links example is all in one stacking context), and so this may work well enough in practice. ----- DL-Construction-Directed (maybe best?) ----- Alternatively, it may be possible to amortize this analysis over the entire DL construction by just holding onto this state as we walk over the frame tree and compute which ones should produce display items. Not sure if this would be easier or harder than the StackingContext-Directed approach. It is unclear how exactly skipping over hidden frames should work -- should we be forced to walk over them to maintain our amortized state, or simply hard reset our state, and hope that visibility is "usually" contiguous enough that we don't need to reset state much? The latter sounds plausibly true? ---- Conclusions ---- The high level idea seems promising, but this is all based on a pretty fuzzy understanding of the whole system. In particular I personally have no idea how to implement this, and what assumptions actually hold true. Any insights would be greatly appreciated!
A variant of range directed approach might not be too bad. We wouldn't need to go through the whole document or range (if a large page is fully selected) or anything like that. Start from point A based on whatever is being painted. If A is under a selection container, check if A's next x nodes are selected and store X[i] = selected|not_selected|partially_selected in a, say, ring buffer. Also store the flag on node or nsIFrame. When buffer is full, remove one item and clear the flags stored on it. (Also at the end clear the buffer and flags on the relevant nodes.) If some other node B isn't then cached, do the slow selectedness check and re-populate the cache. If a node/nsIFrame has selected or not_selected flag, returning from IsNodeSelected would be O(1) with very small constant. If partially selected, would need to do a bit slower operation, but that seems to be fast already in current wr. Unfortunately nsIFrame doesn't seem to have enough spare FrameState bits, and nsINode has spare bits on 64bit only (but I need to check if we could have extra 4 bytes on 32bit too.).
(In reply to Alexis Beingessner [:Gankro] from comment #3) > Even if we clobber it on every paint? Cache might be the wrong word here -- > I just want to compute it all at once, but actually recompute it on every > frame. We could later make that mechanism into an actually invalidated > cache, but it's not clear to me that that's totally necessary. Adding a frame bit to mark it as selected during one painting pass is fine with me as long as it's clobbered at the start of every paint (and that the bit is used for painting only, obviously). I'm skeptical about a persistent selection state on frames that needs to be updated on DOM changes though, since we had that in the past and it was buggy, slow and hard to maintain. (In reply to Alexis Beingessner [:Gankro] from comment #4) > However z-sorting is a pass that is done in chunks: we build the sub-DL for > a stacking context, z-sort it, then concatenate it onto the full DL. As such > we have a point where we have a chunk of contiguous items in DOM-order. I don't think you can infer anything about DOM order from DislayLists because grid/flex items may have CSS 'order': https://drafts.csswg.org/css-flexbox-1/#propdef-order (and XUL elements can have 'ordinal' attribute)
Smaug was asking about this so I did a quick checkin with jrmuizel to clarify my understanding of this. Why is Vanilla gecko and WR so different here, and where are they same? Both Vanilla and WR make use of the "retained (gecko) display list" feature, which reuses parts of the Gecko DL whenever the scene changes (rather than recomputing it from scratch from the frame tree every time). Also async changes such as scrolling or animations don't count as a scene change here; they skip over to later subsystems. We now determine that we would like the Gecko DL to be painted. Both WR and Vanilla take the Gecko DL, which is on the main thread, and as quickly as possible convert it into a representation that can be taken off the main thread to unblock it. This is a process of running over every display item, gathering up all the style and layout information for it, and producing lower-level painting commands. Both Vanilla and WR largely use the same code for this, since the style and layout calculations are the same regardless. It's just a matter of what is done with these computed values. Of note for this issue, the computation of selection style is completely identical. For Vanilla, we use DrawTargetCapture to capture all the item's drawing commands and replay them on the painting thread. For WR, we push WR display items onto the WR display list, which is then sent over IPC to the WR process to be further processed. **However, there is one key difference:** Vanilla has FrameLayerBuilder which implements *display-list-based painting invalidation*. FrameLayerBuilder is (somehow) able to know when a change to the Gecko DL might have invalidated the painting commands (DrawTargetCapture) for a gecko display item, and can skip recomputing them. WR doesn't have FrameLayerBuilder or any kind of display-list-based invalidation. It just always recomputes all the WR display items for every scene change. So if you select some text with Vanilla, we only rerun the selection code on text items that have *just* become selected or unselected for that paint. But with WR, we rerun the selection code on all text items every time your selection area changes. So WR is much more exposed to the high cost of checking if a text item is selected. This implies two "obvious" solutions: * Make text-selection fast (through any of the approaches described in this thread) * Reimplement and/or abstract enough of FrameLayerBuilder for WR to be able to determine if the WR display items for a text item need to be recomputed, and also add a way to retain WR DL fragments between scenes. I was naturally inclined towards the former solution because making text selection checks faster is a win for Vanilla *anyway*, but if it's deemed infeasible, we'll need to look into invalidating the WR display list. There *has* been some desire to have a bit of this anyway just because sending all the text-runs over IPC all the time is pretty expensive.
(I still believe webrender should utilize invalidation information and repaint less if possible, but selection handling can certainly be a lot faster too.)
Component: Graphics: WebRender → Selection
Another problem is that even if checking whether some node is selected was fast, painting selected node isn't https://searchfox.org/mozilla-central/source/layout/generic/nsTextFrame.cpp#6477,6482
Since this is actually about couple of different things, splitting this to sub-bugs. (but I still wonder if webrender could just somehow utilize the setup non-webrender, so that it wouldn't need to paint so much)
Assignee: bugs → nobody
See Also: → 1590379
You need to log in before you can comment on or make changes to this bug.