Closed Bug 1725555 Opened 2 months ago Closed 2 months ago

100% CPU load and hang of content process when searching in large documents

Categories

(Core :: Layout: Text and Fonts, defect, P4)

defect

Tracking

()

VERIFIED FIXED
93 Branch
Tracking Status
firefox93 --- verified

People

(Reporter: whimboo, Assigned: jfkthame)

References

(Blocks 1 open bug, Regressed 1 open bug)

Details

(Keywords: power)

Crash Data

Attachments

(3 files)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:92.0) Gecko/20100101 Firefox/92.0 ID:20210804214554

Take the following log file as example:

https://community.taskcluster-artifacts.net/It905wBLQGOFhFaIUiURXA/0/public/logs/live_backing.log

Searching for some content will immediately let the CPU usage jump to 100% for the given content process, which will also block other pages using the same process from working at all. Removing the search term and closing the find toolbar doesn't make a difference.

Here a recorded Gecko profile: https://share.firefox.dev/3sfe5ws

The high CPU load most likely also causes the page not being updated when scrolling up and down. Areas stay completely blank (stopped after some minutes). This behavior cannot be seen with Chrome, whereby it's also not fast.

nsTextFrame::GetContentEnd() const
GetPartialTextRect(mozilla::RectCallback*, mozilla::dom::Sequence<nsTString<char16_t> >*, nsIContent*, int, int, bool, bool)
nsRange::CollectClientRectsAndText(mozilla::RectCallback*, mozilla::dom::Sequence<nsTString<char16_t> >*, nsRange*, nsINode*, unsigned int, nsINode*, unsigned int, bool, bool)
nsRange::GetBoundingClientRect(bool, bool)
mozilla::dom::Range_Binding::getBoundingClientRect(JSContext*, JS::Handle<JSObject*>, void*, JSJitMethodCallArgs const&)
Range.getBoundingClientRect
bool mozilla::dom::binding_detail::GenericMethod<mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*)
0x3a6901848c00
updateScrollMarks
Interpret(JSContext*, js::RunState&)
js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason)
js::RunScript
js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason)
js::CallSelfHostedFunction(JSContext*, JS::Handle<js::PropertyName*>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>)
js::jit::InterpretResume(JSContext*, JS::Handle<JSObject*>, JS::Value*, JS::MutableHandle<JS::Value>)
0x3a69017bca00
AsyncFunctionNext

Looks like we have a lot of updateScrollMarks calls, which seem to even continue happening after the search bar has been closed.

(not really a power usage issue. This is plain simple performance problem).

This between DOM and Layout, but I think the core fix needs to happen on layout side.
DOM uses whatever functionality Layout provides, but I think it doesn't provide anything too useful here, unless I'm
missing something. If there was an array of continuation frames, one could do faster search.
Or am I missing some API?
The issue is, I think, https://searchfox.org/mozilla-central/rev/36904ac58d2528fc59f640db57cc9429103368d3/dom/base/nsRange.cpp#2650-2651

Component: DOM: Core & HTML → Layout: Text and Fonts
Flags: needinfo?(jfkthame)

(not really a power usage issue. This is plain simple performance problem).

Searching a 41MB plain text document containing 773k lines isn't really typical usage for normal users IMHO.

That said, is there any DOM/HTML spec that says we MUST put all that text into the same Text node?
This use case, and many other, would be much more performant if we could cap DOM Text node size to say ~4kiB (which would limit the length of continuation chain drastically).

Severity: -- → S3
Priority: -- → P4

BTW, this issue is related to updateScrollMarks, which needs to know the layout position of the search matches.
Is there a way to enable that feature on Linux? (I don't see these marks in my local build.)

Yeah, this is pretty bad. An array of continuation frames would let us binary-search it, which would be much faster in a case like this (where there are hundreds of thousands of them). The risk, of course, is that maintaining such an array would come with a significant cost that might regress other aspects of performance. Seems worth at least some investigation, though.

The real performance problem here arises not as a result of a single search of the chain of continuations, but because there may be lots of selected ranges and we re-search the entire continuation chain for every one of them. So one possibility is to not attempt to maintain an array of continuations on nsTextFrame all the time, but only create it when needed to support the nsRange::CollectClientRectsAndText method involved here (and ensure it gets discarded if the continuation chain is subsequently mutated).

I've pushed a try build at https://treeherder.mozilla.org/jobs?repo=try&revision=c93002e08e06a6268e2bb008bc0d0ffcde56ecc2 with some experimental patches that should help here; Henrik, if you'd care to test a build from there it'd be good to know how it performs for you. I'll put patches up for review after a bit more testing.

Flags: needinfo?(jfkthame)

OK, I've collected some profiles using a local opt build (on Linux) to show the benefit of caching an array of continuations in the primary text frame. The sequence I'm using to test here is:

  1. Load a local copy of the log file mentioned in comment 0
  2. Wait for it to fully load and reflow (check by dragging the scroll-bar all the way to the end and back)
  3. Open the browser Find bar
  4. Confirm that the Highlight All option is checked
  5. Start the profiler
  6. Paste the search text "TEST_END" into the Find bar
  7. Wait for the Find bar to update to show 1 of 823 matches and the scroll-bar marks to appear
  8. Stop the profiler

Baseline (local opt build of current mozilla-central):

  • https://share.firefox.dev/3ATyJFu
    This shows a 17s jank period, almost entirely under nsRange::CollectClientRectsAndText
    Overall elapsed time for the test sequence is about 24s.

With patch to optimize nsRange::CollectClientRectsAndText by using a cached continuations array:

  • https://share.firefox.dev/3mfziFd
    The 17-second jank is gone; now, the worst offender left is nsTextFrame::SelectionStateChanged, which takes 1+ seconds, multiple times.
    Elapsed time for the test sequence is now about 7s.

With added patch to optimize nsTextFrame::SelectionStateChanged using the same cached array:

This is helpful when we have extremely long continuation chains, to avoid having to
follow all the back-pointers to find the primary frame. By itself it makes little
difference to the testcase here, but is a basis for the patch that follows.

Assignee: nobody → jfkthame
Status: NEW → ASSIGNED

This allows us to binary-search the continuations from nsRange::ColletClientRectsAndText,
instead of linear-searching the linked list for every range we need to look up.

Depends on D122998

After CollectClientRectsAndText is eliminated from the profiles here, SelectionStateChanged
is the next obvious hotspot, and it can similarly be accelerated by binary-searching the continuations.

Depends on D122999

Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/62ba1322b82b
patch 1 - Cache first-continuation pointer in nsContinuingTextFrame so that accessing FirstContinuation() is constant-time. r=emilio
https://hg.mozilla.org/integration/autoland/rev/054669fb1640
patch 2 - Enable the primary nsTextFrame to provide, and cache, an array of all the continuations in the chain. r=emilio
https://hg.mozilla.org/integration/autoland/rev/2b74d92f7be2
patch 3 - Also use continuations array to accelerate nsTextFrame::SelectionStateChanged. r=emilio

Backed out for causing crashes [@ nsTextFrame::FindContinuationForOffset], e.g. bp-67956f27-9e28-4490-8284-edb610210820

https://hg.mozilla.org/mozilla-central/rev/61c9f3dcf85d1a2ae660931e4fbf0a50b7c38e83

Status: RESOLVED → REOPENED
Crash Signature: [@ nsTextFrame::FindContinuationForOffset ]
Flags: needinfo?(jfkthame)
Resolution: FIXED → ---
Target Milestone: 93 Branch → ---
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6abaa3b17be8
patch 1 - Cache first-continuation pointer in nsContinuingTextFrame so that accessing FirstContinuation() is constant-time. r=emilio
https://hg.mozilla.org/integration/autoland/rev/35fbd6aaada3
patch 2 - Enable the primary nsTextFrame to provide, and cache, an array of all the continuations in the chain. r=emilio
https://hg.mozilla.org/integration/autoland/rev/a0c5d86c771b
patch 3 - Also use continuations array to accelerate nsTextFrame::SelectionStateChanged. r=emilio
Status: REOPENED → RESOLVED
Closed: 2 months ago2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch

The crash signature from comment 11 has resurfaced, e.g. bp-3766f818-be8e-4946-927b-67c1d0210824 - should the changes be reverted?

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 93 Branch → ---

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #14)

The crash signature from comment 11 has resurfaced, e.g. bp-3766f818-be8e-4946-927b-67c1d0210824 - should the changes be reverted?

Ugh - looks like this time it's somehow IME-related. Yes, best to back out (again) while I try to reproduce & investigate it locally. Thanks!

Flags: needinfo?(jfkthame)

OK, figured out what was broken: RemoveEmptyInFlows could wipe out the mFirstContinuation pointers in the continuation frames following the range to be removed, because it did aFrame->SetPrevInFlow(nullptr) before disconnecting the end of the range with lastRemoved->SetNextInFlow(nullptr). That leads to problems later on when one of the subsequent continuations wants to tell the primary frame to discard a cached continuations array, but its mFirstContinuations pointer has been cleared.

Simply reversing those chain-breaking operations should resolve the issue. I'm also adding a MOZ_DIAGNOSTIC_ASSERT that would flag this closer to the root of the problem, should such an issue occur again sometime.

Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f2357e055668
patch 1 - Cache first-continuation pointer in nsContinuingTextFrame so that accessing FirstContinuation() is constant-time. r=emilio
https://hg.mozilla.org/integration/autoland/rev/22b941581212
patch 2 - Enable the primary nsTextFrame to provide, and cache, an array of all the continuations in the chain. r=emilio
https://hg.mozilla.org/integration/autoland/rev/d0864a5c8e90
patch 3 - Also use continuations array to accelerate nsTextFrame::SelectionStateChanged. r=emilio
Flags: needinfo?(jfkthame)

Ah, nsSplittableFrame::RemoveFromFlow has the same issue as RemoveEmptyInFlows, and needs a similar fix.

Flags: needinfo?(jfkthame)
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2095f5391c82
patch 1 - Cache first-continuation pointer in nsContinuingTextFrame so that accessing FirstContinuation() is constant-time. r=emilio
https://hg.mozilla.org/integration/autoland/rev/1a476d9ea9c3
patch 2 - Enable the primary nsTextFrame to provide, and cache, an array of all the continuations in the chain. r=emilio
https://hg.mozilla.org/integration/autoland/rev/c36347bc9a83
patch 3 - Also use continuations array to accelerate nsTextFrame::SelectionStateChanged. r=emilio
Regressions: 1727942
Status: REOPENED → RESOLVED
Closed: 2 months ago2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
Regressions: 1728265

(In reply to Jonathan Kew (:jfkthame) from comment #4)

I've pushed a try build at https://treeherder.mozilla.org/jobs?repo=try&revision=c93002e08e06a6268e2bb008bc0d0ffcde56ecc2 with some experimental patches that should help here; Henrik, if you'd care to test a build from there it'd be good to know how it performs for you. I'll put patches up for review after a bit more testing.

Sorry, but I was on vacation and got back yesterday. I tried your fix in the most recent Nightly build and it works like a charm! Thanks a lot for taking care of it! It makes investigating issues by having to search in these large log files way more pleasant.

One thing I noticed is that when stepping through instances of the search term by quickly clicking the up/down arrows the highlights in the scrollbar are getting invalidated and redrawn. Do you know if there is a bug for that? If not I can file one.

Flags: needinfo?(jfkthame)

Glad it works for you!

I don't know of a bug about the invalidation/redrawing of the highlights, so probably worth filing one. (I noticed that happening, but didn't try to follow up on it at all. I think it'll be more of a toolkit/front-end issue, rather than core platform.)

Flags: needinfo?(jfkthame)
Flags: qe-verify+

(In reply to Jonathan Kew (:jfkthame) from comment #24)

I don't know of a bug about the invalidation/redrawing of the highlights, so probably worth filing one. (I noticed that happening, but didn't try to follow up on it at all. I think it'll be more of a toolkit/front-end issue, rather than core platform.)

Thanks. I filed bug 1729633.

Regressions: 1729578

Noticed as well some ugly hangs on Win10 with 93.0a1 (2021-08-13).
With 90.3 it's a bit better, staying at @ 60-80% load on Win10/Ubuntu 16.

Colleague checked with macOS 10.15 and the PC is still usable, ~ improvment noticed on 93.0/94.0a1.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.