Open Bug 1854770 Opened 2 years ago Updated 6 months ago

Hangs on pages with huge (many thousands) number of links, mostly due to building large IPC messages and HashSets

Categories

(Toolkit :: Places, defect, P3)

defect

Tracking

()

Performance Impact low

People

(Reporter: mayankleoboy1, Unassigned, NeedInfo)

References

()

Details

(Keywords: perf, Whiteboard: [sng-scrubbed])

Attachments

(2 files)

See profile : https://share.firefox.dev/3Rtd9mJ , https://share.firefox.dev/3ZxdfLV

I was doing some testing on a demo-type page.
Go to https://random-items.erosson.org/
Enter 1000000 as the number.
Click on roll

I get this 42 second hang (once the js stuff is over and the content is generated) . The browser froze, and I got the "Mouse cursor is spinning as the program is not responding" error message on Windows.

My guess is that because the demo page generates a lot of URLs, the browser takes a long time to check if these URLs have been visited before.

Attached file about:support

Actually, the parent process jank also occurs for smaller values of inputs.

Enter 100000
https://share.firefox.dev/3LAPVY2

And a profile with all threads and File I/O : https://share.firefox.dev/463GBE4

Shows some File I/O activity on on sqlitedb:places:sqlite

Afaict this seems due to a large number of runnables dispatched across threads.

Severity: -- → N/A
Component: Storage → Places
Depends on: 1594368
Keywords: perf
Priority: -- → P3
See Also: → 1854771

(In reply to Marco Bonardo [:mak] from comment #4)

Afaict this seems due to a large number of runnables dispatched across threads.

In particular, I see Document::FlushPendingLinkUpdates doing a lot of work, and if we look at a diagram of calls to BaseHistory::ScheduleVisitedQuery that's on the places stack we can see that FlushPendingLinkUpdates can be causally responsible for that.

It seems like DOM and/or Places should handle this better, if only by entering an overload mode that stops doing link visited requests for a page.

Given the amount of time spent in layout, I wonder if part of the problem could be massive churn in the DOM such that a VDOM implementation is not providing continuity to DOM elements and so there might be an additional multiplicative factor.

This will probably need input from Emilio and the DOM team.

Whiteboard: [sng-scrubbed]
Performance Impact: --- → ?

cannot reproduce on linux, 118. (granted 64GB ram). But similar configs show on mac.

Performance Impact: ? → low

I am able to reproduce similar hang on an hg.m.o URL as well.

URL to repro: https://hg.mozilla.org/try/rev/ec8f980f95c16f401eb49f12d315d4795190d452
Corresponding profile: https://share.firefox.dev/47qE8oc

(In reply to Mayank Bansal from comment #8)

I am able to reproduce similar hang on an hg.m.o URL as well.

URL to repro: https://hg.mozilla.org/try/rev/ec8f980f95c16f401eb49f12d315d4795190d452
Corresponding profile: https://share.firefox.dev/47qE8oc

This shows a 20s hang but almost all the samples are in profiler code itself. Apparently https://share.firefox.dev/3QMQJe7 has profiler-only code that tracks mDispatchTimes but calling Count() appears to be slow. Though I also don't understand why there are only ~1400 samples when the sample interval is 1ms and there's a 23 second hang. Florian, who could look into this / is there some way to get a more useful profile out of this?

(Reporter, I assume the hang happens even if you disable the profiler?)

Flags: needinfo?(mayankleoboy1)
Flags: needinfo?(florian)

(In reply to :Gijs (he/him) from comment #9)

(Reporter, I assume the hang happens even if you disable the profiler?)

Now that I try again, with the profiler disabled, the hang is much less compared to when the profiler is enabled.

Though I also don't understand why there are only ~1400 samples when the sample interval is 1ms and there's a 23 second hang

AFAIK, on Windows the profiler is actually sampling at 2ms frequency when it is set at 1ms in the Profiler UI. So the number of samples are half of actual time spent.

Another profile if the previous one was not useful : https://share.firefox.dev/47DeZ9g

Flags: needinfo?(mayankleoboy1)

(In reply to :Gijs (he/him) from comment #9)

This shows a 20s hang but almost all the samples are in profiler code itself.

This is the event delay code that have been planning to remove for a while. Moving the needinfo to Markus to clarify what the next steps are to go ahead and remove this code.

Though I also don't understand why there are only ~1400 samples when the sample interval is 1ms and there's a 23 second hang.

On Windows the default timer resolution is 1/64s (ie 15.6ms). If we request high precision timers (which the profiler does, when the requested sampling interval is smaller than 10ms), we can sample up to every 2ms. In https://share.firefox.dev/3QSkudC we have high resolution timers for the profiler sampler until 30.18s, and after that it goes back to low resolution timers. It's not the first time I see this happening in a profile, but I don't have an explanation for why it happens. Maybe some other code does mismatched timeBeginPeriod/timeEndPeriod calls.

Flags: needinfo?(florian) → needinfo?(mstange.moz)
Severity: N/A → S3
Type: enhancement → defect
See Also: → 1909883

The link no longer loads.

This testcase will reproduce the hang. The duration of hang increases with the increase in the loopcount.

Profile with the attached testcase: https://share.firefox.dev/3yfFB47
Edit: On second look, the profile in parent-process looks to be heavily skewed due to the profiler issue described above.

An alternate msthod would be to go to : https://wolcendb.erosson.org/loot?tier=0
Let it load once, then close the tab. Then go to the link again.

Profile: https://share.firefox.dev/3MzVoyt

See Also: 18547711917511

Massive improvements.

  • Profile from latest Nightly (containing the fix from bug 1594368)
  1. Profile of the attached testcase: https://share.firefox.dev/3YHsWBC
  2. Profile of opening the URL in comment #15: https://share.firefox.dev/3YnQTfK
  3. Profile of opening the hg.mo link from comment #8: https://share.firefox.dev/3YjuTCK

Based on a profile, I think the remaining cost is mostly about preparing the IPC messages.
I wonder if we could in the return message just include URIs whose state changed, to build a much smaller one.
That said, I think the most compelling part is done.

Leaving this open to evaluate shrinking the IPC data exchange in the future.

Summary: 42 second Browser hang. Profiler indicates this may be BaseHistory/VisitedHistory/AsyncexecuteStatement on the main thread. → Hangs on pages with huge (many thousands) number of links, mostly due to building large IPC messages and HashSets
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: