Open Bug 1854770 Opened 2 years ago Updated 8 months ago

Hangs on pages with huge (many thousands) number of links, mostly due to building large IPC messages and HashSets

Tracking

()

Status:

NEW

Project Flags:

Performance Impact

low

People

(Reporter: mayankleoboy1, Unassigned)

References

(
URL
)

Details

(Keywords: perf, Whiteboard: [sng-scrubbed])

Attachments

(2 files)

about:support 2 years ago Mayank Bansal 39.93 KB, text/plain		Details
Too Many Links_Testcase.html 1 year ago Mayank Bansal 419 bytes, text/html		Details

Mayank Bansal

Reporter

Description

•

2 years ago

•

Edited

See profile : https://share.firefox.dev/3Rtd9mJ , https://share.firefox.dev/3ZxdfLV

I was doing some testing on a demo-type page.
Go to https://random-items.erosson.org/
Enter 1000000 as the number.
Click on roll

I get this 42 second hang (once the js stuff is over and the content is generated) . The browser froze, and I got the "Mouse cursor is spinning as the program is not responding" error message on Windows.

My guess is that because the demo page generates a lot of URLs, the browser takes a long time to check if these URLs have been visited before.

Mayank Bansal

Reporter

Comment 1

•

2 years ago

Attached file about:support — Details

Mayank Bansal

Reporter

Comment 2

•

2 years ago

•

Edited

Actually, the parent process jank also occurs for smaller values of inputs.

Enter 100000
https://share.firefox.dev/3LAPVY2

Mayank Bansal

Reporter

Comment 3

•

2 years ago

•

Edited

And a profile with all threads and File I/O : https://share.firefox.dev/463GBE4

Shows some File I/O activity on on sqlitedb:places:sqlite

Marco Bonardo [:mak]

Comment 4

•

2 years ago

Afaict this seems due to a large number of runnables dispatched across threads.

Severity: -- → N/A

Component: Storage → Places

Depends on: 1594368

Keywords: perf

Priority: -- → P3

Mayank Bansal

Reporter

Updated

•

2 years ago

Comment 5

•

2 years ago

(In reply to Marco Bonardo [:mak] from comment #4)

Afaict this seems due to a large number of runnables dispatched across threads.

In particular, I see Document::FlushPendingLinkUpdates doing a lot of work, and if we look at a diagram of calls to BaseHistory::ScheduleVisitedQuery that's on the places stack we can see that FlushPendingLinkUpdates can be causally responsible for that.

It seems like DOM and/or Places should handle this better, if only by entering an overload mode that stops doing link visited requests for a page.

Given the amount of time spent in layout, I wonder if part of the problem could be massive churn in the DOM such that a VDOM implementation is not providing continuity to DOM elements and so there might be an additional multiplicative factor.

Stephanie Cunnane [:scunnane]

Comment 6

•

2 years ago

This will probably need input from Emilio and the DOM team.

Whiteboard: [sng-scrubbed]

Jira Integration Bot

Updated

•

2 years ago

See Also: → https://mozilla-hub.atlassian.net/browse/SNG-1001

:Gijs (he/him)

Updated

•

2 years ago

Performance Impact: --- → ?

Benjamin De Kosnik [:bdekoz]

Comment 7

•

2 years ago

cannot reproduce on linux, 118. (granted 64GB ram). But similar configs show on mac.

Benjamin De Kosnik [:bdekoz]

Updated

•

2 years ago

Performance Impact: ? → low

Mayank Bansal

Reporter

Comment 8

•

2 years ago

•

Edited

I am able to reproduce similar hang on an hg.m.o URL as well.

URL to repro: https://hg.mozilla.org/try/rev/ec8f980f95c16f401eb49f12d315d4795190d452
Corresponding profile: https://share.firefox.dev/47qE8oc

:Gijs (he/him)

Comment 9

•

2 years ago

(In reply to Mayank Bansal from comment #8)

I am able to reproduce similar hang on an hg.m.o URL as well.

URL to repro: https://hg.mozilla.org/try/rev/ec8f980f95c16f401eb49f12d315d4795190d452
Corresponding profile: https://share.firefox.dev/47qE8oc

This shows a 20s hang but almost all the samples are in profiler code itself. Apparently https://share.firefox.dev/3QMQJe7 has profiler-only code that tracks mDispatchTimes but calling Count() appears to be slow. Though I also don't understand why there are only ~1400 samples when the sample interval is 1ms and there's a 23 second hang. Florian, who could look into this / is there some way to get a more useful profile out of this?

(Reporter, I assume the hang happens even if you disable the profiler?)

Flags: needinfo?(mayankleoboy1)

Flags: needinfo?(florian)

Mayank Bansal

Reporter

Comment 10

•

2 years ago

•

Edited

(In reply to :Gijs (he/him) from comment #9)

(Reporter, I assume the hang happens even if you disable the profiler?)

Now that I try again, with the profiler disabled, the hang is much less compared to when the profiler is enabled.

Though I also don't understand why there are only ~1400 samples when the sample interval is 1ms and there's a 23 second hang

AFAIK, on Windows the profiler is actually sampling at 2ms frequency when it is set at 1ms in the Profiler UI. So the number of samples are half of actual time spent.

Another profile if the previous one was not useful : https://share.firefox.dev/47DeZ9g

Flags: needinfo?(mayankleoboy1)

Florian Quèze [:florian]

Comment 11

•

2 years ago

(In reply to :Gijs (he/him) from comment #9)

This shows a 20s hang but almost all the samples are in profiler code itself.

This is the event delay code that have been planning to remove for a while. Moving the needinfo to Markus to clarify what the next steps are to go ahead and remove this code.

Though I also don't understand why there are only ~1400 samples when the sample interval is 1ms and there's a 23 second hang.

On Windows the default timer resolution is 1/64s (ie 15.6ms). If we request high precision timers (which the profiler does, when the requested sampling interval is smaller than 10ms), we can sample up to every 2ms. In https://share.firefox.dev/3QSkudC we have high resolution timers for the profiler sampler until 30.18s, and after that it goes back to low resolution timers. It's not the first time I see this happening in a profile, but I don't have an explanation for why it happens. Maybe some other code does mismatched timeBeginPeriod/timeEndPeriod calls.

Flags: needinfo?(florian) → needinfo?(mstange.moz)

Marco Bonardo [:mak]

Updated

•

2 years ago

Severity: N/A → S3

Type: enhancement → defect

Mayank Bansal

Reporter

Updated

•

1 year ago

Comment 12

•

1 year ago

The link no longer loads.

Mayank Bansal

Reporter

Comment 13

•

1 year ago

Attached file Too Many Links_Testcase.html — Details

This testcase will reproduce the hang. The duration of hang increases with the increase in the loopcount.

Mayank Bansal

Reporter

Comment 14

•

1 year ago

•

Edited

Profile with the attached testcase: https://share.firefox.dev/3yfFB47
Edit: On second look, the profile in parent-process looks to be heavily skewed due to the profiler issue described above.

Mayank Bansal

Reporter

Comment 15

•

1 year ago

An alternate msthod would be to go to : https://wolcendb.erosson.org/loot?tier=0
Let it load once, then close the tab. Then go to the link again.

Profile: https://share.firefox.dev/3MzVoyt

Mayank Bansal

Reporter

Updated

•

1 year ago

Comment 16

•

1 year ago

•

Edited

Massive improvements.

Profile from latest Nightly (containing the fix from bug 1594368)

Profile of the attached testcase: https://share.firefox.dev/3YHsWBC
Profile of opening the URL in comment #15: https://share.firefox.dev/3YnQTfK
Profile of opening the hg.mo link from comment #8: https://share.firefox.dev/3YjuTCK

Marco Bonardo [:mak]

Comment 17

•

1 year ago

Based on a profile, I think the remaining cost is mostly about preparing the IPC messages.
I wonder if we could in the return message just include URIs whose state changed, to build a much smaller one.
That said, I think the most compelling part is done.

Leaving this open to evaluate shrinking the IPC data exchange in the future.

Summary: 42 second Browser hang. Profiler indicates this may be BaseHistory/VisitedHistory/AsyncexecuteStatement on the main thread. → Hangs on pages with huge (many thousands) number of links, mostly due to building large IPC messages and HashSets

Markus Stange [:mstange]

Comment 18

•

8 months ago

(In reply to Florian Quèze [:florian] from comment #11)

(In reply to :Gijs (he/him) from comment #9)

This shows a 20s hang but almost all the samples are in profiler code itself.

This is the event delay code that have been planning to remove for a while. Moving the needinfo to Markus to clarify what the next steps are to go ahead and remove this code.

I've written down the next steps for this in bug 1951664 comment 2.

Flags: needinfo?(mstange.moz)

You need to log in before you can comment on or make changes to this bug.