Open Bug 1037984 Opened 10 years ago Updated 3 months ago

Website causes Firefox to completely hang because same script file is getting loaded over and over again

Categories

(Firefox :: New Tab Page, defect)

defect

Tracking

()

Performance Impact medium
Tracking Status
firefox31 --- affected
firefox32 --- affected
firefox33 --- affected
firefox-esr24 --- affected

People

(Reporter: whimboo, Unassigned, NeedInfo)

References

()

Details

(6 keywords, Whiteboard: [necko-triaged][necko-priority-review])

Attachments

(1 file)

Mozilla/5.0 (X11; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0 ID:20140706004001 CSet: f53099796238

Seen this today with my Aurora build when loading the referenced web page. The browser was completely locked up even after 1h of doing nothing. I had to force kill it. Checks for other versions have shown that all current versions are affected.
At least OSX is also affected. Generating a sample of this process I see that most of the time is spent in NS_EscapeURL(), which is defined here: http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsEscape.cpp#359

There is also a GC process running and taking up a fair amount of cpu time. But that is lesser compared to the io process for nsEscapeURL. Maybe Andrew can look into that, and check if that is known.
Component: General → XPCOM
OS: Linux → All
Product: Firefox → Core
Hardware: x86_64 → All
FWIW, the site also hangs latest (today's m-c win32) build.  IE11 and latest Chrome Dev also hang...

Smells like a badly coded site maybe ?
The profiler reports something like 10-20 bailouts per millisecond continuously. My understanding is an excessive number of bailouts is bug on our side.
Saving this file off gives me a file size alone of the HTML file of about 9MB. Tons of script lines are getting added here. Not sure if this may cause this problem. I will try to create a minimized testcase tomorrow, if no-one else already does it.
Attached file testcase.zip
This is the minimized testcase. I can't minimize it further due to limited time. But I hope it already helps.
Keywords: testcase
Whatever it's doing, it sucks memory, not just CPU
Keywords: mlk

Hey Henrik,
Can you still reproduce this issue or should we close it?

Flags: needinfo?(hskupin)

The attached testcase reproduces the problem perfectly. But it moves the high CPU load and memory allocation into the file process. Running the profiler I cannot see how to actually inspect that process. Do we currently miss to record it, or which option needs to be set? Greg, could you help me here?

Flags: needinfo?(hskupin) → needinfo?(gtatum)

You can choose the processes you want to record via "about:profiling" and Threads section. You can also check "Bypass selections above and record all registered threads" if you want to record all threads, but that option is a bit slow. However, you're guaranteed to get all of the threads the profiler can find. If it's still not in that list, then the thread needs to be registered with the profiler. You can file a new profiler bug for the latter case. However, I think this thread should probably be registered already with the profiler, it just needs enabling.

Flags: needinfo?(gtatum)
QA Whiteboard: [qa-not-actionable]

I finally had the time to have a look and indeed the Gecko profiler is not able to record the file: process. When I click the profiler icon in about:processes the profiler is recording, but when capturing the data nothing can be found. Greg, can you verify? If we really don't record that process, I'm happy to get a bug filed.

When using Instruments to get a profile from the process I can see that nearly all the time is spent in GeckoMain. A hot path seems to be nsBaseChannel::AsyncOpen(nsIStreamListener*).

Flags: needinfo?(gtatum)

Did it not show up when recording all threads? The profiler has to hook into the thread creation functions, and sometimes third party libraries bypass these hooks.

So if you record all threads, and the file thread shows up, then you can request to add the name to the checkboxes. The threads are matched through simple string matching.

If it's not recorded at all, then you can file a bug to request it to be tracked.

It's invoked from a third party library, it's more difficult, and this would be: https://bugzilla.mozilla.org/show_bug.cgi?id=1341811

Flags: needinfo?(gtatum)

The file process is actually recorded. So there is nothing we have to worry about here.

To keep it unrelated to file:// I moved the test case to https://hskupin.info/tmp/hang-on-load. Note: Please do not load that page in your daily profile! Even after killing and restarting Firefox the application will not be responsive until you remove the sessionrestore data.

I actually recorded a Gecko profile: https://share.firefox.dev/3lHXMGF

As it looks like the page is infinitely loading the script file count.js. The memory of the parent and web content process grow to multiple GB within a minute and due to the high CPU load Firefox is not usable.

Why do we have to load and keep that many instances of the same script file? Olli, is it something you could help with?

Flags: needinfo?(bugs)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #10)

I finally had the time to have a look and indeed the Gecko profiler is not able to record the file: process. When I click the profiler icon in about:processes the profiler is recording, but when capturing the data nothing can be found.

That's usually a symptom of bug 1673513. You can workaround it by increasing the value of devtools.performance.recording.child.timeout_s in about:config, or reducing the value of toolkit.aboutProcesses.profileDuration.

We could yes have some per process cache for a script, but one could easily still reproduce the issue by just loading a different script all the time, say count.js?<increasingnumber>.

(There are obviously many many ways to hang a browser. Creating more and more iframes for example would increase memory usage both in child and in the parent processes.)

Flags: needinfo?(bugs)

I see. So it means we should close this bug given that there is nothing that we could actually do here?

Flags: needinfo?(bugs)

We could rate limit a child process somehow. But then there are valid reasons why a child process can consume lots of CPU time and cause rather busy IPC (games for example).

Flags: needinfo?(bugs)

As such lets update the summary to reflect why this bug has been filed. I'll leave it open for further opinions.

Summary: Loading website causes Firefox to completely hang → Website causes Firefox to completely hang because same script file is getting loaded over and over again

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: critical → --

Is this really about Firefox hanging totally, or only a child process?
Child processes can of course always just keep using 100% cpu, even because of valid use cases.

Flags: needinfo?(hskupin)

The whole browser is basically not operable. You can check that by loading https://hskupin.info/tmp/hang-on-load. Every click on UI elements like the hamburger menu is at the beginning massively delayed and trying to exit Firefox also doesn't work. It needs to be force killed.

Olli, are you not able to see that behavior with the above URL?

Flags: needinfo?(hskupin) → needinfo?(smaug)

I can open a new tab and close the old tab and things normalize again then.

Flags: needinfo?(smaug)

I've asked around so that others with a Mac would also try to reproduce and they actually can see the same problem. Firefox is completely frozen and needs to be force-killed.

Florian, given that you have a Mac as well, could you try out maybe yourself? Not sure what the next steps here would be.

Flags: needinfo?(florian)
OS: All → macOS

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #22)

I've asked around so that others with a Mac would also try to reproduce and they actually can see the same problem. Firefox is completely frozen and needs to be force-killed.

Florian, given that you have a Mac as well, could you try out maybe yourself? Not sure what the next steps here would be.

Here is a profile on my local build: https://share.firefox.dev/3ygOwig I closed the tab after about 1s, but the parent process remained busy for more than 5s, and the content process also remained active for longer than necessary.

Next step would be to spend time looking at the profile, and to file actionable bugs based on it. It would be a fun profile to look at during a Joy of Profiling session.

Flags: needinfo?(florian)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #12)

Even after killing and restarting Firefox the application will not be responsive until you remove the sessionrestore data.

I think this part is worth investigating. I don't know what causes this behavior. The only hint I have so far is I saw this in my terminal after I killed a content process using 100% of a core from about:processes:

console.error: "BackgroundThumbnails remote process crashed - recovering"
console.error: "getScreenshot(https://hskupin.info/tmp/hang-on-load/) failed: Error: page-thumbnail:error"

I don't know why we capture thumbnails of pages after a restart, but that's something we should probably throttle a lot more aggressively.

Performance Impact: --- → ?

I can reproduce on Linux, too (with URL from comment 20). I can open new empty tabs, but I can't navigate or open devtools or open the hamburger menu.

competition-wise: The testcase loads forever in Chrome as well, but it doesn't take down the whole browser.

It looks like the testcase is making a zillion requests to https://hskupin.info/tmp/hang-on-load/count.js, AFAICT? (looking at network pane in comment 23) And I guess that's effectively DOS'ing the parent process?

OS: macOS → All

(In reply to Florian Quèze [:florian] from comment #24)

I think this part is worth investigating. I don't know what causes this behavior. The only hint I have so far is I saw this in my terminal after I killed a content process using 100% of a core from about:processes:

console.error: "BackgroundThumbnails remote process crashed - recovering"
console.error: "getScreenshot(https://hskupin.info/tmp/hang-on-load/) failed: Error: page-thumbnail:error"

I don't know why we capture thumbnails of pages after a restart, but that's something we should probably throttle a lot more aggressively.

I can actually workaround it by setting browser.pagethumbnails.capturing_disabled to true. I mean the hang makes sense given it basically also loads the same page to capture the thumbnail. Maybe the same or related to bug 1333961?

Flags: needinfo?(dharvey)

So the whole-browser-takedown seems to be the same underlying issue (1) when you load the testcase and (2) after you restart. We're essentially letting a content process consume all of the parent process's resources (initially in the content process for a tab, and then later in a content process for a thumbnail).

We should probably fix this by addressing that underlying issue (e.g. throttling the rate at which we generate http requests, or something like that?)

Let's tentatively reclassify this as networking, since network requests seem to be the dominant source of trouble here, and I imagine the fix would be some sort of mitigation in networking code.

(In Florian's profile from comment 23, he only had the tab open for 1 second, but that gave the testcase an opportunity to generate over 12,000 HTTP requests for https://hskupin.info/tmp/hang-on-load/count.js , which then locked up the parent process for a while, as he noted. Presumably the site's JS is explicitly making those requests; but hopefully there's something we can do to limit the impact on other tabs & on the browser frontend, by making the parent process service those requests less-eagerly. Maybe we'd only apply a mitigation after some threshold is reached, too, so that there'd only be a noticeable change-in-behavior for pathological cases like the attached testcase?)

Component: XPCOM → Networking
Version: 31 Branch → Trunk

I think there should also be opportunities to unblock the parent much faster when the tab is closed. Currently we wait for the content process main thread to be available to process the tab close event to then cancel all the network requests. I think the parent should be able to drop the network requests for a closed tab without waiting for an answer from the (potentially unresponsive) content process.

Hi Henrik

Apologies for the confusing situation, I am currently assigned to be the triage owner but am not actually involved or familiar with this code much at all, I have forwarded the request to someone who hopefully knows a bit more. We are currently working on transitioning the triage owner to someone who would be familiar with the code.

Scott is this something you know anything about?

Cheers
Dale

Flags: needinfo?(dharvey) → needinfo?(scott.downe)

Redirect a needinfo that is pending on an inactive user to the triage owner.
:jesup, since the bug has recent activity, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(scott.downe) → needinfo?(rjesup)
Flags: needinfo?(mconley)

From my read of this, it seems like the PageThumbs / BackgroundPageThumbs issue is kind of a red herring, in that it's basically hitting the same issue as loading the page "intentionally" in a tab, and the many many network requests end up consuming all resources in the parent.

I scanned around, and it looks like the only place where the BackgroundPageThumbs code might kick in is via Places Preview: https://searchfox.org/mozilla-central/rev/90c4ec56977adf756c921a16131076f5cf2b74a5/toolkit/components/places/PlacesPreviews.sys.mjs#348-355 or via the Recent Activity in about:home / about:newtab: https://searchfox.org/mozilla-central/rev/90c4ec56977adf756c921a16131076f5cf2b74a5/browser/components/newtab/lib/Screenshots.jsm#47-49 (which, I believe, is off by default).

Flags: needinfo?(mconley)

This is a DOS issue, either intentional or unintentional. There may be reasons other than blocking a DOS to throttle large numbers of http requests from a given content processes, so prioritizing based on that mostly.

@valentin - I don't think this is important enough for priority review queue, but I may be wrong - thoughts?

Severity: -- → S3
Flags: needinfo?(rjesup) → needinfo?(valentin.gosu)
Keywords: csectype-dos
Priority: -- → P3
Whiteboard: [necko-triaged]

This seems like a fairly problematic use case. The content process is triggering an unlimited number of requests that end up overwhelming the main process and stalling everything.
I'm thinking we want some sort of backpressure/IPC congestion control so that going over a certain limit simply makes opening channels fail. There may be even more elegant solutions. I'm making this a P2 and putting it in the [necko-priority-review] to discuss potential solutions.

Flags: needinfo?(valentin.gosu)
Priority: P3 → P2
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-review]

(In reply to Valentin Gosu [:valentin] (he/him) from comment #34)

I'm thinking we want some sort of backpressure/IPC congestion control so that going over a certain limit simply makes opening channels fail. There may be even more elegant solutions. I'm making this a P2 and putting it in the [necko-priority-review] to discuss potential solutions.

Hi Valentin, I wanted to ask if there was already time within the last 3 months for the network team to discuss potential solutions? Thanks!

The team had a chat about this bug.
I think we've had some IPC flood issues in the past that had a chance of locking up the entire browser. If you have a while loop that executes something that causes an IPC message, the browser is not going to be very happy 🙂.
Normally this wouldn't be too bad, but this example here is exacerbated by the BackgroundPageThumbs also loading the page in the background after restart, and triggering the issue again.

Mike, would it be reasonable for the BackgroundPageThumbs load to disable running JS on the loaded page?

That said, I'm not totally opposed to adding some sort of rate control. In fact I can confirm that making HttpChannelChild::AsyncOpen fail when there are more than 100 pending channels makes us avoid the hang. Bumping that limit to 1000-10000 however means that there will still be a lot of channels active in the background and will soon overwhelm the system's resources.

In short, I think this is an interesting test case. Chrome handles it a bit better. If we get rid of the persistent issue caused by the backgroundPageThumbs triggering the bug again, we could potentially add some guardrails so this specific kind of IPC flood doesn't happen. Thoughts?

Flags: needinfo?(mconley)

Hi mconley, would you have any kind of feedback regarding Valentin's needinfo request? Thanks!

Sorry, this one fell off the radar.

Mike, would it be reasonable for the BackgroundPageThumbs load to disable running JS on the loaded page?

In certain cases, I suspect this would mean that a page thumb would not be captured properly. Sites that rely on JS to render (basically any that use modern frameworks like React, Vue, Angular, etc) would likely present a blank page or a very small subset of the page if JS were to be disabled.

Visiting YouTube, for example, with JS disabled (via DevTools) looks like this: https://i.imgur.com/QygwMC8.png

What does make sense to me, however, is having BackgroundPageThumbs / PageThumbs be a lot more strict about how long it's willing to let a page loaded in the background to process its script before it gets killed. I'd certainly be for clamping down a bit on that, and having some timeouts / thresholds that cause the thumbnail request to be aborted.

Flags: needinfo?(mconley)

Mike should we then move this to the New Tab Page component which seems to be the right one for BackgroundPageThumbs.sys.mjs?

Component: Networking → New Tab Page
Product: Core → Firefox

Resetting priority/severity so the bug can be re-triaged in the new component.

Severity: S3 → --
Priority: P2 → --

The severity field is not set for this bug.
:amy, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(achurchwell)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: