Open Bug 1790500 Opened 2 years ago Updated 2 months ago

Browser crash due to GC not collecting FDs used in XHR

Categories

(Core :: DOM: Networking, defect, P3)

Firefox 104
defect

Tracking

()

UNCONFIRMED

People

(Reporter: s+mozb, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(3 files, 1 obsolete file)

Attached file fd-no-gc.html (obsolete) —

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:104.0) Gecko/20100101 Firefox/104.0

Steps to reproduce:

When using XHR to upload several thousand files, Firefox may enter a state where it stops closing file descriptors, eventually crashing. Appears to be exacerbated by heavy DOM load

An MRE is attached which should reproduce the issue after a few attempts; screenshot @ https://ocv.me/stuff/bugs/firefox/fd-no-gc.png

Possibly Linux-only bug; very brief testing on Windows came out negative

Actual results:

Firefox runs out of file descriptors and crashes in various ways

Expected results:

The file descriptors should be closed in a timely manner once the objects are nulled in javascript

The Bugbug bot thinks this bug should belong to the 'Core::DOM: Networking' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → DOM: Networking
Product: Firefox → Core
Attached file fd-no-gc.html

prevent accidental POSTing to bmoattachments.org -- encourage local-hosting the mre instead

Attachment #9294337 - Attachment is obsolete: true

Some observations I forgot to mention:

  • clicking "GC" or "Measure" in about:memory will close the FDs
    • and sometimes it'll make the GC start running at an alright pace, closing FDs periodically as expected
    • but usually the number of open FDs will just start another climb
  • there's been cases where, after all the files have finished uploading, the final handful of FDs will never be closed, permanently blocking safe-removal of USB flashdrives until the browsertab is closed or a GC is performed manually
    • I'll make a separate issue for that problem once I find a way to reproduce it reliably
Severity: -- → S4
Priority: -- → P2
Whiteboard: [necko-triaged]

The related issue regarding the final handful of FDs "never" getting closed has been reported as issue 1792598

While far less severe than this bug (just a user annoyance rather than a full browser crash) it should be easier to reproduce

Blocks: xhr
Priority: P2 → P3

I noticed that an attempt was made to solve this (or a similar) bug in Firefox 128, but please note that this bug is NOT fixed; it still crashes the entire browser (all tabs) when it occurs.

This is very unfortunate, as this bug is the only reason my software (https://github.com/9001/copyparty) is forced to say "Firefox is not supported, please use Chrome instead". I'd much prefer recommending Firefox, both out of principle and also because Chrome has some comparatively minor issues which Firefox does not. But as this bug causes Firefox to crash whenever someone tries to upload a large collection of files, which is the primary usecase of copyparty, it doesn't leave me any other choice.

The change introduced in Firefox 128 is to raise the fd softlimit on startup to that of the hardlimit. But Firefox appears to be using int16 for FDs internally, and Firefox 128.0.2 easily hits this new limit of 65536 and crashes just like before, using the previously attached MRE (fd-no-gc.html).

Due to the increased FD limit, this bug is now harder to hit. I have successfully reproduced it with 128.0.2. The bug is particularly easy to hit in Firefox 123.0.x and 126.0.1 for some reason.

When encountered with Firefox 128.0.2, the graph of open FDs was a perfectly straight line going up, so the FD GC was not at all running for over 100 seconds. But sometimes the GC suddenly kicks in after a large amount of FDs have accumulated, cleans everything up, and after that it continues running at a 4.2 second interval, which is sufficient to keep everything running smoothly given the generous FD limit as of Firefox 128. This suspiciously tends to happen at either 16384 or 32768 open FDs. I recommend using Netdata's File Descriptors graph (see earlier screenshot) to see how the GC is doing.

I still haven't looked at the Firefox source code, but I want to share a theory: When the bug hits, it looks like the GC component that's in charge of closing FDs simply isn't running. However, this only happens with the current approach of giving the fileslice directly to xhr. With a slight variation (see fd-gc-ok.html attached to this comment), the FDs are closed immediately after each file upload. So could the issue be that the FD GC simply never gets "woken up" when xhr is given a fileslice to upload, rather than a buffer? Alas, the approach in fd-gc-ok.html is not applicable to real-world scenarios, since it introduces insane GC stuttering and uses gigabytes of RAM.

Given the increased FD limit in Firefox 128, now at least 524288 files are necessary to hit the bug. The following bash oneliner builds an appropriate folder for reproducing this bug:

for d in {000..524}; do echo $d; mkdir -p tinyfiles/$d; for f in {000..999}; do echo $d$f >tinyfiles/$d/$f; done; done

When reproduced in Firefox 128.0.2, the following may be printed to stderr:

IPDL protocol Error: Received an invalid file descriptor
[20813] Sandbox: Unexpected EOF, op 0 flags 00 path /proc/20813/statm
[20500] Sandbox: bad read from pid 20813: EMFILE
[20500] Sandbox: bad read from pid 20676: EMFILE
[20676] Sandbox: Unexpected EOF, op 0 flags 00 path /proc/20676/statm
Attached file fd-gc-ok.html

slightly modified version of the initial MRE which avoids the FD leak, in exchange for being unusable in real-world scenarios due to GC thrashing

Attached file fd-no-gc-fetchmod.html

This bug is not unique to XMLHttpRequest; it also happens with fetch. Attached is another MRE, fd-no-gc-fetchmod.html, which replaces XMLHttpRequest with fetch(), yet still crashes Firefox with the same behavior.

There is one difference: When using fetch, the server must accept the PUT/POST and return a 200. If the server returns an error, the FD does not leak. So the attached MRE (fd-no-gc-fetchmod.html) must be hosted from a server which accepts PUT. Below is a suggested approach to reproducing the bug using copyparty as the webserver, but feel free to replace it with any other httpd that fits the bill.

curl -LO https://github.com/9001/copyparty/releases/latest/download/copyparty-sfx.py
python3 copyparty-sfx.py -nw
firefox http://127.0.0.1:3923/fd-no-gc-fetchmod.html
Summary: File-descriptors are not closed under heavy load → Browser crash due to GC not collecting FDs used in XHR
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: