Open Bug 934084 Opened 11 years ago Updated 2 years ago

Firefox does not sanitize special characters in page titles for Save As...

Categories

(Firefox :: File Handling, defect)

26 Branch
x86_64
Linux
defect

Tracking

()

People

(Reporter: from_bugzilla3, Unassigned)

Details

(Keywords: dupeme)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0 (Beta/Release)
Build ID: 20131009092955

Steps to reproduce:

1. Visit a page where the <title> contains characters disallowed by the filesystem
2. Hit Ctrl+S
3. Note the suggested filename
4. Try to accept the suggestion by pressing Enter

Test cases:
- http://www.fimfiction.net/download_chapter.php?chapter=432177&html
- http://www.fimfiction.net/download_chapter.php?chapter=155561&html


Actual results:

If the title contains a double quote, the suggested filename is truncated before it.

If the title contains another special character, it's included verbatim, even if the filesystem doesn't support it.

For example, on Linux, with a POSIX-compliant filesystem (the most liberal, allowing anything but NUL and the path separator), trying to save a page with a slash in the title (eg. "1/2", "fight/flight", etc.) will include said slash in the suggested filename, causing the GTK+ save dialog to produce the following confusing error dialog:

The folder contents could not be displayed
Error when getting information for file 'PATH_UP_TO_BEFORE_THE_SLASH': No such file or directory

I've confirmed this in Aurora 26 on Ubuntu Linux 12.04 LTS but it's been around since at least Firefox 3.6. I just kept forgetting to report it.


Expected results:

Firefox should replace special characters with some placeholder (every program I know uses the underscore. For example, youtube-dl) before handing the suggested filename off to the GTK+ dialog.

Given that there's no well-known way to rename the _files directory from a "Web Page, Complete" pair without either breaking the saved page or manually loading and re-saving the page in a browser, it makes sense to sanitize for the more restrictive NTFS Win32 namespace even on Linux.

That'd mean replacing the following characters:
    NUL / \ : * ? " < > |

I don't have experience with privilieged Javascript but, if there's a regex API equivalent to the one in the un-privileged Javascript standard library, this should do the trick:
    var filename_sanitizer_re = new RegExp('[\0\\/:*?"<>|]+', 'g');
    title = title.replace(filename_sanitizer_re, '_');
Both Win7 and and my Mint Linux virtual box handled the file names okay, but I agree that passing a colon through on Linux is a bit dicey.
Status: UNCONFIRMED → NEW
Ever confirmed: true
New information which may or may not be relevant to identifying the problem:

1. I just noticed that those URLs don't actually have <title> elements. Firefox seems to be falling back to scraping the first <h1> to suggest a filename.

2. On this URL, the contents of the first <h1> begin with a quote, so Firefox suggests download_chapter.php.html as the filename.

http://www.fimfiction.net/download_chapter.php?chapter=327395&html
Whiteboard: [dupeme]
Confirmed bug on the following specs:
Mozilla/5.0 (X11; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0
Build: 20160114060719
Cleaning up the Untriaged list. Triaging as Firefox > File Handling. If this isn't the correct location, please triage this to the correct area. Thanks.
Component: Untriaged → File Handling
Problem still present in 52.6: Trying to save https://wiki.documentfoundation.org/ReleaseNotes/5.4 (for example) suggests "LibreOffice 5.4: Release Notes – The Document Foundation Wiki.html" as file name on Linux. The problem is when I want to copy the saved files on NTFS for use on Windows, the colon is not allowed in file names, and fixing the links is quite some work...
As suggested initially, the default file names should try to avoid special characters (BTW: We have the code to URL-encode any character, don't we?)
Despite of that, the proposed file name could also be a shortened version (maybe using ellipsis characters) if the title of the page is quite long.
I also noticed that this is a Chrome parity issue, with Chrome already doing at least some escaping that Firefox still does not.

As for ellipsizing, also a good idea. I have, on occasion, had the GTK+ Save dialog reject a suggested filename from a *booru image site because the tag list in the page title is too long for the filesystem.

I recommend using the limitations of Microsoft's Joliet extension to ISO9660 (Microsoft's answer to long filenames before ISO 9660:1999 and UDF, the former of which the K3b frontend doesn't support and the latter of which can only be selected in addition to the base ISO9660 filesystem when using tools based on mkisofs/genisoimage backends).

K3b's "May I truncate filenames in the Joliet tree?" dialog is by far the most common place I see filenames exceeding what a filesystem allows. (ie. Limit filenames to 103 16-bit codepoints, which is the "seems to work safely" number mkisofs identified by experimentation, despite being more than the 64 in the spec and less than the 110 that Microsoft claims it can use.)
Severity: normal → S3
Keywords: dupeme
Whiteboard: [dupeme]
You need to log in before you can comment on or make changes to this bug.