Support loading BOMless UTF-8 text/plain files from file: URLs
Categories
(Core :: DOM: HTML Parser, defect)
Tracking
()
People
(Reporter: vincent-moz, Assigned: hsivonen)
References
(Depends on 1 open bug)
Details
Attachments
(2 files, 5 obsolete files)
Reporter | ||
Comment 1•10 years ago
|
||
Assignee | ||
Comment 2•10 years ago
|
||
Reporter | ||
Comment 3•10 years ago
|
||
Assignee | ||
Comment 4•10 years ago
|
||
Reporter | ||
Comment 5•10 years ago
|
||
Assignee | ||
Comment 8•10 years ago
|
||
Comment 10•9 years ago
|
||
Comment 11•8 years ago
|
||
Comment 12•8 years ago
|
||
Assignee | ||
Comment 13•8 years ago
|
||
Reporter | ||
Comment 14•8 years ago
|
||
Comment 15•8 years ago
|
||
Comment 16•8 years ago
|
||
Comment 17•8 years ago
|
||
Assignee | ||
Comment 20•7 years ago
|
||
Comment 21•7 years ago
|
||
Comment 22•7 years ago
|
||
Comment 23•7 years ago
|
||
Comment 24•7 years ago
|
||
Comment 25•7 years ago
|
||
Comment 26•7 years ago
|
||
Comment 27•7 years ago
|
||
Reporter | ||
Comment 28•7 years ago
|
||
Comment 29•7 years ago
|
||
Comment 30•7 years ago
|
||
Assignee | ||
Comment 31•7 years ago
|
||
Comment 32•7 years ago
|
||
Comment 33•7 years ago
|
||
Reporter | ||
Comment 34•7 years ago
|
||
Comment hidden (mozreview-request) |
Comment 36•7 years ago
|
||
Assignee | ||
Comment 37•7 years ago
|
||
mozreview-review |
Comment 38•7 years ago
|
||
Comment 39•7 years ago
|
||
bugherder |
Comment 40•6 years ago
|
||
Reporter | ||
Comment 42•6 years ago
|
||
Comment 44•6 years ago
|
||
Comment 45•6 years ago
|
||
Comment 46•6 years ago
|
||
Comment 47•6 years ago
|
||
Comment 48•6 years ago
|
||
Reporter | ||
Comment 49•6 years ago
|
||
Comment 50•6 years ago
|
||
Comment 51•6 years ago
|
||
Comment 52•6 years ago
|
||
Reporter | ||
Comment 53•6 years ago
|
||
Comment 54•6 years ago
|
||
Reporter | ||
Comment 55•6 years ago
|
||
Assignee | ||
Comment 56•6 years ago
|
||
Comment 57•6 years ago
|
||
Assignee | ||
Comment 58•6 years ago
|
||
Assignee | ||
Comment 59•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 60•6 years ago
|
||
Assignee | ||
Comment 61•6 years ago
|
||
Comment 62•6 years ago
|
||
Comment 63•6 years ago
|
||
Assignee | ||
Comment 64•6 years ago
|
||
Assignee | ||
Comment 65•6 years ago
|
||
Assignee | ||
Comment 66•6 years ago
|
||
Assignee | ||
Comment 67•6 years ago
|
||
Assignee | ||
Comment 68•6 years ago
|
||
Assignee | ||
Comment 69•6 years ago
|
||
Comment 70•6 years ago
|
||
Assignee | ||
Comment 71•6 years ago
|
||
Comment 72•6 years ago
|
||
Assignee | ||
Comment 73•6 years ago
|
||
Comment 74•6 years ago
|
||
Comment 75•6 years ago
|
||
bugherder |
Comment 76•6 years ago
|
||
Updated•6 years ago
|
Comment 77•6 years ago
|
||
Assignee | ||
Comment 78•6 years ago
|
||
Assignee | ||
Comment 79•6 years ago
|
||
Comment 83•5 years ago
|
||
See duplicate bug 1477983 for additional discussion about standards, how Firefox perpetuates confusion about character sets, about the need for good character set conversion, and about this error. These problems with UTF-8 support have gone on for too many years.
Comment 84•5 years ago
|
||
The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455
Comment 85•5 years ago
|
||
I guess this makes sense, as strings should be private (they could be encoded in some secret way that Firefox has no business knowing about). All the more reason that UTF-8 should be assumed and bugs like this one should be fixed promptly, not allowed to exist 5 years later, in my opinion.
Reporter | ||
Comment 86•5 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #84)
The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455
Do you mean that the whatwg has decided that all text files are expected to be in windows-1252? Wow!
Comment 87•5 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #84)
The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455
To clarify, does that test specifically apply to content from file: URLs, rather than just to content served over the web?
Assignee | ||
Comment 88•5 years ago
|
||
(In reply to Smylers from comment #87)
(In reply to Masatoshi Kimura [:emk] from comment #84)
The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455To clarify, does that test specifically apply to content from file: URLs, rather than just to content served over the web?
I believe it's meant to apply to non-file URLs only.
(In reply to Vincent Lefevre from comment #86)
(In reply to Masatoshi Kimura [:emk] from comment #84)
The spec forbids to sniffing for UTF-8, not our discretion. Even Web-Platform test is present so that UTF-8 is not sniffed:
https://github.com/web-platform-tests/wpt/pull/14455Do you mean that the whatwg has decided that all text files are expected to be in windows-1252? Wow!
Unlabeled files are expected to be legacy files in legacy encodings (just like files that don't opt into standards mode behavior with doctype are expected to be legacy and quirky). Considering the configuration that test cases are run in (generic TLD and en-US browser localization), windows-1252 is the applicable legacy encoding.
Newly-created files are expected to be UTF-8 and labeled (just like newly-created files are expected to be non-quirky and to have the HTML5 doctype to say so). The file URL case is different when it comes to the labeling expectation, because a) one of the labeling mechanisms (HTTP headers) is not available and b) we don't need to support incremental loading of local files.
Comment 89•5 years ago
|
||
You can't label a text file. And it makes no sense to apply a different encoding to local files (which on most computers shipped today are exclusively UTF-8) than to the same files carried over the net. The latter, even including legacy stuff, is already 94% UTF-8.
So if one of competing standards bodies declares it wants Windows-1252, what about having a config option WHATWGLY_CORRECT that defaults to off, and doing a sane thing otherwise?
Assignee | ||
Comment 90•5 years ago
•
|
||
(In reply to Adam Borowski from comment #89)
You can't label a text file.
On common non-file
transports you can: with ; charset=utf-8
appended to the text/plain
type. (Also, the BOM is an option on any transport, but has other issues.)
And it makes no sense to apply a different encoding to local files (which on most computers shipped today are exclusively UTF-8) than to the same files carried over the net.
It does when a file carried over the net loses its Content-Type
header when saved locally and sniffing for UTF-8 locally is feasible in a way that doesn't apply to network streams in the context of the Web's incremental rendering requirements.
So if one of competing standards bodies declares it wants Windows-1252, what about having a config option WHATWGLY_CORRECT that defaults to off, and doing a sane thing otherwise?
Do I understand correctly that you'd want to assume UTF-8 for unlabeled content carried over the network and break unlabeled legacy content in order to give newly-authored content the convenience of not having to declare UTF-8?
Comment 91•5 years ago
|
||
Adding that ";charset=utf-8" is not possible within most common user interfaces. And even if the content's author can actually access the web server's configuration, that requirement is an obscure detail that's not well publicized. And it shouldn't be — an assumption of plain ASCII or Windows-1252 might have been reasonable in 1980's but today it is well overdue to make such basics work out of the box.
Do I understand correctly that you'd want to assume UTF-8 for unlabeled content carried over the network and break unlabeled legacy content in order to give newly-authored content the convenience of not having to declare UTF-8?
I'm afraid that yes. It is nasty to potentially break historic content — but the alternative is to 1. educate users, and 2. require them to do manual steps; 2. is unnecessary work while 1. is not feasible.
A today's node.php developer doesn't even know what "encoding" is.
Assignee | ||
Comment 92•5 years ago
•
|
||
today it is well overdue to make such basics work out of the box
Doing this on the browser side is incompatible with not breaking existing content. If legacy content expects something unreasonable, reasonable behavior needs to be opt-in resulting in boilerplate for all new content. UTF-8 isn't the only case. Other obvious cases are the standards mode and viewport behavior. So newly-authored HTML needs to start with <!DOCTYPE html><meta charset="utf-8"><meta content="width=device-width, initial-scale=1" name="viewport">
. It's sad that new content bears this burden instead of old content, but that's how backward compatibility works.
If you are already at peace with putting <!DOCTYPE html>
and <meta content="width=device-width, initial-scale=1" name="viewport">
in your template, I suggest just treating <meta charset="utf-8">
as yet another template bit and not trying to fight it.
It doesn't work for text/plain
, which is something of an afterthought in the Web Platform compared to text/html
. However, text/plain
is also significantly less common than text/html
, so it kinda works out in the aggregate even though it's annoying for people who actually do serve text/plain
.
Web servers have tried to change their out-of-the-box experience. For example, it takes special effort to get nginx not to state any charset
for text/html
. This has its own set of problems when the Web server is upgraded without making the corresponding content changes. (Previously with Apache, similar issues lead to browsers not trusting server-claimed text/plain
to actually be text.)
In any case, this is off-topic for this bug.
A today's node.php developer doesn't even know what "encoding" is.
At least with Node or PHP they are in control of their HTTP headers.
Comment 93•5 years ago
|
||
In response to comment 88, "Unlabeled files are expected to be legacy files in legacy encodings": this is an unrealistic expectation. While it certainly may apply to many novice programmers or users stuck with old data or applications, it most certainly does not apply to modern users or modern data, both of which already use UTF-8 as a de facto convention or standard. Clearly, as time goes on, support for legacy data must fall on the users of such data. They must be responsible for declaring it as using a Windows code page or whatever other nonstandard encoding it perpetuates.
Perhaps over-reaching assumptions like this one by senior Firefox developers and/or by standards organizations are at the root of Firefox's problems in the area of character encoding.
Comment 94•5 years ago
|
||
"What problems, plural, (as opposed to the single issue of whether unlabeled should mean UTF-8) are you referring to?" Henri, I don't normally deal with Firefox character encoding problems in my software work, so you need to excuse me if I can't remember all the issues, but one of them is the large body of complaints about Firefox > Menu > View > Text Encoding, and the submenus of Text Encoding that sometimes are available. You might try some Web searching to find other issues with character encoding. I would be very surprised if my memory is wrong and there are no other problems in this area.
Comment 95•5 years ago
|
||
I deal with text/plain files daily (local and remote) and every single one of them is UTF-8 and have been for at least a decade. It's long past time that Firefox defaults to this modern encoding (like other browsers do).
It is not practical to convert these text/plain files into html or to encode them using a legacy encoding as that would (a) be significant added effort required daily (by several people), and (b) break at least three other parts of the workflow.
Comment 96•5 years ago
|
||
(In reply to Chris McKenna from comment #95)
I deal with text/plain files daily (local and remote) and every single one of them is UTF-8 and have been for at least a decade. It's long past time that Firefox defaults to this modern encoding (like other browsers do).
This bug is about local files only, and has been closed because the encoding is now detected by sniffing (see changeset 5a6f372f62c1 and Firefox 66 Release Notes #HTML). This makes sense for a whole lot of reasons cited above, most prominently that the transport does not convey the charset and that the full file is available. So your problem should be fixed here on Firefox 66+.
For other transports I’m sure another bug 81551 is more appropriate, though the key arguments against using UTF-8 by default remain:
- the charset can be specified:
- either by the transport (i.e.
Content-Type
HTTP headers) - or in the file in case of html
<meta charset="utf-8">
- either by the transport (i.e.
- forcing new content to be more specific than old content is by design, that’s how backwards compatibility works. In other words, bluntly defaulting to UTF-8 breaks legacy content,
- (in a different thread) magically guessing UTF-8 is encouraging people not to specify the charset,
- autodetecting is not desirable over the network, for a number of reasons including:
- incremental loading,
- interoperability,
- For html, WHATWG forbids it (see the dedicated Web-Platform test and the Note “User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network” in the specs).
If you can address these points (maybe specifically for text/plain?) I suggest you open a new bug about it.
Comment 97•5 years ago
|
||
While I hold to a more general point of view, that of not hesitating to move on to the most stable version of the best technology (and doing so as close to automatically as possible, given constraints such as dependencies on standards, hardware, affordability, and other software), I also see some very good reasons for maintaining back compatibility with a given set of software versions: one excellent use case is the company, charity, or organization that has as its primary function something other than technology and simply cannot afford to port their current working software to a new platform, or even a new set of software versions. We can even assume that such organizations are running donated software on donated hardware and are providing a vital service to the world. Given no expertise and no money to hire expertise, such organizations can easily get stuck with versions that stop working.
While I do not subscribe to the outmoded point of view that software should be automatically back-compatible forever, because new features will always and eventually be needed, I do recognize (as should anyone in software) that many organizations and individuals simply must continue using their current hardware and software long past its supposed end-of-life. Windows XP has many current users, in spite of the fact that many new software products will not run on XP.
An intelligent and balanced point of view recognizes the validity of the points made in comment 95, but adds that workarounds must always be provided for those who use obsolete technologies. Such workarounds can be as simple as providing an extension to a Firefox Web Developer tool (or some other semi-hidden part of Firefox) that allows for the manual (and possibly programmatic) selection of a (now) nonstandard character encoding for a web page (including obsolete Code Pages and the most-used obsolete general encodings). While casual users of Mozilla-based tools would not see such an option (since it won't matter and will just confuse them), it should be available somewhere to spare organizations and individuals a porting expense they may not be able to afford.
Thus, Firefox should always provide reasonable and optional compatibility features to help its less knowledgeable and less wealthy users, even if they make absolutely no sense to cutting-edge developers who are eager for the world to discover the latest in elegant and functional features that technology and standards can offer.
Reporter | ||
Comment 98•5 years ago
|
||
(In reply to Cimbali from comment #96)
- (in a different thread) magically guessing UTF-8 is encouraging people not to specify the charset,
Since this has come up before, note that there are contexts where the user cannot specify the charset (and the MIME type), e.g. with VCS repository viewing (though with some VCS such as Subversion, one can specify the MIME type and the charset). However, I think that's more the job of the server to guess the MIME type and the charset: it has the local file, it can cache the result, and this ensures the same behavior with all clients. So, nothing to change in Firefox.
Description
•