Open Bug 261263 Opened 20 years ago Updated 2 years ago

validate charset (esp. UTF-8) of text/plain downloads (first N bytes) before accepting as text/plain

Categories

(Firefox :: File Handling, defect)

defect

Tracking

()

People

(Reporter: darin.moz, Unassigned)

References

()

Details

validate charset of text/plain downloads (first N bytes) before accepting as
text/plain.

this is derived from bug 261207, in which a binary file is incorrectly served as
"text/plain; charset=UTF-8"
This can presumably only be done for a very limited set of character sets, most
notably UTF-8 -- and mislabelling of UTF-8 data is most likely to mean it's
really ISO-8859-1 or similar. What exactly were you suggesting we should do?
(i.e. what are the steps to reproduce, and what change in behaviour would you
consider to mean the bug was fixed?)
At the moment, we only do our type sniffing if the server delivers "text/plain"
or "text/plain; charset=ISO-8859-1" data.

Darin's suggestion is that for "text/plain; charset=FOO" data we check whether
it looks like data in the FOO charset, and if it does not that we sniff for the
content type.

The steps to reproduce are to load a non-text file sent as "text/plain;
charset=FOO" where FOO is not ISO-8859-1 by the web server and check whether
Mozilla tries to render it as plaintext or whether we sniff it as
application/octet-stream.

The slope is a slippery one.  ;)
For most character sets, any byte-stream is valid. I would strongly recommend we
don't go down this path...
(In reply to comment #3)
> For most character sets, any byte-stream is valid.

Which means that they won't be affected by anything we change here (would render
as text/plain, as they do now).  Or am I missing something?

> I would strongly recommend we don't go down this path...

Is that still the case given the previous statement I made in this comment? 
(Note that I don't feel particularly strongly about this, but I do think it's a
pretty safe heuristic).
Are there really enough pages sent as UTF-8 that aren't text files to warrant
this? (Does IE in XPSP2 do _any_ sniffing of text/plain content?)

Special casing of certain character sets to do something that is explicitly the
opposite of what the server said just seems like a really weird thing to do,
especially given that we are effectively promoting content from the least
dangerous content (text/plain) to something that will be processed by an
application that is known to be rather buggy (Windows Media Player).

Here's an idea, though: Why don't we display the content, but, if we think it
might be binary data, display one of those cool "info bars" and say "This
document appears to be an MPEG Video File. _Open_with_Media_Player_ [Close]" or
similar? That would mean we did the right thing and were safe, but allowed the
user to easily get to the content as if it was correctly labelled.
(In reply to comment #5)
> (Does IE in XPSP2 do _any_ sniffing of text/plain content?)

This will need to be answered by someone who has a Windows system somewhere nearby.

> especially given that we are effectively promoting content from the least
> dangerous content (text/plain) to something that will be processed by an
> application that is known to be rather buggy (Windows Media Player).

I agree that this is an issue.

> Here's an idea, though: Why don't we display the content

This on its own is enough to hang Mozilla or the OS or crash Mozilla in many
cases (buggy fonts that give garbage for certain codepoints, bugs in xft, issues
with font servers, trying to look up bogus codepoints in every single font of
the system for millions of characters (since binary files can easily get to be
megabytes in size), etc, etc).

We could _not_ display the content and show the little info bar, but that's what
we already do (except we show the helper app dialog).  What we do need to do is
add a "view as text" option on said helper app dialog.
> This on its own is enough to hang Mozilla or the OS or crash Mozilla in many
> cases (buggy fonts that give garbage for certain codepoints,

We shouldn't crash if we get bogus font data. That should be fixed.


> bugs in xft, issues with font servers,

Same as above, of course. Frankly I'd rather take my chances with xft bugs than
Windows Media Player bugs.


> trying to look up bogus codepoints in every single font of the system for 
> millions of characters (since binary files can easily get to be megabytes in 
> size), etc, etc).

Why are we looking up codepoints for characters that aren't being painted?

If there are really millions of bogus codepoints, does that mean there are less
than millions of invalid bytes? If so, detecting it as invalid would be a
problem anyway, no?


> We could _not_ display the content and show the little info bar, but that's 
> what we already do (except we show the helper app dialog).  What we do need to
> do is add a "view as text" option on said helper app dialog.

I thought this was for the "we really think it is text/plain, although now that
you mention it..." case, not the "they said it was text/plain, but we didn't
believe them for a minute" case. (We do need that feature too, but that's
another bug.)
(In reply to comment #7)
> We shouldn't crash if we get bogus font data. That should be fixed.

A lot of the crashing is upstreadm (not in our control).

> > bugs in xft, issues with font servers,
> 
> Same as above, of course. Frankly I'd rather take my chances with xft bugs than
> Windows Media Player bugs.

The latter don't crash the app.

> Why are we looking up codepoints for characters that aren't being painted?

If we're loading the document as text/plain, we do try to paint it and all.

> If there are really millions of bogus codepoints, does that mean there are
> less than millions of invalid bytes?

No, there would be millions of _total_ bogus bytes in the file.  We'd only look
at the first 1024 bytes, though; if there's nothing bogus in there, chances are
the whole file is ok.

> I thought this was for the "we really think it is text/plain, although now
> that you mention it..." case, not the "they said it was text/plain, but we
> didn't believe them for a minute" case.

This is for the "they said it was text/plain, so chances are they're lying,
because Apache just lies by default; let's do a quick test to see whether it
could conceivably be text/plain" case.
In that case I don't understand bug 261207 comment 4, from which this bug was
apparently derived.

BTW, for text files, we really shouldn't be painting characters if they are
off-screen, should I file a bug on that? Seems like that would be an easy win
and a definite perf advantage in cases like this. (Or did I misunderstand you?)

(And the crashes in upstream code obviously should be fixed too (whether by us
or others); if they aren't then Mozilla won't be the only crashing app, it'll
make the entire system unstable.)
(In reply to comment #9)
> In that case I don't understand bug 261207 comment 4, from which this bug was
> apparently derived.

This bug is a suggestion to change our current "we don't think it's text/plain,
so we'll check", criteria (which are hinted at in bug 261207 comment 4).

> BTW, for text files, we really shouldn't be painting characters if they are
> off-screen

We don't as far as I know.  But we have to get glyph info for them anyway, since
it affects layout.

> (And the crashes in upstream code obviously should be fixed too

It's being worked on, but the upstream code's buggy versions are widely
installed in numerous Linux distributions...
(In reply to comment #5)
> Are there really enough pages sent as UTF-8 that aren't text files to warrant
> this? (Does IE in XPSP2 do _any_ sniffing of text/plain content?)

Yes, nearly all the files I've seen that are sent with the wrong MIME type that
Mozilla doesn't currently sniff correctly are sent as text/plain UTF-8.

Yes, as far as I can tell, by default the latest IE does all the sniffing that
its predecessors do. There's an option to turn off sniffing, but unsurprisingly
not many people choose to do that.
It looks like Apache on Redhat/Fedora is set to send text/plain pages as UTF-8
by default now: https://www.redhat.com/archives/fedora-list/2005-March/msg03022.html
We're running into this problem with Mozilla/Firefox themes now, which are
distributed as .jar files.  Most of our FTP mirrors are serving them with a
text/plain filetype.  We didn't notice before because most of them were also
sending charset=ISO-8859-1, which this content-sniffing apparently kicks in on.
 A couple of them recently started serving them as UTF-8, and we started getting
complaints.  The correct answer is to badger the servers into fixing the mime
types.  I sent out an email today to all of our FTP mirrors asking them to set
the mime type for jar files.
Summary: validate charset of text/plain downloads (first N bytes) before accepting as text/plain → validate charset (esp. UTF-8) of text/plain downloads (first N bytes) before accepting as text/plain
(In reply to comment #0)
> ...a binary file is incorrectly served as "text/plain; charset=UTF-8"

It looks like this Content-Type is becoming an increasingly common problem. According to http://www.hardforum.com/showpost.php?p=1027768766&postcount=9 it looks like the exact header line causing the problem is:
Content-Type: text/plain; charset=UTF-8

I've found a few other examples with this exact Content-Type header line, and all are running Apache 2 on Fedora. See also http://forums.mozillazine.org/viewtopic.php?t=248906
Has anyone filed a bug on Fedora about this?  This is totally a bug in either Apache (long filed on them) or the settings Fedora sets up for Apache by default.
(In reply to comment #15)
> This is totally a bug in either Apache (long filed on them) or the
> settings Fedora sets up for Apache by default.

Apache bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=13986
Fedora bug: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197840
Assignee: file-handling → nobody
QA Contact: ian → file-handling
This has become less of a problem as Firefox has become more popular and Fedora installs Apache to serve WMV files as the correct MIME type. Apache is even removing the default MIME type in the next version of httpd. I think this should be a WONTFIX at this point, as the two bugs I linked to in comment #17 are fixed.
Product: Core → Firefox
Version: Trunk → unspecified
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.