[FIX]Prompt user about invalid text/plain content - solving most incorrect MIME type issues

RESOLVED FIXED in mozilla1.7alpha


16 years ago
3 years ago


(Reporter: email, Assigned: bzbarsky)


(Blocks 2 bugs)

Dependency tree / graph

Firefox Tracking Flags

(Not tracked)



(1 attachment, 7 obsolete attachments)



16 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6a) Gecko/20030923 Firebird/0.7+
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6a) Gecko/20030923 Firebird/0.7+

When Firebird receives content which the server has set to the MIME type
"text/plain", but the data contains invalid ASCII control characters (as can be
identified from IETF RFC 2046), Firebird should prompt the user as to what to do
with the data rather than attempt to display it immediately.

This behaviour is intended to avoid the common problem of attempting to download
binary data from mis-configured web-servers; and solves it *without* breaking
Firebird's compliance with MIME type handling.

The content of a "text/plain" document, according to IETF RFC 2046 (Section 4.1.2): 

  Note that the control characters including DEL (0-31, 127) 
  have no defined meaning in apart from the combination CRLF 
  (US-ASCII values 13 and 10) indicating a new line. Two of 
  the characters have de facto meanings in wide use: FF (12) 
  ... and TAB or HT (9) ... Aside from these conventions, 
  any use of the control characters or DEL in a body must 
  either occur
    (1) because a subtype of text other than "plain" 
        specifically assigns some additional meaning, or
    (2) within the context of a private agreement between 
        the sender and recipient. Such private agreements 
        are discouraged and should be replaced by the other 
        capabilities of this document.
  [RFC 2046]

Thus, the detection can be made from a stream sent as the "text/plain"
MIME-type, yet the first few bytes contain one or more ASCII values in the range
0-9 or 14-31 (or, possibly, 127 - but I'm sure some codepages use it when they

Reproducible: Always

Steps to Reproduce:
1. Find a website with, e.g. a .rar file which has no MIME type 
   associated with .rar files
2. Click on the link.

Actual Results:  
Firebird starts displaying the erroneous text/plain stream in the browser -- the
user sees complete garbage on their screen.

Expected Results:  
(The server doesn't recognise the file type so, typically, claims it is

This stream could be identified (from the first few bytes) as containing invalid
text/plain characters (in accordance with IETF RFC 2046), and the end user could
be prompted with this information.

The end-user could then be prompted as to what to do with this file - e.g. save
it to disk (or display it anyway)

I believe the FTP component of Firbird has a mechanism to detect binary data.  

Example Problem:

An example where this is a common problem is: .RAR archive files stored on a
website where the webmaster has no direct control over MIME type association.


Many veteran users say "Just right-click and Save-As" or "Wait until the
document loads fully, then save it all".  The former is impossible if the
download was accessed by filling in a form or as the result of a browser
redirect.  The latter is impractical to expect a novice user to understand and
be satisfied with.


The problem is not with Firebird, but with mis-configured servers.  However, it
is important to bear in mind:

 * most servers appear to default to text/plain for unknown file types

 * new binary file types will be unknown

 * these unknown files will be sent as text/plain

 * in many cases, the webmaster cannot or will not directly control the MIME
type association to correct this problem (either the site is hosted and they
have no access, or may feel the majority of their (IE) users have no problem as
other browsers use MIME-type detection)


It's a similar situation to if the server sent random bytes claiming they were
"image/gif".  You wouldn't expect the browser to display a load of rubbish,
you'd expect it to notice that something's wrong and perhaps display a "broken
image" icon and the ALT text instead.  This proposal isn't all that dissimilar:
FB should cope with invalid text/plain, and the way it should cope should be by
offering to save it instead -- this has the enormous benefit of also being a
workaround for the many misconfigured servers out there.

Multi-lingual notes:

The default codepage for text/plain is "US-ASCII".  Other languages may specify
a different codepage or use UTF-8 encoding of Unicode.  However, these facts
make no difference to the interpretation of the control characters.

I don't know whether, for example, Unicode UTF-16 would actually be valid for
"text/plain".  If it is, it would be easy to check for the FF EF (? - or
vice-versa) unicode header.

Suggested UI behaviour:

If it appears likely that the stream is not "text/plain", the user should be
prompted with a dialog similar to one that would be displayed if the content
were "application/octet-stream":
   |  You have chosen to open
   |    filename.rar
   |  from: http://www.website.com/
   |  which has been sent as plain text, but it appears to be binary data.
   |  What should Mozilla Firebird do with this file?
   |    (*) Display as plain text
   |    ( ) Open with [UnRAR    |V]
   |    ( ) Save to Disk
   |    [ ] Do this automatically for files like this from now on.
   |                     [OK]  [CANCEL]

- the changes being the "Display as plain text" option, and that the "open with"
application could come from any applications on the user's system that are
registered for that file extension.

The "Do this automatically" option is undoubtedly one for debate.  It could be
used to always force this behaviour for that file extension, or perhaps only for
future false text/plain detections.  The user could then manage their choices as
additions in the Tools|Options|Downloads panel -- but, in this case, by managing
from the file extension rather than the MIME type.  (Aside: why doesn't
Tools|Options|Downloads currently display extensions *instead of* MIME types?)

In short:

The problem of misconfigured servers sending binary data as text/plain is
unlikely to go away for several years.

Many people agree something needs to be done to make Firebird more usable in
this respect

However, most binary files sent as text/plain are invalid for that format,
Firebird could allow the user to choose what to do with these files.

This is the best solution as far as I can see it, without violating the specs.

Comment 1

16 years ago
I just want to post to ask the devlopers to seriously consider this instead of
just calling it another Mimetype bug and WONTFIX'ing it. This specific
implementation does not technically deviate from WC3 standards, could be added
as something that can be enabled/disabled by user preference, and would give
users a way to handle misconfigured webservers (which would make a *LOT* of
people happy);

Comment 2

16 years ago
See also bug 57342, Add "View as Text" option for unknown mime content-type.

Comment 3

16 years ago
Also see Camino bug 221877.  The only problem I forsee is the Firebird proposal
uses sniffing to tell the user that text-plain is erroneous.  This appears to
violate the standards, since mozilla.org has said we should trust the server
(see  bug 175848 and RFC 2068, section 7.2.1).

A different approach is to have the dialog state the file is indeed text-plain,
so the server is trusted, but then give the user some options other than
rendering in the window.  It is a subtle change, but could be important for
How does the proposed approach handle text/plain content in non-ASCII encodings?
 Say in UTF-16?

Comment 5

16 years ago
Regarding Comment #3

WRT calling it "content sniffing", I still believe the analogy of loading an
erroneous image is valid; you don't expect it a browser to display garbage on
your screen if the server claims a file is an image but it doesn't actually
contain an image header.  I guess the only concern is *when* the start of the
"text/plain" stream is checked.  If you look at it as checking the first few
bytes to see if it can be displayed before actually displaying it, it really
doesn't seem any different to the handling of images.

However, the idea of always prompting about text/plain is a very good
alternative to the proposal.  

There aren't *that* many text/plain files that a user does want to see per day,
but to prevent it annoying people, perhaps there can be a list of file
extensions that the user always wants to be displayed in the browser
(pre-installed with ".txt" etc.)

Regarding Comment #4

I'm no expert, but I'm not even sure the specs I read are valid for UTF-16. 
Regardless, it (and UTF-32) are easily identified by the byte-order mark (BOM)
at the start of the file (e.g. somthing like 'FF EF').  

If UTF-16 is valid for text/plain, then the "invalid text/plain" check wouldn't
be performed on a text/plain stream if it starts with a valid BOM.

If UTF-16 isn't valid for text/plain, and the check on the first few bytes
determined a stream was invalid text/plain, then the BOM could be checked-for
and, if present, the file could be displayed anyway.

Small correction to original post: 

At the end of "Suggested UI behaviour": I was suppose to be asking, as an aside,
"why *DOES* Tools|Options|Downloads currently display (file) extensions instead
of MIME types"?  This placement almost makes it appear as if those actions will
be performed on files with those extensions irrespective of MIME type.

> it (and UTF-32) are easily identified by the byte-order mark (BOM)

that does not need to be there

Comment 7

16 years ago
Could I ask on the developer opinion on this one? There are quite some votes,
and it hasn't been WONTFIXed yet. However, there's also not been any response.
Is this normal? (sorry, I've just tracked a few bugs so far.)
The developers responsible for the MIME dispatch back end raised an issue with
the proposed approach that has not been addressed (comment 4).

At the same time, this bug is filed as a UI bug on the Firebird browser, not as
a core bug.  Which means that it should only be resolved by a Firebird developer.

Comment 9

16 years ago
Can this solution be implemented only for text/plain documents in the ASCII
Or the solution could still be implemented for all cases and just change the
wording on the dialog to say something like "... which has been sent as plain
text, but appears to be either binary data or non-ascii encoding."

The user could still choose to handle it whichever way they prefer and
optionally choose to have future occurences handled the same way.
>"... which has been sent as plain
>text, but appears to be either binary data or non-ascii encoding."

are you aware that there are very few languages for which ASCII suffices?

Comment 11

16 years ago
Now I'm just a common user, so I don't know all that much about this stuff, but
in the situation where the text/plain is being specified as the default MIME
type because of a misconfigured server, isn't it normal that the encoding is
either unspecifed or us-ascii? If there is any other encoding or character-set
wouldn't those be specifically identified in the header? This feature of
detecting possible mislabeled MIME types should still be possible in the cases
of us-ascii or non-specified character sets. While using this only in the case
of text/plain + us-ascii may not correct all of the occurences, I think it would
go a long way towards solving the problem.

Even if finding a safe way to determine if a file might be binary or not is
blocking this feature, I think it should still be implemented as an option. For
example, if I did not want my web browser to handle text/plain content, but
would rather such files open in my text-editor, I should be able to do that
either by selecting the default action the first time I navigate to such a file,
or by changing the default behavior in the preferences dialog.
> For example, if I did not want my web browser to handle text/plain content

That's a separate issue from this bug...

The problem with comment 11 is the same as with the previous comments.  Say I am
serving up a Japanese plaintext document.  I server it up as text/plain.  I do
not send a charset header (usual operating procedure and all).  It renders just
fine on everyone's browsers because of charset autodetection.

How do we avoid breaking this scenario?

Comment 13

16 years ago
According to this source: http://www.faqs.org/rfcs/rfc2046.html
"The default character set, which must be assumed in the absence of a charset
parameter, is US-ASCII." and also "messages containing characters coded
according to other versions of ISO 646 than US-ASCII and the 1991 IRV, or using
code-switching procedures (e.g., those of ISO 2022), as well as 8bit or multiple
octet character encodings MUST use an appropriate character set specification to
be consistent with MIME."
So in the case of US-ASCII (or nothing sepcified, which is probably the most
common case in the case of a misconfigured server) we check the data to see if
it is binary and prompt the user. In all other cases we can trust the
character-set specified in the header.
> According to this source: http://www.faqs.org/rfcs/rfc2046.html

Which is ignored by as many, if not more, people as the the HTTP RFCs....
By the way, I think we're all agreed that the only document that should get
special treatment of any sort are text/plain documents coming in over HTTP with
no charset specified.  The remaining questions are:

1)  What fraction of such documents are actually ASCII plaintext?
2)  What fraction of such documents are actually random "binary" data?
3)  What fraction of such documents are non-ascii plaintext?

If the answers to these show that category #3 is not too big, we could try to
make some changes to URILoader to handle this ("could" as in "I won't mind if
someone does").  Have to put some thought into how the whole dialog-posing mess
would work... That's sort of tied in to the general channel-suspendability
stuff, unless we plan to force the dialog modal and hope for the best.
re comment 13: I seem to recall that HTTP specifies a default charset of

>That's sort of tied in to the general channel-suspendability stuff

that needs to be fixed anyway...

Comment 17

16 years ago
If you serve a Japanese (or other non-ASCII) document as text/plain with no
charset header, then you're relying on browser error-correction to properly
render it.  However, from a strict standards standpoint, charset "sniffing" is
just as evil as content-type sniffing, isn't it?

On another tangent, is it proper by the standards to serve a data stream as
"text/plain" if it intentionally uses ASCII control characters in their defined
(though little used these days) meanings in the original 1960s-era ASCII
standards?  I remember from my '80s college days common use of "cute tricks"
like carriage returns without linefeeds, backspaces, and the bell character (#7)
to do gimmicky animations and special effects, for instance in "plan" files
displayed from the "finger" command on timesharing computers.  Then, once dialup
BBSs became commonplace, they used lots of ANSI escape codes, but I guess those
wouldn't count as "plain text".  I myself sometimes use the delimiter characters
from #28 to #31 (FS, GS, RS, US) to separate fields in formatted data sent as
the output of one program for the purpose of being processed by another; what
MIME type would be appropriate for such a thing?
> However, from a strict standards standpoint, charset "sniffing" is just as evil

Sure.   Which is why I'm willing to break it if the other case is a lot more common.

> what MIME type would be appropriate for such a thing?

Probably a text/x-something mimetype unless you register a subtype of text/.

Comment 19

16 years ago
I think that the problem of, for example, UTF-16 text formats is not that
important. Since the problem usually occurs only on a very limited number of
file types, the browser could be programmed only to scan the text for files with
.rar, .ace and a few of the other problematic file types. 
The file would only be declared application/octet stream if both the extension
AND the file sniffling AND the user interaction suggests that the MIME type is
specified wrongly. So you have

MIME type                   | Action
text/plain                  | display

octet stream                | save

text/plain with octet       | 'guessed' text -> display
extension (e.g. .rar)       | 'guessed' octet & user wants display -> Display
                            | 'guessed' octet & user wants save -> save

So even rar-files with UTF-16 text would be displayed properly, although that
would require user interaction. I doubt there many UTF-16-rar-plain-text-files
though, I ever doub there is one such file on the internet. And even that file
would be handled correctly by the intelligent user. 

In bug 126782, a detection method has been programmed in the fix
(http://bugzilla.mozilla.org/attachment.cgi?id=135573&action=view). Could that
one be used here?
> I think that the problem of, for example, UTF-16 text formats is not that
> important.

You're also very clearly not East Asian...  Please don't impose your
ethnocentric biases on the whole world.  There are a lot more people to whom
that problem is important than live in all of Europe.  I'd still like some
numbers in response to my question in comment 15.  Sweeping generalizations are
not numbers.

> Since the problem usually occurs only on a very limited number of
> file types, the browser could be programmed only to scan the text for files
> with .rar, .ace and a few of the other problematic file types. 

No.  There will be no hardcoding of extensions in this code.

The bug 126782 mechanism is exactly what would be used here. That is, if we
decide the "text/plain" type is "unreliable" we would simply run the unknown
type detection code on it instead.

Comment 21

16 years ago
Boris Zbarsky, 

With the hardcoding of extensions, UTF-16 would not be a big issue. I am not
'ethnocentrically biased', that's why I made this comment. If you hardcode
extensions, you wouldn't have to program all non-ASCII encodings and still have
a solution working perfectly. As I said, only plain text files .RAR files from
asia would require user interaction. I thought everyone could live with that, as
I could even live with automatically declaring .rar files octet/stream (but
since developers won't, I won't bother with that). 

Besides, if you refuse to hardcode anything, all files, e.g. .html, would have
to be scanned in ASCII, UTF-7, UTF-8, UTF-16, etc. until a fitting encoding is
found. I can't see why that would be a good and fast solution. 
> Besides, if you refuse to hardcode anything

as I understood bz, the text/plain MIME type would be hardcoded. so .html would
not be scanned unless they were sent as such a type.

let me add that I would love to mark this bug wontfix.
Dominik, why are you so focused on .rar files?  This is a problem with a lot
more than those...

The reason I refuse to hardcode extensions is that when a new extension appears
(like .dmg, for example), none of the web servers will know about it so will
send it as text/plain AND Mozilla will treat it as text/plain.  Not useful,
really.  Yeah, we could play whack-a-mole and have older Mozillas not be able to
deal with these new extensions, but that's not a good alternative in my mind.

Comment 24

16 years ago
Just to clarify: the "bug" described would work perfectly in correctly identify
invalid "text/plain".  The text/plain MIME type is only supposed to be valid for
ASCII-compatible octet streams (e.g. any ASCII codepage or Unicode as UTF-8).

The only issue we have, is that some servers *incorrectly* serve
non-ASCII-compatible text as text/plain (when it's supposed to be
text/something-else) and with no "charset" part in the MIME-type.

However, it is the case that many other character encodings are actually
upwardly-compatible with ASCII, in that they don't use (at least most of) the
ASCII control codes to encode characters.  

For example, if I understand Shift-JIS encoding properly, then just by accepting
the escape character (27) as a permitted control character, Shift-JIS would also
work with this proposed scheme.  I believe many other character sets would too.

So the issue is with non-ASCII compatible character sets that make use of the
octet values 0-8, 14-26, or 28-31.  Which character sets are actually affected
by this?

UTF-16 or UTF-32 *incorrectly* sent as text/plain would indeed use ASCII control
codes, but I thought these encodings typically had a byte-order-mark at the
start of the file.  If they don't, then what would the browser do with them
anyway - it should really display them as US-ASCII (and I can't see a manual
"UTF-16" encoding option in Firebird anyway).

The real question is:  How often is this actually a problem? 

 - Plain text files are only a small fraction of what most users view

 - then the fraction of those files in a charset that uses octet 
   values {0-8, 14-26, 28-31} (or no BOM if UTF-16/32)

 - then the fraction of those files incorrectly sent as "text/plain" 
   and with no "charset" header

If it's quite rare, then would it really be a problem if the user was
(correctly) informed the content is invalid "text/plain", and they can select
"view as text" anyway.  (I assume they'll also have to select an encoding
manually as the server didn't tell them what the file was, perhaps they could
have a combobox for the encoding next to "view as text", or even perform any
charset auto-detection wizardry if, and only if, it's got this far).

One other mitigating feature could be the UI option:

  "[ ] Do this automatically for files with this extension from now on."

(this could probably integrate nicely into the list of file types under Tools |
Options | Download)

For example, this could allow the user to build up their own personal list of
preferences to:

  * Always "Save" files that are incorrectly sent as "text/plain" 
    which have the extension ".rar"

  * Always "View as text" files that are incorrectly sent as "text/plain" 
    which have the extension ".txt"

(Note: Absolutely no "hard-coding" of extensions, just allowing the user to have
a per-extension preference for dealing with files incorrectly marked as text/plain).

Comment 25

16 years ago
> 1)  What fraction of such documents are actually ASCII plaintext?
According to the standard, all such documents should be. And for the most part
this is true, although I've been watching lately, and I see ISO-8859-1 instead
of US-ASCII most of the time.
> 2)  What fraction of such documents are actually random "binary" data?
This is a small percentage of misconfigured servers.
> 3)  What fraction of such documents are non-ascii plaintext?
None. As is defined by the standard. I've looked and I could not find a single
situation like this. If anyone else has ever seen it, please correct me.

I don't know of anyway to collect scientific data on these kind of numbers,
otherwise I would do it. The biggest point about this idea, as Daniel is trying
to express I believe, is that it prompts the user in cases of ambiguity so that
the user can decide how to handle the situation. It shouldn't matter so much how
often these cases happen because this feature would give the user an option of
how to handle it in the future. It doesn't break any functionality, the worst it
could do is bother the user once with a dialog to choose "handle this way
always". The default action could even be to handle it as it is handled now.
> The default action could even be to handle it as it is handled now.

Not any time in the near future... that would depend on a "view in browser"
option being available in the dialog.

>> 3)  What fraction of such documents are non-ascii plaintext?
>None. As is defined by the standard. I've looked

If I may ask, where did you look?

I agree this is hard to solidly quantify -- hence the quandary.

Comment 27

16 years ago
I sometimes browse various Japanese websites and I have never seen such an
occurence. In addition, I did some google searches on .jp domains for txt, text,
etc and came up with several text/plain in Shift-JIS such as:



And even some English on Japanese servers in ISO-8859-1:

but wasnt able to find any as described. Ofcourse, I didn't search for other
languages since I can't read them, so I could be overlooking a huge sample of data.

Comment 28

16 years ago
The nice thing about this approach in general (good idea!) is like bz said: If
you have a file with MIME-type text/plain and no text-encoding set, you're out
of the standards zone and into the Guess Zone. That means that any and all kinds
of sniffing are *permissable* (according to spec).

We could at that point special-case things. For example, we have very commonly
this problem with .dmg files on the Mac. So I would definitely think that if the
file is in the Guess Zone, sniffing for certain features would definitely be OK.
Is there any way to detect a UTF-8 or UTF-16 file? Can we sniff for a DMG or an
ASF file (those are commonly unset on servers).

My point is that this bug defines a Guess Zone. So now we can finally
special-case certain highly-complained-about types like DMGs without breaking
the specs.
> If you have a file with MIME-type text/plain and no text-encoding set, you're
> out of the standards zone and into the Guess Zone.

Actually, that's not true.  The standards clearly say that this means the text
is ASCII-encoded.

In any case, per discussion with biesi and darin, the following is the plan:

If incoming data meets the following criteria:

1)  HTTP channel
2)  "text/plain" request header (literal, case-sensitive comparison)
3)  This is the initial content dispatch

Then reset the content-type on the channel to "application/x-maybe-text" (or
some such).  Necko will provide a stream converter (probably just subclass or
superclass of nsUnknownDecoder) that will take that type as input and apply the
following heuristics:

A)  If the data starts with a BOM, treat it as text/plain
B)  Otherwise, apply the "last ditch sniff" algorithm from nsUnknownDecoder to
    detect it as either text/plain or application/octet-stream.  Note that this
    will NOT detect the data as any other types (in particular as any types we
    would render inline).

Embeddors are of course free to register their own converters for the relevant
contractids and override the converter Necko registers.

I may get to this by the 1.7a cycle if no one posts a patch earlier.

There are other various bugs in bugzilla that should probably be dependent on
this one... (eg there is a similar Camino bug).

Comment 30

16 years ago
bz: That's great news!

Although, may I suggest also allowing char "27" in the "IS_TEXT_CHAR" macro from
nsUnknownDecoder as this would then allow Shift-JIS (and probably other
non-ASCII, yet ASCII-compatible) encodings to be treated as text, and not binary.  
(I know they will not be strictly text/plain, but that will be what was intended)

Yeah, we should do that.
Posted patch Ok, I got bored (obsolete) — Splinter Review
Comment on attachment 136882 [details] [diff] [review]
Ok, I got bored

biesi?	Darin?	Thoughts?
Attachment #136882 - Flags: superreview?(darin)
Attachment #136882 - Flags: review?(cbiesinger)
I applied this patch then attempted to load kerz's "Gold Standard" "Idiot on
Silverado" wmv test:


It still displays as text. 

Comment 35

16 years ago
For the good of the web, we must be able to display this movie.  Thank's for
your work on this so far...
Yep.  That server sends:

  Content-Type: text/plain; charset=ISO-8859-1

which fails test #2 in comment 29.  _If_ this is the default Apache setup, I
would be willing to change test #2 to also check for this literal string....
To make it clear, if that's the default misconfiguration Apache ships with, I'm
willing to work around it. If this server has gone out of its way to be more
misconfigured than the average server, then that's a different issue entirely.
So, if it meets criteria #2 (sending text/plain) should that not mean your "Last
Ditch Sniff" should detect binary data? What does server configuration have to
do with that?
> So, if it meets criteria #2 (sending text/plain)

It does NOT mean criterion #2.  Please read what I said again.  Server
configuration has everything to do with this.  I am implementing a workaround
for a bug in the default configuration of the most popular web server out there.
 I am _not_ attempting to solve all possible ways in which someone can fuck up a
web server to lie to web browsers.

Now I suspect that the default config of Apache may end up sending Content-Type
headers in some cases.  If that's the case, we need to rethink criterion #2 to
account for that a bit...
Oh, I didn't realize your test #2 was explicit "text/plain" (and thus would not
catch "text/plain *" 
Right.  The whole point is that "text/plain; charset=Big5", eg, should not go
through the sniffer, since it would be incorrectly detected as binary...

Comment 42

16 years ago
i can't say for sure that it's the *default*, but all testcases I can find do
indeed send:

Content-Type: text/plain; charset=ISO-8859-1
I ran:


through the http://www.rexswain.com/ 's HTTP viewer, and it reports a
Content-Type field of "text/plain" but a build bz's patch still shows raw data. 

I don't know enough about Apache to say if this affects anything, but my bog
standard Apache install on this machine has this line in httpd.conf:

AddDefaultCharset ISO-8859-1
> http://www.bengoodger.com/video/funny_cats.wmv

My patched build is detecting this as application/octet-stream.... you did
rebuild in all the relevant places (uriloader/, netwerk/, docshell/build/), right?
ooh, not docshell. thanks. 

kerz points out this bug in the apache database:

ah, I see.  So that's a server version issue... sounds like we need to check for
both of those literal strings.
OK, I've properly tried your patch now and it works like a charm, except for the
old OSHelperAppService bugs which I can look at separately (on Windows it's
showing the helper for WMV files as "WMVFile" which I think implies it's looking
for the pretty app name in the wrong spot in the windows registry). 
ben: yes... the oshelperappservice uses HKCR\.<ext>\@ as the helper app
description, and HKCR\<thatvalue>\@ as mime type description (where @ means
"default value")

do you have a better suggestion?
Comment on attachment 136882 [details] [diff] [review]
Ok, I got bored

+      httpChannel->SetContentType(

could you maybe add a log comment around here, that we're checking if this is
really text?

also... is application/vnd.mozilla.maybe-text a good type to use? It's not a
registered type per 
ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/application/ (hardly
surprising :) ), I think  you should use some x- type.

+  void DetermineContentType(nsIRequest* aRequest);

hm... can you add a "virtual" here, to make clear that this is a virtual

+  (((unsigned char)ch) > 31
thank you :)

+  // First, check for a BOM.
you have no check for a UTF-32 BOM... doesn't mozilla support that?

+CreateNewBinaryDetectorFactory(nsISupports *aOuter, REFNSIID aIID, void
is there a reason to avoid NS_GENERIC_FACTORY_CONSTRUCTOR?
> could you maybe add a log comment 

err, I meant "a LOG statement"
> could you maybe add a log comment around here,

Sure thing.

> is application/vnd.mozilla.maybe-text a good type to use?

Good question.  I'll change it to a x- type that should still not cause random
namespace collisions.

> can you add a "virtual" here,


> you have no check for a UTF-32 BOM

True.  I'll add it -- I finally convinced myself this can be done in a safe way.

> is there a reason to avoid NS_GENERIC_FACTORY_CONSTRUCTOR?

Following the style of the file... no other reason.

putting this finally in browser:File handling where it belongs; also -> bz
Assignee: blake → bz-vacation
Component: General → File Handling
Product: Firebird → Browser
Version: unspecified → Trunk
Attachment #136882 - Flags: superreview?(darin)
Posted patch Updated to comments (obsolete) — Splinter Review
Attachment #136882 - Attachment is obsolete: true
Attachment #136945 - Flags: superreview?(darin)
Attachment #136945 - Flags: review?(cbiesinger)
Comment on attachment 136945 [details] [diff] [review]
Updated to comments

+	 (buf[0] == 0xFF && buf[1] == 0xFE && buf[2] == 0 && buf[3] == 0) || //
this will already match the UTF-16LE check, so this line can be removed

+	 (buf[0] == 0xFE && buf[1] == 0xFF && buf[2] == 0 && buf[3] == 0)) { //

this will match the UTF1-6BE check, so it can also be removed

with the lines removed, r=me
Attachment #136945 - Flags: review?(cbiesinger) → review+
Priority: -- → P2
Summary: Prompt user about invalid text/plain content - solving most incorrect MIME type issues → [FIX]Prompt user about invalid text/plain content - solving most incorrect MIME type issues
Target Milestone: --- → mozilla1.7alpha

Comment 55

16 years ago
> ben: yes... the oshelperappservice uses HKCR\.<ext>\@ as the helper app
> description, and HKCR\<thatvalue>\@ as mime type description (where @ means
> "default value")
> do you have a better suggestion?

filetype = HKCR\.<ext>\@
defaultapp = HKCR\<filetype>\shell\@
helperapp = HKCR\<filetype>\shell\<defaultapp>
helperappdescription = HKCU\Software\Microsoft\Windows\ShellNoRoam\<helperapp>

If helperappdescription does not exist, then we should display helperapp.  This
is the default windows behavior.  Displaying either the application name if
available or the application filename if not is much more intuitive than
displaying the internal filetype name and saying that's the program you're
opening it with.
Let's have a separate bug on the Windows thing, ok?  ;)
Christian, I'll do some research and file another bug. 
ben: scratch filed bug 227671 about the application name question
OK, this new patch works great with the "Idiot on Silverado" video
(Content-Type: text/plain; charset=ISO-8859-1). Please don't attempt to load the
link from my comment above anymore, it now seems to display undesirable content

So... what does this patch do to our results on these tests?:

What does this faking of the MIME type do to Page Info and so forth? I really
hate how we sometimes show our autodetected MIME types as if it was the server type.

Anyway. Interesting solution. Separate from that though we do IMHO still need the
   View | Content Type
...submenu (analogous to the View | Encoding submenu for the analogous problem).
Ian: the behaviour on those files does not change with this patch

For files where behaviour does change, the helper app dialog will be shown and
indicate a type of application/octet-stream.
(hence, page info never comes into play)

hmm... why does mozilla display application/x-view-source files?
> why does mozilla display application/x-view-source files?

Because we have a handler registered for that content type -- that's how we do
view-source, y'now.  ;)
QA Contact: ian
Posted patch bz's patch plus a little more (obsolete) — Splinter Review
This adds to bz's patch an extension-based lookup in
nsExternalHelperAppService::DoContent as a last ditch attempt to locate the One
True Type from the OS. 

bz/biesi - I've included the MIME type application/x-vnd.moz.maybe-binary as a
literal string in both cases, please advise if there's a common header or
something where it'd be better defined.
Attachment #136945 - Attachment is obsolete: true
Attachment #137227 - Flags: superreview?(bz-vacation)
Attachment #137227 - Flags: review?(cbiesinger)
Ben, did you base that on the patch _before_ I addressed some review comments?

The type could go in nsMimeTypes.h, I suppose.
Comment on attachment 137227 [details] [diff] [review]
bz's patch plus a little more

this does not address most parts of comment 49...

+  if (aMimeContentType == "application/x-vnd.mozilla.maybe-binary") {

aMimeContentType is "const char*". this comparison does therefore not do what
you want.

+    nsXPIDLCString mimeType;
+    if (mimeInfo)
+      mimeInfo->GetMIMEType(getter_Copies(mimeType));

can you move this into the "if (!fileExtension.IsEmpty())" block?

(except the nsXPIDLCString that should stay outside of course)

I would also like a LOG statement to show what mime type the nsIMIMEInfo has
Attachment #137227 - Flags: review?(cbiesinger) → review-
oh, about the mimetype question... yes, you could put it in nsMimeTypes.h as bz
suggested, or just use the literal string... personally, I don't care either way
Huh. I'm pretty sure I reversed the original patch and then applied the newer
one. I must have messed something up. *tries again* 
Attachment #137227 - Flags: superreview?(bz-vacation) → superreview-
Posted patch newer patch (obsolete) — Splinter Review
This one is really based on bz's second patch, with biesi's comments rolled in.
Attachment #137227 - Attachment is obsolete: true
Comment on attachment 137481 [details] [diff] [review]
newer patch

oops. one moment. disregard this patch.
Attachment #137481 - Attachment is obsolete: true
Comment on attachment 137485 [details] [diff] [review]
new patch

+  if (!nsCRT::strcmp(aMimeContentType, APPLICATION_MAYBE_BINARY)) {

do we guarantee anywhere that aMimeContentType is lowercase? it doesn't look
like it.

please use nsCRT::strcasecmp.


huh? will that do any good?
Christian - I replaced the place where the content type was set to octet-stream
with maybe-binary. I assumed that bz's original patch relied on this code to set
to octet-stream so that the UCT dialog would be presented, and so switched to

bz - did I misinterpret your origianl patch?
oh. hmmm.

well I fear that at that point, this will cause another streamconverter to get
instantiated, so you get into an infinite loop...

bz knows this code better than me, though, I'll let him comment.
(maybe we need two mimetypes - one to trigger the check for non-ascii
characters, and one to tell the helper app dialog to get the type from the
Attachment #136945 - Flags: superreview?(darin)
Comment on attachment 137485 [details] [diff] [review]
new patch

In the future, patches using the -p switch and more context (-u8 at least) are
much easier to review....  And there was no need for -w here, either, imo.

>+  (((unsigned char)ch) > 31 || (9 <= ch && ch <= 13) || ch == 27)

I know I wrote this code... but could you put () around "ch" everywhere in that
line?  So:

(((unsigned char)(ch)) > 31 || (9 <= (ch) && (ch) <= 13 || (ch) == 27)


biesi, this part is correct.  We do have two types involved here -- maybe-text
and maybe-binary.  The former triggers this stream converter, the latter is
magic to the external helper app handler.  Someone could indeed create a stream
converter for maybe-binary, but that's ok if they want to do that....

>+  // the type off the channel.
>     nextLink->mContentType = aOutContentType;
>-  }

This is indented correctly, right?  Damn -w.  ;)

>+        LOG(("OS-Provided mime type '%s' for extension '%s'\n", 
>+             fileExtension.get(), mimeType.get()));

That should pass mimeType.get() first, no?

sr=bzbarsky on Ben's additional changes with that addressed.  The rest of the
patch still needs sr (and moa from darin), so not marking the sr flag myself...
Attachment #137485 - Flags: superreview?(darin)
>We do have two types involved here -- maybe-text
>and maybe-binary.

er, oops - I missed that. :(
but, why is maybe-binary called maybe-binary? don't we know that it is binary
(because it failed the "is text" check)?
Better name suggestions?  maybe-octet-stream?

Comment 78

16 years ago
There's no guarantee that there will be an extension.  Streamed videos or
content management systems, for example.  I suppose we could do
application/x-guess-type-from-extension and fall into x-maybe-octet-stream if
that fails, too.
*** Bug 221877 has been marked as a duplicate of this bug. ***
> There's no guarantee that there will be an extension.

indeed. please read the patch. the question is just how to name that thing.
I've rolled in biesi and bz's comments except the content type name change into
this patch which I plan to check in on the 0.8 Branch. We can continue to
discuss the content type and check the further modified patch in for 1.7a
Christian... should that be application/x-vnd.mozilla.guess-type-from-extension ?

Comment 83

16 years ago
I don't think any MIME type needs both a "x-" prefix and a "vnd." one... "x-"
means "unregistered", and no further structure is necessary after that.  ("vnd."
is for registered MIME types in the vendor subtree.)
> don't think any MIME type needs both a "x-" prefix and a "vnd." one... "x-"
> means "unregistered", and no further structure is necessary after that.

Except if you're trying to be a good citizen and ensure that your x- crap won't
collide with anyone else's x- crap.  I made the type what I made it very

I don't really care what we name the type, as long as we clearly document in
nsMIMETypes.h what its "magic" effect is and as long as it starts with
are we really sure using magic MIME types is the way to go here? :-)

/me starts planning tests that use those MIME types to see what happens
> application/x-vnd.mozilla.guess-type-from-extension

that sounds good.
> are we really sure using magic MIME types is the way to go here? :-)

Yes, we are.  It's the simplest way to do it, and they are not very magic (you
could set up a plugin or other internal handler for any of the types involved
any time you wanted to, eg; you just can't set up a helper for the
application/x-vnd.mozilla.guess-type-from-extension type).  And the name was
carefully chosen such that you better not expect it to work in any particular way.

Comment 88

16 years ago
I've read through the entire discussion and it looks like things are shaping up
alright.  I would highly recommend taking steps preventing any active content
from loading automatically based on mime-type sniffing (offering to download is
okay, however).  

Fortunately, the patch here seems to only apply by making files sent as
text/plain appear with a download dialog.  

Here are some articles from the Freenet-devel mailing list discussing how
mime-type autodetection can affect privacy: 


It's impossible to use IE in any sort of anonymous way on Freenet (and I assume
other anonymous networks) due to "mime type sniffing".  You can easily provide
an anonymity-destroying extension file and pass that as an image!

Just a couple of points to keep in mind.
Comment 88 has nothing to do with this bug.  This bug is about the main content
area load.  The issues mentioned in comment 88 have to do with subsidiary
script/plugin loads to which the code discussed in this bug does not apply.

Please don't make irrelevant comments like that; file separate bugs on separate
issues.  If you're not sure it's a separate issue, file a separate bug and add a

Comment 90

16 years ago
There appears to be a case sensitivity issue with the charset.


Yeah, yeah.  See what I wrote about what this patch is aiming to do (quoted in
that thread).  The case-sensitive comparison is very much purposeful.

Comment 92

16 years ago
Yes, I see that you're doing it case-sensitive because you're handling it
exactly as Apache 2.x sends it.  And my question is:  why not make it case
insensitive?  Forgive me for being dumb, but how do we handle TEXT/plain or
text/Plain?  I'm not trying to say we should handle every single possibility.
the point is: this is only done to work around the default apache
misconfiguration, and not against anything else.

Comment 94

16 years ago
That may well be the point, but if you're going to the lengths to implement a
fix like this, surely you might as well add a simple case conversion statement,
and then everyone is even happier.
Not everyone. This entire codepath is a standards violation, we absolutely want
to keep the violating to the strict minimum required to hit the most common case.

If people go out of their way to misconfigure their servers even further, then
tough -- it won't work.

Comment 96

16 years ago
Well, if we limit the check to the first 10 characters we can ignore the charset
The simple reason to NOT make this a case-insensitive check is that I'd rather
like to avoid detecting actual valid ISO-8859-1 text content as
application/octet-stream.  Sorry, but that check is staying case-sensitive.

alanjstr, if we limit the check to the first ten chars, then we will fail on
type strings like in comment 41.  See also the first 20 or so comments in this
bug for more background.

One more comment from someone who clearly hasn't bothered to carefully read the
bug, and I'm reassigning this to whoever did the commenting.

Comment 98

16 years ago
Please add the charset "iso-8859-1" (lowercased) as it is the default on my
Debian/Apache server as well. The only mention of the the charset in httpd.conf is:

# Default charset to iso-8859-1 (http://www.apache.org/info/css-security/).
AddDefaultCharset on

HTTP/1.1 200 OK
Date: Thu, 25 Dec 2003 20:41:32 GMT
Server: Apache/1.3.29 (Debian GNU/Linux)
Last-Modified: Sat, 22 Nov 2003 23:38:46 GMT
ETag: "1ed36-28e0-3fbff386"
Accept-Ranges: bytes
Content-Length: 10464
Connection: close
Content-Type: text/plain; charset=iso-8859-1
Yeah, that also seems to match the bug report mentioned in comment 45.  Can do that.

Comment 100

16 years ago
*** Bug 209343 has been marked as a duplicate of this bug. ***


16 years ago
Blocks: 135411

Comment 101

16 years ago
I do not think that this she Block the evangelism bugs, since there are probably
quite a few of those and fixing this won't fix those servers, merely hide the
Attachment #137485 - Attachment is obsolete: true
Attachment #137485 - Flags: superreview?(darin)
Attachment #137535 - Attachment is obsolete: true
Comment on attachment 138491 [details] [diff] [review]
Patch for trunk, updated to all comments

biesi, darin, could you review?
Attachment #138491 - Flags: superreview?(darin)
Attachment #138491 - Flags: review?(cbiesinger)
Attachment #138491 - Flags: review?(cbiesinger) → review+

Comment 104

16 years ago
Comment on attachment 138491 [details] [diff] [review]
Patch for trunk, updated to all comments

>Index: netwerk/mime/public/nsMimeTypes.h

>+#define APPLICATION_GUESS_TYPE_FROM_EXTENSION "application/x-vnd.mozilla.guess-type-from-extension"

i think this mime type is excessively verbose.	how about
one of these instead:


>Index: netwerk/streamconv/converters/nsUnknownDecoder.h

i think this class deserves some decription.  please explain
why you are overriding DetermineContentType, etc.

>+#define NS_BINARYDETECTOR_CID                        \
>+{ /* a2027ec6-ba0d-4c72-805d-148233f5f33c */         \
>+    0xa2027ec6,                                      \
>+    0xba0d,                                          \
>+    0x4c72,                                          \
>+    {0x80, 0x5d, 0x14, 0x82, 0x33, 0xf5, 0xf3, 0x3c} \
>+class nsBinaryDetector : public nsUnknownDecoder
>+  nsBinaryDetector()
>+  {}
>+  virtual ~nsBinaryDetector()
>+  {}
>+  virtual void DetermineContentType(nsIRequest* aRequest);

seems like you should be able to compact the class
declaration down to this:

  class nsBinaryDetector : public nsUnknownDecoder
    virtual void DetermineContentType(nsIRequest* aRequest);

>Index: uriloader/base/nsURILoader.cpp

>+  if (httpChannel && mContentType.IsEmpty()) {
>+    // This is our initial dispatch, and this is an HTTP channel.  Check for
>+    // the text/plain mess.
>+    nsCAutoString contentType;
>+    httpChannel->GetResponseHeader(NS_LITERAL_CSTRING("Content-Type"),
>+                                   contentType);
>+    // Make sure to do a case-sensitive exact match comparison here.  Apache
>+    // 1.x just sends text/plain for "unknown", while Apache 2.x sends
>+    // text/plain with a ISO-8859-1 charset.  Debian's Apache version, just to
>+    // be different, sends text/plain with iso-8859-1 charset.  Don't do
>+    // general case-insensitive comparison, since we really want to apply this
>+    // crap as rarely as we can.
>+    if (contentType.Equals(NS_LITERAL_CSTRING("text/plain")) ||
>+        contentType.Equals(
>+             NS_LITERAL_CSTRING("text/plain; charset=ISO-8859-1")) ||
>+        contentType.Equals(
>+             NS_LITERAL_CSTRING("text/plain; charset=iso-8859-1"))) {

i'm not sure i follow why you wouldn't want to just do a case-insensitive
compare against "text/plain; charset=iso-8859-1" 
the only difference is that you would accept "Text/Plain..." as well.
that doesn't seem like a big deal, right?

Attachment #138491 - Flags: superreview?(darin) → superreview+
> application/x-vnd.mozilla.guess-from-ext

Sounds good.  Will do that.

> i think this class deserves some decription.

Done.  Also compacted the class decl as suggested.

> i'm not sure i follow why you wouldn't want to just do a case-insensitive
> compare against "text/plain; charset=iso-8859-1" 

Because if the type is _not_ "text/plain", it's not just Apache being dumb --
it's someone purposefully sending that type.  In which case we really should
follow what the server said.
Attachment #138491 - Attachment is obsolete: true
Checked in that patch.
Last Resolved: 16 years ago
Resolution: --- → FIXED
Blocks: 226024
*** Bug 207154 has been marked as a duplicate of this bug. ***
*** Bug 218372 has been marked as a duplicate of this bug. ***
*** Bug 233465 has been marked as a duplicate of this bug. ***
*** Bug 233678 has been marked as a duplicate of this bug. ***

Comment 112

16 years ago
Is there an option to disable this behaviour ? Even a hidden pref ...
(This is important for example if we actually want to see the data....)
There is no such option.  I would rather implement a "view in browser" option on
the helper app dialog (which we need to do anyway) than waste time on a pref for
Blocks: 233047

Comment 114

15 years ago
I just sent myself a mail with a RTF attachment for which my Windows XP system
did not have a Content-Type set in the registry. The MUA sent the file with a
Content-Type of application/x-vnd.mozilla.guess-from-ext; I would have expected
that the Content-Type were application/octet-stream, since this is the
Content-Type for "unknown"/unclassified data.

Using Thunderbird Nightly 2004-03-21-00-trunk on Windows XP.

Requesting to reopen bug.

Comment 115

15 years ago
(In reply to comment #114)
> Requesting to reopen bug.

This bug is fixed but has caused a regression so therefore it would be better to
file a new bug.
(In reply to comment #114)
> I just sent myself a mail with a RTF attachment for which my Windows XP system
> did not have a Content-Type set in the registry. The MUA sent the file with a
> Content-Type of application/x-vnd.mozilla.guess-from-ext; 

heh. please file a new bug for that, and mention the bug# here.

Comment 117

15 years ago
I submitted bug 238706.

Comment 118

15 years ago
It is still sometimes opening .rar files in a new window for me. I also cannot
right-click and save some .rar files. 

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113

If I go to this site http://www.dustydunes.com/ for example and click on the
link for: http://ygs.pwp.blueyonder.co.uk/matt/maps/de_seaside2k.rar 

then Mozilla opens the contents in a new window instead of downloading them.

If I go to http://zveris007.boom.ru/ for example and click the link for:

the Mozilla opens the contents in a new window instead of downloading them.
Furthermore, if I right-click on the link and click "Save Link Target As..." a
box pops up that says:

"The link could not be saved. The web page might have been removed or had its
name changed."

But obviously it is there, since Mozilla could get it in text format. The
availability of those files is questionable (for some reason?). Sometimes IE6
goes to the website's 404 not found page if I click directly on the link (and
sometimes it downloads). IE6 also flags it as unavailable if you right-click and
go to "Save Target as..."

*I make no claims to the security of those sites or files, I just searched for
.rar files to demonstrate.

Yarrrr I'm a n00b at this reporting business!

Comment 119

15 years ago
this is the intended behavior, see the "solving most incorrect MIME type
issues".  see comment #39.

Comment 120

15 years ago
(In reply to comment #119)
> this is the intended behavior, see the "solving most incorrect MIME type
> issues".  see comment #39.

The intended behaviour is "display garbage". Great.

Comment 121

15 years ago
"Display garbage" is better than "Open a gaping security hole". In any case,
this bug is resolved. If you want to suggest that we break standards, discuss it
on http://www.mozillazine.org or file a new bug.

Comment 123

15 years ago
I think I've found another "safe to assume without breaking standards" case.  If
the server sends text/plain but the A HREF link contains a
TYPE="application/java-archive", then would it not be safe to assume that 1) the
linked file is a java archive, or 2) the server and web page disagree, so it's a
situation where you should be prompted?

The reason I'm asking is that in a hosted environment web page authors may not
have control over the server MIME types, and they may wish to use type="xxx" in
cases where the server is wrong.

This could also be a totally different bug, but I think it's a valid request for
improvement.  Mozilla 1.7 RC1 currently always takes the server as gospel
regardless of the TYPE indicated in the link.
Please file a separate bug on that.  That may be worth doing... Note that we
already support type in general, so this could be done without too much pain,


15 years ago
Blocks: 266410
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.