Last Comment Bug 220807 - [FIX]Prompt user about invalid text/plain content - solving most incorrect MIME type issues
: [FIX]Prompt user about invalid text/plain content - solving most incorrect MI...
Status: RESOLVED FIXED
:
Product: Core Graveyard
Classification: Graveyard
Component: File Handling (show other bugs)
: Trunk
: All All
: P2 enhancement with 14 votes (vote)
: mozilla1.7alpha
Assigned To: Boris Zbarsky [:bz] (still a bit busy)
: Hixie (not reading bugmail)
:
Mentors:
: 207154 209343 218372 221877 233465 fe (view as bug list)
Depends on:
Blocks: 233047 266410 135411 226024 285517
  Show dependency treegraph
 
Reported: 2003-09-30 10:02 PDT by Daniel Jackson
Modified: 2016-06-22 12:16 PDT (History)
46 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
Ok, I got bored (14.49 KB, patch)
2003-12-05 10:47 PST, Boris Zbarsky [:bz] (still a bit busy)
no flags Details | Diff | Splinter Review
Updated to comments (15.35 KB, patch)
2003-12-06 12:25 PST, Boris Zbarsky [:bz] (still a bit busy)
cbiesinger: review+
Details | Diff | Splinter Review
bz's patch plus a little more (10.41 KB, patch)
2003-12-10 21:40 PST, Ben Goodger (use ben at mozilla dot org for email)
cbiesinger: review-
bzbarsky: superreview-
Details | Diff | Splinter Review
newer patch (12.12 KB, patch)
2003-12-15 14:37 PST, Ben Goodger (use ben at mozilla dot org for email)
no flags Details | Diff | Splinter Review
new patch (12.13 KB, patch)
2003-12-15 15:00 PST, Ben Goodger (use ben at mozilla dot org for email)
no flags Details | Diff | Splinter Review
Patch for FB_0_8_BRANCH *ONLY* (12.14 KB, patch)
2003-12-16 16:22 PST, Ben Goodger (use ben at mozilla dot org for email)
no flags Details | Diff | Splinter Review
Patch for trunk, updated to all comments (18.73 KB, patch)
2004-01-06 10:20 PST, Boris Zbarsky [:bz] (still a bit busy)
cbiesinger: review+
darin.moz: superreview+
Details | Diff | Splinter Review
Updated to darin's comments (20.67 KB, patch)
2004-01-07 19:51 PST, Boris Zbarsky [:bz] (still a bit busy)
no flags Details | Diff | Splinter Review

Description Daniel Jackson 2003-09-30 10:02:49 PDT
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6a) Gecko/20030923 Firebird/0.7+
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6a) Gecko/20030923 Firebird/0.7+

When Firebird receives content which the server has set to the MIME type
"text/plain", but the data contains invalid ASCII control characters (as can be
identified from IETF RFC 2046), Firebird should prompt the user as to what to do
with the data rather than attempt to display it immediately.

This behaviour is intended to avoid the common problem of attempting to download
binary data from mis-configured web-servers; and solves it *without* breaking
Firebird's compliance with MIME type handling.


The content of a "text/plain" document, according to IETF RFC 2046 (Section 4.1.2): 

  ...
  Note that the control characters including DEL (0-31, 127) 
  have no defined meaning in apart from the combination CRLF 
  (US-ASCII values 13 and 10) indicating a new line. Two of 
  the characters have de facto meanings in wide use: FF (12) 
  ... and TAB or HT (9) ... Aside from these conventions, 
  any use of the control characters or DEL in a body must 
  either occur
    (1) because a subtype of text other than "plain" 
        specifically assigns some additional meaning, or
    (2) within the context of a private agreement between 
        the sender and recipient. Such private agreements 
        are discouraged and should be replaced by the other 
        capabilities of this document.
  [RFC 2046]


Thus, the detection can be made from a stream sent as the "text/plain"
MIME-type, yet the first few bytes contain one or more ASCII values in the range
0-9 or 14-31 (or, possibly, 127 - but I'm sure some codepages use it when they
shouldn't).  


Reproducible: Always

Steps to Reproduce:
1. Find a website with, e.g. a .rar file which has no MIME type 
   associated with .rar files
2. Click on the link.

Actual Results:  
Firebird starts displaying the erroneous text/plain stream in the browser -- the
user sees complete garbage on their screen.


Expected Results:  
(The server doesn't recognise the file type so, typically, claims it is
"text/plain".)

This stream could be identified (from the first few bytes) as containing invalid
text/plain characters (in accordance with IETF RFC 2046), and the end user could
be prompted with this information.

The end-user could then be prompted as to what to do with this file - e.g. save
it to disk (or display it anyway)


I believe the FTP component of Firbird has a mechanism to detect binary data.  


Example Problem:

An example where this is a common problem is: .RAR archive files stored on a
website where the webmaster has no direct control over MIME type association.



Alternatives:

Many veteran users say "Just right-click and Save-As" or "Wait until the
document loads fully, then save it all".  The former is impossible if the
download was accessed by filling in a form or as the result of a browser
redirect.  The latter is impractical to expect a novice user to understand and
be satisfied with.



Background:

The problem is not with Firebird, but with mis-configured servers.  However, it
is important to bear in mind:

 * most servers appear to default to text/plain for unknown file types

 * new binary file types will be unknown

 * these unknown files will be sent as text/plain

 * in many cases, the webmaster cannot or will not directly control the MIME
type association to correct this problem (either the site is hosted and they
have no access, or may feel the majority of their (IE) users have no problem as
other browsers use MIME-type detection)



Precident:

It's a similar situation to if the server sent random bytes claiming they were
"image/gif".  You wouldn't expect the browser to display a load of rubbish,
you'd expect it to notice that something's wrong and perhaps display a "broken
image" icon and the ALT text instead.  This proposal isn't all that dissimilar:
FB should cope with invalid text/plain, and the way it should cope should be by
offering to save it instead -- this has the enormous benefit of also being a
workaround for the many misconfigured servers out there.



Multi-lingual notes:

The default codepage for text/plain is "US-ASCII".  Other languages may specify
a different codepage or use UTF-8 encoding of Unicode.  However, these facts
make no difference to the interpretation of the control characters.

I don't know whether, for example, Unicode UTF-16 would actually be valid for
"text/plain".  If it is, it would be easy to check for the FF EF (? - or
vice-versa) unicode header.



Suggested UI behaviour:

If it appears likely that the stream is not "text/plain", the user should be
prompted with a dialog similar to one that would be displayed if the content
were "application/octet-stream":
   +------------------------------------
   |
   |  You have chosen to open
   |    filename.rar
   |
   |  from: http://www.website.com/
   |
   |  which has been sent as plain text, but it appears to be binary data.
   |
   |  What should Mozilla Firebird do with this file?
   |
   |    (*) Display as plain text
   |    ( ) Open with [UnRAR    |V]
   |    ( ) Save to Disk
   |
   |    [ ] Do this automatically for files like this from now on.
   |
   |                     [OK]  [CANCEL]
   |
   +------------------------------------

- the changes being the "Display as plain text" option, and that the "open with"
application could come from any applications on the user's system that are
registered for that file extension.

The "Do this automatically" option is undoubtedly one for debate.  It could be
used to always force this behaviour for that file extension, or perhaps only for
future false text/plain detections.  The user could then manage their choices as
additions in the Tools|Options|Downloads panel -- but, in this case, by managing
from the file extension rather than the MIME type.  (Aside: why doesn't
Tools|Options|Downloads currently display extensions *instead of* MIME types?)



In short:

The problem of misconfigured servers sending binary data as text/plain is
unlikely to go away for several years.

Many people agree something needs to be done to make Firebird more usable in
this respect

However, most binary files sent as text/plain are invalid for that format,
Firebird could allow the user to choose what to do with these files.

This is the best solution as far as I can see it, without violating the specs.
Comment 1 Josh 2003-09-30 10:11:51 PDT
I just want to post to ask the devlopers to seriously consider this instead of
just calling it another Mimetype bug and WONTFIX'ing it. This specific
implementation does not technically deviate from WC3 standards, could be added
as something that can be enabled/disabled by user preference, and would give
users a way to handle misconfigured webservers (which would make a *LOT* of
people happy);
Comment 2 Jesse Ruderman 2003-10-02 16:04:15 PDT
See also bug 57342, Add "View as Text" option for unknown mime content-type.
Comment 3 Warren TenBrook 2003-10-15 08:40:23 PDT
Also see Camino bug 221877.  The only problem I forsee is the Firebird proposal
uses sniffing to tell the user that text-plain is erroneous.  This appears to
violate the standards, since mozilla.org has said we should trust the server
(see  bug 175848 and RFC 2068, section 7.2.1).

A different approach is to have the dialog state the file is indeed text-plain,
so the server is trusted, but then give the user some options other than
rendering in the window.  It is a subtle change, but could be important for
compliance.
Comment 4 Boris Zbarsky [:bz] (still a bit busy) 2003-10-16 15:37:20 PDT
How does the proposed approach handle text/plain content in non-ASCII encodings?
 Say in UTF-16?
Comment 5 Daniel Jackson 2003-10-16 16:17:23 PDT
Regarding Comment #3

WRT calling it "content sniffing", I still believe the analogy of loading an
erroneous image is valid; you don't expect it a browser to display garbage on
your screen if the server claims a file is an image but it doesn't actually
contain an image header.  I guess the only concern is *when* the start of the
"text/plain" stream is checked.  If you look at it as checking the first few
bytes to see if it can be displayed before actually displaying it, it really
doesn't seem any different to the handling of images.

However, the idea of always prompting about text/plain is a very good
alternative to the proposal.  

There aren't *that* many text/plain files that a user does want to see per day,
but to prevent it annoying people, perhaps there can be a list of file
extensions that the user always wants to be displayed in the browser
(pre-installed with ".txt" etc.)



Regarding Comment #4

I'm no expert, but I'm not even sure the specs I read are valid for UTF-16. 
Regardless, it (and UTF-32) are easily identified by the byte-order mark (BOM)
at the start of the file (e.g. somthing like 'FF EF').  

If UTF-16 is valid for text/plain, then the "invalid text/plain" check wouldn't
be performed on a text/plain stream if it starts with a valid BOM.

If UTF-16 isn't valid for text/plain, and the check on the first few bytes
determined a stream was invalid text/plain, then the BOM could be checked-for
and, if present, the file could be displayed anyway.



Small correction to original post: 

At the end of "Suggested UI behaviour": I was suppose to be asking, as an aside,
"why *DOES* Tools|Options|Downloads currently display (file) extensions instead
of MIME types"?  This placement almost makes it appear as if those actions will
be performed on files with those extensions irrespective of MIME type.

Comment 6 Christian :Biesinger (don't email me, ping me on IRC) 2003-10-16 16:26:20 PDT
> it (and UTF-32) are easily identified by the byte-order mark (BOM)

that does not need to be there
Comment 7 Dominik Geisler 2003-11-12 09:38:22 PST
Could I ask on the developer opinion on this one? There are quite some votes,
and it hasn't been WONTFIXed yet. However, there's also not been any response.
Is this normal? (sorry, I've just tracked a few bugs so far.)
Comment 8 Boris Zbarsky [:bz] (still a bit busy) 2003-11-12 09:50:00 PST
The developers responsible for the MIME dispatch back end raised an issue with
the proposed approach that has not been addressed (comment 4).

At the same time, this bug is filed as a UI bug on the Firebird browser, not as
a core bug.  Which means that it should only be resolved by a Firebird developer.
Comment 9 Jeff Van Boxtel 2003-11-14 15:49:02 PST
Can this solution be implemented only for text/plain documents in the ASCII
encoding?
Or the solution could still be implemented for all cases and just change the
wording on the dialog to say something like "... which has been sent as plain
text, but appears to be either binary data or non-ascii encoding."

The user could still choose to handle it whichever way they prefer and
optionally choose to have future occurences handled the same way.
Comment 10 Christian :Biesinger (don't email me, ping me on IRC) 2003-11-14 15:53:42 PST
>"... which has been sent as plain
>text, but appears to be either binary data or non-ascii encoding."

are you aware that there are very few languages for which ASCII suffices?
Comment 11 Jeff Van Boxtel 2003-11-14 19:56:04 PST
Now I'm just a common user, so I don't know all that much about this stuff, but
in the situation where the text/plain is being specified as the default MIME
type because of a misconfigured server, isn't it normal that the encoding is
either unspecifed or us-ascii? If there is any other encoding or character-set
wouldn't those be specifically identified in the header? This feature of
detecting possible mislabeled MIME types should still be possible in the cases
of us-ascii or non-specified character sets. While using this only in the case
of text/plain + us-ascii may not correct all of the occurences, I think it would
go a long way towards solving the problem.

Even if finding a safe way to determine if a file might be binary or not is
blocking this feature, I think it should still be implemented as an option. For
example, if I did not want my web browser to handle text/plain content, but
would rather such files open in my text-editor, I should be able to do that
either by selecting the default action the first time I navigate to such a file,
or by changing the default behavior in the preferences dialog.
Comment 12 Boris Zbarsky [:bz] (still a bit busy) 2003-11-14 20:13:59 PST
> For example, if I did not want my web browser to handle text/plain content

That's a separate issue from this bug...

The problem with comment 11 is the same as with the previous comments.  Say I am
serving up a Japanese plaintext document.  I server it up as text/plain.  I do
not send a charset header (usual operating procedure and all).  It renders just
fine on everyone's browsers because of charset autodetection.

How do we avoid breaking this scenario?
Comment 13 Jeff Van Boxtel 2003-11-15 01:00:36 PST
According to this source: http://www.faqs.org/rfcs/rfc2046.html
"The default character set, which must be assumed in the absence of a charset
parameter, is US-ASCII." and also "messages containing characters coded
according to other versions of ISO 646 than US-ASCII and the 1991 IRV, or using
code-switching procedures (e.g., those of ISO 2022), as well as 8bit or multiple
octet character encodings MUST use an appropriate character set specification to
be consistent with MIME."
So in the case of US-ASCII (or nothing sepcified, which is probably the most
common case in the case of a misconfigured server) we check the data to see if
it is binary and prompt the user. In all other cases we can trust the
character-set specified in the header.
Comment 14 Boris Zbarsky [:bz] (still a bit busy) 2003-11-15 01:05:49 PST
> According to this source: http://www.faqs.org/rfcs/rfc2046.html

Which is ignored by as many, if not more, people as the the HTTP RFCs....
Comment 15 Boris Zbarsky [:bz] (still a bit busy) 2003-11-15 01:15:54 PST
By the way, I think we're all agreed that the only document that should get
special treatment of any sort are text/plain documents coming in over HTTP with
no charset specified.  The remaining questions are:

1)  What fraction of such documents are actually ASCII plaintext?
2)  What fraction of such documents are actually random "binary" data?
3)  What fraction of such documents are non-ascii plaintext?

If the answers to these show that category #3 is not too big, we could try to
make some changes to URILoader to handle this ("could" as in "I won't mind if
someone does").  Have to put some thought into how the whole dialog-posing mess
would work... That's sort of tied in to the general channel-suspendability
stuff, unless we plan to force the dialog modal and hope for the best.
Comment 16 Christian :Biesinger (don't email me, ping me on IRC) 2003-11-15 06:24:48 PST
re comment 13: I seem to recall that HTTP specifies a default charset of
ISO-8859-1...

>That's sort of tied in to the general channel-suspendability stuff

that needs to be fixed anyway...
Comment 17 Dan Tobias 2003-11-15 08:46:50 PST
If you serve a Japanese (or other non-ASCII) document as text/plain with no
charset header, then you're relying on browser error-correction to properly
render it.  However, from a strict standards standpoint, charset "sniffing" is
just as evil as content-type sniffing, isn't it?

On another tangent, is it proper by the standards to serve a data stream as
"text/plain" if it intentionally uses ASCII control characters in their defined
(though little used these days) meanings in the original 1960s-era ASCII
standards?  I remember from my '80s college days common use of "cute tricks"
like carriage returns without linefeeds, backspaces, and the bell character (#7)
to do gimmicky animations and special effects, for instance in "plan" files
displayed from the "finger" command on timesharing computers.  Then, once dialup
BBSs became commonplace, they used lots of ANSI escape codes, but I guess those
wouldn't count as "plain text".  I myself sometimes use the delimiter characters
from #28 to #31 (FS, GS, RS, US) to separate fields in formatted data sent as
the output of one program for the purpose of being processed by another; what
MIME type would be appropriate for such a thing?
Comment 18 Boris Zbarsky [:bz] (still a bit busy) 2003-11-15 09:35:43 PST
> However, from a strict standards standpoint, charset "sniffing" is just as evil

Sure.   Which is why I'm willing to break it if the other case is a lot more common.

> what MIME type would be appropriate for such a thing?

Probably a text/x-something mimetype unless you register a subtype of text/.
Comment 19 Dominik Geisler 2003-11-22 11:05:03 PST
I think that the problem of, for example, UTF-16 text formats is not that
important. Since the problem usually occurs only on a very limited number of
file types, the browser could be programmed only to scan the text for files with
.rar, .ace and a few of the other problematic file types. 
The file would only be declared application/octet stream if both the extension
AND the file sniffling AND the user interaction suggests that the MIME type is
specified wrongly. So you have

MIME type                   | Action
==================================================================
text/plain                  | display

octet stream                | save

text/plain with octet       | 'guessed' text -> display
extension (e.g. .rar)       | 'guessed' octet & user wants display -> Display
                            | 'guessed' octet & user wants save -> save

So even rar-files with UTF-16 text would be displayed properly, although that
would require user interaction. I doubt there many UTF-16-rar-plain-text-files
though, I ever doub there is one such file on the internet. And even that file
would be handled correctly by the intelligent user. 

In bug 126782, a detection method has been programmed in the fix
(http://bugzilla.mozilla.org/attachment.cgi?id=135573&action=view). Could that
one be used here?
Comment 20 Boris Zbarsky [:bz] (still a bit busy) 2003-11-22 11:12:38 PST
> I think that the problem of, for example, UTF-16 text formats is not that
> important.

You're also very clearly not East Asian...  Please don't impose your
ethnocentric biases on the whole world.  There are a lot more people to whom
that problem is important than live in all of Europe.  I'd still like some
numbers in response to my question in comment 15.  Sweeping generalizations are
not numbers.

> Since the problem usually occurs only on a very limited number of
> file types, the browser could be programmed only to scan the text for files
> with .rar, .ace and a few of the other problematic file types. 

No.  There will be no hardcoding of extensions in this code.

The bug 126782 mechanism is exactly what would be used here. That is, if we
decide the "text/plain" type is "unreliable" we would simply run the unknown
type detection code on it instead.
Comment 21 Dominik Geisler 2003-11-23 06:10:53 PST
Boris Zbarsky, 

With the hardcoding of extensions, UTF-16 would not be a big issue. I am not
'ethnocentrically biased', that's why I made this comment. If you hardcode
extensions, you wouldn't have to program all non-ASCII encodings and still have
a solution working perfectly. As I said, only plain text files .RAR files from
asia would require user interaction. I thought everyone could live with that, as
I could even live with automatically declaring .rar files octet/stream (but
since developers won't, I won't bother with that). 

Besides, if you refuse to hardcode anything, all files, e.g. .html, would have
to be scanned in ASCII, UTF-7, UTF-8, UTF-16, etc. until a fitting encoding is
found. I can't see why that would be a good and fast solution. 
Comment 22 Christian :Biesinger (don't email me, ping me on IRC) 2003-11-23 06:51:57 PST
> Besides, if you refuse to hardcode anything

as I understood bz, the text/plain MIME type would be hardcoded. so .html would
not be scanned unless they were sent as such a type.

let me add that I would love to mark this bug wontfix.
Comment 23 Boris Zbarsky [:bz] (still a bit busy) 2003-11-23 09:13:13 PST
Dominik, why are you so focused on .rar files?  This is a problem with a lot
more than those...

The reason I refuse to hardcode extensions is that when a new extension appears
(like .dmg, for example), none of the web servers will know about it so will
send it as text/plain AND Mozilla will treat it as text/plain.  Not useful,
really.  Yeah, we could play whack-a-mole and have older Mozillas not be able to
deal with these new extensions, but that's not a good alternative in my mind.
Comment 24 Daniel Jackson 2003-11-24 06:27:42 PST
Just to clarify: the "bug" described would work perfectly in correctly identify
invalid "text/plain".  The text/plain MIME type is only supposed to be valid for
ASCII-compatible octet streams (e.g. any ASCII codepage or Unicode as UTF-8).

The only issue we have, is that some servers *incorrectly* serve
non-ASCII-compatible text as text/plain (when it's supposed to be
text/something-else) and with no "charset" part in the MIME-type.

However, it is the case that many other character encodings are actually
upwardly-compatible with ASCII, in that they don't use (at least most of) the
ASCII control codes to encode characters.  

For example, if I understand Shift-JIS encoding properly, then just by accepting
the escape character (27) as a permitted control character, Shift-JIS would also
work with this proposed scheme.  I believe many other character sets would too.

So the issue is with non-ASCII compatible character sets that make use of the
octet values 0-8, 14-26, or 28-31.  Which character sets are actually affected
by this?

UTF-16 or UTF-32 *incorrectly* sent as text/plain would indeed use ASCII control
codes, but I thought these encodings typically had a byte-order-mark at the
start of the file.  If they don't, then what would the browser do with them
anyway - it should really display them as US-ASCII (and I can't see a manual
"UTF-16" encoding option in Firebird anyway).


The real question is:  How often is this actually a problem? 

 - Plain text files are only a small fraction of what most users view

 - then the fraction of those files in a charset that uses octet 
   values {0-8, 14-26, 28-31} (or no BOM if UTF-16/32)

 - then the fraction of those files incorrectly sent as "text/plain" 
   and with no "charset" header


If it's quite rare, then would it really be a problem if the user was
(correctly) informed the content is invalid "text/plain", and they can select
"view as text" anyway.  (I assume they'll also have to select an encoding
manually as the server didn't tell them what the file was, perhaps they could
have a combobox for the encoding next to "view as text", or even perform any
charset auto-detection wizardry if, and only if, it's got this far).


One other mitigating feature could be the UI option:

  "[ ] Do this automatically for files with this extension from now on."

(this could probably integrate nicely into the list of file types under Tools |
Options | Download)


For example, this could allow the user to build up their own personal list of
preferences to:

  * Always "Save" files that are incorrectly sent as "text/plain" 
    which have the extension ".rar"

  * Always "View as text" files that are incorrectly sent as "text/plain" 
    which have the extension ".txt"

(Note: Absolutely no "hard-coding" of extensions, just allowing the user to have
a per-extension preference for dealing with files incorrectly marked as text/plain).
Comment 25 Jeff Van Boxtel 2003-12-03 13:00:02 PST
> 1)  What fraction of such documents are actually ASCII plaintext?
According to the standard, all such documents should be. And for the most part
this is true, although I've been watching lately, and I see ISO-8859-1 instead
of US-ASCII most of the time.
> 2)  What fraction of such documents are actually random "binary" data?
This is a small percentage of misconfigured servers.
> 3)  What fraction of such documents are non-ascii plaintext?
None. As is defined by the standard. I've looked and I could not find a single
situation like this. If anyone else has ever seen it, please correct me.

I don't know of anyway to collect scientific data on these kind of numbers,
otherwise I would do it. The biggest point about this idea, as Daniel is trying
to express I believe, is that it prompts the user in cases of ambiguity so that
the user can decide how to handle the situation. It shouldn't matter so much how
often these cases happen because this feature would give the user an option of
how to handle it in the future. It doesn't break any functionality, the worst it
could do is bother the user once with a dialog to choose "handle this way
always". The default action could even be to handle it as it is handled now.
Comment 26 Boris Zbarsky [:bz] (still a bit busy) 2003-12-03 13:15:11 PST
> The default action could even be to handle it as it is handled now.

Not any time in the near future... that would depend on a "view in browser"
option being available in the dialog.

>> 3)  What fraction of such documents are non-ascii plaintext?
>None. As is defined by the standard. I've looked

If I may ask, where did you look?

I agree this is hard to solidly quantify -- hence the quandary.
Comment 27 Jeff Van Boxtel 2003-12-03 13:38:46 PST
I sometimes browse various Japanese websites and I have never seen such an
occurence. In addition, I did some google searches on .jp domains for txt, text,
etc and came up with several text/plain in Shift-JIS such as:

http://rawfish.at.infoseek.co.jp/txt/log0000/00000001.txt
http://www.rtpro.yamaha.co.jp/RT/docs/relnote/Rev.05.01/relnote_05_01_14.txt
http://www.iodata.jp/lib/product/c/cdv_104.txt
http://www.iodata.jp/lib/product/c/cdv_102.txt

Or EUC:
http://sky-mue.jp/linux/plamo/plamo-HOWTO.txt
http://kwatch.at.infoseek.co.jp/web/http.txt
http://www.suplex.gr.jp/~hourin/signature.txt

And even some English on Japanese servers in ISO-8859-1:
http://geta.ex.nii.ac.jp/getaN2002/release/drep_200202.txt

but wasnt able to find any as described. Ofcourse, I didn't search for other
languages since I can't read them, so I could be overlooking a huge sample of data.
Comment 28 S Woodside 2003-12-04 22:10:37 PST
The nice thing about this approach in general (good idea!) is like bz said: If
you have a file with MIME-type text/plain and no text-encoding set, you're out
of the standards zone and into the Guess Zone. That means that any and all kinds
of sniffing are *permissable* (according to spec).

We could at that point special-case things. For example, we have very commonly
this problem with .dmg files on the Mac. So I would definitely think that if the
file is in the Guess Zone, sniffing for certain features would definitely be OK.
Is there any way to detect a UTF-8 or UTF-16 file? Can we sniff for a DMG or an
ASF file (those are commonly unset on servers).

My point is that this bug defines a Guess Zone. So now we can finally
special-case certain highly-complained-about types like DMGs without breaking
the specs.
Comment 29 Boris Zbarsky [:bz] (still a bit busy) 2003-12-04 22:39:59 PST
> If you have a file with MIME-type text/plain and no text-encoding set, you're
> out of the standards zone and into the Guess Zone.

Actually, that's not true.  The standards clearly say that this means the text
is ASCII-encoded.

In any case, per discussion with biesi and darin, the following is the plan:

If incoming data meets the following criteria:

1)  HTTP channel
2)  "text/plain" request header (literal, case-sensitive comparison)
3)  This is the initial content dispatch

Then reset the content-type on the channel to "application/x-maybe-text" (or
some such).  Necko will provide a stream converter (probably just subclass or
superclass of nsUnknownDecoder) that will take that type as input and apply the
following heuristics:

A)  If the data starts with a BOM, treat it as text/plain
B)  Otherwise, apply the "last ditch sniff" algorithm from nsUnknownDecoder to
    detect it as either text/plain or application/octet-stream.  Note that this
    will NOT detect the data as any other types (in particular as any types we
    would render inline).

Embeddors are of course free to register their own converters for the relevant
contractids and override the converter Necko registers.

I may get to this by the 1.7a cycle if no one posts a patch earlier.

There are other various bugs in bugzilla that should probably be dependent on
this one... (eg there is a similar Camino bug).
Comment 30 Daniel Jackson 2003-12-05 08:31:36 PST
bz: That's great news!

Although, may I suggest also allowing char "27" in the "IS_TEXT_CHAR" macro from
nsUnknownDecoder as this would then allow Shift-JIS (and probably other
non-ASCII, yet ASCII-compatible) encodings to be treated as text, and not binary.  
(I know they will not be strictly text/plain, but that will be what was intended)

Comment 31 Boris Zbarsky [:bz] (still a bit busy) 2003-12-05 08:50:13 PST
Yeah, we should do that.
Comment 32 Boris Zbarsky [:bz] (still a bit busy) 2003-12-05 10:47:20 PST
Created attachment 136882 [details] [diff] [review]
Ok, I got bored
Comment 33 Boris Zbarsky [:bz] (still a bit busy) 2003-12-05 10:48:04 PST
Comment on attachment 136882 [details] [diff] [review]
Ok, I got bored

biesi?	Darin?	Thoughts?
Comment 34 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 00:33:08 PST
I applied this patch then attempted to load kerz's "Gold Standard" "Idiot on
Silverado" wmv test:

http://static1.stileproject.com/jump/4e46800c366751035c3a771fb7f50137/help6.wmv

It still displays as text. 
Comment 35 Jason Kersey 2003-12-06 00:42:28 PST
For the good of the web, we must be able to display this movie.  Thank's for
your work on this so far...
Comment 36 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 00:43:49 PST
Yep.  That server sends:

  Content-Type: text/plain; charset=ISO-8859-1

which fails test #2 in comment 29.  _If_ this is the default Apache setup, I
would be willing to change test #2 to also check for this literal string....
Comment 37 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 00:49:06 PST
To make it clear, if that's the default misconfiguration Apache ships with, I'm
willing to work around it. If this server has gone out of its way to be more
misconfigured than the average server, then that's a different issue entirely.
Comment 38 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 00:55:04 PST
So, if it meets criteria #2 (sending text/plain) should that not mean your "Last
Ditch Sniff" should detect binary data? What does server configuration have to
do with that?
Comment 39 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 00:58:08 PST
> So, if it meets criteria #2 (sending text/plain)

It does NOT mean criterion #2.  Please read what I said again.  Server
configuration has everything to do with this.  I am implementing a workaround
for a bug in the default configuration of the most popular web server out there.
 I am _not_ attempting to solve all possible ways in which someone can fuck up a
web server to lie to web browsers.

Now I suspect that the default config of Apache may end up sending Content-Type
headers in some cases.  If that's the case, we need to rethink criterion #2 to
account for that a bit...
Comment 40 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 01:03:44 PST
Oh, I didn't realize your test #2 was explicit "text/plain" (and thus would not
catch "text/plain *" 
Comment 41 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 01:11:27 PST
Right.  The whole point is that "text/plain; charset=Big5", eg, should not go
through the sniffer, since it would be incorrectly detected as binary...
Comment 42 scratch 2003-12-06 01:24:31 PST
i can't say for sure that it's the *default*, but all testcases I can find do
indeed send:

Content-Type: text/plain; charset=ISO-8859-1
Comment 43 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 01:30:04 PST
I ran:

 http://www.bengoodger.com/video/funny_cats.wmv

through the http://www.rexswain.com/ 's HTTP viewer, and it reports a
Content-Type field of "text/plain" but a build bz's patch still shows raw data. 

I don't know enough about Apache to say if this affects anything, but my bog
standard Apache install on this machine has this line in httpd.conf:

AddDefaultCharset ISO-8859-1
Comment 44 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 01:42:38 PST
> http://www.bengoodger.com/video/funny_cats.wmv

My patched build is detecting this as application/octet-stream.... you did
rebuild in all the relevant places (uriloader/, netwerk/, docshell/build/), right?
Comment 45 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 01:50:29 PST
ooh, not docshell. thanks. 

kerz points out this bug in the apache database:

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23421
Comment 46 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 01:58:53 PST
ah, I see.  So that's a server version issue... sounds like we need to check for
both of those literal strings.
Comment 47 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 02:03:20 PST
OK, I've properly tried your patch now and it works like a charm, except for the
old OSHelperAppService bugs which I can look at separately (on Windows it's
showing the helper for WMV files as "WMVFile" which I think implies it's looking
for the pretty app name in the wrong spot in the windows registry). 
Comment 48 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-06 10:46:07 PST
ben: yes... the oshelperappservice uses HKCR\.<ext>\@ as the helper app
description, and HKCR\<thatvalue>\@ as mime type description (where @ means
"default value")

do you have a better suggestion?
Comment 49 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-06 10:56:16 PST
Comment on attachment 136882 [details] [diff] [review]
Ok, I got bored

+      httpChannel->SetContentType(
+		     
NS_LITERAL_CSTRING("application/vnd.mozilla.maybe-text"));

could you maybe add a log comment around here, that we're checking if this is
really text?

also... is application/vnd.mozilla.maybe-text a good type to use? It's not a
registered type per 
ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/application/ (hardly
surprising :) ), I think  you should use some x- type.

+  void DetermineContentType(nsIRequest* aRequest);

hm... can you add a "virtual" here, to make clear that this is a virtual
function?

+  (((unsigned char)ch) > 31
thank you :)

+  // First, check for a BOM.
you have no check for a UTF-32 BOM... doesn't mozilla support that?

+CreateNewBinaryDetectorFactory(nsISupports *aOuter, REFNSIID aIID, void
**aResult)
is there a reason to avoid NS_GENERIC_FACTORY_CONSTRUCTOR?
Comment 50 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-06 11:06:03 PST
> could you maybe add a log comment 

err, I meant "a LOG statement"
Comment 51 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 11:20:51 PST
> could you maybe add a log comment around here,

Sure thing.

> is application/vnd.mozilla.maybe-text a good type to use?

Good question.  I'll change it to a x- type that should still not cause random
namespace collisions.

> can you add a "virtual" here,

Yep.

> you have no check for a UTF-32 BOM

True.  I'll add it -- I finally convinced myself this can be done in a safe way.

> is there a reason to avoid NS_GENERIC_FACTORY_CONSTRUCTOR?

Following the style of the file... no other reason.

Comment 52 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-06 12:14:19 PST
putting this finally in browser:File handling where it belongs; also -> bz
Comment 53 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 12:25:52 PST
Created attachment 136945 [details] [diff] [review]
Updated to comments
Comment 54 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-06 12:37:16 PST
Comment on attachment 136945 [details] [diff] [review]
Updated to comments

+	 (buf[0] == 0xFF && buf[1] == 0xFE && buf[2] == 0 && buf[3] == 0) || //
UCS-4LE
this will already match the UTF-16LE check, so this line can be removed

+	 (buf[0] == 0xFE && buf[1] == 0xFF && buf[2] == 0 && buf[3] == 0)) { //
UCS-4

this will match the UTF1-6BE check, so it can also be removed

with the lines removed, r=me
Comment 55 scratch 2003-12-06 13:31:33 PST
> ben: yes... the oshelperappservice uses HKCR\.<ext>\@ as the helper app
> description, and HKCR\<thatvalue>\@ as mime type description (where @ means
> "default value")
> 
> do you have a better suggestion?

filetype = HKCR\.<ext>\@
defaultapp = HKCR\<filetype>\shell\@
helperapp = HKCR\<filetype>\shell\<defaultapp>
helperappdescription = HKCU\Software\Microsoft\Windows\ShellNoRoam\<helperapp>

If helperappdescription does not exist, then we should display helperapp.  This
is the default windows behavior.  Displaying either the application name if
available or the application filename if not is much more intuitive than
displaying the internal filetype name and saying that's the program you're
opening it with.
Comment 56 Boris Zbarsky [:bz] (still a bit busy) 2003-12-06 13:33:30 PST
Let's have a separate bug on the Windows thing, ok?  ;)
Comment 57 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 14:35:29 PST
Christian, I'll do some research and file another bug. 
Comment 58 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-06 15:24:29 PST
ben: scratch filed bug 227671 about the application name question
Comment 59 Ben Goodger (use ben at mozilla dot org for email) 2003-12-06 16:54:45 PST
OK, this new patch works great with the "Idiot on Silverado" video
(Content-Type: text/plain; charset=ISO-8859-1). Please don't attempt to load the
link from my comment above anymore, it now seems to display undesirable content
o_o;;

Comment 60 Hixie (not reading bugmail) 2003-12-07 03:59:14 PST
So... what does this patch do to our results on these tests?:
   http://www.hixie.ch/tests/adhoc/http/content-type/

What does this faking of the MIME type do to Page Info and so forth? I really
hate how we sometimes show our autodetected MIME types as if it was the server type.

Anyway. Interesting solution. Separate from that though we do IMHO still need the
   View | Content Type
...submenu (analogous to the View | Encoding submenu for the analogous problem).
Comment 61 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-07 05:17:39 PST
Ian: the behaviour on those files does not change with this patch

For files where behaviour does change, the helper app dialog will be shown and
indicate a type of application/octet-stream.
(hence, page info never comes into play)

hmm... why does mozilla display application/x-view-source files?
Comment 62 Boris Zbarsky [:bz] (still a bit busy) 2003-12-07 08:46:19 PST
> why does mozilla display application/x-view-source files?

Because we have a handler registered for that content type -- that's how we do
view-source, y'now.  ;)
Comment 63 Ben Goodger (use ben at mozilla dot org for email) 2003-12-10 21:40:22 PST
Created attachment 137227 [details] [diff] [review]
bz's patch plus a little more

This adds to bz's patch an extension-based lookup in
nsExternalHelperAppService::DoContent as a last ditch attempt to locate the One
True Type from the OS. 

bz/biesi - I've included the MIME type application/x-vnd.moz.maybe-binary as a
literal string in both cases, please advise if there's a common header or
something where it'd be better defined.
Comment 64 Boris Zbarsky [:bz] (still a bit busy) 2003-12-11 01:18:29 PST
Ben, did you base that on the patch _before_ I addressed some review comments?

The type could go in nsMimeTypes.h, I suppose.
Comment 65 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-11 05:40:14 PST
Comment on attachment 137227 [details] [diff] [review]
bz's patch plus a little more

this does not address most parts of comment 49...

also:
+  if (aMimeContentType == "application/x-vnd.mozilla.maybe-binary") {

aMimeContentType is "const char*". this comparison does therefore not do what
you want.

+    nsXPIDLCString mimeType;
+    if (mimeInfo)
+      mimeInfo->GetMIMEType(getter_Copies(mimeType));

can you move this into the "if (!fileExtension.IsEmpty())" block?

(except the nsXPIDLCString that should stay outside of course)

I would also like a LOG statement to show what mime type the nsIMIMEInfo has
stored.
Comment 66 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-11 14:18:33 PST
oh, about the mimetype question... yes, you could put it in nsMimeTypes.h as bz
suggested, or just use the literal string... personally, I don't care either way
Comment 67 Ben Goodger (use ben at mozilla dot org for email) 2003-12-11 14:39:23 PST
Huh. I'm pretty sure I reversed the original patch and then applied the newer
one. I must have messed something up. *tries again* 
Comment 68 Ben Goodger (use ben at mozilla dot org for email) 2003-12-15 14:37:17 PST
Created attachment 137481 [details] [diff] [review]
newer patch

This one is really based on bz's second patch, with biesi's comments rolled in.
Comment 69 Ben Goodger (use ben at mozilla dot org for email) 2003-12-15 14:43:09 PST
Comment on attachment 137481 [details] [diff] [review]
newer patch

oops. one moment. disregard this patch.
Comment 70 Ben Goodger (use ben at mozilla dot org for email) 2003-12-15 15:00:49 PST
Created attachment 137485 [details] [diff] [review]
new patch
Comment 71 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-15 15:24:48 PST
Comment on attachment 137485 [details] [diff] [review]
new patch

+  if (!nsCRT::strcmp(aMimeContentType, APPLICATION_MAYBE_BINARY)) {

do we guarantee anywhere that aMimeContentType is lowercase? it doesn't look
like it.

please use nsCRT::strcasecmp.

+    mContentType = APPLICATION_MAYBE_BINARY;

huh? will that do any good?
Comment 72 Ben Goodger (use ben at mozilla dot org for email) 2003-12-15 15:56:41 PST
Christian - I replaced the place where the content type was set to octet-stream
with maybe-binary. I assumed that bz's original patch relied on this code to set
to octet-stream so that the UCT dialog would be presented, and so switched to
maybe-binary. 

bz - did I misinterpret your origianl patch?
Comment 73 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-15 16:02:57 PST
oh. hmmm.
hmmmmm.

well I fear that at that point, this will cause another streamconverter to get
instantiated, so you get into an infinite loop...

bz knows this code better than me, though, I'll let him comment.
(maybe we need two mimetypes - one to trigger the check for non-ascii
characters, and one to tell the helper app dialog to get the type from the
extension)
Comment 74 Boris Zbarsky [:bz] (still a bit busy) 2003-12-15 18:37:13 PST
Comment on attachment 137485 [details] [diff] [review]
new patch

In the future, patches using the -p switch and more context (-u8 at least) are
much easier to review....  And there was no need for -w here, either, imo.

>+  (((unsigned char)ch) > 31 || (9 <= ch && ch <= 13) || ch == 27)

I know I wrote this code... but could you put () around "ch" everywhere in that
line?  So:

(((unsigned char)(ch)) > 31 || (9 <= (ch) && (ch) <= 13 || (ch) == 27)

>+    mContentType = APPLICATION_MAYBE_BINARY;

biesi, this part is correct.  We do have two types involved here -- maybe-text
and maybe-binary.  The former triggers this stream converter, the latter is
magic to the external helper app handler.  Someone could indeed create a stream
converter for maybe-binary, but that's ok if they want to do that....

>+  // the type off the channel.
>     nextLink->mContentType = aOutContentType;
>-  }

This is indented correctly, right?  Damn -w.  ;)

>+        LOG(("OS-Provided mime type '%s' for extension '%s'\n", 
>+             fileExtension.get(), mimeType.get()));

That should pass mimeType.get() first, no?

sr=bzbarsky on Ben's additional changes with that addressed.  The rest of the
patch still needs sr (and moa from darin), so not marking the sr flag myself...
Comment 75 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-16 07:01:11 PST
>We do have two types involved here -- maybe-text
>and maybe-binary.

er, oops - I missed that. :(
but, why is maybe-binary called maybe-binary? don't we know that it is binary
(because it failed the "is text" check)?
Comment 76 Boris Zbarsky [:bz] (still a bit busy) 2003-12-16 08:09:39 PST
Better name suggestions?  maybe-octet-stream?
Comment 77 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-16 13:54:47 PST
application/x-guess-type-from-extension?
Comment 78 alanjstr 2003-12-16 14:01:08 PST
There's no guarantee that there will be an extension.  Streamed videos or
content management systems, for example.  I suppose we could do
application/x-guess-type-from-extension and fall into x-maybe-octet-stream if
that fails, too.
Comment 79 Mike Pinkerton (not reading bugmail) 2003-12-16 14:16:29 PST
*** Bug 221877 has been marked as a duplicate of this bug. ***
Comment 80 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-16 14:35:22 PST
> There's no guarantee that there will be an extension.

indeed. please read the patch. the question is just how to name that thing.
Comment 81 Ben Goodger (use ben at mozilla dot org for email) 2003-12-16 16:22:42 PST
Created attachment 137535 [details] [diff] [review]
Patch for FB_0_8_BRANCH *ONLY*

I've rolled in biesi and bz's comments except the content type name change into
this patch which I plan to check in on the 0.8 Branch. We can continue to
discuss the content type and check the further modified patch in for 1.7a
Comment 82 Ben Goodger (use ben at mozilla dot org for email) 2003-12-16 16:25:21 PST
Christian... should that be application/x-vnd.mozilla.guess-type-from-extension ?
Comment 83 Dan Tobias 2003-12-16 17:53:13 PST
I don't think any MIME type needs both a "x-" prefix and a "vnd." one... "x-"
means "unregistered", and no further structure is necessary after that.  ("vnd."
is for registered MIME types in the vendor subtree.)
Comment 84 Boris Zbarsky [:bz] (still a bit busy) 2003-12-16 20:06:47 PST
> don't think any MIME type needs both a "x-" prefix and a "vnd." one... "x-"
> means "unregistered", and no further structure is necessary after that.

Except if you're trying to be a good citizen and ensure that your x- crap won't
collide with anyone else's x- crap.  I made the type what I made it very
purposefully.

I don't really care what we name the type, as long as we clearly document in
nsMIMETypes.h what its "magic" effect is and as long as it starts with
"x-vnd.mozilla."
Comment 85 Hixie (not reading bugmail) 2003-12-17 05:11:47 PST
are we really sure using magic MIME types is the way to go here? :-)

/me starts planning tests that use those MIME types to see what happens
Comment 86 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-17 06:01:37 PST
> application/x-vnd.mozilla.guess-type-from-extension

that sounds good.
Comment 87 Boris Zbarsky [:bz] (still a bit busy) 2003-12-17 06:16:59 PST
> are we really sure using magic MIME types is the way to go here? :-)

Yes, we are.  It's the simplest way to do it, and they are not very magic (you
could set up a plugin or other internal handler for any of the types involved
any time you wanted to, eg; you just can't set up a helper for the
application/x-vnd.mozilla.guess-type-from-extension type).  And the name was
carefully chosen such that you better not expect it to work in any particular way.
Comment 88 Matthew Mastracci 2003-12-18 06:48:02 PST
I've read through the entire discussion and it looks like things are shaping up
alright.  I would highly recommend taking steps preventing any active content
from loading automatically based on mime-type sniffing (offering to download is
okay, however).  

Fortunately, the patch here seems to only apply by making files sent as
text/plain appear with a download dialog.  

Here are some articles from the Freenet-devel mailing list discussing how
mime-type autodetection can affect privacy: 

http://article.gmane.org/gmane.network.freenet.devel/4939/match=internet+explorer 
http://article.gmane.org/gmane.network.freenet.devel/1580/match=autodetect+mime

It's impossible to use IE in any sort of anonymous way on Freenet (and I assume
other anonymous networks) due to "mime type sniffing".  You can easily provide
an anonymity-destroying extension file and pass that as an image!

Just a couple of points to keep in mind.
Comment 89 Boris Zbarsky [:bz] (still a bit busy) 2003-12-18 07:15:27 PST
Comment 88 has nothing to do with this bug.  This bug is about the main content
area load.  The issues mentioned in comment 88 have to do with subsidiary
script/plugin loads to which the code discussed in this bug does not apply.

Please don't make irrelevant comments like that; file separate bugs on separate
issues.  If you're not sure it's a separate issue, file a separate bug and add a
dependency.
Comment 90 alanjstr 2003-12-21 10:42:08 PST
There appears to be a case sensitivity issue with the charset.

http://forums.mozillazine.org/viewtopic.php?t=41893

Comment 91 Boris Zbarsky [:bz] (still a bit busy) 2003-12-21 13:17:52 PST
Yeah, yeah.  See what I wrote about what this patch is aiming to do (quoted in
that thread).  The case-sensitive comparison is very much purposeful.
Comment 92 alanjstr 2003-12-21 13:24:14 PST
Yes, I see that you're doing it case-sensitive because you're handling it
exactly as Apache 2.x sends it.  And my question is:  why not make it case
insensitive?  Forgive me for being dumb, but how do we handle TEXT/plain or
text/Plain?  I'm not trying to say we should handle every single possibility.
Comment 93 Christian :Biesinger (don't email me, ping me on IRC) 2003-12-21 13:31:54 PST
the point is: this is only done to work around the default apache
misconfiguration, and not against anything else.
Comment 94 Ben Basson 2003-12-21 16:33:05 PST
That may well be the point, but if you're going to the lengths to implement a
fix like this, surely you might as well add a simple case conversion statement,
and then everyone is even happier.
Comment 95 Hixie (not reading bugmail) 2003-12-21 16:45:22 PST
Not everyone. This entire codepath is a standards violation, we absolutely want
to keep the violating to the strict minimum required to hit the most common case.

If people go out of their way to misconfigure their servers even further, then
tough -- it won't work.
Comment 96 alanjstr 2003-12-21 16:47:48 PST
Well, if we limit the check to the first 10 characters we can ignore the charset
regardless.
Comment 97 Boris Zbarsky [:bz] (still a bit busy) 2003-12-23 20:51:07 PST
The simple reason to NOT make this a case-insensitive check is that I'd rather
like to avoid detecting actual valid ISO-8859-1 text content as
application/octet-stream.  Sorry, but that check is staying case-sensitive.

alanjstr, if we limit the check to the first ten chars, then we will fail on
type strings like in comment 41.  See also the first 20 or so comments in this
bug for more background.

One more comment from someone who clearly hasn't bothered to carefully read the
bug, and I'm reassigning this to whoever did the commenting.
Comment 98 Donnie Roberts 2003-12-25 12:49:09 PST
Please add the charset "iso-8859-1" (lowercased) as it is the default on my
Debian/Apache server as well. The only mention of the the charset in httpd.conf is:

# Default charset to iso-8859-1 (http://www.apache.org/info/css-security/).
AddDefaultCharset on

HTTP/1.1 200 OK
Date: Thu, 25 Dec 2003 20:41:32 GMT
Server: Apache/1.3.29 (Debian GNU/Linux)
Last-Modified: Sat, 22 Nov 2003 23:38:46 GMT
ETag: "1ed36-28e0-3fbff386"
Accept-Ranges: bytes
Content-Length: 10464
Connection: close
Content-Type: text/plain; charset=iso-8859-1
Comment 99 Boris Zbarsky [:bz] (still a bit busy) 2003-12-25 13:00:16 PST
Yeah, that also seems to match the bug report mentioned in comment 45.  Can do that.
Comment 100 Joe Infla 2003-12-31 08:54:39 PST
*** Bug 209343 has been marked as a duplicate of this bug. ***
Comment 101 alanjstr 2003-12-31 11:59:25 PST
I do not think that this she Block the evangelism bugs, since there are probably
quite a few of those and fixing this won't fix those servers, merely hide the
symptoms.
Comment 102 Boris Zbarsky [:bz] (still a bit busy) 2004-01-06 10:20:16 PST
Created attachment 138491 [details] [diff] [review]
Patch for trunk, updated to all comments
Comment 103 Boris Zbarsky [:bz] (still a bit busy) 2004-01-06 10:21:39 PST
Comment on attachment 138491 [details] [diff] [review]
Patch for trunk, updated to all comments

biesi, darin, could you review?
Comment 104 Darin Fisher 2004-01-07 11:57:24 PST
Comment on attachment 138491 [details] [diff] [review]
Patch for trunk, updated to all comments

>Index: netwerk/mime/public/nsMimeTypes.h

>+#define APPLICATION_GUESS_TYPE_FROM_EXTENSION "application/x-vnd.mozilla.guess-type-from-extension"

i think this mime type is excessively verbose.	how about
one of these instead:

  application/x-vnd.mozilla.guess-from-ext
  application/x-vnd.mozilla.from-extension
  application/x-vnd.mozilla.from-ext


>Index: netwerk/streamconv/converters/nsUnknownDecoder.h

i think this class deserves some decription.  please explain
why you are overriding DetermineContentType, etc.

>+#define NS_BINARYDETECTOR_CID                        \
>+{ /* a2027ec6-ba0d-4c72-805d-148233f5f33c */         \
>+    0xa2027ec6,                                      \
>+    0xba0d,                                          \
>+    0x4c72,                                          \
>+    {0x80, 0x5d, 0x14, 0x82, 0x33, 0xf5, 0xf3, 0x3c} \
>+}
>+
>+class nsBinaryDetector : public nsUnknownDecoder
>+{
>+public:
>+  nsBinaryDetector()
>+  {}
>+
>+protected:
>+  virtual ~nsBinaryDetector()
>+  {}
>+  
>+  virtual void DetermineContentType(nsIRequest* aRequest);
>+};

seems like you should be able to compact the class
declaration down to this:

  class nsBinaryDetector : public nsUnknownDecoder
  {
  protected:
    virtual void DetermineContentType(nsIRequest* aRequest);
  };



>Index: uriloader/base/nsURILoader.cpp

>+  if (httpChannel && mContentType.IsEmpty()) {
>+    // This is our initial dispatch, and this is an HTTP channel.  Check for
>+    // the text/plain mess.
>+    nsCAutoString contentType;
>+    httpChannel->GetResponseHeader(NS_LITERAL_CSTRING("Content-Type"),
>+                                   contentType);
>+    // Make sure to do a case-sensitive exact match comparison here.  Apache
>+    // 1.x just sends text/plain for "unknown", while Apache 2.x sends
>+    // text/plain with a ISO-8859-1 charset.  Debian's Apache version, just to
>+    // be different, sends text/plain with iso-8859-1 charset.  Don't do
>+    // general case-insensitive comparison, since we really want to apply this
>+    // crap as rarely as we can.
>+    if (contentType.Equals(NS_LITERAL_CSTRING("text/plain")) ||
>+        contentType.Equals(
>+             NS_LITERAL_CSTRING("text/plain; charset=ISO-8859-1")) ||
>+        contentType.Equals(
>+             NS_LITERAL_CSTRING("text/plain; charset=iso-8859-1"))) {

i'm not sure i follow why you wouldn't want to just do a case-insensitive
compare against "text/plain; charset=iso-8859-1" 
the only difference is that you would accept "Text/Plain..." as well.
that doesn't seem like a big deal, right?


sr=darin
Comment 105 Boris Zbarsky [:bz] (still a bit busy) 2004-01-07 19:49:33 PST
> application/x-vnd.mozilla.guess-from-ext

Sounds good.  Will do that.

> i think this class deserves some decription.

Done.  Also compacted the class decl as suggested.

> i'm not sure i follow why you wouldn't want to just do a case-insensitive
> compare against "text/plain; charset=iso-8859-1" 

Because if the type is _not_ "text/plain", it's not just Apache being dumb --
it's someone purposefully sending that type.  In which case we really should
follow what the server said.
Comment 106 Boris Zbarsky [:bz] (still a bit busy) 2004-01-07 19:51:41 PST
Created attachment 138583 [details] [diff] [review]
Updated to darin's comments
Comment 107 Boris Zbarsky [:bz] (still a bit busy) 2004-01-07 21:22:45 PST
Checked in that patch.
Comment 108 Christian :Biesinger (don't email me, ping me on IRC) 2004-01-15 02:26:00 PST
*** Bug 207154 has been marked as a duplicate of this bug. ***
Comment 109 Christian :Biesinger (don't email me, ping me on IRC) 2004-01-28 17:15:54 PST
*** Bug 218372 has been marked as a duplicate of this bug. ***
Comment 110 Boris Zbarsky [:bz] (still a bit busy) 2004-02-08 17:50:11 PST
*** Bug 233465 has been marked as a duplicate of this bug. ***
Comment 111 Christian :Biesinger (don't email me, ping me on IRC) 2004-02-10 11:27:17 PST
*** Bug 233678 has been marked as a duplicate of this bug. ***
Comment 112 Xanthor Sccas 2004-02-24 11:19:48 PST
Is there an option to disable this behaviour ? Even a hidden pref ...
(This is important for example if we actually want to see the data....)
Comment 113 Boris Zbarsky [:bz] (still a bit busy) 2004-02-24 11:20:58 PST
There is no such option.  I would rather implement a "view in browser" option on
the helper app dialog (which we need to do anyway) than waste time on a pref for
this.
Comment 114 Alexander Skwar 2004-03-25 08:22:01 PST
I just sent myself a mail with a RTF attachment for which my Windows XP system
did not have a Content-Type set in the registry. The MUA sent the file with a
Content-Type of application/x-vnd.mozilla.guess-from-ext; I would have expected
that the Content-Type were application/octet-stream, since this is the
Content-Type for "unknown"/unclassified data.

Using Thunderbird Nightly 2004-03-21-00-trunk on Windows XP.

Requesting to reopen bug.
Comment 115 Alex Bishop 2004-03-25 12:48:08 PST
(In reply to comment #114)
> Requesting to reopen bug.

This bug is fixed but has caused a regression so therefore it would be better to
file a new bug.
Comment 116 Christian :Biesinger (don't email me, ping me on IRC) 2004-03-25 14:21:23 PST
(In reply to comment #114)
> I just sent myself a mail with a RTF attachment for which my Windows XP system
> did not have a Content-Type set in the registry. The MUA sent the file with a
> Content-Type of application/x-vnd.mozilla.guess-from-ext; 

heh. please file a new bug for that, and mention the bug# here.


Comment 117 Alexander Skwar 2004-03-25 15:20:08 PST
Okay.
I submitted bug 238706.
Comment 118 Ben Apgar 2004-03-29 13:34:57 PST
It is still sometimes opening .rar files in a new window for me. I also cannot
right-click and save some .rar files. 

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113

If I go to this site http://www.dustydunes.com/ for example and click on the
link for: http://ygs.pwp.blueyonder.co.uk/matt/maps/de_seaside2k.rar 

then Mozilla opens the contents in a new window instead of downloading them.

If I go to http://zveris007.boom.ru/ for example and click the link for:
http://zveris007.boom.ru/Plan2.part01.rar

the Mozilla opens the contents in a new window instead of downloading them.
Furthermore, if I right-click on the link and click "Save Link Target As..." a
box pops up that says:

"The link could not be saved. The web page might have been removed or had its
name changed."

But obviously it is there, since Mozilla could get it in text format. The
availability of those files is questionable (for some reason?). Sometimes IE6
goes to the website's 404 not found page if I click directly on the link (and
sometimes it downloads). IE6 also flags it as unavailable if you right-click and
go to "Save Target as..."

*I make no claims to the security of those sites or files, I just searched for
.rar files to demonstrate.

Yarrrr I'm a n00b at this reporting business!
Comment 119 scratch 2004-03-29 14:46:00 PST
this is the intended behavior, see the "solving most incorrect MIME type
issues".  see comment #39.
Comment 120 Ben Apgar 2004-04-04 11:25:23 PDT
(In reply to comment #119)
> this is the intended behavior, see the "solving most incorrect MIME type
> issues".  see comment #39.

The intended behaviour is "display garbage". Great.
Comment 121 Ben Basson 2004-04-04 11:30:15 PDT
"Display garbage" is better than "Open a gaping security hole". In any case,
this bug is resolved. If you want to suggest that we break standards, discuss it
on http://www.mozillazine.org or file a new bug.
Comment 122 Christian :Biesinger (don't email me, ping me on IRC) 2004-04-05 09:14:55 PDT
(this caused Bug 238706)
Comment 123 Keith Briscoe 2004-04-25 22:49:08 PDT
I think I've found another "safe to assume without breaking standards" case.  If
the server sends text/plain but the A HREF link contains a
TYPE="application/java-archive", then would it not be safe to assume that 1) the
linked file is a java archive, or 2) the server and web page disagree, so it's a
situation where you should be prompted?

The reason I'm asking is that in a hosted environment web page authors may not
have control over the server MIME types, and they may wish to use type="xxx" in
cases where the server is wrong.

This could also be a totally different bug, but I think it's a valid request for
improvement.  Mozilla 1.7 RC1 currently always takes the server as gospel
regardless of the TYPE indicated in the link.
Comment 124 Boris Zbarsky [:bz] (still a bit busy) 2004-04-25 22:54:00 PDT
Please file a separate bug on that.  That may be worth doing... Note that we
already support type in general, so this could be done without too much pain,
probably.

Note You need to log in before you can comment on or make changes to this bug.