Closed Bug 220807 Opened 21 years ago Closed 21 years ago

[FIX]Prompt user about invalid text/plain content - solving most incorrect MIME type issues

Categories

(Core Graveyard :: File Handling, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.7alpha

People

(Reporter: email, Assigned: bzbarsky)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 7 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6a) Gecko/20030923 Firebird/0.7+ Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6a) Gecko/20030923 Firebird/0.7+ When Firebird receives content which the server has set to the MIME type "text/plain", but the data contains invalid ASCII control characters (as can be identified from IETF RFC 2046), Firebird should prompt the user as to what to do with the data rather than attempt to display it immediately. This behaviour is intended to avoid the common problem of attempting to download binary data from mis-configured web-servers; and solves it *without* breaking Firebird's compliance with MIME type handling. The content of a "text/plain" document, according to IETF RFC 2046 (Section 4.1.2): ... Note that the control characters including DEL (0-31, 127) have no defined meaning in apart from the combination CRLF (US-ASCII values 13 and 10) indicating a new line. Two of the characters have de facto meanings in wide use: FF (12) ... and TAB or HT (9) ... Aside from these conventions, any use of the control characters or DEL in a body must either occur (1) because a subtype of text other than "plain" specifically assigns some additional meaning, or (2) within the context of a private agreement between the sender and recipient. Such private agreements are discouraged and should be replaced by the other capabilities of this document. [RFC 2046] Thus, the detection can be made from a stream sent as the "text/plain" MIME-type, yet the first few bytes contain one or more ASCII values in the range 0-9 or 14-31 (or, possibly, 127 - but I'm sure some codepages use it when they shouldn't). Reproducible: Always Steps to Reproduce: 1. Find a website with, e.g. a .rar file which has no MIME type associated with .rar files 2. Click on the link. Actual Results: Firebird starts displaying the erroneous text/plain stream in the browser -- the user sees complete garbage on their screen. Expected Results: (The server doesn't recognise the file type so, typically, claims it is "text/plain".) This stream could be identified (from the first few bytes) as containing invalid text/plain characters (in accordance with IETF RFC 2046), and the end user could be prompted with this information. The end-user could then be prompted as to what to do with this file - e.g. save it to disk (or display it anyway) I believe the FTP component of Firbird has a mechanism to detect binary data. Example Problem: An example where this is a common problem is: .RAR archive files stored on a website where the webmaster has no direct control over MIME type association. Alternatives: Many veteran users say "Just right-click and Save-As" or "Wait until the document loads fully, then save it all". The former is impossible if the download was accessed by filling in a form or as the result of a browser redirect. The latter is impractical to expect a novice user to understand and be satisfied with. Background: The problem is not with Firebird, but with mis-configured servers. However, it is important to bear in mind: * most servers appear to default to text/plain for unknown file types * new binary file types will be unknown * these unknown files will be sent as text/plain * in many cases, the webmaster cannot or will not directly control the MIME type association to correct this problem (either the site is hosted and they have no access, or may feel the majority of their (IE) users have no problem as other browsers use MIME-type detection) Precident: It's a similar situation to if the server sent random bytes claiming they were "image/gif". You wouldn't expect the browser to display a load of rubbish, you'd expect it to notice that something's wrong and perhaps display a "broken image" icon and the ALT text instead. This proposal isn't all that dissimilar: FB should cope with invalid text/plain, and the way it should cope should be by offering to save it instead -- this has the enormous benefit of also being a workaround for the many misconfigured servers out there. Multi-lingual notes: The default codepage for text/plain is "US-ASCII". Other languages may specify a different codepage or use UTF-8 encoding of Unicode. However, these facts make no difference to the interpretation of the control characters. I don't know whether, for example, Unicode UTF-16 would actually be valid for "text/plain". If it is, it would be easy to check for the FF EF (? - or vice-versa) unicode header. Suggested UI behaviour: If it appears likely that the stream is not "text/plain", the user should be prompted with a dialog similar to one that would be displayed if the content were "application/octet-stream": +------------------------------------ | | You have chosen to open | filename.rar | | from: http://www.website.com/ | | which has been sent as plain text, but it appears to be binary data. | | What should Mozilla Firebird do with this file? | | (*) Display as plain text | ( ) Open with [UnRAR |V] | ( ) Save to Disk | | [ ] Do this automatically for files like this from now on. | | [OK] [CANCEL] | +------------------------------------ - the changes being the "Display as plain text" option, and that the "open with" application could come from any applications on the user's system that are registered for that file extension. The "Do this automatically" option is undoubtedly one for debate. It could be used to always force this behaviour for that file extension, or perhaps only for future false text/plain detections. The user could then manage their choices as additions in the Tools|Options|Downloads panel -- but, in this case, by managing from the file extension rather than the MIME type. (Aside: why doesn't Tools|Options|Downloads currently display extensions *instead of* MIME types?) In short: The problem of misconfigured servers sending binary data as text/plain is unlikely to go away for several years. Many people agree something needs to be done to make Firebird more usable in this respect However, most binary files sent as text/plain are invalid for that format, Firebird could allow the user to choose what to do with these files. This is the best solution as far as I can see it, without violating the specs.
I just want to post to ask the devlopers to seriously consider this instead of just calling it another Mimetype bug and WONTFIX'ing it. This specific implementation does not technically deviate from WC3 standards, could be added as something that can be enabled/disabled by user preference, and would give users a way to handle misconfigured webservers (which would make a *LOT* of people happy);
See also bug 57342, Add "View as Text" option for unknown mime content-type.
Also see Camino bug 221877. The only problem I forsee is the Firebird proposal uses sniffing to tell the user that text-plain is erroneous. This appears to violate the standards, since mozilla.org has said we should trust the server (see bug 175848 and RFC 2068, section 7.2.1). A different approach is to have the dialog state the file is indeed text-plain, so the server is trusted, but then give the user some options other than rendering in the window. It is a subtle change, but could be important for compliance.
How does the proposed approach handle text/plain content in non-ASCII encodings? Say in UTF-16?
Regarding Comment #3 WRT calling it "content sniffing", I still believe the analogy of loading an erroneous image is valid; you don't expect it a browser to display garbage on your screen if the server claims a file is an image but it doesn't actually contain an image header. I guess the only concern is *when* the start of the "text/plain" stream is checked. If you look at it as checking the first few bytes to see if it can be displayed before actually displaying it, it really doesn't seem any different to the handling of images. However, the idea of always prompting about text/plain is a very good alternative to the proposal. There aren't *that* many text/plain files that a user does want to see per day, but to prevent it annoying people, perhaps there can be a list of file extensions that the user always wants to be displayed in the browser (pre-installed with ".txt" etc.) Regarding Comment #4 I'm no expert, but I'm not even sure the specs I read are valid for UTF-16. Regardless, it (and UTF-32) are easily identified by the byte-order mark (BOM) at the start of the file (e.g. somthing like 'FF EF'). If UTF-16 is valid for text/plain, then the "invalid text/plain" check wouldn't be performed on a text/plain stream if it starts with a valid BOM. If UTF-16 isn't valid for text/plain, and the check on the first few bytes determined a stream was invalid text/plain, then the BOM could be checked-for and, if present, the file could be displayed anyway. Small correction to original post: At the end of "Suggested UI behaviour": I was suppose to be asking, as an aside, "why *DOES* Tools|Options|Downloads currently display (file) extensions instead of MIME types"? This placement almost makes it appear as if those actions will be performed on files with those extensions irrespective of MIME type.
> it (and UTF-32) are easily identified by the byte-order mark (BOM) that does not need to be there
Could I ask on the developer opinion on this one? There are quite some votes, and it hasn't been WONTFIXed yet. However, there's also not been any response. Is this normal? (sorry, I've just tracked a few bugs so far.)
The developers responsible for the MIME dispatch back end raised an issue with the proposed approach that has not been addressed (comment 4). At the same time, this bug is filed as a UI bug on the Firebird browser, not as a core bug. Which means that it should only be resolved by a Firebird developer.
Can this solution be implemented only for text/plain documents in the ASCII encoding? Or the solution could still be implemented for all cases and just change the wording on the dialog to say something like "... which has been sent as plain text, but appears to be either binary data or non-ascii encoding." The user could still choose to handle it whichever way they prefer and optionally choose to have future occurences handled the same way.
>"... which has been sent as plain >text, but appears to be either binary data or non-ascii encoding." are you aware that there are very few languages for which ASCII suffices?
Now I'm just a common user, so I don't know all that much about this stuff, but in the situation where the text/plain is being specified as the default MIME type because of a misconfigured server, isn't it normal that the encoding is either unspecifed or us-ascii? If there is any other encoding or character-set wouldn't those be specifically identified in the header? This feature of detecting possible mislabeled MIME types should still be possible in the cases of us-ascii or non-specified character sets. While using this only in the case of text/plain + us-ascii may not correct all of the occurences, I think it would go a long way towards solving the problem. Even if finding a safe way to determine if a file might be binary or not is blocking this feature, I think it should still be implemented as an option. For example, if I did not want my web browser to handle text/plain content, but would rather such files open in my text-editor, I should be able to do that either by selecting the default action the first time I navigate to such a file, or by changing the default behavior in the preferences dialog.
> For example, if I did not want my web browser to handle text/plain content That's a separate issue from this bug... The problem with comment 11 is the same as with the previous comments. Say I am serving up a Japanese plaintext document. I server it up as text/plain. I do not send a charset header (usual operating procedure and all). It renders just fine on everyone's browsers because of charset autodetection. How do we avoid breaking this scenario?
According to this source: http://www.faqs.org/rfcs/rfc2046.html "The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII." and also "messages containing characters coded according to other versions of ISO 646 than US-ASCII and the 1991 IRV, or using code-switching procedures (e.g., those of ISO 2022), as well as 8bit or multiple octet character encodings MUST use an appropriate character set specification to be consistent with MIME." So in the case of US-ASCII (or nothing sepcified, which is probably the most common case in the case of a misconfigured server) we check the data to see if it is binary and prompt the user. In all other cases we can trust the character-set specified in the header.
> According to this source: http://www.faqs.org/rfcs/rfc2046.html Which is ignored by as many, if not more, people as the the HTTP RFCs....
By the way, I think we're all agreed that the only document that should get special treatment of any sort are text/plain documents coming in over HTTP with no charset specified. The remaining questions are: 1) What fraction of such documents are actually ASCII plaintext? 2) What fraction of such documents are actually random "binary" data? 3) What fraction of such documents are non-ascii plaintext? If the answers to these show that category #3 is not too big, we could try to make some changes to URILoader to handle this ("could" as in "I won't mind if someone does"). Have to put some thought into how the whole dialog-posing mess would work... That's sort of tied in to the general channel-suspendability stuff, unless we plan to force the dialog modal and hope for the best.
re comment 13: I seem to recall that HTTP specifies a default charset of ISO-8859-1... >That's sort of tied in to the general channel-suspendability stuff that needs to be fixed anyway...
If you serve a Japanese (or other non-ASCII) document as text/plain with no charset header, then you're relying on browser error-correction to properly render it. However, from a strict standards standpoint, charset "sniffing" is just as evil as content-type sniffing, isn't it? On another tangent, is it proper by the standards to serve a data stream as "text/plain" if it intentionally uses ASCII control characters in their defined (though little used these days) meanings in the original 1960s-era ASCII standards? I remember from my '80s college days common use of "cute tricks" like carriage returns without linefeeds, backspaces, and the bell character (#7) to do gimmicky animations and special effects, for instance in "plan" files displayed from the "finger" command on timesharing computers. Then, once dialup BBSs became commonplace, they used lots of ANSI escape codes, but I guess those wouldn't count as "plain text". I myself sometimes use the delimiter characters from #28 to #31 (FS, GS, RS, US) to separate fields in formatted data sent as the output of one program for the purpose of being processed by another; what MIME type would be appropriate for such a thing?
> However, from a strict standards standpoint, charset "sniffing" is just as evil Sure. Which is why I'm willing to break it if the other case is a lot more common. > what MIME type would be appropriate for such a thing? Probably a text/x-something mimetype unless you register a subtype of text/.
I think that the problem of, for example, UTF-16 text formats is not that important. Since the problem usually occurs only on a very limited number of file types, the browser could be programmed only to scan the text for files with .rar, .ace and a few of the other problematic file types. The file would only be declared application/octet stream if both the extension AND the file sniffling AND the user interaction suggests that the MIME type is specified wrongly. So you have MIME type | Action ================================================================== text/plain | display octet stream | save text/plain with octet | 'guessed' text -> display extension (e.g. .rar) | 'guessed' octet & user wants display -> Display | 'guessed' octet & user wants save -> save So even rar-files with UTF-16 text would be displayed properly, although that would require user interaction. I doubt there many UTF-16-rar-plain-text-files though, I ever doub there is one such file on the internet. And even that file would be handled correctly by the intelligent user. In bug 126782, a detection method has been programmed in the fix (http://bugzilla.mozilla.org/attachment.cgi?id=135573&action=view). Could that one be used here?
> I think that the problem of, for example, UTF-16 text formats is not that > important. You're also very clearly not East Asian... Please don't impose your ethnocentric biases on the whole world. There are a lot more people to whom that problem is important than live in all of Europe. I'd still like some numbers in response to my question in comment 15. Sweeping generalizations are not numbers. > Since the problem usually occurs only on a very limited number of > file types, the browser could be programmed only to scan the text for files > with .rar, .ace and a few of the other problematic file types. No. There will be no hardcoding of extensions in this code. The bug 126782 mechanism is exactly what would be used here. That is, if we decide the "text/plain" type is "unreliable" we would simply run the unknown type detection code on it instead.
Boris Zbarsky, With the hardcoding of extensions, UTF-16 would not be a big issue. I am not 'ethnocentrically biased', that's why I made this comment. If you hardcode extensions, you wouldn't have to program all non-ASCII encodings and still have a solution working perfectly. As I said, only plain text files .RAR files from asia would require user interaction. I thought everyone could live with that, as I could even live with automatically declaring .rar files octet/stream (but since developers won't, I won't bother with that). Besides, if you refuse to hardcode anything, all files, e.g. .html, would have to be scanned in ASCII, UTF-7, UTF-8, UTF-16, etc. until a fitting encoding is found. I can't see why that would be a good and fast solution.
> Besides, if you refuse to hardcode anything as I understood bz, the text/plain MIME type would be hardcoded. so .html would not be scanned unless they were sent as such a type. let me add that I would love to mark this bug wontfix.
Dominik, why are you so focused on .rar files? This is a problem with a lot more than those... The reason I refuse to hardcode extensions is that when a new extension appears (like .dmg, for example), none of the web servers will know about it so will send it as text/plain AND Mozilla will treat it as text/plain. Not useful, really. Yeah, we could play whack-a-mole and have older Mozillas not be able to deal with these new extensions, but that's not a good alternative in my mind.
Just to clarify: the "bug" described would work perfectly in correctly identify invalid "text/plain". The text/plain MIME type is only supposed to be valid for ASCII-compatible octet streams (e.g. any ASCII codepage or Unicode as UTF-8). The only issue we have, is that some servers *incorrectly* serve non-ASCII-compatible text as text/plain (when it's supposed to be text/something-else) and with no "charset" part in the MIME-type. However, it is the case that many other character encodings are actually upwardly-compatible with ASCII, in that they don't use (at least most of) the ASCII control codes to encode characters. For example, if I understand Shift-JIS encoding properly, then just by accepting the escape character (27) as a permitted control character, Shift-JIS would also work with this proposed scheme. I believe many other character sets would too. So the issue is with non-ASCII compatible character sets that make use of the octet values 0-8, 14-26, or 28-31. Which character sets are actually affected by this? UTF-16 or UTF-32 *incorrectly* sent as text/plain would indeed use ASCII control codes, but I thought these encodings typically had a byte-order-mark at the start of the file. If they don't, then what would the browser do with them anyway - it should really display them as US-ASCII (and I can't see a manual "UTF-16" encoding option in Firebird anyway). The real question is: How often is this actually a problem? - Plain text files are only a small fraction of what most users view - then the fraction of those files in a charset that uses octet values {0-8, 14-26, 28-31} (or no BOM if UTF-16/32) - then the fraction of those files incorrectly sent as "text/plain" and with no "charset" header If it's quite rare, then would it really be a problem if the user was (correctly) informed the content is invalid "text/plain", and they can select "view as text" anyway. (I assume they'll also have to select an encoding manually as the server didn't tell them what the file was, perhaps they could have a combobox for the encoding next to "view as text", or even perform any charset auto-detection wizardry if, and only if, it's got this far). One other mitigating feature could be the UI option: "[ ] Do this automatically for files with this extension from now on." (this could probably integrate nicely into the list of file types under Tools | Options | Download) For example, this could allow the user to build up their own personal list of preferences to: * Always "Save" files that are incorrectly sent as "text/plain" which have the extension ".rar" * Always "View as text" files that are incorrectly sent as "text/plain" which have the extension ".txt" (Note: Absolutely no "hard-coding" of extensions, just allowing the user to have a per-extension preference for dealing with files incorrectly marked as text/plain).
> 1) What fraction of such documents are actually ASCII plaintext? According to the standard, all such documents should be. And for the most part this is true, although I've been watching lately, and I see ISO-8859-1 instead of US-ASCII most of the time. > 2) What fraction of such documents are actually random "binary" data? This is a small percentage of misconfigured servers. > 3) What fraction of such documents are non-ascii plaintext? None. As is defined by the standard. I've looked and I could not find a single situation like this. If anyone else has ever seen it, please correct me. I don't know of anyway to collect scientific data on these kind of numbers, otherwise I would do it. The biggest point about this idea, as Daniel is trying to express I believe, is that it prompts the user in cases of ambiguity so that the user can decide how to handle the situation. It shouldn't matter so much how often these cases happen because this feature would give the user an option of how to handle it in the future. It doesn't break any functionality, the worst it could do is bother the user once with a dialog to choose "handle this way always". The default action could even be to handle it as it is handled now.
> The default action could even be to handle it as it is handled now. Not any time in the near future... that would depend on a "view in browser" option being available in the dialog. >> 3) What fraction of such documents are non-ascii plaintext? >None. As is defined by the standard. I've looked If I may ask, where did you look? I agree this is hard to solidly quantify -- hence the quandary.
I sometimes browse various Japanese websites and I have never seen such an occurence. In addition, I did some google searches on .jp domains for txt, text, etc and came up with several text/plain in Shift-JIS such as: http://rawfish.at.infoseek.co.jp/txt/log0000/00000001.txt http://www.rtpro.yamaha.co.jp/RT/docs/relnote/Rev.05.01/relnote_05_01_14.txt http://www.iodata.jp/lib/product/c/cdv_104.txt http://www.iodata.jp/lib/product/c/cdv_102.txt Or EUC: http://sky-mue.jp/linux/plamo/plamo-HOWTO.txt http://kwatch.at.infoseek.co.jp/web/http.txt http://www.suplex.gr.jp/~hourin/signature.txt And even some English on Japanese servers in ISO-8859-1: http://geta.ex.nii.ac.jp/getaN2002/release/drep_200202.txt but wasnt able to find any as described. Ofcourse, I didn't search for other languages since I can't read them, so I could be overlooking a huge sample of data.
The nice thing about this approach in general (good idea!) is like bz said: If you have a file with MIME-type text/plain and no text-encoding set, you're out of the standards zone and into the Guess Zone. That means that any and all kinds of sniffing are *permissable* (according to spec). We could at that point special-case things. For example, we have very commonly this problem with .dmg files on the Mac. So I would definitely think that if the file is in the Guess Zone, sniffing for certain features would definitely be OK. Is there any way to detect a UTF-8 or UTF-16 file? Can we sniff for a DMG or an ASF file (those are commonly unset on servers). My point is that this bug defines a Guess Zone. So now we can finally special-case certain highly-complained-about types like DMGs without breaking the specs.
> If you have a file with MIME-type text/plain and no text-encoding set, you're > out of the standards zone and into the Guess Zone. Actually, that's not true. The standards clearly say that this means the text is ASCII-encoded. In any case, per discussion with biesi and darin, the following is the plan: If incoming data meets the following criteria: 1) HTTP channel 2) "text/plain" request header (literal, case-sensitive comparison) 3) This is the initial content dispatch Then reset the content-type on the channel to "application/x-maybe-text" (or some such). Necko will provide a stream converter (probably just subclass or superclass of nsUnknownDecoder) that will take that type as input and apply the following heuristics: A) If the data starts with a BOM, treat it as text/plain B) Otherwise, apply the "last ditch sniff" algorithm from nsUnknownDecoder to detect it as either text/plain or application/octet-stream. Note that this will NOT detect the data as any other types (in particular as any types we would render inline). Embeddors are of course free to register their own converters for the relevant contractids and override the converter Necko registers. I may get to this by the 1.7a cycle if no one posts a patch earlier. There are other various bugs in bugzilla that should probably be dependent on this one... (eg there is a similar Camino bug).
bz: That's great news! Although, may I suggest also allowing char "27" in the "IS_TEXT_CHAR" macro from nsUnknownDecoder as this would then allow Shift-JIS (and probably other non-ASCII, yet ASCII-compatible) encodings to be treated as text, and not binary. (I know they will not be strictly text/plain, but that will be what was intended)
Yeah, we should do that.
Attached patch Ok, I got bored (obsolete) — Splinter Review
Comment on attachment 136882 [details] [diff] [review] Ok, I got bored biesi? Darin? Thoughts?
Attachment #136882 - Flags: superreview?(darin)
Attachment #136882 - Flags: review?(cbiesinger)
I applied this patch then attempted to load kerz's "Gold Standard" "Idiot on Silverado" wmv test: http://static1.stileproject.com/jump/4e46800c366751035c3a771fb7f50137/help6.wmv It still displays as text.
For the good of the web, we must be able to display this movie. Thank's for your work on this so far...
Yep. That server sends: Content-Type: text/plain; charset=ISO-8859-1 which fails test #2 in comment 29. _If_ this is the default Apache setup, I would be willing to change test #2 to also check for this literal string....
To make it clear, if that's the default misconfiguration Apache ships with, I'm willing to work around it. If this server has gone out of its way to be more misconfigured than the average server, then that's a different issue entirely.
So, if it meets criteria #2 (sending text/plain) should that not mean your "Last Ditch Sniff" should detect binary data? What does server configuration have to do with that?
> So, if it meets criteria #2 (sending text/plain) It does NOT mean criterion #2. Please read what I said again. Server configuration has everything to do with this. I am implementing a workaround for a bug in the default configuration of the most popular web server out there. I am _not_ attempting to solve all possible ways in which someone can fuck up a web server to lie to web browsers. Now I suspect that the default config of Apache may end up sending Content-Type headers in some cases. If that's the case, we need to rethink criterion #2 to account for that a bit...
Oh, I didn't realize your test #2 was explicit "text/plain" (and thus would not catch "text/plain *"
Right. The whole point is that "text/plain; charset=Big5", eg, should not go through the sniffer, since it would be incorrectly detected as binary...
i can't say for sure that it's the *default*, but all testcases I can find do indeed send: Content-Type: text/plain; charset=ISO-8859-1
I ran: http://www.bengoodger.com/video/funny_cats.wmv through the http://www.rexswain.com/ 's HTTP viewer, and it reports a Content-Type field of "text/plain" but a build bz's patch still shows raw data. I don't know enough about Apache to say if this affects anything, but my bog standard Apache install on this machine has this line in httpd.conf: AddDefaultCharset ISO-8859-1
> http://www.bengoodger.com/video/funny_cats.wmv My patched build is detecting this as application/octet-stream.... you did rebuild in all the relevant places (uriloader/, netwerk/, docshell/build/), right?
ooh, not docshell. thanks. kerz points out this bug in the apache database: http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23421
ah, I see. So that's a server version issue... sounds like we need to check for both of those literal strings.
OK, I've properly tried your patch now and it works like a charm, except for the old OSHelperAppService bugs which I can look at separately (on Windows it's showing the helper for WMV files as "WMVFile" which I think implies it's looking for the pretty app name in the wrong spot in the windows registry).
ben: yes... the oshelperappservice uses HKCR\.<ext>\@ as the helper app description, and HKCR\<thatvalue>\@ as mime type description (where @ means "default value") do you have a better suggestion?
Comment on attachment 136882 [details] [diff] [review] Ok, I got bored + httpChannel->SetContentType( + NS_LITERAL_CSTRING("application/vnd.mozilla.maybe-text")); could you maybe add a log comment around here, that we're checking if this is really text? also... is application/vnd.mozilla.maybe-text a good type to use? It's not a registered type per ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/application/ (hardly surprising :) ), I think you should use some x- type. + void DetermineContentType(nsIRequest* aRequest); hm... can you add a "virtual" here, to make clear that this is a virtual function? + (((unsigned char)ch) > 31 thank you :) + // First, check for a BOM. you have no check for a UTF-32 BOM... doesn't mozilla support that? +CreateNewBinaryDetectorFactory(nsISupports *aOuter, REFNSIID aIID, void **aResult) is there a reason to avoid NS_GENERIC_FACTORY_CONSTRUCTOR?
> could you maybe add a log comment err, I meant "a LOG statement"
> could you maybe add a log comment around here, Sure thing. > is application/vnd.mozilla.maybe-text a good type to use? Good question. I'll change it to a x- type that should still not cause random namespace collisions. > can you add a "virtual" here, Yep. > you have no check for a UTF-32 BOM True. I'll add it -- I finally convinced myself this can be done in a safe way. > is there a reason to avoid NS_GENERIC_FACTORY_CONSTRUCTOR? Following the style of the file... no other reason.
putting this finally in browser:File handling where it belongs; also -> bz
Assignee: blake → bz-vacation
Component: General → File Handling
Product: Firebird → Browser
Version: unspecified → Trunk
Attachment #136882 - Flags: superreview?(darin)
Attached patch Updated to comments (obsolete) — Splinter Review
Attachment #136882 - Attachment is obsolete: true
Attachment #136945 - Flags: superreview?(darin)
Attachment #136945 - Flags: review?(cbiesinger)
Comment on attachment 136945 [details] [diff] [review] Updated to comments + (buf[0] == 0xFF && buf[1] == 0xFE && buf[2] == 0 && buf[3] == 0) || // UCS-4LE this will already match the UTF-16LE check, so this line can be removed + (buf[0] == 0xFE && buf[1] == 0xFF && buf[2] == 0 && buf[3] == 0)) { // UCS-4 this will match the UTF1-6BE check, so it can also be removed with the lines removed, r=me
Attachment #136945 - Flags: review?(cbiesinger) → review+
Priority: -- → P2
Summary: Prompt user about invalid text/plain content - solving most incorrect MIME type issues → [FIX]Prompt user about invalid text/plain content - solving most incorrect MIME type issues
Target Milestone: --- → mozilla1.7alpha
> ben: yes... the oshelperappservice uses HKCR\.<ext>\@ as the helper app > description, and HKCR\<thatvalue>\@ as mime type description (where @ means > "default value") > > do you have a better suggestion? filetype = HKCR\.<ext>\@ defaultapp = HKCR\<filetype>\shell\@ helperapp = HKCR\<filetype>\shell\<defaultapp> helperappdescription = HKCU\Software\Microsoft\Windows\ShellNoRoam\<helperapp> If helperappdescription does not exist, then we should display helperapp. This is the default windows behavior. Displaying either the application name if available or the application filename if not is much more intuitive than displaying the internal filetype name and saying that's the program you're opening it with.
Let's have a separate bug on the Windows thing, ok? ;)
Christian, I'll do some research and file another bug.
ben: scratch filed bug 227671 about the application name question
OK, this new patch works great with the "Idiot on Silverado" video (Content-Type: text/plain; charset=ISO-8859-1). Please don't attempt to load the link from my comment above anymore, it now seems to display undesirable content o_o;;
So... what does this patch do to our results on these tests?: http://www.hixie.ch/tests/adhoc/http/content-type/ What does this faking of the MIME type do to Page Info and so forth? I really hate how we sometimes show our autodetected MIME types as if it was the server type. Anyway. Interesting solution. Separate from that though we do IMHO still need the View | Content Type ...submenu (analogous to the View | Encoding submenu for the analogous problem).
Ian: the behaviour on those files does not change with this patch For files where behaviour does change, the helper app dialog will be shown and indicate a type of application/octet-stream. (hence, page info never comes into play) hmm... why does mozilla display application/x-view-source files?
> why does mozilla display application/x-view-source files? Because we have a handler registered for that content type -- that's how we do view-source, y'now. ;)
QA Contact: ian
Attached patch bz's patch plus a little more (obsolete) — Splinter Review
This adds to bz's patch an extension-based lookup in nsExternalHelperAppService::DoContent as a last ditch attempt to locate the One True Type from the OS. bz/biesi - I've included the MIME type application/x-vnd.moz.maybe-binary as a literal string in both cases, please advise if there's a common header or something where it'd be better defined.
Attachment #136945 - Attachment is obsolete: true
Attachment #137227 - Flags: superreview?(bz-vacation)
Attachment #137227 - Flags: review?(cbiesinger)
Ben, did you base that on the patch _before_ I addressed some review comments? The type could go in nsMimeTypes.h, I suppose.
Comment on attachment 137227 [details] [diff] [review] bz's patch plus a little more this does not address most parts of comment 49... also: + if (aMimeContentType == "application/x-vnd.mozilla.maybe-binary") { aMimeContentType is "const char*". this comparison does therefore not do what you want. + nsXPIDLCString mimeType; + if (mimeInfo) + mimeInfo->GetMIMEType(getter_Copies(mimeType)); can you move this into the "if (!fileExtension.IsEmpty())" block? (except the nsXPIDLCString that should stay outside of course) I would also like a LOG statement to show what mime type the nsIMIMEInfo has stored.
Attachment #137227 - Flags: review?(cbiesinger) → review-
oh, about the mimetype question... yes, you could put it in nsMimeTypes.h as bz suggested, or just use the literal string... personally, I don't care either way
Huh. I'm pretty sure I reversed the original patch and then applied the newer one. I must have messed something up. *tries again*
Attachment #137227 - Flags: superreview?(bz-vacation) → superreview-
Attached patch newer patch (obsolete) — Splinter Review
This one is really based on bz's second patch, with biesi's comments rolled in.
Attachment #137227 - Attachment is obsolete: true
Comment on attachment 137481 [details] [diff] [review] newer patch oops. one moment. disregard this patch.
Attachment #137481 - Attachment is obsolete: true
Comment on attachment 137485 [details] [diff] [review] new patch + if (!nsCRT::strcmp(aMimeContentType, APPLICATION_MAYBE_BINARY)) { do we guarantee anywhere that aMimeContentType is lowercase? it doesn't look like it. please use nsCRT::strcasecmp. + mContentType = APPLICATION_MAYBE_BINARY; huh? will that do any good?
Christian - I replaced the place where the content type was set to octet-stream with maybe-binary. I assumed that bz's original patch relied on this code to set to octet-stream so that the UCT dialog would be presented, and so switched to maybe-binary. bz - did I misinterpret your origianl patch?
oh. hmmm. hmmmmm. well I fear that at that point, this will cause another streamconverter to get instantiated, so you get into an infinite loop... bz knows this code better than me, though, I'll let him comment. (maybe we need two mimetypes - one to trigger the check for non-ascii characters, and one to tell the helper app dialog to get the type from the extension)
Attachment #136945 - Flags: superreview?(darin)
Comment on attachment 137485 [details] [diff] [review] new patch In the future, patches using the -p switch and more context (-u8 at least) are much easier to review.... And there was no need for -w here, either, imo. >+ (((unsigned char)ch) > 31 || (9 <= ch && ch <= 13) || ch == 27) I know I wrote this code... but could you put () around "ch" everywhere in that line? So: (((unsigned char)(ch)) > 31 || (9 <= (ch) && (ch) <= 13 || (ch) == 27) >+ mContentType = APPLICATION_MAYBE_BINARY; biesi, this part is correct. We do have two types involved here -- maybe-text and maybe-binary. The former triggers this stream converter, the latter is magic to the external helper app handler. Someone could indeed create a stream converter for maybe-binary, but that's ok if they want to do that.... >+ // the type off the channel. > nextLink->mContentType = aOutContentType; >- } This is indented correctly, right? Damn -w. ;) >+ LOG(("OS-Provided mime type '%s' for extension '%s'\n", >+ fileExtension.get(), mimeType.get())); That should pass mimeType.get() first, no? sr=bzbarsky on Ben's additional changes with that addressed. The rest of the patch still needs sr (and moa from darin), so not marking the sr flag myself...
Attachment #137485 - Flags: superreview?(darin)
>We do have two types involved here -- maybe-text >and maybe-binary. er, oops - I missed that. :( but, why is maybe-binary called maybe-binary? don't we know that it is binary (because it failed the "is text" check)?
Better name suggestions? maybe-octet-stream?
application/x-guess-type-from-extension?
There's no guarantee that there will be an extension. Streamed videos or content management systems, for example. I suppose we could do application/x-guess-type-from-extension and fall into x-maybe-octet-stream if that fails, too.
*** Bug 221877 has been marked as a duplicate of this bug. ***
> There's no guarantee that there will be an extension. indeed. please read the patch. the question is just how to name that thing.
Attached patch Patch for FB_0_8_BRANCH *ONLY* (obsolete) — Splinter Review
I've rolled in biesi and bz's comments except the content type name change into this patch which I plan to check in on the 0.8 Branch. We can continue to discuss the content type and check the further modified patch in for 1.7a
Christian... should that be application/x-vnd.mozilla.guess-type-from-extension ?
I don't think any MIME type needs both a "x-" prefix and a "vnd." one... "x-" means "unregistered", and no further structure is necessary after that. ("vnd." is for registered MIME types in the vendor subtree.)
> don't think any MIME type needs both a "x-" prefix and a "vnd." one... "x-" > means "unregistered", and no further structure is necessary after that. Except if you're trying to be a good citizen and ensure that your x- crap won't collide with anyone else's x- crap. I made the type what I made it very purposefully. I don't really care what we name the type, as long as we clearly document in nsMIMETypes.h what its "magic" effect is and as long as it starts with "x-vnd.mozilla."
are we really sure using magic MIME types is the way to go here? :-) /me starts planning tests that use those MIME types to see what happens
> application/x-vnd.mozilla.guess-type-from-extension that sounds good.
> are we really sure using magic MIME types is the way to go here? :-) Yes, we are. It's the simplest way to do it, and they are not very magic (you could set up a plugin or other internal handler for any of the types involved any time you wanted to, eg; you just can't set up a helper for the application/x-vnd.mozilla.guess-type-from-extension type). And the name was carefully chosen such that you better not expect it to work in any particular way.
I've read through the entire discussion and it looks like things are shaping up alright. I would highly recommend taking steps preventing any active content from loading automatically based on mime-type sniffing (offering to download is okay, however). Fortunately, the patch here seems to only apply by making files sent as text/plain appear with a download dialog. Here are some articles from the Freenet-devel mailing list discussing how mime-type autodetection can affect privacy: http://article.gmane.org/gmane.network.freenet.devel/4939/match=internet+explorer http://article.gmane.org/gmane.network.freenet.devel/1580/match=autodetect+mime It's impossible to use IE in any sort of anonymous way on Freenet (and I assume other anonymous networks) due to "mime type sniffing". You can easily provide an anonymity-destroying extension file and pass that as an image! Just a couple of points to keep in mind.
Comment 88 has nothing to do with this bug. This bug is about the main content area load. The issues mentioned in comment 88 have to do with subsidiary script/plugin loads to which the code discussed in this bug does not apply. Please don't make irrelevant comments like that; file separate bugs on separate issues. If you're not sure it's a separate issue, file a separate bug and add a dependency.
There appears to be a case sensitivity issue with the charset. http://forums.mozillazine.org/viewtopic.php?t=41893
Yeah, yeah. See what I wrote about what this patch is aiming to do (quoted in that thread). The case-sensitive comparison is very much purposeful.
Yes, I see that you're doing it case-sensitive because you're handling it exactly as Apache 2.x sends it. And my question is: why not make it case insensitive? Forgive me for being dumb, but how do we handle TEXT/plain or text/Plain? I'm not trying to say we should handle every single possibility.
the point is: this is only done to work around the default apache misconfiguration, and not against anything else.
That may well be the point, but if you're going to the lengths to implement a fix like this, surely you might as well add a simple case conversion statement, and then everyone is even happier.
Not everyone. This entire codepath is a standards violation, we absolutely want to keep the violating to the strict minimum required to hit the most common case. If people go out of their way to misconfigure their servers even further, then tough -- it won't work.
Well, if we limit the check to the first 10 characters we can ignore the charset regardless.
The simple reason to NOT make this a case-insensitive check is that I'd rather like to avoid detecting actual valid ISO-8859-1 text content as application/octet-stream. Sorry, but that check is staying case-sensitive. alanjstr, if we limit the check to the first ten chars, then we will fail on type strings like in comment 41. See also the first 20 or so comments in this bug for more background. One more comment from someone who clearly hasn't bothered to carefully read the bug, and I'm reassigning this to whoever did the commenting.
Please add the charset "iso-8859-1" (lowercased) as it is the default on my Debian/Apache server as well. The only mention of the the charset in httpd.conf is: # Default charset to iso-8859-1 (http://www.apache.org/info/css-security/). AddDefaultCharset on HTTP/1.1 200 OK Date: Thu, 25 Dec 2003 20:41:32 GMT Server: Apache/1.3.29 (Debian GNU/Linux) Last-Modified: Sat, 22 Nov 2003 23:38:46 GMT ETag: "1ed36-28e0-3fbff386" Accept-Ranges: bytes Content-Length: 10464 Connection: close Content-Type: text/plain; charset=iso-8859-1
Yeah, that also seems to match the bug report mentioned in comment 45. Can do that.
*** Bug 209343 has been marked as a duplicate of this bug. ***
Blocks: 135411
I do not think that this she Block the evangelism bugs, since there are probably quite a few of those and fixing this won't fix those servers, merely hide the symptoms.
Attachment #137485 - Attachment is obsolete: true
Attachment #137485 - Flags: superreview?(darin)
Attachment #137535 - Attachment is obsolete: true
Comment on attachment 138491 [details] [diff] [review] Patch for trunk, updated to all comments biesi, darin, could you review?
Attachment #138491 - Flags: superreview?(darin)
Attachment #138491 - Flags: review?(cbiesinger)
Attachment #138491 - Flags: review?(cbiesinger) → review+
Comment on attachment 138491 [details] [diff] [review] Patch for trunk, updated to all comments >Index: netwerk/mime/public/nsMimeTypes.h >+#define APPLICATION_GUESS_TYPE_FROM_EXTENSION "application/x-vnd.mozilla.guess-type-from-extension" i think this mime type is excessively verbose. how about one of these instead: application/x-vnd.mozilla.guess-from-ext application/x-vnd.mozilla.from-extension application/x-vnd.mozilla.from-ext >Index: netwerk/streamconv/converters/nsUnknownDecoder.h i think this class deserves some decription. please explain why you are overriding DetermineContentType, etc. >+#define NS_BINARYDETECTOR_CID \ >+{ /* a2027ec6-ba0d-4c72-805d-148233f5f33c */ \ >+ 0xa2027ec6, \ >+ 0xba0d, \ >+ 0x4c72, \ >+ {0x80, 0x5d, 0x14, 0x82, 0x33, 0xf5, 0xf3, 0x3c} \ >+} >+ >+class nsBinaryDetector : public nsUnknownDecoder >+{ >+public: >+ nsBinaryDetector() >+ {} >+ >+protected: >+ virtual ~nsBinaryDetector() >+ {} >+ >+ virtual void DetermineContentType(nsIRequest* aRequest); >+}; seems like you should be able to compact the class declaration down to this: class nsBinaryDetector : public nsUnknownDecoder { protected: virtual void DetermineContentType(nsIRequest* aRequest); }; >Index: uriloader/base/nsURILoader.cpp >+ if (httpChannel && mContentType.IsEmpty()) { >+ // This is our initial dispatch, and this is an HTTP channel. Check for >+ // the text/plain mess. >+ nsCAutoString contentType; >+ httpChannel->GetResponseHeader(NS_LITERAL_CSTRING("Content-Type"), >+ contentType); >+ // Make sure to do a case-sensitive exact match comparison here. Apache >+ // 1.x just sends text/plain for "unknown", while Apache 2.x sends >+ // text/plain with a ISO-8859-1 charset. Debian's Apache version, just to >+ // be different, sends text/plain with iso-8859-1 charset. Don't do >+ // general case-insensitive comparison, since we really want to apply this >+ // crap as rarely as we can. >+ if (contentType.Equals(NS_LITERAL_CSTRING("text/plain")) || >+ contentType.Equals( >+ NS_LITERAL_CSTRING("text/plain; charset=ISO-8859-1")) || >+ contentType.Equals( >+ NS_LITERAL_CSTRING("text/plain; charset=iso-8859-1"))) { i'm not sure i follow why you wouldn't want to just do a case-insensitive compare against "text/plain; charset=iso-8859-1" the only difference is that you would accept "Text/Plain..." as well. that doesn't seem like a big deal, right? sr=darin
Attachment #138491 - Flags: superreview?(darin) → superreview+
> application/x-vnd.mozilla.guess-from-ext Sounds good. Will do that. > i think this class deserves some decription. Done. Also compacted the class decl as suggested. > i'm not sure i follow why you wouldn't want to just do a case-insensitive > compare against "text/plain; charset=iso-8859-1" Because if the type is _not_ "text/plain", it's not just Apache being dumb -- it's someone purposefully sending that type. In which case we really should follow what the server said.
Attachment #138491 - Attachment is obsolete: true
Checked in that patch.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Blocks: 226024
*** Bug 207154 has been marked as a duplicate of this bug. ***
*** Bug 218372 has been marked as a duplicate of this bug. ***
*** Bug 233465 has been marked as a duplicate of this bug. ***
*** Bug 233678 has been marked as a duplicate of this bug. ***
Is there an option to disable this behaviour ? Even a hidden pref ... (This is important for example if we actually want to see the data....)
There is no such option. I would rather implement a "view in browser" option on the helper app dialog (which we need to do anyway) than waste time on a pref for this.
Blocks: 233047
I just sent myself a mail with a RTF attachment for which my Windows XP system did not have a Content-Type set in the registry. The MUA sent the file with a Content-Type of application/x-vnd.mozilla.guess-from-ext; I would have expected that the Content-Type were application/octet-stream, since this is the Content-Type for "unknown"/unclassified data. Using Thunderbird Nightly 2004-03-21-00-trunk on Windows XP. Requesting to reopen bug.
(In reply to comment #114) > Requesting to reopen bug. This bug is fixed but has caused a regression so therefore it would be better to file a new bug.
(In reply to comment #114) > I just sent myself a mail with a RTF attachment for which my Windows XP system > did not have a Content-Type set in the registry. The MUA sent the file with a > Content-Type of application/x-vnd.mozilla.guess-from-ext; heh. please file a new bug for that, and mention the bug# here.
Okay. I submitted bug 238706.
It is still sometimes opening .rar files in a new window for me. I also cannot right-click and save some .rar files. Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113 If I go to this site http://www.dustydunes.com/ for example and click on the link for: http://ygs.pwp.blueyonder.co.uk/matt/maps/de_seaside2k.rar then Mozilla opens the contents in a new window instead of downloading them. If I go to http://zveris007.boom.ru/ for example and click the link for: http://zveris007.boom.ru/Plan2.part01.rar the Mozilla opens the contents in a new window instead of downloading them. Furthermore, if I right-click on the link and click "Save Link Target As..." a box pops up that says: "The link could not be saved. The web page might have been removed or had its name changed." But obviously it is there, since Mozilla could get it in text format. The availability of those files is questionable (for some reason?). Sometimes IE6 goes to the website's 404 not found page if I click directly on the link (and sometimes it downloads). IE6 also flags it as unavailable if you right-click and go to "Save Target as..." *I make no claims to the security of those sites or files, I just searched for .rar files to demonstrate. Yarrrr I'm a n00b at this reporting business!
this is the intended behavior, see the "solving most incorrect MIME type issues". see comment #39.
(In reply to comment #119) > this is the intended behavior, see the "solving most incorrect MIME type > issues". see comment #39. The intended behaviour is "display garbage". Great.
"Display garbage" is better than "Open a gaping security hole". In any case, this bug is resolved. If you want to suggest that we break standards, discuss it on http://www.mozillazine.org or file a new bug.
I think I've found another "safe to assume without breaking standards" case. If the server sends text/plain but the A HREF link contains a TYPE="application/java-archive", then would it not be safe to assume that 1) the linked file is a java archive, or 2) the server and web page disagree, so it's a situation where you should be prompted? The reason I'm asking is that in a hosted environment web page authors may not have control over the server MIME types, and they may wish to use type="xxx" in cases where the server is wrong. This could also be a totally different bug, but I think it's a valid request for improvement. Mozilla 1.7 RC1 currently always takes the server as gospel regardless of the TYPE indicated in the link.
Please file a separate bug on that. That may be worth doing... Note that we already support type in general, so this could be done without too much pain, probably.
Blocks: 266410
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: