Closed Bug 126782 Opened 22 years ago Closed 20 years ago
[FIX]Binary file with unknown type displayed as text/plain rather than saved
Going to only this particulair URL, mozilla will load the file in the browser window, hence never making it to my hard disk w/ no work around. I pasted the URL into IE finally to download the file. All the other files on this page came as expected.
Win98SE, 2002022203, this file loads into the window for me too. Reporter: You can then save the file with File->Save page as... Observing the output from wget, there doesn't seem to be any indication in the server headers of what this file actually is (i.e. no mime type). Given that the extension ".w02" is hardly well-known, and given that making assumptions based on the extension is A Bad Thing, what _should_ moz do with it?
Severity: major → normal
I thought the File-->Save As might mess up the contents of the file, regarding text to binary conversion. If you view the directory on the site that the file is stored in, you will see that there are files named .W01 .W02 .W03, which are compressed files. All of the other files load properly. There is even many other .W02 files which downloads fine! It is just that one link that loads wrong. Weird, huh?
> I thought the File-->Save As might mess up the contents of the file, > regarding text to binary conversion. I've done it many times with several formats (notably .asf, .wmv, both binary formats) without a hitch. However, you're correct. Other files with the same extension elsewhere on the site immediately pop up a "save file" dialog, but this one loads straight to the browser window. The page info dialog shows that moz thinks the file is text/plain. So, a question for the developers: how does moz decide what to do with these files? And how come it's doing different things with similar files from the same site?
Assignee: bbaetz → law
Status: UNCONFIRMED → NEW
Component: Networking: FTP → File Handling
Ever confirmed: true
QA Contact: benc → sairuh
Summary: MIME bug maybe ? → Binary file with unknown type displayed as text/plain rather than saved
ftp doesn't have content type, so we guess. bz, is the mime service getting this wrong here?
This is sort of funny, actually... The mime service says nothing about this file (since it has no useful extension), so it gets passed on to the unknown content decoder. The way the unknown content decoder tells text/plain apart from application/octet-stream is by looking for null bytes. The first null byte in this file is the 1168th byte. The unknown content decoder only looks at the first 1024 bytes of the file (since 99% of the time that's enough to determine what needs to be determined). In fact we're considering decreasing that 1024 to something like 512 or 256 so it won't be so eager to decide things are HTML... Over to rpotts... I'm not sure what a good solution is here, exactly. No matter what we do, unless we sniff the entire file there is no way to tell whether it's text or binary data (one can always come up with a more pathological case). Maybe we should special-case FTP somehow or something?
Assignee: law → rpotts
Component: File Handling → Networking
> The unknown content decoder only looks at the first 1024 bytes of the file > (since 99% of the time that's enough to determine what needs to be determined) I'm going to be nitpicky here, and say that unless I've completely forgotten what I was taught about statistics, looking at the first 1024 bytes is only going to work about 98% of the time, or 49 times out of every 50. Which isn't actually very certain. Cutting down to the first 256 bytes will cut that to 63%, less than 2 times out of three. Not good. I'm assuming a totally random distribution of the individual bytes in the range 0-255 for the purposes of this calculation. This isn't always going to be the case of course, and if a file format has a bias AGAINST null characters, things are going to get worse.
suggest a better approach, given that we have to make the decision before we have all the data and the decision is irreversible...
I ain't got one. I agree, there isn't much that can realistically be done in these circumstances, since we're essentially blindfolded in a dark room and someone's stolen our torch batteries.
I think that the bottom like is that ALL the unknown decoder can do is *guess*!! By the time we get to the unknown decoder, we've exhausted ALL other (more accurate) options for determining the content-type of the data... So, all we have left is a collection of heuristics that we use to 'guess' the content-type... Sometimes we guess wrong :-( Is there some way that we can modify these heuristics to guess better? Currently, i believe that our least reliable heuristic is that for detecting 'text/plain'... Initially, I chose to *only* key off of embedded NULLs because various character set encoding use the 8th bit... Maybe, this isn't an issue?? Since we have NO character encoding information available, we can't deal with these characters very well anyways (all we can use is the 'default' encoding)... So, maybe we should modify the code to disallow *anything* in the 8th bit... The argument for doing so is that it would limit false positive 'text/plain' hits. It may very well, reject streams that 'could' be rendered as text/plain using the default character encoding... I guess the question is which is more desirable: 1. occationally rendering binary data in a window... or 2. occationally bringing up the 'Save As' dialog box for text files... Once we decide which is the desired behavior, we can fine tune our heuristics... -- rick
I _may_ be able to help come up with something better, but I'm gonna need to clarify a few things first: 1) what groups are we categorising files into? From the comments above, we're after at least text/html, text/plain and [everything else] - any more? If it's a fairly short list, then we can see about knocking up a list of conditions for each of them. 2) presumably we have to worry about every language/alphabet under the sun, which is where the 8-bit stuff comes from. I can only claim to know anything about languages that use the latin alphabet, so to be really thorough we'll need some input from the i18n guys. > I guess the question is which is more desirable: > 1. occasionally rendering binary data in a window, or > 2. occasionally bringing up the 'Save As' dialog box for text files... Personally, I'd prefer 2. But then I'm not a typical user. Can we get anyone to go out into the world and mercilessly interrogate a couple of thousand typical users? :) Seriously, users without any technical knowledge are just going to run away screaming when they see "garbage" in the browser window, and many slightly-technically-savvy users "know" that opening a binary file in a text viewer, then saving it, is a quick way to break the binary file, and won't bother trying. In many cases, they're right. Moz is unusual here. At least if we offer to save to disc, the user can save it with a .txt extension (or whatever) and open it in their favourite text editor. Random thought: at the point where this code gets invoked, _we_ have *absolutely no idea* what the incoming file is. How likely is it that the user is as clueless as we are? Assuming that there's a short(ish) list of file categories to worry about, a few ideas: - text/plain. What about whitespace? How many text files are going to have no whitespace (space/tab/cr/lf) characters _at all_ in the first 256 bytes, let alone the first 1024? If it's got less than about one whitespace character per 60 bytes in the first 1024 bytes, it almost certainly isn't plain text (probably not HTML, either). That'll stand for just about every latin-alphabet language, I think. If it isn't a human language (e.g. base 64 encoded, or whatever), then the user is probably going to want to save it anyway, since mozilla's not going to be able to do much useful with it. Course, if it's an ASCII-art kitchen sink, we're in trouble :-D. - text/html. It's gonna have tags in it, surely? Can't we go looking for "<html", "<head", "<body", or even <...>...<...>...<...> patterns? On the other hand, how often does this code actually get handed an HTML file? To get here, it's got to be coming in without any content headers (which I believe means it's probably not coming via HTTP[S]?), and it's not got any kind of recognised HTML file name extension. It'd be really nice if we could get some kind of data on what files actually hit this code. Not likely, I know, but it would be really nice. > Initially, I chose to *only* key off of embedded NULLs because various > character set encoding use the 8th bit... Why only nulls? What about the other control characters, ascii 01-31? OK, there will be CR, LF, and TAB floating around, but what about some of the others? 05 (enquiry), 06 (Acknowledge), 07 (bell), and several others I don't even know the purpose of, are going to frightfully rare in text files, aren't they? OK, enough wibbling from me.
> 1) what groups are we categorising files into? At the moment we detect: application/pdf, application/postscript, text/html, all the image types Mozilla supports, text/plain, application/octet-stream > Can't we go looking for "<html", "<head", "<body", or even > <...>...<...>...<...> patterns? We do. http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#333 > On the other hand, how often does this code actually get handed an HTML file? A lot. 90% of the ad servers out there don't send any content-type. More to the point, every single ebay URL goes through this code (ebay seems to feel it's above sending content-type headers). I think I agree that I'd rather err on the side of letting the user save than on the side of showing in browser. Especially if we ever get a "view as text" option hooked up for the helper app dialog. :)
>> 1) what groups are we categorising files into? > > At the moment we detect: application/pdf, application/postscript, text/html, > all the image types Mozilla supports, text/plain, application/octet-stream OK. Most of those have headers that are being explicitly sniffed, which makes life easier. From personal experience, I'd say it's probably worth adding .asf (http://www.microsoft.com/windows/windowsmedia/WM7/format/asfspec11300e.asp) and .wmv (which has the same internal format as .asf, according to http://support.microsoft.com/default.aspx?scid=kb;EN-US;q284094). Yes, they're MS-proprietary, but they're out there in substantial numbers, and they're the formats that give me the most grief. The spec linked on the above page appears to be in office 2000 format, so I can't read it, but I'd be surprised if there wasn't a sniffable header in there. I know there's a limit to how many types we can be reasonably be expected to sniff, but presumably PDF/PS get in because they're common on the net? What about other things that are common? Can we get data on what file types are out there?  data that's more meaningful than me going "i wanna .asf and a .wmv and a .exe and a .zip and a .tar and a ....." >> Can't we go looking for "<html", "<head", "<body", or even >> <...>...<...>...<...> patterns? > > We do. [snip] And a few more I hadn't thought of. Jolly good. Looking at the code, it's basically: 1. [PDF or Postscript headers] -> appropriate types 2. [local file] -> go to step 4 for security reasons 3. [html tags?] -> HTML 4. [known image headers] -> appropriate types 5. [No nulls in it?] -> plain text 6. [everything else] -> octet-stream Apart from quibbles about other explicitly sniffable types, I've little to add beyond the possible improvements to plain text sniffing I listed above. > I think I agree that I'd rather err on the side of letting the user save > than on the side of showing in browser. Any chance we can ping some usability gurus on this? > Especially if we ever get a "view as text" option hooked up for the helper app > dialog. :) Yeah, that would help. The more I think about it, the more I think that if a file makes it down to step 5, the user probably has a better idea of what it is than we do, so the best solution might be to just ask them.  not least because we're completely clueless at this point.
mpt, what do you think about comment #9?
*** Bug 129918 has been marked as a duplicate of this bug. ***
From comment 4, above: > The mime service says nothing about this file (since it has no useful > extension) Hang on a minute. Does this mean that if the file has a recognised file extension, moz should figure out whether the file can be displayed or not? So the unknown decoder only kicks in if the extension isn't recognised? If so, it looks like .asf and .wmv aren't on that list. Adding them to that list would best be filed as a different bug, since this one is rapidly heading in the direction of "what we should do with files in the unknown content decoder", which is a different issue from preventing them hitting the decoder in the first place. If someone can confirm the above, give me a shout and I'll spin off a separate bug for that. [sorry, brain go slow, should have spotted this earlier]
> So the unknown decoder only kicks in if the extension isn't recognised? Correct. If nothing else ever uses those extensions then we can just add them to our "extensions we know" list at http://lxr.mozilla.org/seamonkey/source/uriloader/exthandler/nsExternalHelperAppService.cpp#124
>> Adding [.asf, .wmv] to [the lsit of known extensions] would best be filed as >> a different bug. > > If nothing else ever uses those extensions then we can just add them to our > "extensions we know" list [...] Well, www.wotsit.org doesn't know any other uses of .asf. And it's not even heard of .wmv or .wma (.wmv's audio cousin). Dunno if that's a good sign or a bad sign :-/ Anyhow, logged as bug 129982.
Ok... so it sounds like tightening up our text/plain detection is desirable. Let me summarize what i'm hearing... 1. In addition to NUL, check for other 'low ascii' control characters to reject text/plain. 2. Add a whitespace heuristic... Some amount of <SP> and/or <TAB> should be present (ideally one or more per line :-) ) any other suggestions to sniff out text/plain ?? I suppose we could add explicit detection of base64 encoding to limit the number of text/plain misses because of this encoding too.. -- rick
That's my best shot for now. The ASF/WMV thing should be covered by bug 129982. The only other thing is the suggestion to switch from: if (known binary) [octet stream] else [plaintext] to: if (known plaintext) [plaintext] else [octet stream] So that the "unknown" cases get saved to disc rather than loaded into the browser window. That's my preferred behaviour and Boris's, too, I believe. But, of course, Boris and I aren't typical users, so PDT and MPT might have different ideas.
According to comment #11 and others, it would be great to add those extension to mozilla : .ace -> Ace archives files (http://www.winace.com/) .rar -> Rar archives files (http://www.rarsoft.com/) Is this possible ? All archives we can download are not only .zip :-)
Files that are .ISO always end up in my window. I don't know if already discussed solutions will fix this too.
Frederic, shwag, those issues are probably best covered by logging separate bugs for those extensions (similar to my bug 129982 for windows media), since this bug is covering what happens once moz decides it's got no idea what it's dealing with.
Would the following work as a fix for this bug? First, add several known binary file extensions, including ISO, bz2, and others to the mime service. Second, set the unknown content decoder to look at 0.05% of any file it gets for null characters. For a one million byte file, it would look at 50,000 bytes.
Removing bogus dependency that was added by a non-driver.
No longer blocks: 138000
In response to comment #23 -- yes, that could be doable.... Rick, what do you think? We probably want to do PR_MAX(512, something*datasize) (othewise for a small file we'd only look at a few chars... I'm assuming you meant 5%, not 0.05%, since 0.05% of 10^6 is 500, not 50000.... I think 5% is a little big. That would be on the order of 500000 bytes (that would need to be allocated in memory!) for downloading Mozilla, and would be on the order of 20-30 megabytes (that would need to be allocated in memory) for ISO images.... But the general approach could certainly be tried; I'd like to see whether that approach has any more success with the various file types listed in this bug. Perhaps something more like: PR_MIN(PR_MAX(512, something*datasize), 20000) would be a thought? That way ridiculously huge files are capped....
That wouldonly be valid for ftp, or the unknown content type. We have to trust the server, if it lies, its a server issue, and not our problem.
That sounds good. Once implemented, we could fine tune it, if necessary.
having a variable length buffer based on the content-length (that is clamped as boris suggests) sounds fine to me. However, this is exactly the opposite of what bug #119942 is all about :-) It suggests that a *smaller* buffer be used ;-) lets decide on a strategy... and mark bug #199942 as either a dup of this bug... or invalid... -- rick
Buffer size: Firstly, let's keep things sane for those on slow connections. In europe, most people are still on dial-up. If they're downloading things from a slow server, on the other side of the world, even 1024 bytes can take a few seconds. A 20,000 byte buffer could mean clicking the "save this link" option, then waiting *15-20 seconds* for the filename dialog to come up. Even from a fast server, with a fast modem, they're gonna be waiting 4-5 seconds with no sign that their click did anything. That's too long. It'll confuse users, and make them think moz is glacially slow at downloading. Ideally, we could use something like the "getting file information" intermediate dialog IE6/Win has, but that's probably gonna be loads of work, and best covered by another bug. Conversely, the buffer's got to be big enough so that, statistically, it is going to correctly figure out binary/text _most_ of the time by whatever method is being used. Obviously, 100% would be good, but that ain't gonna happen. The present method is good for about 98% with a 1024 byte buffer, but a 256 byte buffer will cut that to under 70%, which is terrible. If we improve the detection method, as discussed above, we can probably get better detection, with a smaller buffer than is currently being used, especially if we can catch some of the common culprits via other methods (e.g. windows media, bug 129982 ) So, summary of what I think needs doing: 1. Improve plain text detection heuristics as discussed here. 2. consider adding other sniffable headers to those checked 3. amend default to [save] rather than [display] (i.e. if we can't figure it out, treat it as binary, not as text) 4. reconsider buffer size given improved heuristics.
> A 20,000 byte buffer could mean clicking the "save this link" option This code is never called for that option. The _only_ time this code is called is when you actually load a url (click on a link, type in URL bar, submit form, etc). Any "save link", "mail link", etc. options do not use it.
>> A 20,000 byte buffer could mean clicking the "save this link" option > > This code is never called for that option. Doh! Of course, at that point, they're ASKING to save it, aren't they? So much for that objection. [mental note to self: WAKE UP!] If the heuristics are improved, however, would we relly _need_ a bigger buffer? If we remove a couple of the worst-offending filetypes by checking for headers and/or extensions, add a whitespace check, and add a check for half-a-dozen different ascii 0-31 characters, we could get our accuracy better than 99.99%, all with a 1024 character buffer. We could probably even get better than 99.7% with only the 256-character buffer proposed in bug 119942 - which is to say, a quarter of the error rate of the current system with a 1024-byte buffer. I think that extending the null check to cover other characters may be the best single improvement, if we can do so. Even expending it to check for 2/3 characters, rather than just the one, out of the 8-bit ascii range, will make a huge difference to our accuracy. *all my statistics are assuming random distribution of characters, yada, yada.
I don't know if it is related, but every file with unknown extension (from groups.yahoo.com) are saved like .exe files (in 2002043010 nightly trunk build). Strange ?!
Totally unrelated bug (bug 120327)
Of ASCII 0-31, which characters are valid in text/plain files? 9 = \t (tab) 10 = \n (linefeed aka newline) 12 = \f (formfeed, is this actually used in text files?) 13 = \r (carriage return) Did I miss any? A file should only be considered text if there are no characters in the 0-31 range other than these. IIRC 127 isn't printable either, so should also identify a binary file. So, we should check for 0-8,11,14-31,127 (adjust as needed) and only if none of those characters are present, AND there are spaces or \t or \n or \r scattered appropriately, then it's text, otherwise it's binary. Right? Re: Comment #18, rpotts: are you saying base64-encoded files *should* be displayed as text? Why? Seems to me that displaying them as text is useless; I can't read base64, but if I save the file I can extract it with StuffIt Expander or whatever.
> 12 = \f (formfeed, is this actually used in text files?) It sure is. Newsgroup posts, for example. You forgot 11 -- Vertical Tab (\v) Comment 18 meant that we will currently detect base64 as plaintext (since it's 7-bit-clean printable ascii). We should therefore attempt to detect it as non-text/plain, for best results. :) Any idea what the magic numbers that identify a base64-encoded file are?
Let me add a voice for user control. Specificly: Let the use specify a preferred handler for an unknown type. (BTW, _is_ there a MIME type for "unknown"?) Once loaded (or loading), let the user hand the URL to a specific handler. For exmple, bring the URL in as app/octet-stream (my conservative preference) and in the Save dialogue, off a "recast to type and handler" option. Also, can the .ext -> mime/type mapping be exposed and manually extensible? To be used only in the guess-this-type code of course, since the server's MIME claims should be respected. I'd also like to second the vote for Gavin Long's comment #19, to change: if (known binary) [octet stream] else [plaintext] to: if (known plaintext) [plaintext] else [octet stream] It seems much safer and saner to me.
> BTW, _is_ there a MIME type for "unknown" application/octet-stream is it. The definition is "unknown data of some sort". The rest of what you suggest is already covered in 3 or 4 different RFEs. The extension to type mapping is extensible through helper app preferences already.
*** Bug 119942 has been marked as a duplicate of this bug. ***
Proposed relnote: Mozilla will sometimes not detect that an opened file is binary, and will attempt to display it as a web page. To download such a file, right-click on the link and select "Save Link Target As."
Can't you also take the filesize into account? I mean, if a file is larger than 1 or 2 MB, I'm pretty sure users want to save that file (or open it with another application) rather than read it in the browser window. And I doubt there are that many large textfiles around...
We could, but large logfiles or message archives are actually very common.. Easily multi-megabyte.
Plugins also have this problem on Win32, for example: http://slip.mcom.com/shrir/edittext4.swf Should we not be looking at the extensions? Nominating nsbeta1.
-> ftp (may end up in File Handling) peter: In FTP, yes. For the example you give, what does that extension map to?
Component: Networking → Networking: FTP
My testcase works in FTP mode. It does not work in HTTP. That extension in only mapped to a mime type in plugin code. Calling |nsIPluginHost::IsPluginEnabledForExtension| will check for a mapping.
For HTTP, if the server tells us it's text/plain then we should not be looking at extension.
okay ->file handling, if I'm reading this correctly.
Component: Networking: FTP → File Handling
QA Contact: benc → sairuh
*** Bug 152203 has been marked as a duplicate of this bug. ***
*** Bug 156020 has been marked as a duplicate of this bug. ***
this occurs on Linux as well OS=>All
OS: Windows XP → All
according to bug #156020 this is true on Mac (OS X and 9 ) as well. ( http ) Now, referring to that bug as well, this happens on the .gz format as well, and that format is well recognized with a .gz file ending, And has a very applyable header. Though this seems to barf quite hard with things like this as well: spider@Darkmere spider $ wget http://www.mitzpettel.com/download/IcyJuice0.9d2.dmg.gz --18:02:37-- http://www.mitzpettel.com/download/IcyJuice0.9d2.dmg.gz => `IcyJuice0.9d2.dmg.gz' Resolving www.mitzpettel.com... done. Connecting to www.mitzpettel.com[184.108.40.206]:80... connected. HTTP request sent, awaiting response... 200 OK Length: 606,600 [text/plain] The text/plain would suggest a malconfigured (unconfigured?) http server, but how come it gets attached as text/plain with mozilla? why do we trust the server in this case?
We trust the server because that's what the HTTP specification says we MUST do. Let's keep this bug focused on the issue at hand, please...
I see this in Chimera too, so it hits embedding apps as well. Yet another testcase: <http://ftp.mozilla.org/pub/chimera/nightly/2002-07-22-05/Chimera.dmg.gz>
Hardware: PC → All
Simon: Mozilla/chimera use the Http protocol for this URl and the server sends : text/plain....
This bug hasn't been touched in months? Its marked mozilla1.0? Anyone care to make a patch for assuming save as and providig view as text in the save as options?
> make a patch for assuming save as What does that have to do with this bug? > providig view as text This part is a large piece of work... (trust me, I've tried two or three times). Is there a comment after comment 18 that actually has a useful suggestion other than the banter about buffer sizes?
>> I think I agree that I'd rather err on the side of letting the user save >> than on the side of showing in browser. >Any chance we can ping some usability gurus on this? I wouldn't claim to be a usability guru, but I'm certainly a user. Why not simply add an option to force-save the file in raw format, regardless of the mime type sent, to the "save as type" menu. That way, if mozilla incorrectly identifies a binary file as text, or the server erroneously sends a text mime type for a binary file (like with those RAR archives), the user has some control over how the data is saved -- if they know the file is binary, they have a means of safely saving it as binary data that doesn't involve pasting the address into IE. The same could be added in reverse: on the off-chance that mozilla, for whatever reason, interprets a text file as binary data, the user can force-save as text if s/he so desires.
>Why not simply add an option to force-save the file in raw format imho, saving a file should ALWAYS save it in raw format (unless "web page complete" is chosen, of course)
In case you all missed it, saving in Mozilla _is_ in raw format. we don't even do newline conversion (though we should, imo, in some cases).
Here is another file that does the same ol' thing we've all seen for months. http://220.127.116.11/peng/linusq-a.ogg
Bad example -- that one the server claims to be text/plain. Fix the buggy server, please.
Its not my server to fix, and since there are other servers out there that are also likely misconfigured, it would be foolish to say that it is not worth looking at a way to have mozilla detext files by extension. Workaround: open the URL up in IE.
No, you do not understand. Doing what you suggest wouldbe a gross and blatant violation of the spec that _no_ browser other than IE does (I've tested Mozilla, Opera, Konqueror, Netscape 4, Mosaic, lynx, links, w3m). We _can_ detect these files by extension or even data sniffing. However we will _not_ be doing it. Please stop spamming this bug with rehashes of discussions that have happened in the newsgroups many times over.
Workaround is to save it with File->Save or Ctrl-S in the window.
*** Bug 210973 has been marked as a duplicate of this bug. ***
OK, taking. We've talked a lot, and lots of good ideas here, and I'm going to implement the simplest one -- filtering out known-not-text chars.
Assignee: rpotts → bz-vacation
For the curious, with this patch we detect the file in the URL field as binary three bytes in.
Comment on attachment 135573 [details] [diff] [review] Proposed patch Er, ignore that first hunk; I've not updated this tree to tip in a few days... ;) IS_TEXT_CHAR treats 127 and 8-bit chars as text for now, because various codepages may use them (though they probably should not be using 127, I can't guarantee that they are not). Thoughts?
Priority: -- → P1
Summary: Binary file with unknown type displayed as text/plain rather than saved → [FIX]Binary file with unknown type displayed as text/plain rather than saved
Target Milestone: --- → mozilla1.6beta
Comment on attachment 135573 [details] [diff] [review] Proposed patch this is better than nothing. i agree that matching 127 here might be risky. i think this is a good heuristic that should help catch a lot of cases. r+sr=darin
Checked in. The next step is to add sniffers for common formats, per comment 29 (which I think has a good summary of the situation). Please file bugs on those and assign them to me? So far we have base64 on the list, right?
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.