Closed Bug 175848 Opened 19 years ago Closed 19 years ago
Can't implicitly trust text/plain content type
It is becoming more and more common for Mac users to be downloading files that have .dmg.gz as an extension. Unfortunately the .dmg extension isn't isn't a generally known type so we're ending up with the default Apache (and maybe other servers) content type of text/plain causing us to try and display the file rather than download it. Rather than blindly assume the text/plain type is valid we should, at a minimum, use the buffer null sniff from nsUnknownDecoder::DetermineContentType() to see if applicatiion/octet-stream would be more appropriate.
I've seen reports of servers serving up .dmg.gz as "text/html" too. So maybe it's not just text/plain.
> use the buffer null sniff from nsUnknownDecoder::DetermineContentType() Sounds like a good idea. What's the perf hit for doing that for every text/html or text/plain load? Is it going to skew the pageloader #'s?
It should be decently cheap, actually... especially if we fix the html-check in there to not suck. Two concerns I have: 1) Does text/plain imply us-ascii or equivalent encoding? Or are things like UCS-2 valid text/plain documents? 2) We should really not autodetect html served as text/plain.... there are far too many cases (eg crasher testcases in bugzilla, demos of various sorts, etc) where that's being done completely on purpose. I'm basically looking for something similar to what IE does -- take the original type, the detected type, info from the extension, put them all together, and mix.... (with us weighting original type a _lot_ more than IE does).
text/plain is not limited to any encoding. BOMs can be used to determine Unicode encodings, but other than that it is left to the server or the default app to determine encoding. Why not just equip Mozilla with a reasonably comprehensive magic number file and utilize magic numbers to determine content whenever necessary (like the BSD "file" command)? Apache can of course be configured to use magic numbers, but this is unfortunately not the defualt.
Ah... that is unfortunate... for one thing, that means that text/plain can contain embedded nulls... (I've considered just shelling out 'file' in the past, btw.... there are issues with that, however... )
I've been warned in bug 169991 not to spam this one with "this is evil" comments, so I won't, even though it is... :) I just hope that, if some sort of MIME-overriding based on analysis of data and/or URL extensions (etc.) is done, in order to increase ability of Mozilla users to access sites with poorly configured servers, that this be done as a preference setting that allows Mozilla to still be run in a complete standards-compliance mode by users who wish it that way. Preferably, the standards-compliance mode should be the default, with departure from standards requiring a specific decision on the part of the user.
Plugins have this problem all over the place too, on all platforms. See bug 169991, bug 175368. testcase (sorry, internal): http://slip.mcom.com/shrir/Dice.swf (works on browsers with fix from bug 163568)
A suggestion from the Chimera mailing list: 1. IF (the MIME type is "application/octet-stream" OR "text/plain") AND (the file exension is listed in our /private/etc/httpd/mime.types) THEN use the locally referenced mime type. 2. ELSE trust the server MIME type. I think this gives us a fairly cheap, easy to implement solution, and well specified solution that roughly says use the best information that the server or client can come up with. I don't think it does contradict any RFCs. Can someone who thinks otherwise give us a proper reference, please? -- Ian Eiloart - University of Sussex Computing Service <http://www.sussex.ac.uk/USCS/Staff/staff3.cfm?ID=iane> <http://Eiloart.com/> Other than re-checking the extension when the served type is application/octet-stream I think it's a reasonable approach since the null in buffer test has been shot down
The relevant RFC reference is RFC 2068, section 7.2.1: Any HTTP/1.1 message containing an entity-body SHOULD include a Content-Type header field defining the media type of that body. If and only if the media type is not given by a Content-Type field, the recipient MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URL used to identify the resource. If the media type remains unknown, the recipient SHOULD treat it as type "application/octet-stream". So, in fact, doing anything but blindly trusting the content type (if it is present) is a violation of the RFC... That said, the suggestion is not completely unreasonable.. I would still rather sniff the data than the extension if we have to do _something_
I really don't think that Chimera should be looking at /private/etc/httpd/mime.types. Internet Config maybe, but not that file.
Extensions...? <a type="video/mp4" href="mymovie.rm">Looky here!</a> will then result in what...? RealOne loading!? <a type="plain/text" href="app.gz.dmg">Download</a> will have what effect...? Approximately 0.00001% of servers are configured to handle MPEG-4, so this will always yield garbage with any Gecko browser, since Gecko trusts the dumb server and not the author of the document (the only intelligent being in this chain).
> type="plain/text" we should trust that?
With Apache servers, you can set up custom mime types using .htaccess files, directory by directory. Thus, saying "the dumb server isn't configured for that type" is no excuse to not serve it correctly in your own site.
Most people don't have a site server of their own. They rely on ISP:s. The default configuration for Apache does not allow .htaccess, and I have yet to see an ISP that actually does allow this. I just managed to convince my ISP at a university to support application/xhtml+xml, video/mp4, audio/mp4, and also .xhtml as a DirectoryIndex item. I guess these people are supposed to be more enlightened than others (yet they hadn't implemented this until now, when explicitly asked about it), but there are several commercial ISP:s that ignore such requests and don't give a damn. There are many evangelism bugs covering that, leading nowhere. So to sum up, web server makers don't configure their servers properly when shipping (what exactly is text/x-point-plus and what is its apparently extreme relevance to the web?), and they also configure for benchmarks, stripping out all good features. They certainly do not have a comprehensive set of MIME types defined, so it will be up to the website admins to actually do the grunt work. They might or might not. Whenever new types are registered, there will be some serious delay before web servers implement this. Thus, this will conserve rather than help the net evolve. All servers I have come across also define a default type and generally don't allow for magic number support, which means that this ridiculous DOS-extension based meta-data method will prevail. So much for platform independence. The server is a machine. It doesn't know more than it has been configured to know, and that is often not too much. HTML documents are created by intelligent beings, who know first-hand what their material really is. It should be their choice that comes first in the list of determining what a file really is. If that was so, there would be no bugs like this and others, and everyone would be happy (except the high HTTP priests that for some obscure and completely metaphysical and irrelevant reason think it really matters if the useragent actually cares about that type attribute defined in HTML). [And actually, not even Mozilla is consistent about this; it will load any extensionless file if enclosed in an image tag: <image src="picture" />.] The current dogmatic stance taken by the Mozilla group has the following practical consequences: 1) People will refrain from using strict doctypes when developing sites, and 2) people in general will avoid using Gecko browsers "since they can't even download properly and render ugly pages without style". The HTML spec clearly allows for overriding the HTTP server's MIME type, if there is any doubt about the content's validity, but Mozilla has opted to ignore this. And that is the worst mistake Mozilla ever made and will make.
My hosting provider, Dreamhost, supports .htaccess files. Perhaps many other hosting providers do, and their users just have never even thought to try them. I don't know of anything in the specs that allow browsers to second-guess server-supplied content types, and I know of specific prohibitions on doing so in the specs. W3C has said on this subject: The architecture of the Web depends on applications making dispatching and security decisions for resources based on their Internet Media Types and other MIME headers. It is a serious error for the response body to be inconsistent with the assertions made about it by the MIME headers. Web software SHOULD NOT attempt to recover from such errors by guessing, but SHOULD report the error to the user to allow intelligent corrective action. http://www.w3.org/2001/tag/2002/0129-mime MSIE's disregarding of these standards result in the fact that certain perfectly standards-compliant things don't work reliably in that browser; most notably, trying to get the plain-text output of a CGI script to display correctly is a crapshoot. Here are some test examples: http://webtips.dan.info/cgi-bin/plaintext.pl http://webtips.dan.info/cgi-bin/plaintext.pl/test.html http://webtips.dan.info/cgi-bin/plaintext.pl/test.jpg http://webtips.dan.info/cgi-bin/plaintext.pl/test.exe http://webtips.dan.info/cgi-bin/plaintext.pl/test.swf Another test suite: http://entropymine.com/jason/testbed/mime/ More comments: http://ppewww.ph.gla.ac.uk/~flavell/www/content-type.html
*** Bug 176459 has been marked as a duplicate of this bug. ***
The Mozilla policy has been against this sort of thing. Wouldn't it make more sense to file a bug against Apache for a) defaulting to text/plain b) not including a type for .dmg ? (I've added garbage to a text file in order to prevent IE from parsing it as XML and showing well-formedness errors...)
Dan and Niklas -- stop the ranting. It is claimed that we have a real problem. Suggest a real solution instead of speaking in vague and useless generalities.
To steer this bug back to the immediate problem with .dmg files and what we can do about it in Chimera (where we care about the user experience at least as much as we care about folloing RFCs to the letter)... Buffer sniffing for NULLs is out. Inspection of several .dmg files does not reveal any pattern that could be used to identify them so a magic number is out. This pretty much leaves sniffing the .dmg extension (and the .dmg.gz extension/encoding variant) if we hit text/plain (and maybe text/html as sfraser encountered). No it's not pretty and no it's not 100% foolproof but unless I missed something other than a miracle causing every web server in existence to set a mapping for .dmg I see no other solution. This doesn't mean Mozilla should solve the problem in the same way so perhaps we should diverge that discussion to a different bug.
Why is buffer sniffing out? All the .dmgs I looked at had lots of zero bytes near the beginning.
Comment #4 indicates NULLs may be present with multibyte text
Actaully, now that I think about it some more.... _If_ the type is text/plain _and_ there's no charset specified in the headers then we can sniff for null without fear, since we would almost certainly mis-decode that content anyway.... If we do not find nulls, we leave it as text/plain (that's what the unknown decoder will default to anyway, if it finds no nulls and matches nothing else). This would allow us to at least fall back to application/octet-stream for .dmg.gz, maybe... thoughts?
In addition, multibyte text should never have 2 zero bytes in a row. You could sniff for that.
It should if it's UCS-4...
Another possibility that was suggested by someone on Mozillazine, more or less. We make it possible to override specific MIME-type-and-filename combinations via a pref (eg "text/plain" and "ends in .dmg.gz"). We don't even need UI for this pref yet -- Chimera could just set it... (though UI would be nice eventually). I say "filename" instead of "extension" because the extension of .dmg.gz is just "gz"....
Ok, don't look Hixie cause you'll get ill :-) This patch checks for a .dmg in the url if the content type is text/plain or text/html and changes it to application/octet-stream to force a download.
I just wanted to verify it worked. I'll post a cleaner version with the suggested fixes after I catch up on my sleep :-)
Now I can go to sleep
Attachment #104250 - Attachment is obsolete: true
Comment on attachment 104252 [details] [diff] [review] w/bz's suggestions sr=bzbarsky for Chimera only (not trunk), if I'm allowed to do this. ;)
Attachment #104252 - Flags: superreview+
Comment on attachment 104252 [details] [diff] [review] w/bz's suggestions + if (nsCRT::strcasecmp(contentType.get(), "text/plain") || is contentType a nsACString derivative? In that case, should you use contentType.Equals(NS_LITERAL_CSTRING("text/plain")) here?
Oh, good catch.... I should have gotten sleep before reviewing too. ;)
Except that should be Equals(NS_LITERAL_CSTRING("text/plain"), nsCaseInsensitiveCStringComparator())
the post-sleep version of the fix
Attachment #104252 - Attachment is obsolete: true
Re comment #17: > Wouldn't it make more sense to file a bug against Apache for > a) defaulting to text/plain Yes absolutely. Apache should default to application/octet-stream, this is the MIME type that is defined for "unknown file, let the user agent/user figure it out". text/plain is maybe/probably a ghost from the distant past when unix files were all pretty much text. Nowadays the opposite is true. If apache changed to application/octet-stream then 99% of the time this bug would go away, yes? ...with no need for chimera to install hacks, etc. to deal with text/plain files that aren't.
Actually, Apache should just send no type at all if it has no idea what the type is. We've filed a bug to that effect and attached a patch against the current Apache tree that implements that behavior.
Checked in Chimera only
> We've filed a bug to that effect ok, I now tried to find that bug on http://bugs.apache.org, without success. could you give a link to it please?
> I just hope that...this be done as a preference setting that allows Mozilla to still be run in a complete standards-compliance mode by users who wish it that way. So in your evangelism about MIME types of all things, you are suggesting MORE UI!?!?!? I don't think so. Second guessing the MIME types may be evil to you , but superfluous UI is satan himself.
Maybe to you, but to me one of the strong points of Mozilla is that it is highly configurable, and isn't just "one-size-fits-all", or "Where do we want to force you to go today?" I don't mind the configuration section being large and complex as a result. That makes for a power-user's browser, not a dumbed-down thing aimed at the stupid and ignorant. Maybe Chimera has a different philosophy; as long as this bug's fix stays in that browser's code and doesn't find its way into the Mozilla trunk, I won't object further as I don't use Chimera (or even the platform it runs on).
Re: Comment #41 From Dan Tobias 2002-10-27 12:17 > Maybe to you, but to me one of the strong points of Mozilla is that it is > highly configurable, and isn't just "one-size-fits-all", or "Where do we want > to force you to go today?" I don't mind the configuration section being large > and complex as a result. That makes for a power-user's browser, not a > dumbed-down thing aimed at the stupid and ignorant. That's true; that's what options are there for. Unfortunately, options often add unnecessary bloat. Also, Mozilla already *has* way too many options. And it's difficult to sort them. > Maybe Chimera has a different philosophy; It does. As its release notes say, it's supposed to be a clean and simple native OS X Gecko browser. > as long as this bug's fix stays in > that browser's code and doesn't find its way into the Mozilla trunk, I won't > object further as I don't use Chimera (or even the platform it runs on). Well, .dmg doesn't really matter to non-OSX systems anyway. Okay, maybe GNUStep etc.
It's an extremely hypothetical situation, but would this patch incorrectly match to something like www.dmg.ws? I'm not aware of any actual websites with domain names of the form dmg.xx, but I just wanted to bring it up as a possible (although very unlikely) problem with this solution. It's certianly not something I'm actually worried about. P.S. - Thanks for the link to the Apache bug. I spent 3 votes on it :)
> but would this patch incorrectly match No. That's what GetFileName is for, you know.
You would, however, get false positives if any site used a CGI script with a .dmg extension that generated text/plain or text/html output. That's the sort of problem you run into once you start doing MIME-second-guessing.
Dan: your point is well taken, but I have to say I've run into far more sites serving .dmgs with incorrect mimetypes than sites using .dmg CGI scripts that generate text/plain or text/html output. Thanks to Micro$oft (and to a lesser degree Netscape 4), we have to do dumb stuff like quirks mode and file extension sniffing. If you haven't already, please vote for the Apache bug and email Microsoft about this. The sooner we can get either of them to change their behavior, the sooner we can scrap this patch. (My money's on Apache.)
Looks like we may have broken something: bug 177020.
+ if (extPosition > 0 && + ((extPosition == nameLength - 4) || (extPosition == nameLength - 7))) What is this mysterious '7' here? I assume it's for .dmg.gz, but what about .dmg.sit or .dmg.tgz ?
re comment #36 - boris, I agree with you that sending nothing is ultimately the correct behaviour for apache. That gives the UA total latitude to figure out what to do. However, simply asking apache to change the default DefaultType to application/octet-stream would achieve an almost as good result (IMHO) since: (RFC 2046) > The recommended action for an implementation that receives an >"application/octet-stream" entity is to simply offer to put the > data in a file, with any Content-Transfer-Encoding undone, > or perhaps to use it as input to a user-specified process." This is just as good, in almost all cases. The only case where I can see it falling down is if the content actually should be displayed inline or automatically using a plug-in, in which case following the MIME spec would be broken because it would pop up a file save dialog. ... except that the MIME spec here is only "recommended ... or perhaps" so I think you can do content sniffing and whatever anyway. In any case I appreciate your effort to patch apache but I think this simpler change will accomplish the same thing. Also this change could be back-propagated to users of older apaches by advising them to change the config file (it's a small change).
> but what about .dmg.sit or .dmg.tgz ? Apache comes with a reasonable default type for .sit. Using .dmg.tgz makes no sense (and I think it resolves to a reasonable type as well, but I haven't tested with a vanilla installation of Apache).
Sourceforge breaks the current 'fix' as it has links like this: <http://prdownloads.sourceforge.net/fire/Fire.app.0.31.e.dmg?download> Planned workaround here is to not mangle the content type when the URL contains ?foo
I'm not sure how hard it would be to port the fix for bug 177026 over to Chimera, but that may be worth looking at... That would allow you to restrict this code to resetting the content-type to an empty string (or even application/x-unknown-content-type) and allowing the content-sniffer to perform type detection on it... In fact, you may want to try that anyway, even if you don't port that fix over. If .dmg.something always contains nulls, that would make it be detected as application/octet-stream anyway.... and if you limit the clearing to the cases where you do it now it shouldn't affect most content...
now check to see if URL contains a query before we try and look for a .dmg extension on the file name. Also slightly reorg of the code to not check for the length of the file name if no .dmg is found
Attachment #104281 - Attachment is obsolete: true
Didn't I tell ya there'd be problems with CGI output from URLs with .dmg? :-) This newest patch is only a Band-Aid [tm]... next somebody'll find another site that uses .dmg CGI scripts without any parameters (maybe as the destination of a form post), and they'll still break. I tell ya, this MIME-second-guessing stuff is a tar baby... you'll keep getting stuck deeper and deeper in it the more you try to extricate yourself.
Checked in version that handles a query in a .dmg URL Experimented with bz's suggestion for application/x-unknown-content-type but since Chimera is based off of a 1.0 branch it doesn't have his spiffy new code and doesn't work. Unless an infinite recursion somehow fits the definition of work :-) Dan, you can say "I told ya so" all you want :-) That doesn't change the fact the default Apache configuration makes this a major issue for Mac users. I care a helluva lot more about their experience with Chimera than following the letter of the spec which doesn't address what to do when the content type is so blatently wrong and only set because of a bad default in the first place.
You can override Apache defaults with .htaccess files, you know. While it's true that some hosting providers may disable this feature, there are a lot of webmasters who are actually capable of setting such configurations correctly in their own sites but just don't know it.
> there are a lot of webmasters who are actually capable of setting such configurations correctly in their own sites but just don't know it. Yeah, but that doesn't do Joe Surfer any good at all. Sure a few people might get a nice feeling of superiority knowing that the webmaster at some site isn't as on the ball as they could be. But most users are just going to say, "What the heck is all this garbage on my screen? Where's my file?" While it would be nice to live in a perfect world, I think that the chances are pretty low that all the webmasters out there are going to wake up one day aware that they are not making life convenient for a small portion of their users, and do something about it. In the mean time, why not work around it, hack or not?
Well, it's even less likely the webmasters will ever do anything about fixing their incorrect site configuration if browser makers pander to their incorrectness by hacking around it... at least, when the incorrect configuration causes the download to fail, there's something specific to point to when evangelizing the webmasters to fix the problem. If it's been hacked around at the browser level, all anybody would be able to say is "Your site is breaking the standard... it still works in browsers, but it's incorrect anyway on an abstract intellectual level," which wouldn't give them much incentive to ever fix it, so the temporary hack would have to become a permanent "feature". On the other hand, saying "Your site's incorrect configuration causes it not to work in [fill in browser name here]" gives them more reason to fix it. Meanwhile, those sites that incorrectly trigger the "hack" (e.g., CGI scripts with .dmg extensions) are just out of luck, collateral damage in this "war" between misconfigured sites and browsers that hack around them.
As has been pointed out, this really needs to be fixed at the apache level, not the individual webmaster level. I don't think anyone is suggesting this as a permanent hack; only until apache is fixed. This patch has been called a "Band-Aid"... I agree. But I don't see that as a reason not to apply it; when you have a cut, you don't say, "Well, it'll heal eventually on its own, so I'll leave it alone and bleed everywhere for a while." You put a Band-Aid on it until it heals. How much incentive do webmasters have now to do anything about the problem? Chimera doesn't have a that much of the market share, and if there's anything the web has taught us, it's that if something works in IE, that's good enough for most webmasters. Saying that we won't play the game of making up for shortcomings in the web may make a statement in support of the standards, and make us feel good, but it also makes Chimera a less usable browser until the specs are revised/followed. It would be great if non-IE browsers had enough clout to influence webmasters, but they don't yet. And they never will if they don't give a good enough user experience to get people to switch
Is there any reason this is being contemplated for Chimera alone and not trunk? It gives me the hives to see differential standards support in different Gecko-based browsers, and Evangelism has been dealing quite successfully with bugs of this nature on the trunk.
Could we please take all the pointless philosophical discussion the hell out of this _CHIMERA_ bug to the newsgroups? If people tell me which newsgroup they have started the discussion in, I'll follow it. If nothing else, this gives the rest of the Mozilla community a chance to follow the argument and participate. Choess: this is not being contemplated for the trunk, if nothing else, because it's a "must fix now" hack. IF we ever do this on the trunk (big if, since as you mention evang has been doing a good job) it would be done totally differently (for one thing, it would rely a lot more on content-sniffing than on extensions). I'm not even happy that Chimera's doing it, but I understand their reasons, and I can't really stop them, can I? Again, newsgroups please.
Marking this _Chimera_ bug as fixed hoping it'll end further commentary here. As Boris sez the mozilla newsgroups are where this discussion should be continued.
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
I filed http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14095 on apache to change the DefaultType to either application/octet-stream or application/x-unknown-content-type . This second option is a new one to me, but again, will not require a source code change to apache, just a default config change, thus is more likely to be implemented sooner.
Why? There already is a bug filed with Apache, namely to not send any god damn type at all if it doesn't know for sure what that document's type is. That way the DESIGNER can bypass **** up servers and **** up web server admins with that little type attribute defined in HTML, that attribute which currently is next to meaningless. Sending an octet-stream type instead of text/plain for an unrecognized XHTML document solves exactly nothing. Geez.
Nick, did you bother to read the spec before you flamed me? Based on your confusion (what does XHTML have to do with this?) I'm guessing the answer is no. text/plain is a bad default type. application/octet-stream is a less bad default type (and if you read the spec you'll find out why, it saves binaries to disk instead of spewing them as text on the screen). no type at all is ideal. However, sending no type at all requires code changes to apache, while sending application/octet-stream requires merely a default config file change. Therefore, it's a good workaround until the code change is in place.
Re-opening - get a 404 on a .dmg.gz file and we try to save the 404 message to disk rather than displaying it :-( Before the "I told ya so" pundits chime in we know simply extension sniffing isn't the ideal way to solve this problem. That's why we're going to start sniffing for the .gz/.bzip2 magic number or nulls in the beginning of the datastream if we get a hit on the extension.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
hmm... does the 1.0 branch have the nsIHttpChannel changes to get the "succeeded" boolean? If not, you're gonna have to get the response status and do "status / 100 == 2" explicitly....
For folks following this soap opera... It turns out that where we're checking the file extension we can't sniff the channel's data stream because it doesn't actually contain any data yet. And we can't move the sniffing down into the http channel code as that would mean we'd already have gone through the .gz decompression. Any Joseph Heller fans here? We can however determine if we got a 404 and not mangle the content type (which is what bz's comment referred to) so that's what we're gonna do. Next!
nsHttpChannel trunk changes for GetRequestSucceeded() moved to Chimera branch and checked in uriloader before mangling content type
Status: REOPENED → RESOLVED
Closed: 19 years ago → 19 years ago
Resolution: --- → FIXED
Verified in the 2002-12-04-04 build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.