Closed Bug 175848 Opened 22 years ago Closed 22 years ago

Can't implicitly trust text/plain content type


(Camino Graveyard :: Downloading, defect)

Not set


(Not tracked)



(Reporter: sdagley, Assigned: sdagley)




(1 file, 3 obsolete files)

It is becoming more and more common for Mac users to be downloading files that
have .dmg.gz as an extension.  Unfortunately the .dmg extension isn't isn't a
generally known type so we're ending up with the default Apache (and maybe other
servers) content type of text/plain causing us to try and display the file
rather than download it.  Rather than blindly assume the text/plain type is
valid we should, at a minimum, use the buffer null sniff from
nsUnknownDecoder::DetermineContentType() to see if applicatiion/octet-stream
would be more appropriate.
I've seen reports of servers serving up .dmg.gz as "text/html" too. So maybe
it's not just text/plain.
> use the buffer null sniff from nsUnknownDecoder::DetermineContentType()

Sounds like a good idea. What's the perf hit for doing that for every text/html
or text/plain load? Is it going to skew the pageloader #'s?
It should be decently cheap, actually...  especially if we fix the html-check in
there to not suck.

Two concerns I have: 

1) Does text/plain imply us-ascii or equivalent encoding?  Or are things like
   UCS-2 valid text/plain documents?
2) We should really not autodetect html served as text/plain.... there are far
   too many cases (eg crasher testcases in bugzilla, demos of various sorts,
   etc) where that's being done completely on purpose.

I'm basically looking for something similar to what IE does -- take the original
type, the detected type, info from the extension, put them all together, and
mix....  (with us weighting original type a _lot_ more than IE does).

text/plain is not limited to any encoding. BOMs can be used to determine Unicode
encodings, but other than that it is left to the server or the default app to
determine encoding.

Why not just equip Mozilla with a reasonably comprehensive magic number file and
utilize magic numbers to determine content whenever necessary (like the BSD
"file" command)? Apache can of course be configured to use magic numbers, but
this is unfortunately not the defualt.
Ah... that is unfortunate... for one thing, that means that text/plain can
contain embedded nulls... (I've considered just shelling out 'file' in the past,
btw.... there are issues with that, however... )
I've been warned in bug 169991 not to spam this one with "this is evil"
comments, so I won't, even though it is... :)

I just hope that, if some sort of MIME-overriding based on analysis of data
and/or URL extensions (etc.) is done, in order to increase ability of Mozilla
users to access sites with poorly configured servers, that this be done as a
preference setting that allows Mozilla to still be run in a complete
standards-compliance mode by users who wish it that way.  Preferably, the
standards-compliance mode should be the default, with departure from standards
requiring a specific decision on the part of the user.
Plugins have this problem all over the place too, on all platforms.
See bug 169991, bug 175368.
testcase (sorry, internal):
(works on browsers with fix from bug 163568)
A suggestion from the Chimera mailing list:

1. IF (the MIME type is "application/octet-stream" OR "text/plain") AND (the
file exension is listed in our /private/etc/httpd/mime.types) THEN use the
locally referenced mime type.

2. ELSE trust the server MIME type.

I think this gives us a fairly cheap, easy to implement solution, and well
specified solution that roughly says use the best information that the server or
client can come up with. I don't think it does contradict any RFCs. Can someone
who thinks otherwise give us a proper reference, please?

Ian Eiloart - University of Sussex Computing Service

Other than re-checking the extension when the served type is
application/octet-stream I think it's a reasonable approach since the null in
buffer test has been shot down
The relevant RFC reference is RFC 2068, section 7.2.1:

   Any HTTP/1.1 message containing an entity-body SHOULD include a
   Content-Type header field defining the media type of that body. If
   and only if the media type is not given by a Content-Type field, the
   recipient MAY attempt to guess the media type via inspection of its
   content and/or the name extension(s) of the URL used to identify the
   resource. If the media type remains unknown, the recipient SHOULD
   treat it as type "application/octet-stream".

So, in fact, doing anything but blindly trusting the content type (if it is
present) is a violation of the RFC...

That said, the suggestion is not completely unreasonable.. I would still rather
sniff the data than the extension if we have to do _something_
I really don't think that Chimera should be looking at
/private/etc/httpd/mime.types. Internet Config maybe, but not that file.

<a type="video/mp4" href="mymovie.rm">Looky here!</a> will then result in
what...? RealOne loading!?

<a type="plain/text" href="app.gz.dmg">Download</a> will have what effect...?

Approximately 0.00001% of servers are configured to handle MPEG-4, so this will
always yield garbage with any Gecko browser, since Gecko trusts the dumb server
and not the author of the document (the only intelligent being in this chain).
> type="plain/text"

we should trust that?
With Apache servers, you can set up custom mime types using .htaccess files,
directory by directory.  Thus, saying "the dumb server isn't configured for that
type" is no excuse to not serve it correctly in your own site.
Most people don't have a site server of their own. They rely on ISP:s.

The default configuration for Apache does not allow .htaccess, and I have yet to
see an ISP that actually does allow this. I just managed to convince my ISP at a
university to support application/xhtml+xml, video/mp4, audio/mp4, and also
.xhtml as a DirectoryIndex item. I guess these people are supposed to be more
enlightened than others (yet they hadn't implemented this until now, when
explicitly asked about it), but there are several commercial ISP:s that ignore
such requests and don't give a damn. There are many evangelism bugs covering
that, leading nowhere.

So to sum up, web server makers don't configure their servers properly when
shipping (what exactly is text/x-point-plus and what is its apparently extreme
relevance to the web?), and they also configure for benchmarks, stripping out
all good features. They certainly do not have a comprehensive set of MIME types
defined, so it will be up to the website admins to actually do the grunt work.
They might or might not. Whenever new types are registered, there will be some
serious delay before web servers implement this. Thus, this will conserve rather
than help the net evolve.

All servers I have come across also define a default type and generally don't
allow for magic number support, which means that this ridiculous DOS-extension
based meta-data method will prevail. So much for platform independence.

The server is a machine. It doesn't know more than it has been configured to
know, and that is often not too much. HTML documents are created by intelligent
beings, who know first-hand what their material really is. It should be their
choice that comes first in the list of determining what a file really is. If
that was so, there would be no bugs like this and others, and everyone would be
happy (except the high HTTP priests that for some obscure and completely
metaphysical and irrelevant reason think it really matters if the useragent
actually cares about that type attribute defined in HTML). [And actually, not
even Mozilla is consistent about this; it will load any extensionless file if
enclosed in an image tag: <image src="picture" />.]

The current dogmatic stance taken by the Mozilla group has the following
practical consequences: 1) People will refrain from using strict doctypes when
developing sites, and 2) people in general will avoid using Gecko browsers
"since they can't even download properly and render ugly pages without style".

The HTML spec clearly allows for overriding the HTTP server's MIME type, if
there is any doubt about the content's validity, but Mozilla has opted to ignore
this. And that is the worst mistake Mozilla ever made and will make.
My hosting provider, Dreamhost, supports .htaccess files.  Perhaps many other
hosting providers do, and their users just have never even thought to try them.

I don't know of anything in the specs that allow browsers to second-guess
server-supplied content types, and I know of specific prohibitions on doing so
in the specs.  W3C has said on this subject:

The architecture of the Web depends on applications making 
dispatching and security decisions for resources based on their 
Internet Media Types and other MIME headers. It is a serious 
error for the response body to be inconsistent with the 
assertions made about it by the MIME headers. Web software SHOULD 
NOT attempt to recover from such errors by guessing, but SHOULD 
report the error to the user to allow intelligent corrective 

MSIE's disregarding of these standards result in the fact that certain perfectly
standards-compliant things don't work reliably in that browser; most notably,
trying to get the plain-text output of a CGI script to display correctly is a
crapshoot.  Here are some test examples:

Another test suite:

More comments:
*** Bug 176459 has been marked as a duplicate of this bug. ***
The Mozilla policy has been against this sort of thing. 

Wouldn't it make more sense to file a bug against Apache for 
a) defaulting to text/plain
b) not including a type for .dmg

(I've added garbage to a text file in order to prevent IE from parsing it as XML
and showing well-formedness errors...)
Dan and Niklas -- stop the ranting.  It is claimed that we have a real problem.
 Suggest a real solution instead of speaking in vague and useless generalities.
To steer this bug back to the immediate problem with .dmg files and what we can
do about it in Chimera (where we care about the user experience at least as much
as we care about folloing RFCs to the letter)...

Buffer sniffing for NULLs is out.  Inspection of several .dmg files does not
reveal any pattern that could be used to identify them so a magic number is out.
 This pretty much leaves sniffing the .dmg extension (and the .dmg.gz
extension/encoding variant) if we hit text/plain (and maybe text/html as sfraser
encountered).  No it's not pretty and no it's not 100% foolproof but unless I
missed something other than a miracle causing every web server in existence to
set a mapping for .dmg I see no other solution.  This doesn't mean Mozilla
should solve the problem in the same way so perhaps we should diverge that
discussion to a different bug.
Why is buffer sniffing out? All the .dmgs I looked at had lots of zero bytes
near the beginning.
Comment #4 indicates NULLs may be present with multibyte text
Actaully, now that I think about it some more....

_If_ the type is text/plain _and_ there's no charset specified in the headers
then we can sniff for null without fear, since we would almost certainly
mis-decode that content anyway....  If we do not find nulls, we leave it as
text/plain (that's what the unknown decoder will default to anyway, if it finds
no nulls and matches nothing else).

This would allow us to at least fall back to application/octet-stream for
.dmg.gz, maybe... thoughts?
In addition, multibyte text should never have 2 zero bytes in a row. You could
sniff for that.
It should if it's UCS-4...
Another possibility that was suggested by someone on Mozillazine, more or less.
We make it possible to override specific MIME-type-and-filename combinations via
a pref (eg "text/plain" and "ends in .dmg.gz").  We don't even need UI for this
pref yet -- Chimera could just set it... (though UI would be nice eventually). 
I say "filename" instead of "extension" because the extension of .dmg.gz is just
Attached patch First cut (obsolete) — Splinter Review
Ok, don't look Hixie cause you'll get ill :-)
This patch checks for a .dmg in the url if the content type is text/plain or
text/html and changes it to application/octet-stream to force a download.
> +    if (theFileName.RFind(".dmg") > 0)

Could we make this check that the end is ".dmg.gz" or ".dmg"?  The code
as-written will match

Also, there are _many_ nsIURI's that are not nsIURLs.  Please null-check that
pointer; otherwise the next javascript: or data: or view-source: uri you click
on (eg in the JS console) will crash.  ;)

The rest looks fine, assuming this patch is meant for Chimera only, of course.
I just wanted to verify it worked.  I'll post a cleaner version with the
suggested fixes after I catch up on my sleep :-)
Attached patch w/bz's suggestions (obsolete) — Splinter Review
Now I can go to sleep
Attachment #104250 - Attachment is obsolete: true
Comment on attachment 104252 [details] [diff] [review]
w/bz's suggestions

sr=bzbarsky for Chimera only (not trunk), if I'm allowed to do this.  ;)
Attachment #104252 - Flags: superreview+
Comment on attachment 104252 [details] [diff] [review]
w/bz's suggestions

+  if (nsCRT::strcasecmp(contentType.get(), "text/plain") ||

is contentType a nsACString derivative? In that case, should you use
contentType.Equals(NS_LITERAL_CSTRING("text/plain")) here?
Oh, good catch.... I should have gotten sleep before reviewing too.  ;)
Except that should be 

Equals(NS_LITERAL_CSTRING("text/plain"), nsCaseInsensitiveCStringComparator())
the post-sleep version of the fix
Attachment #104252 - Attachment is obsolete: true
Re comment #17:

> Wouldn't it make more sense to file a bug against Apache for 
> a) defaulting to text/plain

Yes absolutely. Apache should default to application/octet-stream, this is the
MIME type that is defined for "unknown file, let the user agent/user figure it
out". text/plain is maybe/probably a ghost from the distant past when unix files
were all pretty much text. Nowadays the opposite is true.

If apache changed to application/octet-stream then 99% of the time this bug
would go away, yes?

...with no need for chimera to install hacks, etc. to deal with text/plain files
that aren't.
Actually, Apache should just send no type at all if it has no idea what the type
is.  We've filed a bug to that effect and attached a patch against the current
Apache tree that implements that behavior.
Checked in Chimera only
> We've filed a bug to that effect

ok, I now tried to find that bug on, without success.
could you give a link to it please?
> I just hope that...this be done as a
preference setting that allows Mozilla to still be run in a complete
standards-compliance mode by users who wish it that way.

So in your evangelism about MIME types of all things, you are suggesting MORE
UI!?!?!?   I don't think so.  Second guessing the MIME types may be evil to you
, but superfluous UI is satan himself.
Maybe to you, but to me one of the strong points of Mozilla is that it is highly
configurable, and isn't just "one-size-fits-all", or "Where do we want to force
you to go today?"  I don't mind the configuration section being large and
complex as a result.  That makes for a power-user's browser, not a dumbed-down
thing aimed at the stupid and ignorant.

Maybe Chimera has a different philosophy; as long as this bug's fix stays in
that browser's code and doesn't find its way into the Mozilla trunk, I won't
object further as I don't use Chimera (or even the platform it runs on).
Re: Comment #41 From Dan Tobias 2002-10-27 12:17
> Maybe to you, but to me one of the strong points of Mozilla is that it is
> highly configurable, and isn't just "one-size-fits-all", or "Where do we want
> to force you to go today?"  I don't mind the configuration section being large
> and complex as a result.  That makes for a power-user's browser, not a
> dumbed-down thing aimed at the stupid and ignorant.

That's true; that's what options are there for. Unfortunately, options often add
unnecessary bloat. Also, Mozilla already *has* way too many options. And it's
difficult to sort them.

> Maybe Chimera has a different philosophy;

It does. As its release notes say, it's supposed to be a clean and simple native
OS X Gecko browser.

> as long as this bug's fix stays in
> that browser's code and doesn't find its way into the Mozilla trunk, I won't
> object further as I don't use Chimera (or even the platform it runs on).

Well, .dmg doesn't really matter to non-OSX systems anyway. Okay, maybe GNUStep etc.
It's an extremely hypothetical situation, but would this patch incorrectly match
to something like I'm not aware of any actual websites with domain
names of the form dmg.xx, but I just wanted to bring it up as a possible
(although very unlikely) problem with this solution. It's certianly not
something I'm actually worried about.

P.S. - Thanks for the link to the Apache bug. I spent 3 votes on it :)
> but would this patch incorrectly match

No.  That's what GetFileName is for, you know.
You would, however, get false positives if any site used a CGI script with a
.dmg extension that generated text/plain or text/html output.  That's the sort
of problem you run into once you start doing MIME-second-guessing.
your point is well taken, but I have to say I've run into far more sites serving
.dmgs with incorrect mimetypes than sites using .dmg CGI scripts that generate
text/plain or text/html output. Thanks to Micro$oft (and to a lesser degree
Netscape 4), we have to do dumb stuff like quirks mode and file extension sniffing.

If you haven't already, please vote for the Apache bug and email Microsoft about
this. The sooner we can get either of them to change their behavior, the sooner
we can scrap this patch. (My money's on Apache.)
Looks like we may have broken something: bug 177020.
+      if (extPosition > 0 &&
+          ((extPosition == nameLength - 4) || (extPosition == nameLength - 7)))

What is this mysterious '7' here? I assume it's for .dmg.gz, but what about
.dmg.sit or .dmg.tgz ?
re comment #36 - boris, I agree with you that sending nothing is ultimately the
correct behaviour for apache. That gives the UA total latitude to figure out
what to do. However, simply asking apache to change the default DefaultType to
application/octet-stream would achieve an almost as good result (IMHO) since:

(RFC 2046)
> The recommended action for an implementation that receives an 
>"application/octet-stream" entity is to simply offer to put the 
> data in a file, with any Content-Transfer-Encoding undone, 
> or perhaps to use it as input to a user-specified process."

This is just as good, in almost all cases. The only case where I can see it
falling down is if the content actually should be displayed inline or
automatically using a plug-in, in which case following the MIME spec would be
broken because it would pop up a file save dialog.  ... except that the MIME
spec here is only "recommended ... or perhaps" so I think you can do content
sniffing and whatever anyway.

In any case I appreciate your effort to patch apache but I think this simpler
change will accomplish the same thing. Also this change could be back-propagated
to users of older apaches by advising them to change the config file (it's a
small change).
> but what about .dmg.sit or .dmg.tgz ?

Apache comes with a reasonable default type for .sit. Using .dmg.tgz makes no
sense (and I think it resolves to a reasonable type as well, but I haven't
tested with a vanilla installation of Apache).
Sourceforge breaks the current 'fix' as it has links like this:

Planned workaround here is to not mangle the content type when the URL contains ?foo
I'm not sure how hard it would be to port the fix for bug 177026 over to
Chimera, but that may be worth looking at... That would allow you to restrict
this code to resetting the content-type to an empty string (or even
application/x-unknown-content-type) and allowing the content-sniffer to perform
type detection on it...

In fact, you may want to try that anyway, even if you don't port that fix over.
 If .dmg.something always contains nulls, that would make it be detected as
application/octet-stream anyway....  and if you limit the clearing to the cases
where you do it now it shouldn't affect most content...
now check to see if URL contains a query before we try and look for a .dmg
extension on the file name.  Also slightly reorg of the code to not check for
the length of the file name if no .dmg is found
Attachment #104281 - Attachment is obsolete: true
Didn't I tell ya there'd be problems with CGI output from URLs with .dmg?  :-)

This newest patch is only a Band-Aid [tm]... next somebody'll find another site
that uses .dmg CGI scripts without any parameters (maybe as the destination of a
form post), and they'll still break.

I tell ya, this MIME-second-guessing stuff is a tar baby... you'll keep getting
stuck deeper and deeper in it the more you try to extricate yourself.
Checked in version that handles a query in a .dmg URL
Experimented with bz's suggestion for application/x-unknown-content-type but
since  Chimera is based off of a 1.0 branch it doesn't have his spiffy new code
and doesn't work.  Unless an infinite recursion somehow fits the definition of
work :-)

Dan, you can say "I told ya so" all you want :-)  That doesn't change the fact
the default Apache configuration makes this a major issue for Mac users.  I care
a helluva lot more about their experience with Chimera than following the letter
of the spec which doesn't address what to do when the content type is so
blatently wrong and only set because of a bad default in the first place.
You can override Apache defaults with .htaccess files, you know.  While it's
true that some hosting providers may disable this feature, there are a lot of
webmasters who are actually capable of setting such configurations correctly in
their own sites but just don't know it.
> there are a lot of webmasters who are actually capable of setting such
configurations correctly in their own sites but just don't know it.

Yeah, but that doesn't do Joe Surfer any good at all. Sure a few people might
get a nice feeling of superiority knowing that the webmaster at some site isn't
as on the ball as they could be.  But most users are just going to say, "What
the heck is all this garbage on my screen? Where's my file?"

While it would be nice to live in a perfect world, I think that the chances are
pretty low that all the webmasters out there are going to wake up one day aware
that they are not making life convenient for a small portion of their users, and
do something about it.  In the mean time, why not work around it, hack or not?
Well, it's even less likely the webmasters will ever do anything about fixing
their incorrect site configuration if browser makers pander to their
incorrectness by hacking around it... at least, when the incorrect configuration
causes the download to fail, there's something specific to point to when
evangelizing the webmasters to fix the problem.  If it's been hacked around at
the browser level, all anybody would be able to say is "Your site is breaking
the standard... it still works in browsers, but it's incorrect anyway on an
abstract intellectual level," which wouldn't give them much incentive to ever
fix it, so the temporary hack would have to become a permanent "feature".  On
the other hand, saying "Your site's incorrect configuration causes it not to
work in [fill in browser name here]" gives them more reason to fix it.

Meanwhile, those sites that incorrectly trigger the "hack" (e.g., CGI scripts
with .dmg extensions) are just out of luck, collateral damage in this "war"
between misconfigured sites and browsers that hack around them.
As has been pointed out, this really needs to be fixed at the apache level, not
the individual webmaster level.  I don't think anyone is suggesting this as a
permanent hack; only until apache is fixed.  This patch has been called a
"Band-Aid"... I agree.  But I don't see that as a reason not to apply it; when
you have a cut, you don't say, "Well, it'll heal eventually on its own, so I'll
leave it alone and bleed everywhere for a while."  You put a Band-Aid on it
until it heals.

How much incentive do webmasters have now to do anything about the problem? 
Chimera doesn't have a that much of the market share, and if there's anything
the web has taught us, it's that if something works in IE, that's good enough
for most webmasters.  Saying that we won't play the game of making up for
shortcomings in the web may make a statement in support of the standards, and
make us feel good, but it also makes Chimera a less usable browser until the
specs are revised/followed.

It would be great if non-IE browsers had enough clout to influence webmasters,
but they don't yet.  And they never will if they don't give a good enough user
experience to get people to switch
Is there any reason this is being contemplated for Chimera alone and not 
trunk? It gives me the hives to see differential standards support in 
different Gecko-based browsers, and Evangelism has been dealing quite 
successfully with bugs of this nature on the trunk.
Could we please take all the pointless philosophical discussion the hell out of
this _CHIMERA_ bug to the newsgroups?  If people tell me which newsgroup they
have started the discussion in, I'll follow it.  If nothing else, this gives the
rest of the Mozilla community a chance to follow the argument and participate.

Choess: this is not being contemplated for the trunk, if nothing else, because
it's a "must fix now" hack.  IF we ever do this on the trunk (big if, since as
you mention evang has been doing a good job) it would be done totally
differently (for one thing, it would rely a lot more on content-sniffing than on
extensions).  I'm not even happy that Chimera's doing it, but I understand their
reasons, and I can't really stop them, can I?  Again, newsgroups please.
Marking this _Chimera_ bug as fixed hoping it'll end further commentary here. 
As Boris sez the mozilla newsgroups are where this discussion should be continued.
Closed: 22 years ago
Resolution: --- → FIXED
I filed on apache to
change the DefaultType to either application/octet-stream or
application/x-unknown-content-type . This second option is a new one to me, but
again, will not require a source code change to apache, just a default config
change, thus is more likely to be implemented sooner.
Why? There already is a bug filed with Apache, namely to not send any god damn
type at all if it doesn't know for sure what that document's type is. That way
the DESIGNER can bypass **** up servers and **** up web server admins with
that little type attribute defined in HTML, that attribute which currently is
next to meaningless.

Sending an octet-stream type instead of text/plain for an unrecognized XHTML
document solves exactly nothing.

Nick, did you bother to read the spec before you flamed me? Based on your
confusion (what does XHTML have to do with this?) I'm guessing the answer is no.

text/plain is a bad default type. application/octet-stream is a less bad default
type (and if you read the spec you'll find out why, it saves binaries to disk
instead of spewing them as text on the screen). no type at all is ideal.

However, sending no type at all requires code changes to apache, while sending
application/octet-stream requires merely a default config file change.
Therefore, it's a good workaround until the code change is in place.
Re-opening - get a 404 on a .dmg.gz file and we try to save the 404 message to
disk rather than displaying it :-(

Before the "I told ya so" pundits chime in we know simply extension sniffing
isn't the ideal way to solve this problem.  That's why we're going to start
sniffing for the .gz/.bzip2 magic number or nulls in the beginning of the
datastream if we get a hit on the extension.  
Resolution: FIXED → ---
hmm... does the 1.0 branch have the nsIHttpChannel changes to get the
"succeeded" boolean?  If not, you're gonna have to get the response status and
do "status / 100 == 2" explicitly....
For folks following this soap opera...

It turns out that where we're checking the file extension we can't sniff the
channel's data stream because it doesn't actually contain any data yet.  And we
can't move the sniffing down into the http channel code as that would mean we'd
already have gone through the .gz decompression.  Any Joseph Heller fans here? 
We can however determine if we got a 404 and not mangle the content type (which
is what bz's comment referred to) so that's what we're gonna do. Next!
Target Milestone: --- → Chimera0.7
nsHttpChannel trunk changes for GetRequestSucceeded() moved to Chimera branch
and checked in uriloader before mangling content type
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Verified in the 2002-12-04-04 build.
You need to log in before you can comment on or make changes to this bug.