126782 - [FIX]Binary file with unknown type displayed as text/plain rather than saved

Reporter

Description

•

23 years ago

Going to only this particulair URL, mozilla will load the file in the browser
window, hence never making it to my hard disk w/ no work around.  I pasted the
URL into IE finally to download the file.  

All the other files on this page came as expected.

gavin long

Comment 1

•

23 years ago

Win98SE, 2002022203, this file loads into the window for me too.  

Reporter:  You can then save the file with File->Save page as...

Observing the output from wget, there doesn't seem to be any indication in the
server headers of what this file actually is (i.e. no mime type).  Given that
the extension ".w02" is hardly well-known, and given that making assumptions
based on the extension is A Bad Thing, what _should_ moz do with it?

Severity: major → normal

shwag

Reporter

Comment 2

•

23 years ago

I thought the File-->Save As might mess up the contents of the file, regarding
text to binary conversion.  

If you view the directory on the site that the file is stored in, you will see
that there are files named .W01 .W02 .W03, which are compressed files.  All of
the other files load properly.  There is even many other .W02 files which
downloads fine!  It is just that one link that loads wrong.  Weird, huh?

gavin long

Comment 3

•

23 years ago

> I thought the File-->Save As might mess up the contents of the file, 
> regarding text to binary conversion.  

I've done it many times with several formats (notably .asf, .wmv, both binary
formats) without a hitch.

However, you're correct.  Other files with the same extension elsewhere on the
site immediately pop up a "save file" dialog, but this one loads straight to the
browser window.  The page info dialog shows that moz thinks the file is text/plain.

So, a question for the developers:  how does moz decide what to do with these
files?  And how come it's doing different things with similar files from the
same site?

Assignee: bbaetz → law

Status: UNCONFIRMED → NEW

Component: Networking: FTP → File Handling

Ever confirmed: true

QA Contact: benc → sairuh

Summary: MIME bug maybe ? → Binary file with unknown type displayed as text/plain rather than saved

Bradley Baetz (:bbaetz)

Comment 4

•

23 years ago

ftp doesn't have content type, so we guess. bz, is the mime service getting this
wrong here?

Boris Zbarsky [:bzbarsky]

Assignee

Comment 5

•

23 years ago

This is sort of funny, actually...  The mime service says nothing about this
file (since it has no useful extension), so it gets passed on to the unknown
content decoder.

The way the unknown content decoder tells text/plain apart from
application/octet-stream is by looking for null bytes.  The first null byte in
this file is the 1168th byte.  The unknown content decoder only looks at the
first 1024 bytes of the file (since 99% of the time that's enough to determine
what needs to be determined).  In fact we're considering decreasing that 1024 to
something like 512 or 256 so it won't be so eager to decide things are HTML...

Over to rpotts... I'm not sure what a good solution is here, exactly.  No matter
what we do, unless we sniff the entire file there is no way to tell whether it's
text or binary data (one can always come up with a more pathological case).

Maybe we should special-case FTP somehow or something?

Assignee: law → rpotts

Component: File Handling → Networking

gavin long

Comment 6

•

23 years ago

> The unknown content decoder only looks at the first 1024 bytes of the file 
> (since 99% of the time that's enough to determine what needs to be determined)

I'm going to be nitpicky here, and say that unless I've completely forgotten
what I was taught about statistics, looking at the first 1024 bytes is only
going to work about 98% of the time, or 49 times out of every 50.  Which isn't
actually very certain.  Cutting down to the first 256 bytes will cut that to
63%, less than 2 times out of three.  Not good.

I'm assuming a totally random distribution of the individual bytes in the range
0-255 for the purposes of this calculation.  This isn't always going to be the
case of course, and if a file format has a bias AGAINST null characters, things
are going to get worse.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 7

•

23 years ago

suggest a better approach, given that we have to make the decision before we
have all the data and the decision is irreversible...

gavin long

Comment 8

•

23 years ago

I ain't got one.  I agree, there isn't much that can realistically be done in
these circumstances, since we're essentially blindfolded in a dark room and
someone's stolen our torch batteries.

rpotts (gone)

Comment 9

•

23 years ago

I think that the bottom like is that ALL the unknown decoder can do is *guess*!!

By the time we get to the unknown decoder, we've exhausted ALL other (more
accurate) options for determining the content-type of the data...  So, all we
have left is a collection of heuristics that we use to 'guess' the
content-type...  Sometimes we guess wrong :-(

Is there some way that we can modify these heuristics to guess better? 
Currently, i believe that our least reliable heuristic is that for detecting
'text/plain'... 

Initially, I chose to *only* key off of embedded NULLs because various character
set encoding use the 8th bit...  Maybe, this isn't an issue??

Since we have NO character encoding information available, we can't deal with
these characters very well anyways (all we can use is the 'default' encoding)...

So, maybe we should modify the code to disallow *anything* in the 8th bit... 
The argument for doing so is that it would limit false positive 'text/plain'
hits.  It may very well, reject streams that 'could' be rendered as text/plain
using the default character encoding...

I guess the question is which is more desirable:
1. occationally rendering binary data in a window...
or
2. occationally bringing up the 'Save As' dialog box for text files...

Once we decide which is the desired behavior, we can fine tune our heuristics...

-- rick

gavin long

Comment 10

•

22 years ago

I _may_ be able to help come up with something better, but I'm gonna need to
clarify a few things first:

1) what groups are we categorising files into?  From the comments above, we're
after at least text/html, text/plain and [everything else] - any more?  If it's
a fairly short list, then we can see about knocking up a list of conditions for
each of them.

2) presumably we have to worry about every language/alphabet under the sun,
which is where the 8-bit stuff comes from.  I can only claim to know anything
about languages that use the latin alphabet, so to be really thorough we'll need
some input from the i18n guys.

> I guess the question is which is more desirable:
> 1. occasionally rendering binary data in a window, or
> 2. occasionally bringing up the 'Save As' dialog box for text files...

Personally, I'd prefer 2.  But then I'm not a typical user.  Can we get anyone
to go out into the world and mercilessly interrogate a couple of thousand
typical users?  :)

Seriously, users without any technical knowledge are just going to run away
screaming when they see "garbage" in the browser window, and many
slightly-technically-savvy users "know" that opening a binary file in a text
viewer, then saving it, is a quick way to break the binary file, and won't
bother trying.  In many cases, they're right.  Moz is unusual here.  At least if
we offer to save to disc, the user can save it with a .txt extension (or
whatever) and open it in their favourite text editor.  

Random thought: at the point where this code gets invoked, _we_ have *absolutely
no idea* what the incoming file is.  How likely is it that the user is as
clueless as we are?


Assuming that there's a short(ish) list of file categories to worry about, a few
ideas:

- text/plain.  What about whitespace?  How many text files are going to have no
whitespace (space/tab/cr/lf) characters _at all_ in the first 256 bytes, let
alone the first 1024?  If it's got less than about one whitespace character per
60 bytes in the first 1024 bytes, it almost certainly isn't plain text (probably
not HTML, either).  That'll stand for just about every latin-alphabet language,
I think.  If it isn't a human language (e.g. base 64 encoded, or whatever), then
the user is probably going to want to save it anyway, since mozilla's not going
to be able to do much useful with it.  Course, if it's an ASCII-art kitchen
sink, we're in trouble :-D.

- text/html.  It's gonna have tags in it, surely?  Can't we go looking for
"<html", "<head", "<body", or even <...>...<...>...<...> patterns?  On the other
hand, how often does this code actually get handed an HTML file?  To get here,
it's got to be coming in without any content headers (which I believe means it's
probably not coming via HTTP[S]?), and it's not got any kind of recognised HTML
file name extension.  It'd be really nice if we could get some kind of data on
what files actually hit this code.  Not likely, I know, but it would be really nice.

> Initially, I chose to *only* key off of embedded NULLs because various 
> character set encoding use the 8th bit...

Why only nulls?  What about the other control characters, ascii 01-31?  OK,
there will be CR, LF, and TAB floating around, but what about some of the
others?  05 (enquiry), 06 (Acknowledge), 07 (bell), and several others I don't
even know the purpose of, are going to frightfully rare in text files, aren't they?

OK, enough wibbling from me.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 11

•

22 years ago

> 1) what groups are we categorising files into?

At the moment we detect:  application/pdf, application/postscript, text/html,
                          all the image types Mozilla supports, text/plain,
                          application/octet-stream

> Can't we go looking for "<html", "<head", "<body", or even
> <...>...<...>...<...> patterns? 

We do.
http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#333


> On the other hand, how often does this code actually get handed an HTML file? 

A lot.  90% of the ad servers out there don't send any content-type.  More to
the point, every single ebay URL goes through this code (ebay seems to feel it's
above sending content-type headers).

I think I agree that I'd rather err on the side of letting the user save than on
the side of showing in browser.  Especially if we ever get a "view as text"
option hooked up for the helper app dialog.  :)

sairuh (rarely reading bugmail)

Updated

•

22 years ago

QA Contact: sairuh → benc

gavin long

Comment 12

•

22 years ago

>> 1) what groups are we categorising files into?
>
> At the moment we detect:  application/pdf, application/postscript, text/html,
> all the image types Mozilla supports, text/plain, application/octet-stream

OK.  Most of those have headers that are being explicitly sniffed, which makes
life easier.  

From personal experience, I'd say it's probably worth adding .asf
(http://www.microsoft.com/windows/windowsmedia/WM7/format/asfspec11300e.asp) and
.wmv (which has the same internal format as .asf, according to
http://support.microsoft.com/default.aspx?scid=kb;EN-US;q284094).

Yes, they're MS-proprietary, but they're out there in substantial numbers, and
they're the formats that give me the most grief.  The spec linked on the above
page appears to be in office 2000 format, so I can't read it, but I'd be
surprised if there wasn't a sniffable header in there.

I know there's a limit to how many types we can be reasonably be expected to
sniff, but presumably PDF/PS get in because they're common on the net?  What
about other things that are common?  Can we get data[1] on what file types are
out there?

[1] data that's more meaningful than me going "i wanna .asf and a .wmv and a
.exe and a .zip and a .tar and a ....."


>> Can't we go looking for "<html", "<head", "<body", or even
>> <...>...<...>...<...> patterns? 
>
> We do. [snip]

And a few more I hadn't thought of.  Jolly good.

Looking at the code, it's basically:
1. [PDF or Postscript headers] -> appropriate types
2. [local file] -> go to step 4 for security reasons
3. [html tags?] -> HTML
4. [known image headers] -> appropriate types
5. [No nulls in it?] -> plain text
6. [everything else] -> octet-stream

Apart from quibbles about other explicitly sniffable types, I've little to add
beyond the possible improvements to plain text sniffing I listed above.

> I think I agree that I'd rather err on the side of letting the user save
> than on the side of showing in browser.

Any chance we can ping some usability gurus on this?

> Especially if we ever get a "view as text" option hooked up for the helper app
> dialog.  :)

Yeah, that would help.  The more I think about it, the more I think that if a
file makes it down to step 5, the user probably has a better idea of what it
is[2] than we do, so the best solution might be to just ask them.

[2] not least because we're completely clueless at this point.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 13

•

22 years ago

mpt, what do you think about comment #9?

gavin long

Comment 14

•

22 years ago

*** Bug 129918 has been marked as a duplicate of this bug. ***

gavin long

Comment 15

•

22 years ago

From comment 4, above:

> The mime service says nothing about this file (since it has no useful 
> extension)

Hang on a minute.  Does this mean that if the file has a recognised file
extension,  moz should figure out whether the file can be displayed or not?  So
the unknown decoder only kicks in if the extension isn't recognised?  If so, it
looks like .asf and .wmv aren't on that list.  Adding them to that list would
best be filed as a different bug, since this one is rapidly heading in the
direction of "what we should do with files in the unknown content decoder",
which is a different issue from preventing them hitting the decoder in the first
place.

If someone can confirm the above, give me a shout and I'll spin off a separate
bug for that.

[sorry, brain go slow, should have spotted this earlier]

Boris Zbarsky [:bzbarsky]

Assignee

Comment 16

•

22 years ago

> So the unknown decoder only kicks in if the extension isn't recognised?

Correct.  If nothing else ever uses those extensions then we can just add them
to our "extensions we know" list at
http://lxr.mozilla.org/seamonkey/source/uriloader/exthandler/nsExternalHelperAppService.cpp#124

gavin long

Comment 17

•

22 years ago

>> Adding [.asf, .wmv] to [the lsit of known extensions] would best be filed as 
>> a different bug.
>
> If nothing else ever uses those extensions then we can just add them to our 
> "extensions we know" list [...]

Well, www.wotsit.org doesn't know any other uses of .asf.  And it's not even
heard of .wmv or .wma (.wmv's audio cousin).  Dunno if that's a good sign or a
bad sign :-/

Anyhow, logged as bug 129982.

rpotts (gone)

Comment 18

•

22 years ago

Ok... so it sounds like tightening up our text/plain detection is desirable. 
Let me summarize what i'm hearing...

1. In addition to NUL, check for other 'low ascii' control characters to reject
text/plain.

2. Add a whitespace heuristic...  Some amount of <SP> and/or <TAB> should be
present (ideally one or more per line :-) )

any other suggestions to sniff out text/plain ??

I suppose we could add explicit detection of base64 encoding to limit the number
of text/plain misses because of this encoding too..

-- rick

gavin long

Comment 19

•

22 years ago

That's my best shot for now.  The ASF/WMV thing should be covered by bug 129982.

The only other thing is the suggestion to switch from:
if (known binary) [octet stream]
else [plaintext]

to:
if (known plaintext) [plaintext]
else [octet stream]

So that the "unknown" cases get saved to disc rather than loaded into the
browser window.

That's my preferred behaviour and Boris's, too, I believe.  But, of course,
Boris and I aren't typical users, so PDT and MPT might have different ideas.

Frederic Bezies

Comment 20

•

22 years ago

According to comment #11 and others, it would be great to add those extension to
mozilla :

.ace -> Ace archives files (http://www.winace.com/)
.rar -> Rar archives files (http://www.rarsoft.com/)

Is this possible ? All archives we can download are not only .zip :-)

shwag

Reporter

Comment 21

•

22 years ago

Files that are .ISO always end up in my window.  
I don't know if already discussed solutions will fix this too.

gavin long

Comment 22

•

22 years ago

Frederic, shwag, those issues are probably best covered by logging separate bugs
for those extensions (similar to my bug 129982 for windows media), since this
bug is covering what happens once moz decides it's got no idea what it's dealing
with.

Frederic Bezies

Updated

•

22 years ago

Blocks: 138000

Andrew Hagen

Comment 23

•

22 years ago

Would the following work as a fix for this bug? First, add several known binary
file extensions, including ISO, bz2, and others to the mime service. Second, set
the unknown content decoder to look at 0.05% of any file it gets for null
characters. For a one million byte file, it would look at 50,000 bytes.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 24

•

22 years ago

Removing bogus dependency that was added by a non-driver.

No longer blocks: 138000

Boris Zbarsky [:bzbarsky]

Assignee

Comment 25

•

22 years ago

In response to comment #23 -- yes, that could be doable....  Rick, what do you
think?  We probably want to do PR_MAX(512, something*datasize) (othewise for a
small file we'd only look at a few chars...

I'm assuming you meant 5%, not 0.05%, since 0.05% of 10^6 is 500, not 50000....

I think 5% is a little big.  That would be on the order of 500000 bytes (that
would need to be allocated in memory!) for downloading Mozilla, and would be on
the order of 20-30 megabytes (that would need to be allocated in memory) for ISO
images....  But the general approach could certainly be tried; I'd like to see
whether that approach has any more success with the various file types listed in
this bug.

Perhaps something more like:

PR_MIN(PR_MAX(512, something*datasize), 20000) 

would be a thought?  That way ridiculously huge files are capped....

Bradley Baetz (:bbaetz)

Comment 26

•

22 years ago

That wouldonly be valid for ftp, or the unknown content type. We have to trust
the server, if it lies, its a server issue, and not our problem.

Andrew Hagen

Comment 27

•

22 years ago

That sounds good. Once implemented, we could fine tune it, if necessary.

rpotts (gone)

Comment 28

•

22 years ago

having a variable length buffer based on the content-length (that is clamped as
boris suggests) sounds fine to me.

However, this is exactly the opposite of what bug #119942 is all about :-)  It
suggests that a *smaller* buffer be used ;-)

lets decide on a strategy... and mark bug #199942 as either a dup of this bug...
or invalid...

-- rick

gavin long

Comment 29

•

22 years ago

Buffer size:

Firstly, let's keep things sane for those on slow connections.  In europe, most
people are still on dial-up.  If they're downloading things from a slow server,
on the other side of the world, even 1024 bytes can take a few seconds.  

A 20,000 byte buffer could mean clicking the "save this link" option, then
waiting *15-20 seconds* for the filename dialog to come up.  Even from a fast
server, with a fast modem, they're gonna be waiting 4-5 seconds with no sign
that their click did anything.  That's too long.  It'll confuse users, and make
them think moz is glacially slow at downloading.

Ideally, we could use something like the "getting file information" intermediate
dialog IE6/Win has, but that's probably gonna be loads of work, and best covered
by another bug.

Conversely, the buffer's got to be big enough so that, statistically, it is
going to correctly figure out binary/text _most_ of the time by whatever method
is being used.  Obviously, 100% would be good, but that ain't gonna happen.  The
present method is good for about 98% with a 1024 byte buffer, but a 256 byte
buffer will cut that to under 70%, which is terrible.  If we improve the
detection method, as discussed above, we can probably get better detection, with
a smaller buffer than is currently being used, especially if we can catch some
of the common culprits via other methods (e.g. windows media, bug 129982 )

So, summary of what I think needs doing:

1. Improve plain text detection heuristics as discussed here.
2. consider adding other sniffable headers to those checked
3. amend default to [save] rather than [display] (i.e. if we can't figure it
out, treat it as binary, not as text)
4. reconsider buffer size given improved heuristics.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 30

•

22 years ago

> A 20,000 byte buffer could mean clicking the "save this link" option

This code is never called for that option.  The _only_ time this code is called
is when you actually load a url (click on a link, type in URL bar, submit form,
etc).  Any "save link", "mail link", etc. options do not use it.

gavin long

Comment 31

•

22 years ago

>> A 20,000 byte buffer could mean clicking the "save this link" option
>
> This code is never called for that option.  

Doh!  Of course, at that point, they're ASKING to save it, aren't they?  So much
for that objection.  [mental note to self: WAKE UP!]

If the heuristics are improved, however, would we relly _need_ a bigger buffer?

If we remove a couple of the worst-offending filetypes by checking for headers
and/or extensions, add a whitespace check, and add a check for half-a-dozen
different ascii 0-31 characters, we could get our accuracy better than 99.99%,
all with a 1024 character buffer.  We could probably even get better than 99.7%
with only the 256-character buffer proposed in bug 119942 - which is to say, a
quarter of the error rate of the current system with a 1024-byte buffer.

I think that extending the null check to cover other characters may be the best
single improvement, if we can do so.  Even expending it to check for 2/3
characters, rather than just the one, out of the 8-bit ascii range, will make a
huge difference to our accuracy.

*all my statistics are assuming random distribution of characters, yada, yada.

Frederic Bezies

Comment 32

•

22 years ago

I don't know if it is related, but every file with unknown extension (from
groups.yahoo.com) are saved like .exe files (in 2002043010 nightly trunk build).

Strange ?!

Boris Zbarsky [:bzbarsky]

Assignee

Comment 33

•

22 years ago

Totally unrelated bug (bug 120327)

Andy Lyttle

Comment 34

•

22 years ago

Of ASCII 0-31, which characters are valid in text/plain files?

9  = \t (tab)
10 = \n (linefeed aka newline)
12 = \f (formfeed, is this actually used in text files?)
13 = \r (carriage return)

Did I miss any?  A file should only be considered text if there are no
characters in the 0-31 range other than these.

IIRC 127 isn't printable either, so should also identify a binary file.  So, we
should check for 0-8,11,14-31,127 (adjust as needed) and only if none of those
characters are present, AND there are spaces or \t or \n or \r scattered
appropriately, then it's text, otherwise it's binary.  Right?

Re: Comment #18, rpotts: are you saying base64-encoded files *should* be
displayed as text?  Why?  Seems to me that displaying them as text is useless; I
can't read base64, but if I save the file I can extract it with StuffIt Expander
or whatever.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 35

•

22 years ago

> 12 = \f (formfeed, is this actually used in text files?)

It sure is.  Newsgroup posts, for example.

You forgot

11 -- Vertical Tab (\v)

Comment 18 meant that we will currently detect base64 as plaintext (since it's
7-bit-clean printable ascii).  We should therefore attempt to detect it as
non-text/plain, for best results.  :) Any idea what the magic numbers that
identify a base64-encoded file are?

Cameron Simpson

Comment 36

•

22 years ago

Let me add a voice for user control.

Specificly:
Let the use specify a preferred handler for an unknown type. (BTW, _is_ there a
MIME type for "unknown"?)
Once loaded (or loading), let the user hand the URL to a specific handler. For
exmple, bring the URL in as app/octet-stream (my conservative preference) and in
the Save dialogue, off a "recast to type and handler" option.

Also, can the .ext -> mime/type mapping be exposed and manually extensible?
To be used only in the guess-this-type code of course, since the server's MIME
claims should be respected.

I'd also like to second the vote for Gavin Long's comment #19, to change:

    if (known binary) [octet stream]
    else [plaintext]
to:
    if (known plaintext) [plaintext]
    else [octet stream]

It seems much safer and saner to me.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 37

•

22 years ago

> BTW, _is_ there a MIME type for "unknown"

application/octet-stream is it.  The definition is "unknown data of some sort".

The rest of what you suggest is already covered in 3 or 4 different RFEs.  The
extension to type mapping is extensible through helper app preferences already.

Andrew Hagen

Comment 38

•

22 years ago

*** Bug 119942 has been marked as a duplicate of this bug. ***

Andrew Hagen

Comment 39

•

22 years ago

Proposed relnote: Mozilla will sometimes not detect that an opened file is
binary, and will attempt to display it as a web page. To download such a file,
right-click on the link and select "Save Link Target As."

Keywords: relnote

Fabian Lau

Comment 40

•

22 years ago

Can't you also take the filesize into account? I mean, if a file is larger than
1 or 2 MB, I'm pretty sure users want to save that file (or open it with another
application) rather than read it in the browser window. And I doubt there are
that many large textfiles around...

Boris Zbarsky [:bzbarsky]

Assignee

Comment 41

•

22 years ago

We could, but large logfiles or message archives are actually very common..
Easily multi-megabyte.

Markus Hübner

Updated

•

22 years ago

Keywords: mozilla1.0

Andrew Hagen

Updated

•

22 years ago

Keywords: mozilla1.1

Peter Lubczynski

Comment 42

•

22 years ago

Plugins also have this problem on Win32, for example:
http://slip.mcom.com/shrir/edittext4.swf

Should we not be looking at the extensions?

Nominating nsbeta1.

Keywords: nsbeta1

benc

Comment 43

•

22 years ago

-> ftp (may end up in File Handling)
peter: In FTP, yes. For the example you give, what does that extension map to?

Component: Networking → Networking: FTP

Peter Lubczynski

Comment 44

•

22 years ago

My testcase works in FTP mode. It does not work in HTTP.

That extension in only mapped to a mime type in plugin code.  Calling
|nsIPluginHost::IsPluginEnabledForExtension| will check for a mapping.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 45

•

22 years ago

For HTTP, if the server tells us it's text/plain then we should not be looking 
at extension.

benc

Comment 46

•

22 years ago

okay ->file handling, if I'm reading this correctly.

Component: Networking: FTP → File Handling

QA Contact: benc → sairuh

Matthias Versen [:Matti]

Comment 47

•

22 years ago

*** Bug 152203 has been marked as a duplicate of this bug. ***

Andrew Schultz

Comment 48

•

22 years ago

*** Bug 156020 has been marked as a duplicate of this bug. ***

Andrew Schultz

Comment 49

•

22 years ago

this occurs on Linux as well
OS=>All

OS: Windows XP → All

Spider

Comment 50

•

22 years ago

according to bug #156020 this is true on Mac (OS X and 9 ) as well.
( http )
Now, referring to that bug as well, this happens on the .gz format as well, and
that format is well recognized with a .gz file ending, And has a very applyable
header.

Though this seems to barf quite hard with things like this as well:

spider@Darkmere spider $ wget
http://www.mitzpettel.com/download/IcyJuice0.9d2.dmg.gz
--18:02:37--  http://www.mitzpettel.com/download/IcyJuice0.9d2.dmg.gz
           => `IcyJuice0.9d2.dmg.gz'
Resolving www.mitzpettel.com... done.
Connecting to www.mitzpettel.com[161.58.237.23]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 606,600 [text/plain]



The text/plain would suggest a malconfigured (unconfigured?) http server, but
how come it gets attached as text/plain with mozilla? why do we trust the server
 in this case?

Boris Zbarsky [:bzbarsky]

Assignee

Comment 51

•

22 years ago

We trust the server because that's what the HTTP specification says we MUST do.
 Let's keep this bug focused on the issue at hand, please...

Simon Fraser [no longer active]

Comment 52

•

22 years ago

I see this in Chimera too, so it hits embedding apps as well. Yet another
testcase: <http://ftp.mozilla.org/pub/chimera/nightly/2002-07-22-05/Chimera.dmg.gz>

Hardware: PC → All

Matthias Versen [:Matti]

Comment 53

•

22 years ago

Simon: Mozilla/chimera use the Http protocol for this URl and the server sends :
text/plain....

sairuh (rarely reading bugmail)

Updated

•

22 years ago

Blocks: 150046

Steve Dagley

Updated

•

22 years ago

No longer blocks: 150046

sairuh (rarely reading bugmail)

Updated

•

22 years ago

Blocks: 127253

jlarsen

Comment 54

•

22 years ago

This bug hasn't been touched in months? Its marked mozilla1.0?
Anyone care to make a patch for assuming save as and providig view as text in
the save as options?

Boris Zbarsky [:bzbarsky]

Assignee

Comment 55

•

22 years ago

> make a patch for assuming save as

What does that have to do with this bug?

> providig view as text

This part is a large piece of work...  (trust me, I've tried two or three times).

Is there a comment after comment 18 that actually has a useful suggestion other
than the banter about buffer sizes?

Daniel Hyde

Comment 56

•

22 years ago

>> I think I agree that I'd rather err on the side of letting the user save
>> than on the side of showing in browser.

>Any chance we can ping some usability gurus on this?

I wouldn't claim to be a usability guru, but I'm certainly a user.

Why not simply add an option to force-save the file in raw format, regardless of
the mime type sent, to the "save as type" menu. That way, if mozilla incorrectly
identifies a binary file as text, or the server erroneously sends a text mime
type for a binary file (like with those RAR archives), the user has some control
over how the data is saved -- if they know the file is binary, they have a means
of safely saving it as binary data that doesn't involve pasting the address into IE.

The same could be added in reverse: on the off-chance that mozilla, for whatever
reason, interprets a text file as binary data, the user can force-save as text
if s/he so desires.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 57

•

22 years ago

>Why not simply add an option to force-save the file in raw format

imho, saving a file should ALWAYS save it in raw format (unless "web page
complete" is chosen, of course)

Boris Zbarsky [:bzbarsky]

Assignee

Comment 58

•

22 years ago

In case you all missed it, saving in Mozilla _is_ in raw format.  we don't even do 
newline conversion (though we should, imo, in some cases).

sairuh (rarely reading bugmail)

Updated

•

22 years ago

QA Contact: sairuh → petersen

shwag

Reporter

Comment 59

•

22 years ago

Here is another file that does the same ol' thing we've all seen for months.

http://205.122.23.229/peng/linusq-a.ogg

Boris Zbarsky [:bzbarsky]

Assignee

Comment 60

•

22 years ago

Bad example -- that one the server claims to be text/plain.  Fix the buggy
server, please.

shwag

Reporter

Comment 61

•

22 years ago

Its not my server to fix, and since there are other servers out there that are 
also likely misconfigured, it would be foolish to say that it is not worth 
looking at a way to have mozilla detext files by extension.  

Workaround: open the URL up in IE.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 62

•

22 years ago

No, you do not understand.  Doing what you suggest wouldbe a gross and blatant
violation of the spec that _no_ browser other than IE does (I've tested Mozilla,
Opera, Konqueror, Netscape 4, Mosaic, lynx, links, w3m).

We _can_ detect these files by extension or even data sniffing.  However we will
_not_ be doing it.

Please stop spamming this bug with rehashes of discussions that have happened in
the newsgroups many times over.

Cameron Simpson

Comment 63

•

22 years ago

Workaround is to save it with File->Save or Ctrl-S in the window.

Adam Hauner

Updated

•

22 years ago

Keywords: mozilla1.0, mozilla1.1

Paul Wyskoczka

Comment 64

•

21 years ago

adt: nsbeta1-

Keywords: nsbeta1 → nsbeta1-

Jon Henry

Comment 65

•

21 years ago

*** Bug 210973 has been marked as a duplicate of this bug. ***

Boris Zbarsky [:bzbarsky]

Assignee

Comment 66

•

21 years ago

OK, taking.  We've talked a lot, and lots of good ideas here, and I'm going to
implement the simplest one -- filtering out known-not-text chars.

Assignee: rpotts → bz-vacation

Boris Zbarsky [:bzbarsky]

Assignee

Comment 67

•

21 years ago

Attached patch Proposed patch — Details — Splinter Review

For the curious, with this patch we detect the file in the URL field as binary
three bytes in.

Boris Zbarsky [:bzbarsky]

Assignee

Comment 68

•

21 years ago

Comment on attachment 135573 [details] [diff] [review]
Proposed patch

Er, ignore that first hunk; I've not updated this tree to tip in a few days...
;)

IS_TEXT_CHAR treats 127 and 8-bit chars as text for now, because various
codepages may use them (though they probably should not be using 127, I can't
guarantee that they are not).

Thoughts?

Attachment #135573 - Flags: superreview?(darin)

Attachment #135573 - Flags: review?(darin)

Boris Zbarsky [:bzbarsky]

Assignee

Updated

•

21 years ago

Priority: -- → P1

Summary: Binary file with unknown type displayed as text/plain rather than saved → [FIX]Binary file with unknown type displayed as text/plain rather than saved

Target Milestone: --- → mozilla1.6beta

Darin Fisher

Comment 69

•

21 years ago

Comment on attachment 135573 [details] [diff] [review]
Proposed patch

this is better than nothing.  i agree that matching 127 here might be risky.  i
think this is a good heuristic that should help catch a lot of cases.

r+sr=darin

Attachment #135573 - Flags: superreview?(darin)

Attachment #135573 - Flags: superreview+

Attachment #135573 - Flags: review?(darin)

Attachment #135573 - Flags: review+

Boris Zbarsky [:bzbarsky]

Assignee

Comment 70

•

21 years ago

Checked in.  The next step is to add sniffers for common formats, per comment 29
(which I think has a good summary of the situation).  Please file bugs on those
and assign them to me?  So far we have base64 on the list, right?

Status: NEW → RESOLVED

Closed: 21 years ago

Resolution: --- → FIXED

Asa Dotzler [:asa]

Updated

•

21 years ago

Keywords: relnote

Nobody; OK to take it and work on it

Updated

•

8 years ago

Product: Core → Core Graveyard