Open Bug 986924 Opened 10 years ago Updated 2 years ago

binary gzip content-encoded contents sent as text/plain shows as gibberish

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

People

(Reporter: moumny, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-backlog])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (Beta/Release)
Build ID: 20140314220517

Steps to reproduce:

go to
http://popcorn.cdnjd.com/
click downlowd


Actual results:

it opens new tab 
displays binary as plain text


Expected results:

handle like other archives: ask/open/save
Reproduced on ubuntu 13.10 64-bits.
Component: Untriaged → File Handling
Do any other browsers handle this differently? This looks like a server error, since there's a text/plain Content-Type header being sent.
Assignee: nobody → english-us
Component: File Handling → English US
Product: Firefox → Tech Evangelism
Version: 28 Branch → unspecified
Yes, Chrome download the file as you would expect.
Safari works too.

Best regards.
Huh. Chrome appears to be sending the same request headers and receiving the same response headers as us, but apparently it chooses to interpret the result internally as application/x-gzip. That's interesting.
Assignee: english-us → nobody
Component: English US → File Handling
Product: Tech Evangelism → Core
Chromium has code that sniffs out gzip headers: https://code.google.com/p/chromium/codesearch#chromium/src/net/base/mime_sniffer.cc&q=application/x-gzip&sq=package:chromium&dr=C&l=155. We have similar code in nsUnknownDecoder, but don't include such an entry: http://mxr.mozilla.org/mozilla-central/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#266
Component: File Handling → Networking
Not only that. The sniffing part of nsUnknownDecoder::DetermineContentType() only runs if mContentType is empty, which in the case shown here isn't. See: http://mxr.mozilla.org/mozilla-central/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#301

The server says it is text/plain while in reality it is gzip compressed content.

Whichever is the "correct" behavior here is hard to tell.
I can be noted that if we'd follow the simple guidelines in the w3c document on mime sniffing on how to detect binary vs text (http://mimesniff.spec.whatwg.org/#binary-data-byte) we would only need to read the first byte to (0x1f) determine that this is a binary response...

My vote is that we start sniffing for binary in text/* responses. I'm sure it'll bring some other "interesting" side-effects though and I'm not in a position to say what they are and if they're worth this extra detection.
Please search for "check-for-apache-bug flag" in the mimesniff spec.
Blocks: mimesniff
" Platform: 	x86 Mac OS X "
Not only: same problem on Ubuntu 12.04 (Linux x86_64)
The other problem with opening a binary file as plain text is that it can crash/hang the browser. It happened to me twice (4GB ram).

While a power user can easily circumvent the problems, most users will be just confused/annoyed. So I think my vote goes for sniffing for binary.
OS: Mac OS X → All
Hardware: x86 → All
I can't reproduce this on Aurora 29 on Ubuntu 10.13; I get prompted to download a file of "unknown" type.

However, if the resource is reported as 'text/plain', then the fact that it has binary data bytes should cause it to be sniffed as 'application/octet-stream', per the following algorithms:

http://mimesniff.spec.whatwg.org/#supplied-mime-type-detection-algorithm
http://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm
http://mimesniff.spec.whatwg.org/#rules-for-text-or-binary
The original URL doesn't work anymore but I have some further details to shed on this behavior. I've setup a test URL (http://daniel.haxx.se/dump2.cgi) that only serves the beginning of the file the original URL provided. ~2K out of the original 49MB. I hope that I mimic the original problem close enough here.

The response headers it sends are these:

  HTTP/1.1 200 OK
  Date: Thu, 03 Apr 2014 09:57:50 GMT
  Server: Apache/2.4.6 (Debian)
  Vary: Accept-Encoding
  Transfer-Encoding: chunked
  Content-Type: text/plain

The beginning of the response body contains the three magic bytes 1f 8b 08 that can identify it as gzip data, but note that there's nothing that says it is.

Using Firefox network tools to inspect the response headers, it claims there's a "Content-Encoding: gzip" header (but that wasn't actually present over the wire). I figure that has been sniffed (and added) somewhere previous in the funcion call chain. I'm afraid I don't know yet exactly where that's done.

Then, in nsBinaryDetector::DetermineContentType() the code tries to determine if the content is truly text/plain or possibly binary. That check is *aborted* if Content-Encoding is set! If I just edit out that check (http://mxr.mozilla.org/mozilla-central/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp#626) the function will successfully detect the contents as binary and pop up a dialogue with an offer to download it...
(In reply to Daniel Stenberg [:bagder] from comment #13)
>   Vary: Accept-Encoding

> Using Firefox network tools to inspect the response headers, it claims
> there's a "Content-Encoding: gzip" header (but that wasn't actually present
> over the wire).

How did you inspect the response header on the wire? "Vary: Accept-Encoding" means that the response will vary depending on the Accept-Encoding header. Did you send "Accept-Encoding: gzip" on the request?
Argh. Sorry, I messed up. The headers on the wire do indeed include "Content-Encoding: gzip" I looked on the wrong request. This is the request:

GET /dump2.cgi HTTP/1.1
Host: daniel.haxx.se
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive

which actually gets this repsonse:

HTTP/1.1 200 OK
Date: Thu, 03 Apr 2014 11:07:45 GMT
Server: Apache/2.4.6 (Debian)
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 2787
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/plain

The cgi on my server just does this, so I'm a bit surprised it delivers a Content-Length:

#!/bin/sh

echo "Content-Type: text/plain"
echo ""
cat file

So, yeah the server does indeed say this is Content-Encoding gzip and text/plain.
Assignee: nobody → daniel
What's missing here is that the main sniffing is done _before_ the decompressing of the content. The decompressed content is delivered without sniffing (except for trying to figure out charset) and the content-type is trusted.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Summary: Firefox considers .tgz archive as plain text → binary gzip content-encoded contents sent as text/plain shows as gibberish
Continued: so the stream parser decompresses the content and call nsHTTPCompressConv::do_OnDataAvailable() for each decompressed chunk. This is used for all sorts of data and not just html or text to render, so we can sniff there. This function then calls the mListener->OnDataAvailable() to deliver the data.

In our problematic case, that function is nsHtml5StreamParser::DoDataAvailable() which calls nsHtml5StreamParser::SniffStreamBytes() in which I've played with adding detection logic for binary contents like below.

1 - I think the detection isn't good enough for UTF16 contents

2 - I don't know what to do if we truly detect the contents is binary and not text at all!

--- a/parser/html/nsHtml5StreamParser.cpp
+++ b/parser/html/nsHtml5StreamParser.cpp
@@ -737,10 +737,21 @@ nsHtml5StreamParser::SniffStreamBytes(const uint8_t* aFromSegment,
         mTreeBuilder->SetDocumentCharset(mCharset, mCharsetSource);
         return SetupDecodingAndWriteSniffingBufferAndCurrentSegment(
           aFromSegment, aCount, aWriteCount);
       }
     }
+    else if (mMode == PLAIN_TEXT) {
+      uint32_t i;
+      for(i=0; i<countToSniffingLimit; i++) {
+        if(!aFromSegment[i]) {
+          fprintf(stderr, "***************** found zero at index %u\n", i);
+          break;
+        }
+      }
+    }
     if (mCharsetSource == kCharsetFromParentForced ||
         mCharsetSource == kCharsetFromUserForced) {
       // meta not found, honor override
       return SetupDecodingAndWriteSniffingBufferAndCurrentSegment(
         aFromSegment, aCount, aWriteCount);
This bug is set to be blocking bug 808593, but possibly it is the other way around and this bug would not even exist if bug 808593 was implemented...
Please implement the algorithm from the MIME Sniffing Standard [1] instead of inventing yet another own algorithm.

[1] http://mimesniff.spec.whatwg.org/#sniffing-a-mislabeled-binary-resource
1 - I didn't invent a new algorithm, I was testing where I could detect binary.
2 - See bug 808593
Whiteboard: [necko-backlog]
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Possible duplicate of bug 864851?
Assignee: daniel → nobody
Status: ASSIGNED → NEW
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.