Open Bug 1470011 Opened 6 years ago Updated 2 days ago

.tar.gz files mangled on download by double compression

Categories

(Core :: Networking, defect, P2)

60 Branch
defect

Tracking

()

People

(Reporter: RossBoylan, Unassigned)

References

Details

(Whiteboard: [necko-triaged][necko-priority-next])

User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
Build ID: 20180605171542

Steps to reproduce:

I added new information to https://bugzilla.mozilla.org/show_bug.cgi?id=1414459 but am not sure anyone will know, since it is marked as resolved incomplete.  So I'm opening this items as a reference to it.

In brief, when using FF to read email hosted at office365.com, when I download an attached .tar.gz the downloaded file is not the original.  It appears to be the result of gziping the tar.gz a second time.  At any rate, the result is not useable unless when undoes the extra zip, and is larger then the original.



Actual results:

See the http log recently attached to the original bug report.


Expected results:

The attachment is downloaded as is.  MSIE and Chrome both behave as expected.
Hello Ross,

 The archives with the extension ".tar.gz" need to be decompressed twice on a Windows OS. Firstly, they are decompressed from "archive.tar.gz" to "archive.tar" and then, from "archive.tar" to the original files. I have to mention that I have used the 7-Zip File Manager to decompress the archives.

If this is not the case, then I am missing some important bit of information because I cannot reproduce it. Both Chrome and Firefox Release browsers will download a ".tar.gz" archive from any place (including an attached file from Office 365), will need to be decompressed twice on Windows OS using 7-Zip File Manager.

If you think this still needs investigation, please provide a file with which the issue occurs on Firefox and not in Chrome and more detailed steps to reproduce.

P.S. Sometimes, it is normal for a compressed file to be larger than the original file, because it is in a format that's faster to upload/download.
Flags: needinfo?(RossBoylan)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0
Build ID: 20180621125625

Hi all, 

I'm encountering this issue in the context of downloading .tar.gz software artifacts: those downloaded via Firefox fail a checksum while those obtained via Chrome or cURL pass. 

Artifacts downloaded from Firefox (load URL, choose "Save File") can be gunzipped twice. After the first gunzip, the files still show their gzip magic numbers as verified through an `od -x <file>`. After executing gunzip a second time on the file downloaded via Firefox, the resulting .tar does *not* have the gzip header magic numbers.  Note that this is taking place on OSX 10.13 and not a Windows machine. 

The following is a URL to a public artifact that will consistently reproduce the issue:
https://3009-80669528-gh.circle-artifacts.com/0/artifacts/ddev_linux.v0.20.0.tar.gz
Hi Gary,

In bug 1414459, you have requested a HTTP log. The reporter has provided the HTTP log after the issue got status Resolved Incomplete and then opened this issue to replace it. 
Furthermore, I can't reproduce this issue on Windows 10 or Max OS X; with this archive or any another.

Does the HTTP log from the previous bug help you? Can we move on from there?
Component: Untriaged → Networking
Flags: needinfo?(xeonchen)
Product: Firefox → Core
(In reply to Bodea Daniel [:danibodea] from comment #1)
I've been on vacation.  See below.


> Hello Ross,
> 
>  The archives with the extension ".tar.gz" need to be decompressed twice on
> a Windows OS. Firstly, they are decompressed from "archive.tar.gz" to
> "archive.tar" and then, from "archive.tar" to the original files. I have to
> mention that I have used the 7-Zip File Manager to decompress the archives.
> 

My remark about needing to decompress twice did NOT refer to the layering of tar and gzip.  It referred to a second level of gzipping.

The problem Andrew reported in comment 2 does appear to the be same problem I am experiencing.


> If this is not the case, then I am missing some important bit of information
> because I cannot reproduce it. Both Chrome and Firefox Release browsers will
> download a ".tar.gz" archive from any place (including an attached file from
> Office 365), will need to be decompressed twice on Windows OS using 7-Zip
> File Manager.

Perhaps what you are missing is that by running 7-Zip twice you are removing the 2 levels of gzipping, as well as the tarring.

The file that is downloaded as foo.tar.gz should be called foo.tar.gz.gz.  That is, the original foo.tar.gz has been gzipped again on download. So if you unzip it once you get a file called "foo.tar", but in format it is "for.tar.gz".  So if you try to read it with, e.g., tar, it doesn't work (maybe it would with the unzip option).  I don't know if this misbehavior is limited to MS Windows; Andrew's experience suggests it is not.

If one knows all this one can  work-around it by unzipping the file, adding .gz back to the resulting file, and proceeding as normal, But this is a complete hack and  FF has no business applying a second zip.

> 
> If you think this still needs investigation, 

YES

> please provide a file with
> which the issue occurs on Firefox and not in Chrome and more detailed steps
> to reproduce.
> 
I provided the requested telemetry capture on the old bug.

Andrew provided an example that seems better than mine since it a) is public and b) doesn't involve MS hosted email.


> P.S. Sometimes, it is normal for a compressed file to be larger than the
> original file, because it is in a format that's faster to upload/download.

The result is significantly larger not because of any optimization; it's larger because the only thing gzipping twice can do is add overhead.

Ross
Flags: needinfo?(RossBoylan)
Hi Valentin,

Can you help (find someone to) check this?
Flags: needinfo?(xeonchen) → needinfo?(valentin.gosu)
The problem is that some ill-configured servers will return double-gzipped .tar.gz with `Content-Type: application/gzip` + `Content-Encoding: gzip` while some other servers will return single-gzipped .tar.gz with the same header. We can't save both cases simultaneously.
I can confirm there seems to be a problem here.
I reproduced on Linux Ubuntu - downloading with Firefox seems to produce a smaller file than you get with Chrome.
After extracting, the files seem to be the same, so there is an issue with the file being processed, rather than straight downloaded.
wget also produces an archive identical to what Chrome downloads. So unless the server is sending something different to Chrome/Firefox, I think we should change the behaviour here.
Flags: needinfo?(valentin.gosu)
The response has "Content-Encoding: gzip" header and for some reason we don't decompress it. The content stored in the cache (where it should be kept as downloaded from the server) and downloaded as a file are the same.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P3
Whiteboard: [necko-triaged]
We don't do content conversion because content encoding header is dropped at https://searchfox.org/mozilla-central/rev/97d488a17a848ce3bebbfc83dc916cf20b88451c/netwerk/protocol/http/nsHttpChannel.cpp#5463. But in this case it's not bogus header. We're receiving gzipped gzipped tar and we need to gunzip it once.
See Also: → 35956
See also bug 448240 and bug 426273. We don't have to concern the Apache bug anymore?
(In reply to Masatoshi Kimura [:emk] from comment #6)
> The problem is that some ill-configured servers will return double-gzipped
> .tar.gz with `Content-Type: application/gzip` + `Content-Encoding: gzip`
> while some other servers will return single-gzipped .tar.gz with the same
> header. We can't save both cases simultaneously.

First, I'm skeptical that the problem is the server since MSIE and Chrome both download the file as it was originally.

Second, even if the servers (there are two mentioned so far in this bug) are responsible for the extra layer of zipping, the fact that the other 2 browsers both handle it OK indicates that both cases clearly can be handled simultaneously.
Oh, I think I understand: the theory is that the server is doing an extra gzip (which is dumb, but not necessarily a violation of internet standards) and indicating that properly in the headers.  But FF ignores the header information because it is sometimes returned incorrectly when there was not an extra layer of zipping.

If so, the other browsers either have a way of distinguishing when headers are telling the truth about an extra layer of zipping, or else they fail in the situations where FF presumably succeeds (one level of gzip, but headers indicate 2 levels).

I'm finding this same issue when I'm trying to download the Linux tar.gz file from here:
https://ootrandomizer.com/downloads

Only Firefox is downloading the tar.gz double gzipped, and you have to gunzip it before you can decompress and extract it with tar. Chrome is working fine for this, as did wget.

Using file I get this for the working downloads:
gzip compressed data, was "0.tar", last modified: Wed Oct 16 16:28:21 2019, max compression, from Unix, original size modulo 2^32 183767040

For the double gzipped files downloaded via Firefox:
gzip compressed data, original size modulo 2^32 66735979
Then after gunzipping once:
gzip compressed data, was "0.tar", last modified: Wed Oct 16 16:28:21 2019, max compression, from Unix, original size modulo 2^32 183767040
as expected.

I am experiencing this with 72.0.1 on Linux. It is really an issue. We have to change our server software to avoid gzip transport compression on files that are already gzip'd.

Still having this Problem with Linux on 79.0. I understand that it is kind of the fault of the webserver to gzip something that is already gzipped, but that should not create these kind of problems.

Ah, and Chromium handles it fine :-)

This is still occurring in Firefox 94.0. It has been confirmed by multiple people.

First of all a server should guess the content-type and content-encoding not only by file extension but by using a magic number found at the start of the file (see also: standard IANA MIME extensions: https://www.iana.org/assignments/media-types/media-types.xhtml).
Of course looking at the magic number at the start of each file is a thing that almost no web server does (because of performance reasons, etc.); instead this job is usually done by browsers.

If a server is sending a file *.tar.gz with the following HTTP headers, i.e.:

Content-Length: nnnnnn
Content-Type: application/gzip
Content-Encoding: gzip
    or
Content-Length: nnnnnn
Content-Type: application/x-gzip
Content-Encoding: gzip

then the server is sending WRONG HTTP headers, because the first extension must be used for Content-Type and only the following extensions should be used for Content-Encoding, so, for a file *.tar.gz, it should use these HTTP headers:

Content-Length: nnnnnn
Content-Type: application/x-tar
Content-Encoding: gzip

If the file would be named: *.tar.gz.Z the HTTP headers should be:

Content-Length: nnnnnn
Content-Type: application/x-tar
Content-Encoding: gzip, compress

The content encodings must be listed in the order they were applied to the representation (file) (See also: RFC-7231 3.1.2.2. Content-Encoding).

If the file would be named *.gz (with only one extension), these HTTP headers should be used (without Content-Encoding):

Content-Length: nnnnnn
Content-Type: application/gzip

See also: mime types (https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types).

If the server would serve a *.tar file an it tried to compress the content on the fly and by doing this it would not know in advance the final length of the requested representation, it should use a compressed chunk encoding (for an HTTP/1.1 client), i.e.:

Content-Type: application/x-tar
Transfer-Encoding: gzip, chunked

without Content-Length and without Content-Encoding headers (see also: RFC-7230 3.3.1. Transfer-Encoding).

These kind of things are the basic rules that had to be applied since RFC-2616 (1999).

Nobody, who is not out of mind, would ever compress twice a file with the only result to get a bigger file.

HINT

I guess that Firefox should intercept these cases (bad content/encoding HTTP headers) and put out a warning for each case so even some web administrators could verify whether their servers are properly configured or not.

My conclusions are the following ones.

  1. The best strategy when handling file names extensions and encodings should be to start from the end of the file name and then going backwards to get all known extensions associated with encodings (i.e. .gz, .Z, .uu, etc.); then pick the first known extension which is not an encoding, but, if there is no known extension, then there are two options:
    A) if last known extension was a .gz (an encoding), web server can use:
    Content-Type: application/x-gzip
    or
    B) Content-Type: application/octet-stream
    Content-Encoding: gzip

  2. Firefox should never-ever transform the received web resource, when storing it to disk, unless there is a header "Transfer-Encoding" naming a compress transformation, i.e.: chunk, gzip / compress; right now it looks like that when receiving above mentioned HTTP headers, FF compresses received file before storing it to disk in order to make its content compliant with HTTP headers, but this is wrong because web-server may cheat on content-type and content-encoding and in any case content of file name should never be re-compressed by user-agent (browser).

Errata: in comment 18, the following sentence:

"If the file would be named: *.tar.gz.Z the HTTP headers should be":

Content-Length: nnnnnn
Content-Type: application/x-tar
Content-Encoding: gzip, compress

should be read as:

"If the file would be named: *.tar.gz.Z the HTTP headers should be":

Content-Length: nnnnnn
Content-Type: application/x-tar
Content-Encoding: gzip

Actually we had the same problem over here.
"The web page owner reported: "My CentOS server has gzip compression set. That's a fairly basic option. Looks like firefox's down loader is too dumb to realize the server is using compression. I added a line to the htaccess file to disable gzip in the dev directory."
This indeed resolved the problem we were discussing.
:D

(In reply to burdi02 from comment #21)

Actually we had the same problem over here today.
"The web page owner reported: "My CentOS server has gzip compression set. That's a fairly basic option. Looks like firefox's down loader is too dumb to realize the server is using compression. I added a line to the htaccess file to disable gzip in the dev directory."
This indeed resolved the problem we were discussing.
:D

<removed>

Severity: normal → S3

it happend to me it's still exist in 112.0 (64-bit)
steps to reproduce
try to download this file from wget and from Firefox it's still double compress it
it would be appreciated if you could fix this bug

sorry i forget to add the link
https://www.hdsentinel.com/hdslin/hdsentinel-019b.gz
(In reply to 3409769 from comment #24)

it happend to me it's still exist in 112.0 (64-bit)
steps to reproduce
try to download this file from wget and from Firefox it's still double compress it
it would be appreciated if you could fix this bug

Fixing this bug will regress bug 35956 and Chrome has that bug. I uploaded the same file to my server: https://emk.name/test/hdsentinel-019b.gz
Chrome will download and decompress this file, but it does not remove the .gz extension. That is exactly the problem we fixed in bug 35956.

We should either

  1. WONTFIX this bug, or
  2. fix this bug and WONTFIX bug 35956.

It would be product owner's decision.

i see could we make decompression optional in setting and make the default behaviour to not decompress gz files i think that's should fix both bug as i understand i not experienced feel free to correct me

(In reply to Masatoshi Kimura [:emk] from comment #26)

Fixing this bug will regress bug 35956 and Chrome has that bug. I uploaded the same file to my server: https://emk.name/test/hdsentinel-019b.gz
Chrome will download and decompress this file, but it does not remove the .gz extension. That is exactly the problem we fixed in bug 35956.

We should either

  1. WONTFIX this bug, or
  2. fix this bug and WONTFIX bug 35956.

It would be product owner's decision.

See Also: → 1846117
Duplicate of this bug: 1846117

Given that we got so many bug reports and comments, I think we might want to give this a higher priority.

Priority: P3 → P2
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-review]
Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-new]
Whiteboard: [necko-triaged][necko-priority-new] → [necko-triaged][necko-priority-review]

Ni'ing jesup to talk to google.

Flags: needinfo?(rjesup)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #31)

Filed https://bugs.chromium.org/p/chromium/issues/detail?id=1473207

That got duped and then https://bugs.chromium.org/p/chromium/issues/detail?id=1484221 got wontfixed. Does this mean we should adjust necko/gecko's behaviour to match chrome/wget?

Flags: needinfo?(rjesup)
Flags: needinfo?(kershaw)

Dup of bug 610679?

(In reply to Vincent Lefevre from comment #33)

Dup of bug 610679?

Probably but there is useful context here that is not there and vice versa. Let's see what Jesup/Kershaw want to do.

See Also: → 610679

(In reply to :Gijs (he/him) from comment #32)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #31)

Filed https://bugs.chromium.org/p/chromium/issues/detail?id=1473207

That got duped and then https://bugs.chromium.org/p/chromium/issues/detail?id=1484221 got wontfixed. Does this mean we should adjust necko/gecko's behaviour to match chrome/wget?

Yes, I think we should match chrome's behavior.

Flags: needinfo?(kershaw)

We should put this change behind a pref and see if we can land this sometime soon.

Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-next]
Flags: needinfo?(rjesup)
You need to log in before you can comment on or make changes to this bug.