Closed
Bug 1416270
Opened 7 years ago
Closed 6 years ago
Queue: artifacts served with gzip encoding regardless of Accept-Encoding
Categories
(Taskcluster :: Services, enhancement)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: Pike, Assigned: pmoore)
References
()
Details
When downloading the artifact in the URL, sometimes we get the installer.exe, and sometimes we get the same file gzip'ed.
Looks like transfer encoding is sometimes on and sometimes off, or not advertized.
Example output of gzip'ed output:
[axel@l10n-merge1.private.scl3 ~]$ wget https://queue.taskcluster.net/v1/task/TuT79epdQpSH033sMQrY6Q/runs/0/artifacts/public/build/de/target.installer.exe
--2017-11-10 07:52:06-- https://queue.taskcluster.net/v1/task/TuT79epdQpSH033sMQrY6Q/runs/0/artifacts/public/build/de/target.installer.exe
Resolving queue.taskcluster.net... 23.23.74.217, 54.197.255.25, 174.129.218.85
Connecting to queue.taskcluster.net|23.23.74.217|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe [following]
--2017-11-10 07:52:06-- https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe
Resolving public-artifacts.taskcluster.net... 54.230.53.225
Connecting to public-artifacts.taskcluster.net|54.230.53.225|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41579315 (40M) [application/x-msdownload]
Saving to: “target.installer.exe”
100%[======================================>] 41,579,315 25.3M/s in 1.6s
2017-11-10 07:52:08 (25.3 MB/s) - “target.installer.exe” saved [41579315/41579315]
[axel@l10n-merge1.private.scl3 ~]$ file target.installer.exe
target.installer.exe: gzip compressed data, was "target.installer.exe"
[axel@l10n-merge1.private.scl3 ~]$ wget --version
GNU Wget 1.12 built on linux-gnu.
+digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl
-iri
Wgetrc:
/etc/wgetrc (system)
Locale: /usr/share/locale
Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
-DLOCALEDIR="/usr/share/locale" -I. -I../lib -O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m64 -mtune=generic -fno-strict-aliasing
Link: gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic
-fno-strict-aliasing -Wl,-z,relro -lssl -lcrypto
/usr/lib64/libssl.so /usr/lib64/libcrypto.so -ldl -lrt ftp-opie.o
openssl.o http-ntlm.o gen-md5.o ../lib/libgnu.a
Copyright © 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://www.gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Originally written by Hrvoje Nikšić <hniksic@xemacs.org>.
Currently maintained by Micah Cowan <micah@cowan.name>.
Please send bug reports and questions to <bug-wget@gnu.org>.
Example of non-gzip'ed output:
Fuchsia:zzz axelhecht$ wget https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe
--2017-11-10 17:40:48-- https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe
Resolving public-artifacts.taskcluster.net... 52.84.158.254
Connecting to public-artifacts.taskcluster.net|52.84.158.254|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41579315 (40M) [application/x-msdownload]
Saving to: 'target.installer.exe'
target.installer.ex 100%[===================>] 39.65M 5.79MB/s in 6.9s
2017-11-10 17:40:55 (5.78 MB/s) - 'target.installer.exe' saved [41583193]
Fuchsia:zzz axelhecht$ file target.installer.exe
target.installer.exe: PE32 executable (GUI) Intel 80386, for MS Windows, UPX compressed
Fuchsia:zzz axelhecht$ wget --version
GNU Wget 1.19.2 built on darwin16.7.0.
-cares +digest -gpgme +https +ipv6 -iri +large-file -metalink -nls
+ntlm +opie -psl +ssl/openssl
Wgetrc:
/usr/local/etc/wgetrc (system)
Compile:
clang -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/usr/local/etc/wgetrc"
-DLOCALEDIR="/usr/local/Cellar/wget/1.19.2/share/locale" -I.
-I../lib -I../lib -I/usr/local/opt/openssl@1.1/include -DNDEBUG
Link:
clang -DNDEBUG -L/usr/local/opt/openssl@1.1/lib -lssl -lcrypto -ldl
-lz ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a -liconv
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://www.gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Originally written by Hrvoje Niksic <hniksic@xemacs.org>.
Please send bug reports and questions to <bug-wget@gnu.org>.
Reporter | ||
Comment 1•7 years ago
|
||
More with headers, tried wget --save-headers, and then do a less.
Working:
HTTP/1.1 200 OK
Content-Type: application/x-msdownload
Content-Length: 41579315
Connection: keep-alive
Date: Fri, 10 Nov 2017 14:30:54 GMT
Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT
ETag: "137e76feda1ad0a8803f40634e72e16d"
Content-Encoding: gzip
x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd
Accept-Ranges: bytes
Server: AmazonS3
Age: 8261
X-Cache: Hit from cloudfront
Via: 1.1 81c085110a4ab1cc157a3023ea302f38.cloudfront.net (CloudFront)
X-Amz-Cf-Id: tBIF4GSN5IUBCA8YNFCsR93rqulX64S34_21U-uSVB4_k6z9EctYIg==
MZ<90>^@^C^@^@^@^D^@^@^@<FF><FF>^@^@<B8>^@^@^@^@^@^@^@@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@<E0>^@^@^@^N^_<BA>^N^@<B4> <CD>!
<B8>^AL<CD>!This program cannot be run in DOS mode.
Non-working (file gzip'ed locally):
HTTP/1.1 200 OK
Content-Type: application/x-msdownload
Content-Length: 41579315
Connection: keep-alive
Date: Fri, 10 Nov 2017 14:58:06 GMT
Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT
ETag: "137e76feda1ad0a8803f40634e72e16d"
Content-Encoding: gzip
x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd
Accept-Ranges: bytes
Server: AmazonS3
Age: 6669
X-Cache: Hit from cloudfront
Via: 1.1 ec7268fa1110683dbc457e57c2be1475.cloudfront.net (CloudFront)
X-Amz-Cf-Id: rf8ZvNoZWH2SCkTtqEY4Fl8zGJHMBclBdt5SvHnq_1sVu0wD3ZjoPw==
^_<8B>^H^H^@^@^@^@^@<FF>target.installer.exe^@<EC><B2>y4<9B><FF><FB>><F8>d^Q!!A
^P^DQ<A1>A<90><92>ڢ^Z<B1><A5><B5>%v<B5>+<AA><C4>R^R<B4><B5><EF>^QA7<A5>U<D5>ֻ+
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → pmoore
Assignee | ||
Updated•7 years ago
|
Component: Queue → Generic-Worker
Reporter | ||
Comment 2•7 years ago
|
||
Got it, this is that we always gzip, no matter if the client supports it:
Fuchsia:zzz axelhecht$ curl -vO https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 52.84.158.254...
* TCP_NODELAY set
* Connected to public-artifacts.taskcluster.net (52.84.158.254) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: auth.taskcluster.net
* Server certificate: DigiCert SHA2 Secure Server CA
* Server certificate: DigiCert Global Root CA
> GET /TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe HTTP/1.1
> Host: public-artifacts.taskcluster.net
> User-Agent: curl/7.54.0
> Accept: */*
>
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0< HTTP/1.1 200 OK
< Content-Type: application/x-msdownload
< Content-Length: 41579315
< Connection: keep-alive
< Date: Fri, 10 Nov 2017 17:07:30 GMT
< Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT
< ETag: "137e76feda1ad0a8803f40634e72e16d"
< Content-Encoding: gzip
< x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< Via: 1.1 eefd24fb23003934ecf16bb607089417.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: Rr_cO8rFprjU9wtR4X06sYWY6Zhb_ngT1Dox5L3gQ2MamxARljkP6A==
<
{ [16384 bytes data]
100 39.6M 100 39.6M 0 0 2652k 0 0:00:15 0:00:15 --:--:-- 4280k
* Connection #0 to host public-artifacts.taskcluster.net left intact
Fuchsia:zzz axelhecht$ shasum -a 256 target.installer.exe
ba9d1b01a9dd59ce669301bee880b1affa84a9b01563bf7206e0b38f1f7c07b2 target.installer.exe
Fuchsia:zzz axelhecht$ curl --compressed -vO https://public-artifacts.taskcluster.net/TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 52.84.158.254...
* TCP_NODELAY set
* Connected to public-artifacts.taskcluster.net (52.84.158.254) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: auth.taskcluster.net
* Server certificate: DigiCert SHA2 Secure Server CA
* Server certificate: DigiCert Global Root CA
> GET /TuT79epdQpSH033sMQrY6Q/0/public/build/de/target.installer.exe HTTP/1.1
> Host: public-artifacts.taskcluster.net
> User-Agent: curl/7.54.0
> Accept: */*
> Accept-Encoding: deflate, gzip
>
< HTTP/1.1 200 OK
< Content-Type: application/x-msdownload
< Content-Length: 41579315
< Connection: keep-alive
< Date: Fri, 10 Nov 2017 14:30:54 GMT
< Last-Modified: Fri, 10 Nov 2017 14:15:42 GMT
< ETag: "137e76feda1ad0a8803f40634e72e16d"
< Content-Encoding: gzip
< x-amz-version-id: lZSnyJAvR6LTWJLrT.PvdoSPWlKyMKRd
< Accept-Ranges: bytes
< Server: AmazonS3
< Age: 9438
< X-Cache: Hit from cloudfront
< Via: 1.1 635d6b64075ae1410e6cbc26907c7141.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: 3pikUAk7s-t4ht6ikVhrOeiuf6FbmepAM0IX_4EguyQAXciQJFfE5A==
<
{ [5792 bytes data]
100 39.6M 100 39.6M 0 0 5863k 0 0:00:06 0:00:06 --:--:-- 5907k
* Connection #0 to host public-artifacts.taskcluster.net left intact
Fuchsia:zzz axelhecht$ shasum -a 256 target.installer.exe
112e6c8ee7579f43d582933f34e5cf81c8d93d4b76c74ac7437542eaf1696ddd target.installer.exe
Summary: artifacts served with and without gzip encoding → artifacts served with gzip encoding regardless of Accept-Encoding
Assignee | ||
Comment 3•7 years ago
|
||
I think the problem here is that we are serving compressed content from our origin:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html#compressed-content-custom-origin
As I see it, CloudFront doesn't provide a mechanism to *decompress* content that it serves, so since we only provide a compressed version, that is all that is available.
I'm not sure of the mechanics of interaction between our cloudfront web front ends and our s3 buckets, but maybe this is something that can be resolved in the Queue?
As far as the worker is concerned, it publishes content to S3 with the correct encoding headers, so I think it is up to something in Queue/CloudFront/S3 to handle decompressing content as appropriate.
@John/Jonas, what are your thoughts on this? Is there a way some service in the chain can decompress, or should just return a http 406 if gzip compression is not listed in the request Accept-Encoding field? (as per https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.3)
Flags: needinfo?(jopsen)
Flags: needinfo?(jhford)
Comment 4•7 years ago
|
||
So, what's happening is that S3 really takes Simple to extremes. If you upload a resource with Content-Encoding: gzip, it gets sent out with Content-Encoding: gzip regardless of whether you specify Accept-Encoding: gzip, Accept-Encoding: ??? or no Accept-Encoding.
This is done because S3 doesn't understand how to do gzip compression dynamically based on content negotiation. I believe CloudFront does support automatic gzip and content-negotiation, but that'd likely require not uploading with content-encoding. Even so, it might even end up doing double gzip-encoding -- I'm not sure. I'd need to do some tests to confirm. The docs aren't super clear to me: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html
We do this so that we can serve gzip-content inside of the AWS regions and direct from S3 to browsers in a way that they will automatically decompress (think logs, etc). This mainly has an impact to non-browser consumers. It's definitely in violation of the spec, but it is what we need to do to enable serve gzip artifacts direct to the browser.
There's good news coming though! The new artifact API and tooling that I've been working on will help to address this. I've written a Javascript (https://github.com/taskcluster/taskcluster-lib-artifact) and Go Library and CLI tool (https://github.com/taskcluster/taskcluster-lib-artifact-go) which automates all of the logic around uploading and downloading artifacts in a safe and reliable way. These libraries are about to be deployed into workers. The JS library is complete but the Go one is undergoing final review. It turns out that downloading artifacts safely is not really a simple task and requires a decent amount of work to do correctly. Because of this, the supported interface to blob artifacts will become one of these libraries or the CLI tool. These libraries do a bunch of extra verifications to ensure that the file you download is the exact bytes that were created originally.
In the meantime, I would suggest using an HTTP library to do the request against the queue (following redirects) with a HEAD, then checking the content-encoding. To do this, what you want is to use one of the library clients (js, py, go, java) to generate a Signed URL for the method, then run the HEAD request against that. I know it's not ideal, but we're working to resolve this!
Flags: needinfo?(jhford)
Comment 5•7 years ago
|
||
> @John/Jonas, what are your thoughts on this? Is there a way some service in the chain can decompress, or should just return a http 406 if gzip compression is not listed in the request Accept-Encoding field?
At some point early on I think we agreed that we require all clients to support redirects and gzip.
I guess 406'ing the requests if they don't support gzip would be nice. But we don't currently have the content-encoding in
the queue datastore. But this might be something we can explore.
@pmoore, note however, that except for generic-worker all other worker implementations will only gzip the logs.
In many cases gzipping isn't worth the overhead, and we might ask ourselves if it's worth the trouble here?
Flags: needinfo?(jopsen)
Comment 6•7 years ago
|
||
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #5)
> At some point early on I think we agreed that we require all clients to
> support redirects and gzip.
I'm only aware of such a specific agreement in regards to the new artifact api.
> I guess 406'ing the requests if they don't support gzip would be nice. But
> we don't currently have the content-encoding in
> the queue datastore. But this might be something we can explore.
We explored it, and it's part of the new artifact api. I think the path forward is to integrate taskcluster-lib-artifact{-go,} into workers and use the existing support for that.
> @pmoore, note however, that except for generic-worker all other worker
> implementations will only gzip the logs.
> In many cases gzipping isn't worth the overhead, and we might ask ourselves
> if it's worth the trouble here?
In the context of the new artifact api, content-encoding support was considered a blocking feature. Since we now have tools specifically to facilitate easy uploading and downloading of artifacts, with basically transparent Gzip encoding/decoding, I think there's no reason not to support it. We should be suggesting the taskcluster-lib-artifact{-go,} libraries and CLI as the way to get artifacts, since doing it manually is a huge source of errors.
Reporter | ||
Comment 7•7 years ago
|
||
I don't agree that our download assets break spec-compliant http clients. Regardless of what any kind of library does on top, we shouldn't break the web.
Assignee | ||
Comment 8•7 years ago
|
||
(In reply to Axel Hecht [:Pike] from comment #7)
> I don't agree that our download assets break spec-compliant http clients.
> Regardless of what any kind of library does on top, we shouldn't break the
> web.
I agree, any public-facing urls we expose should be http compliant.
As I understand it (disclaimer: mostly based on guesswork), when you download an artifact, you hit the queue service which redirects you to a cloudfront url, which acts as a proxy front end to S3 buckets. In a perfect world, all three services would be http compliant and respect Accept-Encoding request header.
I think it should be relatively straightforward for the queue to return an HTTP 406 if the client doesn't accept gzip content encoding, and this would be compliant with the spec.
Whether we can (or should) do this at all three levels (the queue, cloudfront, and S3) is a good question. Certainly it would be nice if all services did this, but it could be argued that only the queue url is client-facing. However, in reality the user sees the cloudfront url - so it arguably should also be compliant, because it is a discoverable url.
My two cents - I welcome others to pitch in with comments/ideas.
Assignee | ||
Comment 9•7 years ago
|
||
(based on the assumption that the S3 URL that cloudfront proxies is discoverable/reachable for clients)
Comment 10•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #8)
> (In reply to Axel Hecht [:Pike] from comment #7)
> I think it should be relatively straightforward for the queue to return an
> HTTP 406 if the client doesn't accept gzip content encoding, and this would
> be compliant with the spec.
That's easy at the Queue level, but impossible at the S3 URL level. S3 urls have their headers and content set one time, at creation. The only way to do this would be to turn off public access to the S3 objects, do the content-negotiation in the Queue and change the queue to redirect to a *signed* S3 URL. The S3 URL would still be what it is now, but we'd have done the content-negotiation in the queue and would not allow creating the URL for clients which do not set Accept-Encoding: gzip. The thing which follows redirects would still need to understand how to parse a resource with Content-Encoding: gzip without an accept-encoding request header set.
> Whether we can (or should) do this at all three levels (the queue,
> cloudfront, and S3) is a good question. Certainly it would be nice if all
> services did this, but it could be argued that only the queue url is
> client-facing. However, in reality the user sees the cloudfront url - so it
> arguably should also be compliant, because it is a discoverable url.
>
> My two cents - I welcome others to pitch in with comments/ideas.
(In reply to Axel Hecht [:Pike] from comment #7)
> I don't agree that our download assets break spec-compliant http clients.
> Regardless of what any kind of library does on top, we shouldn't break the
> web.
Reading through the Accept/Content-Encoding header sections in the http spec, it doesn't actually say anywhere that the use of Content-Encoding in a response requires an Accept-Encoding in the request, or that Accept-Encoding is anything more than a preference.
https://tools.ietf.org/html/rfc7231#section-3.1.2.2
If one or more encodings have been applied to a representation, the
sender that applied the encodings MUST generate a Content-Encoding
header field that lists the content codings in the order in which
they were applied. Additional information about the encoding
parameters can be provided by other header fields not defined by this
specification.
https://tools.ietf.org/html/rfc7231#section-5.3.4
A request without an Accept-Encoding header field implies that the
user agent has no preferences regarding content-codings. Although
this allows the server to use any content-coding in a response, it
does not imply that the user agent will be able to correctly process
all encodings.
I guess I was premature in saying we're breaking the spec. I didn't actually find anywhere in the http spec that says it is required that the Content-Encoding header has to be one of the specified Accept-Encoding header options. The Content-Encoding has to be set if the content has content coding, but the Accept-Encoding header is just stating preference.
Assignee | ||
Comment 11•7 years ago
|
||
From https://tools.ietf.org/html/rfc7231#section-5.3.4 I think the two exceptions would be:
1) An Accept-Encoding header that doesn't include '*' or 'gzip' (either an empty value, or a list of things that aren't either of these)
2) An Accept-Encoding header that contains 'gzip;q=0'
So technically Axel's example is compliant, but I believe a modified form of it could demonstrate non-compliance (such as "Accept-Encoding: gzip;q=0" / "Accept-Encoding:", "Accept-Encoding: foo").
Assignee | ||
Updated•7 years ago
|
QA Contact: pmoore
Comment 12•7 years ago
|
||
Suggestion:
If request doesn't carry: "Accept-Encoding: gzip" and resource is gzipped
We reply 406 "Not Acceptable", from wikipedia:
The requested resource is capable of generating only content not acceptable according to the Accept headers
sent in the request
Comment 13•7 years ago
|
||
We can't do this without breaking exiting clients that handle gzip, but don't specify Accept-Encoding header.
But we could probably do this for the new blob artifact type.
Comment 14•7 years ago
|
||
PR for this fix: https://github.com/taskcluster/taskcluster-queue/pull/264
(credits jhford for originally suggesting this Berlin)
Note:
This will only affect the 'blob' type, so was we switch workers to use this artifact type, we are liable to see some breakages.
That might be acceptable since we want consumers to start verifying hashes anyways.
Comment 15•6 years ago
|
||
This will start working as we transition more artifacts to the blob type.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Updated•6 years ago
|
Component: Generic-Worker → Workers
Assignee | ||
Updated•6 years ago
|
Component: Workers → Services
Summary: artifacts served with gzip encoding regardless of Accept-Encoding → Queue: artifacts served with gzip encoding regardless of Accept-Encoding
You need to log in
before you can comment on or make changes to this bug.
Description
•