Closed
Bug 1347956
Opened 8 years ago
Closed 7 years ago
Some public-artifacts.taskcluster.net files are not served gzipped
Categories
(Taskcluster :: Services, enhancement)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Unassigned)
References
Details
Compare:
curl -IL --compressed "https://queue.taskcluster.net/v1/task/dTgEO-dOQ4KOnc-ZJ6FLLQ/runs/0/artifacts/public/logs/live_backing.log"
curl -IL --compressed "https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log"
The former's response (after the HTTP 303) has `Content-Encoding: gzip` whereas the latter does not:
$ curl -IL --compressed "https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo 4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log"
HTTP/1.1 303 See Other
...
Location: https://public-artifacts.taskcluster.net/EnHkkQP8RN-7mJf9ZHo4Ug/0/public/test_info//reftest-no-accel_errorsummary.log
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 3457715
Connection: keep-alive
Date: Thu, 16 Mar 2017 14:50:42 GMT
Last-Modified: Thu, 16 Mar 2017 00:39:28 GMT
ETag: "2fb988221e25255927d1964f1d728ece"
x-amz-version-id: f2EfsMJdWE56i2PLwJrUhZDkhV8s0f8f
Accept-Ranges: bytes
Server: AmazonS3
Age: 1467
X-Cache: Hit from cloudfront
Via: 1.1 d2458a4f6c40dc75a7c4afbe573a1387.cloudfront.net (CloudFront)
X-Amz-Cf-Id: 3QbcTOOnG8RzGsktdrVXJmwGXeEN4VElEUii54oZV9CGJO75Qvf1fw==
This file is uploaded via the `MOZ_UPLOAD_DIR` mechanism, and so possibly by this file?
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/blob_upload.py
Whilst I know in some places we compress before upload to S3, Cloudfront can automatically use gzip too:
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html#compressed-content-cloudfront-file-types
Greg, I don't suppose you could redirect to the best person to look at this? :-)
(It would help with the slow transfer performance in eg bug 1347945)
Flags: needinfo?(garndt)
Comment 1•8 years ago
|
||
So the links to the live_backing.log are compressed, the workers do that because that's a file the worker maintains and knows to upload it compressed.
The worker makes no assumption about the compression to use when uploading files that a task creates. We could probably just assume that all plain/text artifact should be gzipped.
Jonas, thoughts?
Flags: needinfo?(garndt) → needinfo?(jopsen)
Comment 2•8 years ago
|
||
We should look into this at some point.
But compression is a storage vs. cpu cycle cost.
Flags: needinfo?(jopsen)
Reporter | ||
Comment 3•8 years ago
|
||
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #2)
> But compression is a storage vs. cpu cycle cost.
If we were talking about archival cost only, yes.
But there's the network transfer for the multiple consumers of the file to consider.
Comment 4•8 years ago
|
||
(In reply to Greg Arndt [:garndt] from comment #1)
> The worker makes no assumption about the compression to use when uploading
> files that a task creates. We could probably just assume that all
> plain/text artifact should be gzipped.
What about all files, regardless of mime type?
Comment 5•8 years ago
|
||
I had a chat with Ed on IRC here: http://logs.glob.uno/?c=ateam#c1107025
09:37 pmoore emorley: jonasfj: do you see any reason why we shouldn't use gzip content encoding for *all* artifacts (not just text/plain ones)?
09:39 pmoore .... we could have a sanity check that if the compressed version is larger than the uncompressed version, we don't - but maybe that never even happens
09:41 pmoore .... i also can't imagine that compression / decompression costs a lot in wall-time - and if it does, probably it is because of enormous artifacts which would then be served more quickly as a result ...
09:41 emorley pmoore: for a point of reference, whitenoise takes the "blacklist rather than whitelist" and "check compression ratio <95%" approach github.com/evansd/whitenoise/blob/…88a/whitenoise/compress.py#L21-L29
09:42 emorley and logs cases where files were skipped due to compression ratio not being good enough
09:42 emorley (where those logs then allow for figuring out which cases to add to the blacklist)
09:42 pmoore ah, nice
09:42 pmoore that seems like a sane approach to me
10:03 pmoore emorley: in the absence of any objections, i'll implement like this in generic-worker - the same blacklist, and the same <95% compression ratio requirement
10:04 emorley sgtm
Comment 6•8 years ago
|
||
It's worth noting that some clients (curl, notably) do not automatically decode HTTP responses, so this may break a lot of things.
Comment 7•8 years ago
|
||
urllib2 being another example
Reporter | ||
Comment 8•8 years ago
|
||
Good point, we'll need to add `--compressed` to the relevant curl invocations.
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> urllib2 being another example
Another reason for switching to requests :-)
Comment 9•8 years ago
|
||
There are probably hundreds of those across lots of projects using TC, so I'm not sure we could do so effectively.
Reporter | ||
Comment 10•8 years ago
|
||
Given that, the simpler alternative might be to do compression on demand:
(In reply to Ed Morley [:emorley] from comment #0)
> Whilst I know in some places we compress before upload to S3, Cloudfront can
> automatically use gzip too:
> https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/
> ServingCompressedFiles.html
> https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/
> ServingCompressedFiles.html#compressed-content-cloudfront-file-types
Comment 11•8 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/87ab14f1770706cd95eef630a351ecaaa96b828b
Bug 1347956 - blacklist certain file extensions from useing gzip content encoding
https://github.com/taskcluster/generic-worker/commit/f4381839feb46103e1e3eef64172425ea5e29b2e
Merge pull request #49 from taskcluster/gzip-more-artifacts
Bug 1347956 - gzip more artifacts
Comment 12•8 years ago
|
||
I've added the blacklist in generic-worker, which should get rolled out to all worker types (OS X, Windows) in the next days. We'll still need to do this in taskcluster-worker and docker-worker (and maybe some releng workers too - buildbot bridge / signing worker / script worker etc).
At the moment I don't check for at least 95% compression - I've just implemented the part with the blacklist, and gzip compress everything else.
Comment 13•8 years ago
|
||
Just seen the comments between Ed and Dustin. To be honest, since we are already gzipping some artifacts, and not others, I expect that anything reading artifacts probably is already handling this. Mostly I guess people will fetch logs, which already had gzip encoding.
I also think, the first time they run something and it pulls back binary content, they will quickly work out it is compressed. So I'm hoping this change won't break too many things.
Comment 14•8 years ago
|
||
Another angle on this, is I think content-encoding is not an obscure thing, it is a standard header, that clients should respect - so if it was something less obscure, I would tend to agree, but I think it is reasonable to expect clients to respect the well-defined semantics of the Content-Encoding header.
I say, let's roll with it, and back stuff out if it causes pain, but not let the worry that something somewhere might be affected prevent us from making progress. I think a disruptive approach has the potential to have a more positive impact.
Comment 15•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> urllib2 being another example
Funny you should mention that! I discovered mozharness is using urllib2 and therefore doesn't support Content-Encoding when downloading artifacts - so I'm adding a fix for that in bug 1358142.
Reporter | ||
Comment 16•8 years ago
|
||
Hi Pete!
Do you know if the changes here have been deployed?
I'd somewhat forgotten about this bug since I thought it had been deployed, but recent slow transaction traces for Treeherder's store-failure-lines task showed 89% of the profile fetching in requests fetching from eg:
https://queue.taskcluster.net/v1/task/VQaPi7msQIGTK9HF1PnPcw/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log
...which isn't served gzipped:
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 3479668
Connection: keep-alive
Date: Wed, 07 Jun 2017 10:10:44 GMT
Last-Modified: Wed, 07 Jun 2017 09:35:50 GMT
ETag: "0a41399ab3767f1b27799b47f50277cb"
x-amz-version-id: TiK42srJM9OeCa6AFuZ6VASC6HgkRQ5l
Accept-Ranges: bytes
Server: AmazonS3
X-Cache: Miss from cloudfront
Via: 1.1 c771900addaa417be1d0b79ff157a3f9.cloudfront.net (CloudFront)
X-Amz-Cf-Id: tOhx8tonD9be2JSFd9pjgYqtk_QnN-F8IWNSqyiVxjeAP-z8cwRmuw==
Flags: needinfo?(pmoore)
Comment 17•8 years ago
|
||
Changes for this have been deployed for generic-worker, but not docker-worker (which ran the task linked to). I thought there might have been a discussion if this is behavior we wanted everywhere but I cannot find a link to it. Perhaps Pete recalls that discussion.
Comment 18•8 years ago
|
||
I think that we haven't discussed docker-worker yet, but my opinion is that it would make sense to implement in docker-worker, if it isn't a huge piece of work. Otherwise we should certainly make sure we have it in taskcluster-worker.
@Jonas,
how far off replacing docker-worker with taskcluster-worker are we, at a guess (6 months/1 year/...)? Do we still intend to replace it? My thinking here is, if we are close to replacing, maybe better just to roll out in taskcluster-worker.
@Ed,
Do you have an idea how much cost this might save overall, or if it would have a big performance impact? My guess is, including picking up the pieces if they break, we're talking a couple of days for making the fix(es), testing and reviewing them, and then rolling out to all our worker types, taking care of bustage etc. Maybe 3 days in total, if things don't go smoothly?
Flags: needinfo?(pmoore)
Flags: needinfo?(jopsen)
Flags: needinfo?(emorley)
Reporter | ||
Comment 19•8 years ago
|
||
Thank you for the further details.
Just to check I understand, we have:
* generic-worker -> gzip support added in this bug (comment 11)
* docker-worker -> doesn't yet use gzip, but may be replaced soon by taskcluster-worker but timeline unclear
* taskcluster-worker -> appears to already gzip (added by bug 1305707), is already used for certain tasks (and might be used for more if it replaces docker-worker)
(In reply to Pete Moore [:pmoore][:pete] from comment #18)
> Do you have an idea how much cost this might save overall, or if it would
> have a big performance impact?
From Treeherder's point of view, the concern is not the cost (we don't pay for the S3 bucket or data transfer) but about reducing the proportion of the time spent fetching the json error summary files, particularly when S3 is being slow. Looking at the last 7 days, the store-failure-lines task profile shows 81% of the profile spent in the request to queue.taskcluster.net (though some of that is due to things like bug 1348071 / bug 1348072).
I'd imagine there are cost savings (storage + transfer) to be had for the Taskcluster team - for example this file as served is 236 MB but only 9.5 MB using default CLI gzip options:
https://queue.taskcluster.net/v1/task/DBYbDF8RR8CrQL9tREtDaA/runs/0/artifacts/public/test_info/reftest-no-accel_errorsummary.log
> My guess is, including picking up the pieces
> if they break, we're talking a couple of days for making the fix(es),
> testing and reviewing them, and then rolling out to all our worker types,
> taking care of bustage etc. Maybe 3 days in total, if things don't go
> smoothly?
Yeah I agree it's not going to be as straight forwards as it might seem. I'm happy to wait until docker-worker is replaced if that's not years away.
My main concern is that lack of gzipping is something that seems to continually bite us. It would be great to have it considered as an essential feature the next time a new worker project is created, so we don't need to try and add it retrospectively :-)
Flags: needinfo?(emorley)
Comment 20•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #19)
> Thank you for the further details.
>
> Just to check I understand, we have:
> * generic-worker -> gzip support added in this bug (comment 11)
Yes, and that landed in generic-worker 8.3.0. That means the fix is currently running on our Windows builders (win2012), and our Windows 10 machines, but not yet on Windows 7 testers.
> * docker-worker -> doesn't yet use gzip, but may be replaced soon by
> taskcluster-worker but timeline unclear
I believe it uses it for log files, but not for other artifacts. garndt can confirm.
> * taskcluster-worker -> appears to already gzip (added by bug 1305707), is
> already used for certain tasks (and might be used for more if it replaces
> docker-worker)
I believe that is only for the log file. Currently taskcluster-worker isn't used for anything yet in production, as far as I know.
>
> (In reply to Pete Moore [:pmoore][:pete] from comment #18)
> > Do you have an idea how much cost this might save overall, or if it would
> > have a big performance impact?
>
> From Treeherder's point of view, the concern is not the cost (we don't pay
> for the S3 bucket or data transfer) but about reducing the proportion of the
> time spent fetching the json error summary files, particularly when S3 is
> being slow. Looking at the last 7 days, the store-failure-lines task profile
> shows 81% of the profile spent in the request to queue.taskcluster.net
> (though some of that is due to things like bug 1348071 / bug 1348072).
>
> I'd imagine there are cost savings (storage + transfer) to be had for the
> Taskcluster team - for example this file as served is 236 MB but only 9.5 MB
> using default CLI gzip options:
> https://queue.taskcluster.net/v1/task/DBYbDF8RR8CrQL9tREtDaA/runs/0/
> artifacts/public/test_info/reftest-no-accel_errorsummary.log
>
> > My guess is, including picking up the pieces
> > if they break, we're talking a couple of days for making the fix(es),
> > testing and reviewing them, and then rolling out to all our worker types,
> > taking care of bustage etc. Maybe 3 days in total, if things don't go
> > smoothly?
>
> Yeah I agree it's not going to be as straight forwards as it might seem. I'm
> happy to wait until docker-worker is replaced if that's not years away.
>
> My main concern is that lack of gzipping is something that seems to
> continually bite us. It would be great to have it considered as an essential
> feature the next time a new worker project is created, so we don't need to
> try and add it retrospectively :-)
Hindsight is a blessing, hey! :) But yes, indeed....
Thanks for the clarifications.
Note, I hope to roll out a newer generic worker to the windows 7 machines over the next week if I don't find any major problems (I can't promise), in which case then all generic worker workers will be using gzip compressed artifacts (when artifact is not blacklisted as already being compressed).
Comment 21•8 years ago
|
||
tc-worker won't be replacing docker-worker anytime soon, we're likely focusing on other uses first.
Note, we should probably declare somewhere that some artifacts are gzipped others aren't, and that consumers of the API is required to support Content-Encoding: gzip.
Flags: needinfo?(jopsen)
Comment 22•8 years ago
|
||
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #21)
> Note, we should probably declare somewhere that some artifacts are gzipped
> others aren't, and that consumers of the API is required to support
> Content-Encoding: gzip.
That's in the new manual.
Comment 23•7 years ago
|
||
Found in triage.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #22)
> (In reply to Jonas Finnemann Jensen (:jonasfj) from comment #21)
> > Note, we should probably declare somewhere that some artifacts are gzipped
> > others aren't, and that consumers of the API is required to support
> > Content-Encoding: gzip.
>
> That's in the new manual.
Is documentation enough to close this out in this case?
Flags: needinfo?(dustin)
Comment 24•7 years ago
|
||
Sure?
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(dustin)
Resolution: --- → FIXED
Comment 25•7 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Component: Platform and Services → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•