Closed Bug 1580992 Opened 5 years ago Closed 3 years ago

taskcluster-artifacts.net returns gzipped content regardless of client Accept-Encoding

Categories

(Taskcluster :: Workers, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: rhelmer, Unassigned)

References

Details

https://taskcluster-artifacts.net seems to always send files gzipped and with Content-Encoding: gzip even if the client doesn't send an Accept-Encoding: gzip (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding)

I ran into this because I am trying to test the Firefox stub installer against a build I did on tryserver, which does not support gzip.

str: curl -o target.installer.exe 'https://taskcluster-artifacts.net/KrDpB_bYQSGZortUtl-f4A/0/public/build/install/sea/target.installer.exe' && file target.installer.exe

As an aside, it turns out this .exe is already compressed so gzipping makes it larger anyway... this might not be worth dealing with but something to consider.

This is a consequence of how we store artifacts, which is to just stick them in S3 (with a CDN in front). Unfortunately, S3 does not support content-encoding negotiation, and just returns the bytes (and headers) as set on the stored object. Our workers make a guess as to whether something should be compressed and set Content-Encoding accordingly, in hopes that any tool downloading artifacts does accept gzip encoding. Unfortunately, curl doesn't.

The object service, which is proposed as a TC RFC but as-yet un-implemented, aims to address this issue - see Content Negotiation. That solution isn't great, since it would mean your curl requests get a 406 instead of blindly downloading gzipped data, rather than "just working".

But I would argue that curl is the wrong tool here, as I've seen similar issues downloading other sorts of files from services that aren't Taskcluster. A recent revision of curl switched from filling my terminal with mojibake to warning "this looks like binary content", but in either case the fix is curl | gunzip -c. I think curl is functioning as designed: its job is to make an HTTP request and put the request body to stdout. That just isn't quite what users want in this case.

I think we'd be open to alternative solutions, but storing every object once for each possible content-encoding is probably not a good one, and we want to avoid our servers handling artifact data (for example by proxying the download from S3 and gunzipping) as that doesn't scale cheaply. Do you have ideas?

See Also: → 1416270

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)

This is a consequence of how we store artifacts, which is to just stick them in S3 (with a CDN in front). Unfortunately, S3 does not support content-encoding negotiation, and just returns the bytes (and headers) as set on the stored object. Our workers make a guess as to whether something should be compressed and set Content-Encoding accordingly, in hopes that any tool downloading artifacts does accept gzip encoding. Unfortunately, curl doesn't.

The object service, which is proposed as a TC RFC but as-yet un-implemented, aims to address this issue - see Content Negotiation. That solution isn't great, since it would mean your curl requests get a 406 instead of blindly downloading gzipped data, rather than "just working".

But I would argue that curl is the wrong tool here, as I've seen similar issues downloading other sorts of files from services that aren't Taskcluster. A recent revision of curl switched from filling my terminal with mojibake to warning "this looks like binary content", but in either case the fix is curl | gunzip -c. I think curl is functioning as designed: its job is to make an HTTP request and put the request body to stdout. That just isn't quite what users want in this case.

Oh right, sorry -- was just using curl as an example, the real reason this came up for me is that the Firefox Windows installer (which uses NSIS) does not support gzip, and for the stub installer especially we want to keep it as tiny and simple as possible: https://searchfox.org/mozilla-central/source/browser/installer/windows/nsis

I think we'd be open to alternative solutions, but storing every object once for each possible content-encoding is probably not a good one, and we want to avoid our servers handling artifact data (for example by proxying the download from S3 and gunzipping) as that doesn't scale cheaply. Do you have ideas?

I'll think about it and research/talk to some folks, off the top of my head I think we could fix this narrow case with what you've said above:

Our workers make a guess as to whether something should be compressed and set Content-Encoding accordingly, in hopes that any tool downloading artifacts does accept gzip encoding. Unfortunately, curl doesn't.

Running file on one of our installer exes will print something like: PE32 executable (GUI) Intel 80386, for MS Windows, UPX compressed, they key part being the UPX compressed. gzipping such files will actually make them slightly larger so it's not really helping in any case :)

I think that if workers could just not compress these files that would not only save a few cpu cycles, it'd also let us use taskcluster to test the stub installer (I hit this bug because I wanted to point stub installer at the result of my try run, which would be a pretty nice workflow :) )

In any case I don't think this should be a high-priority bug at all, it'd just be a small quality-of-life improvement for the very few people that touch the installer code!

I suspect that would just entail adding "exe" to this list (note we do not run file). Pete, what do you think?

Component: Services → Workers
Flags: needinfo?(pmoore)

We could add .exe to that list, although for uncompressed executables, that might not be ideal.

Bug 1536780 is in progress (I heard from Max yesterday) so that might offer an alternative solution that won't affect other .exe artifacts.

I suspect that will land some time in September, so maybe best for us to hold out for that.

Depends on: 1536780
Flags: needinfo?(pmoore)
See Also: → 1577785

Also see bug 1416270 comment 15 - it looks like blob artifacts might save us here, if we migrate to using them.

See Also: 1577785
See Also: → 1577785

Looking at the task definition, I see a directory artifact is used rather than a list of file artifacts, and so with the current feature, you would only be able to specify a single Content-Encoding header for all published artifacts of the directory.

So I think the two options available with generic-worker 16.2.0 are:

  1. Change task definition to list all published artifacts as file artifacts (rather than a single directory artifact), and specify "contentEncoding": "identity" for artifact public/build/install/sea/target.installer.exe.
  2. In the existing directory artifact, specify "contentEncoding": "identity" so that all artifacts public/build/* will have identity encoding.

Both options have major downsides. Either you now need to maintain a list of artifacts that are published, rather than just including all files inside a directory, or you have to sacrifice the ability to compress any of the artifacts, rather than just the one you want to not compress.

Rob, are either of these options worthwhile?

We could consider modifying generic-worker to allow a file artifact to take precedence over a directory artifact, such that you could overwrite the contentEncoding just for a single file within a directory artifact, but that doesn't exist as a feature yet, and we'd need to think about whether that is really desirable.

Ultimately we are probably solving this in the wrong place - the content platform should respect the Accept-Encoding request header, and unfortunately anything we do in the worker only avoids the issue rather than solves it.

I wonder if we should revisit this issue to see if there is a way we can achieve the storage gains of compressed artifacts, but have them served in a way that respects the Accept-Encoding request header.

Flags: needinfo?(rhelmer)
Blocks: 1585578

(In reply to Pete Moore [:pmoore][:pete] (PTO - return 7 October) from comment #6)

Looking at the task definition, I see a directory artifact is used rather than a list of file artifacts, and so with the current feature, you would only be able to specify a single Content-Encoding header for all published artifacts of the directory.

So I think the two options available with generic-worker 16.2.0 are:

  1. Change task definition to list all published artifacts as file artifacts (rather than a single directory artifact), and specify "contentEncoding": "identity" for artifact public/build/install/sea/target.installer.exe.
  2. In the existing directory artifact, specify "contentEncoding": "identity" so that all artifacts public/build/* will have identity encoding.

Both options have major downsides. Either you now need to maintain a list of artifacts that are published, rather than just including all files inside a directory, or you have to sacrifice the ability to compress any of the artifacts, rather than just the one you want to not compress.

Rob, are either of these options worthwhile?

I think #1 is fine, given that we have a pretty small number of artifacts. Out of curiosity, which artifacts do we publish that aren't already compressed? Logs files? It seems like the actual build artifacts tend to be compressed (zip, exe, etc)

We could consider modifying generic-worker to allow a file artifact to take precedence over a directory artifact, such that you could overwrite the contentEncoding just for a single file within a directory artifact, but that doesn't exist as a feature yet, and we'd need to think about whether that is really desirable.

Ultimately we are probably solving this in the wrong place - the content platform should respect the Accept-Encoding request header, and unfortunately anything we do in the worker only avoids the issue rather than solves it.

I wonder if we should revisit this issue to see if there is a way we can achieve the storage gains of compressed artifacts, but have them served in a way that respects the Accept-Encoding request header.

This would probably be ideal.

Flags: needinfo?(rhelmer)

(In reply to Robert Helmer [:rhelmer] from comment #7)

I think #1 is fine, given that we have a pretty small number of artifacts. Out of curiosity, which artifacts do we publish that aren't already compressed? Logs files? It seems like the actual build artifacts tend to be compressed (zip, exe, etc)

Hi Rob,

There is a blacklist of file extensions that by default are not compressed. All other file extensions, by default, will be served compressed with Content-Encoding: gzip. Since generic-worker 16.2.0 it is now possible to overwrite the default behaviour.

Noted in https://github.com/taskcluster/scrum/issues/37 for inclusion in the object service

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.