Upload uncompressed runnable-jobs.json artifact instead of gzipped version.

RESOLVED FIXED in mozilla65

Status

defect
RESOLVED FIXED
2 years ago
5 months ago

People

(Reporter: emorley, Assigned: KWierso)

Tracking

unspecified
mozilla65
Dependency tree / graph

Details

Attachments

(2 attachments)

In bug 1422133 the runnable jobs metadata was split out of the 40MB task graph and also saved to disk gzipped, to improve performance/avoid timeouts when fetched by Treeherder in response to people using the "add new jobs" feature.

However the uploaded file doesn't have a `Content-Encoding` or `Content-Type` header that includes `gzip`, so automatic decompression breaks.

For example:

$ curl -sSfL --compressed 'https://public-artifacts.taskcluster.net/f8vydm9GSW2M88FWKjmruw/0/public/runnable-jobs.json.gz' | jq .
parse error: Invalid numeric literal at line 1, column 32

$ python
>>> import requests
>>> r = requests.get('https://public-artifacts.taskcluster.net/f8vydm9GSW2M88FWKjmruw/0/public/runnable-jobs.json.gz')
>>> j = r.json()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Compare:

$ curl -IL --compressed https://public-artifacts.taskcluster.net/f8vydm9GSW2M88FWKjmruw/0/public/runnable-jobs.json.gz
HTTP/1.1 200 OK
Content-Type: application/octet-stream
...

$ curl -IL --compressed https://secure.pub.build.mozilla.org/builddata/buildjson/builds-4hr.js.gz
HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Encoding: gzip
...

Now since S3 doesn't support dynamic serving of content (CloudFront does, but anything in AWS hits S3 directly) I'm presuming taskcluster needs to set the correct Content-Encoding (or Content-Type, but not both) when uploading any files whose filename ends with `.gz`?

For the case above, is this taskcluster-worker, generic-worker, docker-worker, ...?
Flags: needinfo?(pmoore)
(In reply to Ed Morley [:emorley] from comment #0)

> For the case above, is this taskcluster-worker, generic-worker,
> docker-worker, ...?

gecko-3-decision is docker-worker...

Luckily we have a docker-worker expert on the team ;)
Component: Worker → Docker-Worker
Flags: needinfo?(pmoore) → needinfo?(wcosta)
It's just a binary blob which happens to be gzip encoded, so the encoding is correct.  There isn't anything the worker could do to figure this out other than guessing based on the file extension -- and that got IE into a lot of trouble back in the day.  Even if it did guess, it would be to set the content-type, not the content-encoding.

So basically this file is gzipped *by convention* not by anything involving HTTP.
Flags: needinfo?(wcosta)
Ok, but set the Content-Type instead then :-)

It's annoying to just not have this file usable out of the box.
This just seems like another case of Taskcluster idealism over practicality fwiw.
Anyway, sorry for being short, we just seem to be accumulating workarounds for issues when it comes to downloads of files from S3:
https://github.com/mozilla/treeherder/blob/f360621541b9cf95d8c1e03a8228eeedb6cc803c/treeherder/etl/common.py#L37-L39
https://github.com/mozilla/treeherder/blob/f360621541b9cf95d8c1e03a8228eeedb6cc803c/treeherder/etl/common.py#L59-L67

...it would be good if we could move the fixes for some of these upstream.
The new artifact API does, in fact, allow specification of a content-encoding.  The question is, how does that get communicated from the task (which is just writing files to disk) to the worker (which uploads it).  Basically we lose metadata on the filesystem.

I think the only option is to guess based on file extension and/or content sniffing.  And we know how well that turned out for IE.  I think if anything's going to do content sniffing, it should be the consumer of the artifact (where any errors in sniffing can be handled appropriately) rather than in the worker (where suddenly an artifact named /index/by-email/info@somedomain.gz fails to download correctly, yet all other similarly-named artifacts download correctly).

Probably the best solution is to invent some means of uploading artifacts for a task that does not involve writing them to disk.  But how to do that securely is an as-yet unsolved problem.  One you are welcome to work on a fix for :)
I agree that sniffing content type based on URL is in most cases a bad idea.

However for uploads from a filesystem, particularly for files created by the Taskcluster build process, the set of expected filenames and extensions is much smaller. Whilst someone could deliberately go out of their way to create a file named `info@somedomain.gz`, it's much less likely than in the URL based example of `/index/by-email/info@somedomain.gz`.

(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> Probably the best solution is to invent some means of uploading artifacts
> for a task that does not involve writing them to disk.

Agreed :-)
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> The new artifact API does, in fact, allow specification of a
> content-encoding.  The question is, how does that get communicated from the
> task (which is just writing files to disk) to the worker (which uploads it).
> Basically we lose metadata on the filesystem.

We don't need to lose this metadata - the task payload already declares artifact metadata including the location of the file storing the content, the artifact name, the artifact type, and artifact expiry. It could also comfortably declare a content type. This could be optional for backwards compatibility, whereby either the worker or the queue would choose/guess/sniff etc if not provided a type.
Note, the above solution solves the issue for file artifacts, but not for directory artifacts.

Some solution could also be possible for directory artifacts too, but then it would be more complicated. Alternatively, we could warn against using directory artifacts, if we decide we don't want to support custom content-type settings for directory artifacts.
Note, this allows explicitly setting ContentType in generic worker jobs - this doesn't help docker-worker or taskcluster-worker at this point in time, so we should leave this bug open so patches can be made to them too.

Requested review in the PR, so I won't duplicate the review flag here.
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/764a9bd321c2d49925e7e140adcdef1bb57a8d46
Bug 1423215 - Make contentType an optional property of file artifact in task payload

https://github.com/taskcluster/generic-worker/commit/e6b09b419cfd244655d4f50ee81e8efc290c595e
Merge pull request #70 from taskcluster/bug1423215

Bug 1423215 - Make contentType an optional property of file artifact in task payload
Duplicate of this bug: 1274078
Generic Worker
==============
Released in generic-worker 10.4.0. Not globally rolled out to all our workers yet.

See "contentType" property in the artifacts section of:
  https://docs.taskcluster.net/reference/workers/generic-worker/payload

In this update, content encoding is still decided by the worker (and it does a pretty good job of deciding if it is worth the encoding cost or not) but content type can be explicitly specified, if the system default for the filename extension is not wanted.


Taskcluster Worker
==================
Still a TODO for taskcluster-worker.


Docker Worker
=============

I'm guessing we might not want to port this to docker-worker at this point, since we are planning to migrate to using taskcluster-worker instead. However, that might depend how long the migration takes. The task in comment 0 is a docker-worker task, so it certainly is a problem there.
Filed https://github.com/taskcluster/taskcluster-worker/issues/376 for the tc-worker TODO..


Otherwise, we close this as content-encoding should never be sniffed.
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → INVALID
To be clear, generic-worker doesn't sniff content-encoding -- it *performs* content-encoding and uses heuristics to decide when to do so.

The cause of much of the issue here is that we have uploaded an artifact that the task has pre-gzipped (what I was getting to in comment 2).  It would have been better to output that artifact as runnable-jobs.json and let the various pieces of infrastructure decide whether it's efficient to ship those bits around bracketed with gzip encoding/decoding (and without changing the filename).  That's what we have done with task-graph.json, actions.json, etc.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Summary: Uploaded gzipped job artifacts (such as runnable-jobs.json.gz) have incorrect Content-Encoding/Type → Upload uncompressed runnable-jobs.json artifact instead of gzipped version.
This should not be landed until the treeherder Part 1 PR from bug 1494750 is landed and deployed, otherwise Add New Jobs will stop working for jobs built on this patch, since the fallback to the gzip artifact will not be in place.
Blocks: 1494750
No longer blocks: 1494750
Depends on: 1494750
Attachment #9029099 - Attachment description: Bug 1423215 - Upload only an uncompressed runnable-jobs.json artifact r?dustin → Bug 1423215 - Upload an uncompressed runnable-jobs.json artifact in addition to the gz version r?dustin
Pushed by wkocher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/88d304c633b6
Upload an uncompressed runnable-jobs.json artifact in addition to the gz version r=dustin
I changed the patch to upload both the gzip and the uncompressed version of the artifact. This way, it can land independently of the treeherder patches to support the non-gzip version, and we can get to uplifting it right away. 

Once the patch uplifts to all of the repos we care about, and once we stop having to care about old try pushes, we can make a new patch that stops uploading the gzip artifact, and that can just ride the trains.
Assignee: nobody → wkocher
https://hg.mozilla.org/mozilla-central/rev/88d304c633b6
Status: REOPENED → RESOLVED
Closed: Last year7 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
(In reply to Wes Kocher (:KWierso) from comment #18)
> This way, it can land independently of the treeherder patches
> to support the non-gzip version, and we can get to uplifting it right away. 

Is there a treeherder bug/issue to cover this part? While these artifacts aren't particularly large, I'd rather not double-upload indefinitely.

> Once the patch uplifts to all of the repos we care about, and once we stop
> having to care about old try pushes, we can make a new patch that stops
> uploading the gzip artifact, and that can just ride the trains.

Tom has (once again) taken care of the uplift piece. Thanks, Tom.

What's the horizon on caring about old Try pushes? Is it the same as our two month Try retention threshold?
(In reply to Chris Cooper [:coop] pronoun: he from comment #21)
> Is there a treeherder bug/issue to cover this part? While these artifacts
> aren't particularly large, I'd rather not double-upload indefinitely.
Bug 1494750 covers the TH patches. Now that all trees upload the uncompressed artifacts, I was thinking I could write a patch to stop uploading the gzipped version and letting it ride the trains. (It could also be uplifted once we're sure nothing is depending on the gzip artifacts if you don't want double artifacts for the life of esr60) 


> What's the horizon on caring about old Try pushes? Is it the same as our two
> month Try retention threshold?
Yeah, once old pushes expire so that all non-expired pushes have the uncompressed artifacts. Or until someone makes an executive decision to stop supporting "add new jobs" for non-expired try pushes that don't have that artifact...
Blocks: 1494750
No longer depends on: 1494750
Blocks: 1513480
I've filed bug 1513480 for stopping the creation of the old file (to land only after the Treeherder parts have been deployed).
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.