1423215 - Upload uncompressed runnable-jobs.json artifact instead of gzipped version.

Reporter

Description

•

7 years ago

In bug 1422133 the runnable jobs metadata was split out of the 40MB task graph and also saved to disk gzipped, to improve performance/avoid timeouts when fetched by Treeherder in response to people using the "add new jobs" feature.

However the uploaded file doesn't have a `Content-Encoding` or `Content-Type` header that includes `gzip`, so automatic decompression breaks.

For example:

$ curl -sSfL --compressed 'https://public-artifacts.taskcluster.net/f8vydm9GSW2M88FWKjmruw/0/public/runnable-jobs.json.gz' | jq .
parse error: Invalid numeric literal at line 1, column 32

$ python
>>> import requests
>>> r = requests.get('https://public-artifacts.taskcluster.net/f8vydm9GSW2M88FWKjmruw/0/public/runnable-jobs.json.gz')
>>> j = r.json()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Compare:

$ curl -IL --compressed https://public-artifacts.taskcluster.net/f8vydm9GSW2M88FWKjmruw/0/public/runnable-jobs.json.gz
HTTP/1.1 200 OK
Content-Type: application/octet-stream
...

$ curl -IL --compressed https://secure.pub.build.mozilla.org/builddata/buildjson/builds-4hr.js.gz
HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Encoding: gzip
...

Now since S3 doesn't support dynamic serving of content (CloudFront does, but anything in AWS hits S3 directly) I'm presuming taskcluster needs to set the correct Content-Encoding (or Content-Type, but not both) when uploading any files whose filename ends with `.gz`?

For the case above, is this taskcluster-worker, generic-worker, docker-worker, ...?

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 1

•

7 years ago

(In reply to Ed Morley [:emorley] from comment #0)

> For the case above, is this taskcluster-worker, generic-worker,
> docker-worker, ...?

gecko-3-decision is docker-worker...

Luckily we have a docker-worker expert on the team ;)

Component: Worker → Docker-Worker

Flags: needinfo?(pmoore) → needinfo?(wcosta)

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

7 years ago

It's just a binary blob which happens to be gzip encoded, so the encoding is correct.  There isn't anything the worker could do to figure this out other than guessing based on the file extension -- and that got IE into a lot of trouble back in the day.  Even if it did guess, it would be to set the content-type, not the content-encoding.

So basically this file is gzipped *by convention* not by anything involving HTTP.

Flags: needinfo?(wcosta)

Ed Morley [:emorley]

Reporter

Comment 3

•

7 years ago

Ok, but set the Content-Type instead then :-)

It's annoying to just not have this file usable out of the box.

Ed Morley [:emorley]

Reporter

Comment 4

•

7 years ago

This just seems like another case of Taskcluster idealism over practicality fwiw.

Ed Morley [:emorley]

Reporter

Comment 5

•

7 years ago

Anyway, sorry for being short, we just seem to be accumulating workarounds for issues when it comes to downloads of files from S3:
https://github.com/mozilla/treeherder/blob/f360621541b9cf95d8c1e03a8228eeedb6cc803c/treeherder/etl/common.py#L37-L39
https://github.com/mozilla/treeherder/blob/f360621541b9cf95d8c1e03a8228eeedb6cc803c/treeherder/etl/common.py#L59-L67

...it would be good if we could move the fixes for some of these upstream.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

7 years ago

The new artifact API does, in fact, allow specification of a content-encoding.  The question is, how does that get communicated from the task (which is just writing files to disk) to the worker (which uploads it).  Basically we lose metadata on the filesystem.

I think the only option is to guess based on file extension and/or content sniffing.  And we know how well that turned out for IE.  I think if anything's going to do content sniffing, it should be the consumer of the artifact (where any errors in sniffing can be handled appropriately) rather than in the worker (where suddenly an artifact named /index/by-email/info@somedomain.gz fails to download correctly, yet all other similarly-named artifacts download correctly).

Probably the best solution is to invent some means of uploading artifacts for a task that does not involve writing them to disk.  But how to do that securely is an as-yet unsolved problem.  One you are welcome to work on a fix for :)

Ed Morley [:emorley]

Reporter

Comment 7

•

7 years ago

I agree that sniffing content type based on URL is in most cases a bad idea.

However for uploads from a filesystem, particularly for files created by the Taskcluster build process, the set of expected filenames and extensions is much smaller. Whilst someone could deliberately go out of their way to create a file named `info@somedomain.gz`, it's much less likely than in the URL based example of `/index/by-email/info@somedomain.gz`.

(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> Probably the best solution is to invent some means of uploading artifacts
> for a task that does not involve writing them to disk.

Agreed :-)

Pete Moore [:pmoore][:pete]

Comment 8

•

7 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> The new artifact API does, in fact, allow specification of a
> content-encoding.  The question is, how does that get communicated from the
> task (which is just writing files to disk) to the worker (which uploads it).
> Basically we lose metadata on the filesystem.

We don't need to lose this metadata - the task payload already declares artifact metadata including the location of the file storing the content, the artifact name, the artifact type, and artifact expiry. It could also comfortably declare a content type. This could be optional for backwards compatibility, whereby either the worker or the queue would choose/guess/sniff etc if not provided a type.

Pete Moore [:pmoore][:pete]

Comment 9

•

7 years ago

Note, the above solution solves the issue for file artifacts, but not for directory artifacts.

Some solution could also be possible for directory artifacts too, but then it would be more complicated. Alternatively, we could warn against using directory artifacts, if we decide we don't want to support custom content-type settings for directory artifacts.

Pete Moore [:pmoore][:pete]

Comment 10

•

6 years ago

Attached file Github Pull Request for generic-worker — Details

Note, this allows explicitly setting ContentType in generic worker jobs - this doesn't help docker-worker or taskcluster-worker at this point in time, so we should leave this bug open so patches can be made to them too.

Requested review in the PR, so I won't duplicate the review flag here.

[github robot]

Comment 11

•

6 years ago

Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/764a9bd321c2d49925e7e140adcdef1bb57a8d46
Bug 1423215 - Make contentType an optional property of file artifact in task payload

https://github.com/taskcluster/generic-worker/commit/e6b09b419cfd244655d4f50ee81e8efc290c595e
Merge pull request #70 from taskcluster/bug1423215

Bug 1423215 - Make contentType an optional property of file artifact in task payload

Pete Moore [:pmoore][:pete]

Comment 13

•

6 years ago

Generic Worker
==============
Released in generic-worker 10.4.0. Not globally rolled out to all our workers yet.

See "contentType" property in the artifacts section of:
  https://docs.taskcluster.net/reference/workers/generic-worker/payload

In this update, content encoding is still decided by the worker (and it does a pretty good job of deciding if it is worth the encoding cost or not) but content type can be explicitly specified, if the system default for the filename extension is not wanted.


Taskcluster Worker
==================
Still a TODO for taskcluster-worker.


Docker Worker
=============

I'm guessing we might not want to port this to docker-worker at this point, since we are planning to migrate to using taskcluster-worker instead. However, that might depend how long the migration takes. The task in comment 0 is a docker-worker task, so it certainly is a problem there.

Jonas Finnemann Jensen (:jonasfj)

Comment 14

•

6 years ago

Filed https://github.com/taskcluster/taskcluster-worker/issues/376 for the tc-worker TODO..


Otherwise, we close this as content-encoding should never be sniffed.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → INVALID

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

6 years ago

To be clear, generic-worker doesn't sniff content-encoding -- it *performs* content-encoding and uses heuristics to decide when to do so.

The cause of much of the issue here is that we have uploaded an artifact that the task has pre-gzipped (what I was getting to in comment 2).  It would have been better to output that artifact as runnable-jobs.json and let the various pieces of infrastructure decide whether it's efficient to ship those bits around bracketed with gzip encoding/decoding (and without changing the filename).  That's what we have done with task-graph.json, actions.json, etc.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Updated

•

6 years ago

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Summary: Uploaded gzipped job artifacts (such as runnable-jobs.json.gz) have incorrect Content-Encoding/Type → Upload uncompressed runnable-jobs.json artifact instead of gzipped version.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Comment 16

•

6 years ago

Attached file Bug 1423215 - Upload an uncompressed runnable-jobs.json artifact in addition to the gz version r?dustin — Details

This should not be landed until the treeherder Part 1 PR from bug 1494750 is landed and deployed, otherwise Add New Jobs will stop working for jobs built on this patch, since the fallback to the gzip artifact will not be in place.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Updated

•

6 years ago

Blocks: 1494750

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Updated

•

6 years ago

No longer blocks: 1494750

Depends on: 1494750

Phabricator Automation

Updated

•

5 years ago

Attachment #9029099 - Attachment description: Bug 1423215 - Upload only an uncompressed runnable-jobs.json artifact r?dustin → Bug 1423215 - Upload an uncompressed runnable-jobs.json artifact in addition to the gz version r?dustin

Pulsebot

Comment 17

•

5 years ago

Pushed by wkocher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/88d304c633b6
Upload an uncompressed runnable-jobs.json artifact in addition to the gz version r=dustin

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Comment 18

•

5 years ago

I changed the patch to upload both the gzip and the uncompressed version of the artifact. This way, it can land independently of the treeherder patches to support the non-gzip version, and we can get to uplifting it right away. 

Once the patch uplifts to all of the repos we care about, and once we stop having to care about old try pushes, we can make a new patch that stops uploading the gzip artifact, and that can just ride the trains.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Updated

•

5 years ago

Assignee: nobody → wkocher

Cristian Brindusan [:cbrindusan]

Comment 19

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/88d304c633b6

Status: REOPENED → RESOLVED

Closed: 6 years ago → 5 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla65

Tom Prince [:tomprince]

Comment 20

•

5 years ago

https://hg.mozilla.org/releases/mozilla-release/rev/d44db55d8b564a8117d985e353b1470ebfbaa703
https://hg.mozilla.org/releases/mozilla-esr60/rev/53b49526917f873423d08d86712b000ff9f564d3

Chris Cooper [:coop] (he/him)

Comment 21

•

5 years ago

(In reply to Wes Kocher (:KWierso) from comment #18)
> This way, it can land independently of the treeherder patches
> to support the non-gzip version, and we can get to uplifting it right away. 

Is there a treeherder bug/issue to cover this part? While these artifacts aren't particularly large, I'd rather not double-upload indefinitely.

> Once the patch uplifts to all of the repos we care about, and once we stop
> having to care about old try pushes, we can make a new patch that stops
> uploading the gzip artifact, and that can just ride the trains.

Tom has (once again) taken care of the uplift piece. Thanks, Tom.

What's the horizon on caring about old Try pushes? Is it the same as our two month Try retention threshold?

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Comment 22

•

5 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #21)
> Is there a treeherder bug/issue to cover this part? While these artifacts
> aren't particularly large, I'd rather not double-upload indefinitely.
Bug 1494750 covers the TH patches. Now that all trees upload the uncompressed artifacts, I was thinking I could write a patch to stop uploading the gzipped version and letting it ride the trains. (It could also be uplifted once we're sure nothing is depending on the gzip artifacts if you don't want double artifacts for the life of esr60) 


> What's the horizon on caring about old Try pushes? Is it the same as our two
> month Try retention threshold?
Yeah, once old pushes expire so that all non-expired pushes have the uncompressed artifacts. Or until someone makes an executive decision to stop supporting "add new jobs" for non-expired try pushes that don't have that artifact...

Ed Morley [:emorley]

Reporter

Updated

•

5 years ago

Blocks: 1494750

No longer depends on: 1494750

Ed Morley [:emorley]

Reporter

Updated

•

5 years ago

Blocks: 1513480

Ed Morley [:emorley]

Reporter

Comment 23

•

5 years ago

I've filed bug 1513480 for stopping the creation of the old file (to land only after the Treeherder parts have been deployed).

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Docker-Worker → Workers

Github Pull Request for generic-worker 6 years ago Pete Moore [:pmoore][:pete] 53 bytes, text/x-github-pull-request		Details \| Review
Bug 1423215 - Upload an uncompressed runnable-jobs.json artifact in addition to the gz version r?dustin 6 years ago Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed) 47 bytes, text/x-phabricator-request		Details \| Review