Closed Bug 1190016 Opened 9 years ago Closed 7 years ago

Use SHA256 to verify uploads

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: jhford)

References

Details

Attachments

(1 file)

We should force workers to specify content-md5 and content-sha256. Then ensure that content-md5 is specified when uploading to S3. To guard against data corruption. Additionally, we should add a custom header: x-amz-meta-content-sha256: base64 encoded sha-256 That downloaders can use to verify integrity of the data. Note: S3 does support etag which is md5 hash, unless the object is uploaded with multipart upload. So setting it ourselves seems to make sense. Additionally, we can store sha256 in azure table storage too, so security sensitive artifacts can be validated with the hash from two sources. (it's another factor that can optionally be used to protect us). --------------------------- # Implementation Plan: 1) Optional Support: Let contentMD5 and contentSHA256 be optional properties for queue.createArtifact Uploaders only specify it to S3, if they were given to the queue. 2) Required Support We require contentMD5 and contentSHA256 to be specified for queue.createArtifact Leaving uploaders with no option to not specify them. This gives us time to move all artifact uploaders over, before we break support. Additionally, we'll return a cacheControl property from queue.createArtifact that should be specified by uploaders.
Attached file Github PR Stage 1
So I'm tempted to change the return value from queue.createArtifact into: storageType: 's3' putUrl: '...' headers: { 'content-type': ... 'content-md5': ... 'cache-control': ... } Tell me, if that's crazy or not? The idea is that if we ever need to we can always specify additional headers. In fact we could specify "authorization" as a header and not rely on signed-urls. I don't want to go there yet, because it's complicated... but we would have the option of doing so. As docs would say to specify headers listed in result.headers regardless of what they are.
Attachment #8641970 - Flags: review?(garndt)
Attachment #8641970 - Flags: feedback?(pmoore)
Actually, I suspect providing the "authorization" header would allow us to control content-length too. And that it could be done with: https://github.com/mhart/aws4 It might even allow us to force people to specify, cache-control, content-encoding: gzip, etc.
Comment on attachment 8641970 [details] [review] Github PR Stage 1 Please see questions/comments in the PR.
Attachment #8641970 - Flags: feedback?(pmoore) → feedback+
Comment on attachment 8641970 [details] [review] Github PR Stage 1 comments in PR
Attachment #8641970 - Flags: review?(garndt) → review+
@pmoore, @garndt, A) Is this a pressing issue? Do we have issues corrupted artifacts? (I think I was wrong about that) B) Should we do this along with implementing the publish step, so we don't have to refactor workers twice? I propose createArtifact as POST and publishArtifact as PATCH (same URL), other ideas are welcome too. C) How would you feel about the response being on the following form? storageType: 's3' putUrl: '...' headers: { 'content-type': ... 'content-md5': ... 'cache-control': ... 'content-length': ... } This would allow us the ability to put 'content-length' under the signature too. So content-length must be supplied to the queue, and the can't be faked the main benefit being that the queue can require scopes for creating big artifacts. ie. scopes: queue:artifact-size:10G queue:artifact-size:5G queue:artifact-size:2G queue:artifact-size:1G // Allows you to create artifact less than 1GB queue:artifact-size:500M queue:artifact-size:200M queue:artifact-size:100M queue:artifact-size:50M queue:artifact-size:10M queue:artifact-size:1M So if you upload an artifact of 1.2G we allow it if you have one of: queue:artifact-size:10G queue:artifact-size:5G queue:artifact-size:2G --- If you agree with (A) that this isn't a pressing issue, then take your time and think about this a bit. IMO, we don't want to refactor artifact upload too many times. --- @garndt, docker-worker can still ignore missing artifacts and not fail the task if it wants to. It's just that docker-worker will have to not call createArtifact until it is sure there is a file it can upload. Note, only temporarily long term missing artifacts => task failed, we have to get the architecture right :)
Flags: needinfo?(pmoore)
Flags: needinfo?(garndt)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #5) > @pmoore, @garndt, > A) Is this a pressing issue? Do we have issues corrupted artifacts? (I think > I was wrong about that) I've not seen any issues, although haven't used it extensively. > B) Should we do this along with implementing the publish step, so we don't > have to refactor workers twice? YES > I propose createArtifact as POST and publishArtifact as PATCH (same URL), > other ideas are welcome too. What about createArtifact as PUT? Also maybe we should call it defineArtifact, to be similar to defineTask (i.e. after the PUT has completed, the resource is not yet available). PATCH seems the correct choice for publishArtifact. Until this has been called, it seems reasonable for the queue to return a HTTP 204 to getArtifact (and also not to include it in responses to listArtifacts, listLatestArtifacts, getLatestArtifact). I wonder if the Queue is the right component for managing the publish of Artifacts though. For simplicity it is nice that the queue offers a single API endpoint for workers, but then really it is more than just a queue. Maybe it is a naming matter, rather than needing to change the architecture. "Task Administrator" or something similar might be more reflective of its function. > C) How would you feel about the response being on the following form? > storageType: 's3' > putUrl: '...' > headers: { > 'content-type': ... > 'content-md5': ... > 'cache-control': ... > 'content-length': ... > } > This would allow us the ability to put 'content-length' under the > signature too. That's brilliant. YES. Also enforcing cache-control headers in relation to artifact expiry is ingenuous. Give the queue the power! ++ > So content-length must be supplied to the queue, and the can't be faked the > main benefit being that the queue can require scopes for creating big > artifacts. > ie. scopes: > queue:artifact-size:10G > queue:artifact-size:5G > queue:artifact-size:2G > queue:artifact-size:1G // Allows you to create artifact less than 1GB > queue:artifact-size:500M > queue:artifact-size:200M > queue:artifact-size:100M > queue:artifact-size:50M > queue:artifact-size:10M > queue:artifact-size:1M > > So if you upload an artifact of 1.2G we allow it if you have one of: > queue:artifact-size:10G > queue:artifact-size:5G > queue:artifact-size:2G > > --- > If you agree with (A) that this isn't a pressing issue, then take your time > and think about this a bit. > IMO, we don't want to refactor artifact upload too many times. +1 > --- > @garndt, docker-worker can still ignore missing artifacts and not fail the > task if it wants to. > It's just that docker-worker will have to not call createArtifact until it > is sure there is a file > it can upload. Note, only temporarily long term missing artifacts => task > failed, we have to get the > architecture right :)
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pmoore][:pete] from comment #6) > (In reply to Jonas Finnemann Jensen (:jonasfj) from comment #5) > > C) How would you feel about the response being on the following form? > > storageType: 's3' > > putUrl: '...' > > headers: { > > 'content-type': ... > > 'content-md5': ... > > 'cache-control': ... > > 'content-length': ... > > } > > This would allow us the ability to put 'content-length' under the > > signature too. > > That's brilliant. YES. Also enforcing cache-control headers in relation to > artifact expiry is ingenuous. Give the queue the power! ++ I should add, once we start storing this meta-data with the artifacts, we should have endpoints for returning it. We already have: http://schemas.taskcluster.net/queue/v1/list-artifacts-response.json# for returning an array of artifact metadata, but we probably also want to be able to return data for a single artifact. We could have something like getArtifactMetadata and getLatestArtifactMetadata which returns the artifact metadata, also for artifacts that have not yet been published. "state" could be part of the metadata, with values "defined", "published", "expired", "deleted". You could even have comments attached to artifacts, and bookkeeping about it (e.g. releases could be artifacts that go through a approval chain, and the artifact metadata could track this). I think when releases go through Taskcluster this will be more relevant and important.
Maybe we should have a new component for artifact handling. "Librarian", "Banker", "Secretary", "Book Keeper", "Store Keeper", "Distributor", "Mum", "House Keeper". Or maybe the "Library", "Bank", "Store". The idea is you basically hand stuff over for safe keeping to this component, and later on you can interact with it, and others can too. The Queue seems more about handing out tasks to workers. Maybe better to keep fewer components, I can't decide. But then at some point we should rename the Queue to reflect its real purpose.
> A) Is this a pressing issue? Do we have issues corrupted artifacts? (I think > I was wrong about that) I have not investigated this, so I'm not entirely sure. I am going to be doing some debugging in tc-vcs soon to see why it can't extract a tarball that it downloaded. Maybe it's corrupted, maybe not. > B) Should we do this along with implementing the publish step, so we don't > have to refactor workers twice? > I propose createArtifact as POST and publishArtifact as PATCH (same URL), > other ideas are welcome too. I would definitely be in favor of not having to work on this twice in the worker, leaves less room for error when refactoring. I think POST (since it asks the queue to create something feels the most natural). PUT to me seems like storing something, which we're not doing in this request. > C) How would you feel about the response being on the following form? > storageType: 's3' > putUrl: '...' > headers: { > 'content-type': ... > 'content-md5': ... > 'cache-control': ... > 'content-length': ... > } > This would allow us the ability to put 'content-length' under the > signature too. > So content-length must be supplied to the queue, and the can't be faked the > main benefit being that the queue can require scopes for creating big > artifacts. > ie. scopes: > queue:artifact-size:10G > queue:artifact-size:5G > queue:artifact-size:2G > queue:artifact-size:1G // Allows you to create artifact less than 1GB > queue:artifact-size:500M > queue:artifact-size:200M > queue:artifact-size:100M > queue:artifact-size:50M > queue:artifact-size:10M > queue:artifact-size:1M > > So if you upload an artifact of 1.2G we allow it if you have one of: > queue:artifact-size:10G > queue:artifact-size:5G > queue:artifact-size:2G > Yup, that would be awesome to return. I'm a little less excited about those scopes. Do we really have the need to restrict artifact upload size? I guess this is good to have in case we decide we need it later. > --- > If you agree with (A) that this isn't a pressing issue, then take your time > and think about this a bit. > IMO, we don't want to refactor artifact upload too many times. > --- Absolutely agree (re: not wanting to refactor too many time)
Flags: needinfo?(garndt)
(In reply to Greg Arndt [:garndt] from comment #9) > > B) Should we do this along with implementing the publish step, so we don't > > have to refactor workers twice? > > I propose createArtifact as POST and publishArtifact as PATCH (same URL), > > other ideas are welcome too. > ..... > I think POST (since it asks the queue to create something feels the most > natural). PUT to me seems like storing something, which we're not doing in > this request. I think POST is intended when you want to publish something but do not have a fixed url for where it is going to reside (e.g. http://foo/create/artifact) but when the url you use is the resource location (i.e. the same url as the associated GET request (/task/<taskId>/runs/<runId>/artifacts/<name>) then PUT is more appropriate). With PUT we also gain idempotency. I would argue that if we consider the create to be defining the metadata, and that the GET request would return the same meta data, then I think this is a regular resource publication which can conform to regular REST mechanics. The 30x redirect endpoint for getting the resource from S3 could be at a different URL - and we may want to consider dropping that entirely, if the client is internally going to handle the streaming. Getting the metadata should be enough for the client to manage the downloading without intervention from the Queue.
@jhford, I think you are actually working on this :)
Assignee: jopsen → jhford
So we'll be doing this, except that we'll use SHA256 instead of MD5. We'll be setting x-amz-content-sha256 for all S3 api requests for artifacts, including uploads. Because of internal mechanics, the x-amz-content-sha256 header will not persist after creation of the object, so we'll instead be setting a metadata property x-amz-meta-content-sha256 which will contain the complete file's sha256 checksum.
Summary: force S3 artifact upload to specify content-md5 → Use SHA256 to verify uploads
I believe this work is completed now. Please reopen if not.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Queue → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: