Closed Bug 1380978 Opened 7 years ago Closed 7 years ago

generic-worker: chain of trust artifacts should be indexed by artifact name, not artifact path

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

In the past, generic worker didn't distinguish between artifact name (the name you should give the artifact in the call to the queue to download it) and artifact path (the relative file location of the artifact file from the task directory when the task had completed, used by the worker in order to find the artifact file so it can upload it to the queue).

At a certain point, the feature was added to be able to specify an artifact name that was different to the artifact path, and at that time, the chain of trust feature had already been implemented. Therefore it slipped through the net, that the chain of trust feature was indexing artifacts using the artifact path rather than the artifact name, in the chain of trust artifact it creates.

We should fix the chain of trust feature to use the artifact name, and add a few tests to make sure this works as expected.
In other words:

https://github.com/taskcluster/generic-worker/blob/b61f403c75ce181876797e55412053835342bf90/chain_of_trust.go#L101

-                       artifactHashes[a.CanonicalPath] = ArtifactHash{
+                       artifactHashes[a.Name] = ArtifactHash{
Created a bugfix release for this: https://github.com/taskcluster/generic-worker/releases/tag/v10.1.2

Still need to roll it out.
Deploying to gecko-1-b-win2012-beta in https://tools.taskcluster.net/groups/eNGimqKsQ-qbPXxLtDwRIg/tasks/WyDmrrczTje1rrGO7Yak-w/runs/0 (via OpenCloudConfig)
See Also: → 1380976
Retrying task CKeH8b-cTgKISyQ6FoT5Eg from bug 1380976:

  * https://tools.taskcluster.net/groups/VM5Dp2Q3SZu3tn5Dch7d5w/tasks/VM5Dp2Q3SZu3tn5Dch7d5w/details

Not sure this will work, as I am trying on a level 3 worker type, whereas the original task ran on a level 1 worker type.

In addition, I'm running a regular try push using the win2012 beta worker type that I upgraded in comment 4:

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc4202a42083989ef820d750032377a35d917523

I'm only testing windows 2012 builders for now - we can upgrade the other worker types separately - only the win2012 builders use the chain of trust feature at the moment.
There was no signing key on gecko-1-b-win2012-beta so I've added one:

  * https://github.com/mozilla-releng/OpenCloudConfig/commit/eea29a8a081102e83952f849a834fec426bdae9a

Will need to wait for this to roll out, before retriggering the individual taskcluster task and try push from comment 5.

Rollout can be followed in:

  * https://tools.taskcluster.net/groups/R7Xq0TBFSk6g8QSv78Q8HQ/tasks/CkZUbiq9QHeJJZzigv0ihg/runs/0/logs/public%2Flogs%2Flive.log
(In reply to Pete Moore [:pmoore][:pete] from comment #5)
> Retrying task CKeH8b-cTgKISyQ6FoT5Eg from bug 1380976:
> 
>   * https://tools.taskcluster.net/groups/VM5Dp2Q3SZu3tn5Dch7d5w/tasks/VM5Dp2Q3SZu3tn5Dch7d5w/details
>

Running again, now as:

  * https://tools.taskcluster.net/groups/EJS7nJ8EQjWO6CpAIQkHDg/tasks/EJS7nJ8EQjWO6CpAIQkHDg/details

> Not sure this will work, as I am trying on a level 3 worker type, whereas
> the original task ran on a level 1 worker type.
> 
> In addition, I'm running a regular try push using the win2012 beta worker
> type that I upgraded in comment 4:
> 
>   * https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc4202a42083989ef820d750032377a35d917523

I've retriggered the decision task, now that the worker type has been updated.
It looks like rerunning the decision task went ok, but the individual task rerun (EJS7nJ8EQjWO6CpAIQkHDg) consistently produced a worker exception, also on subsequent automatic retries.

In the task logs (from run 0) we see:

[taskcluster 2017-07-16T10:44:20.264Z] Uploading file public/chainOfTrust.json.asc as artifact public/chainOfTrust.json.asc
[taskcluster 2017-07-16T10:44:20.710Z] Uploading file public/logs/live.log as artifact public/logs/live.log
[taskcluster 2017-07-16T10:44:21.171Z] goroutine 1 [running]:
[taskcluster 2017-07-16T10:44:21.171Z] runtime/debug.Stack(0xc0427a4a78, 0x921100, 0xc0424b09c0)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/go/src/runtime/debug/stack.go:24 +0x80
[taskcluster 2017-07-16T10:44:21.171Z] main.(*TaskRun).Run.func3(0xc042582008, 0xc04222b380, 0xc042582010)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:994 +0x261
[taskcluster 2017-07-16T10:44:21.171Z] panic(0x921100, 0xc0424b09c0)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/go/src/runtime/panic.go:458 +0x251
[taskcluster 2017-07-16T10:44:21.171Z] main.(*TaskRun).uploadArtifact(0xc04222b380, 0xc95e60, 0xc042331620, 0x8)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/artifacts.go:430 +0xa35
[taskcluster 2017-07-16T10:44:21.171Z] main.(*TaskRun).Run.func5(0xc04222b380, 0xc042582008)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1060 +0xce
[taskcluster 2017-07-16T10:44:21.171Z] main.(*TaskRun).Run(0xc04222b380, 0x0)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1103 +0x12e1
[taskcluster 2017-07-16T10:44:21.171Z] main.FindAndRunTask(0x12a05f200)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:673 +0x502
[taskcluster 2017-07-16T10:44:21.171Z] main.RunWorker(0x0)
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:572 +0x5ad
[taskcluster 2017-07-16T10:44:21.171Z] main.main()
[taskcluster 2017-07-16T10:44:21.171Z] 	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:335 +0x7fe
[taskcluster 2017-07-16T10:44:21.171Z] 
[taskcluster 2017-07-16T10:44:21.171Z] &errors.errorString{s:"WORKER EXCEPTION due to response code 400 from Queue when uploading artifact &main.S3Artifact{BaseArtifact:(*main.BaseArtifact)(0xc042118300), MimeType:\"application/x-msdownload\", ContentEncoding:\"\"}"}


At this time in the worker logs we see the same exception, but something strange is going on. We see public/logs/live_backing.log is uploaded successfully (we get a 200 response code) although we didn't see that in the task logs. Also note if you go to https://tools.taskcluster.net/groups/EJS7nJ8EQjWO6CpAIQkHDg/tasks/EJS7nJ8EQjWO6CpAIQkHDg/runs/0/artifacts we can see both public/logs/live_backing.log and public/logs/live.log exist, and return content. The worker logs:

2017/07/16 10:44:20 1 put requests issued to https://taskcluster-public-artifacts.s3-us-west-2.amazonaws.com/EJS7nJ8EQjWO6CpAIQkHDg/0/public/chainOfTrust.json.asc?AWSAccessKeyId=AKIAJQESBGXODWDRTZUA&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1500203671&Signature=xEEJn2mxFicvgLT1c2hkWxqxic8%3D 
2017/07/16 10:44:20 Response 
2017/07/16 10:44:20 HTTP/1.1 200 OK
2017/07/16 10:44:21 1 put requests issued to https://taskcluster-public-artifacts.s3-us-west-2.amazonaws.com/EJS7nJ8EQjWO6CpAIQkHDg/0/public/logs/live_backing.log?AWSAccessKeyId=AKIAJQESBGXODWDRTZUA&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1500203672&Signature=rjAvk6MeMo5OwOyLVRJJHw0cbc4%3D 
2017/07/16 10:44:21 Response 
2017/07/16 10:44:21 HTTP/1.1 200 OK
2017/07/16 10:44:21 goroutine 1 [running]: 
runtime/debug.Stack(0x0, 0xc0427a4758, 0x40c237) 
	/home/travis/go/src/runtime/debug/stack.go:24 +0x80 
main.HandleCrash(0x921100, 0xc0424b09c0) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:499 +0x3b 
main.RunWorker.func1(0xc0427a5d10) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:508 +0x59 
panic(0x921100, 0xc0424b09c0) 
	/home/travis/go/src/runtime/panic.go:458 +0x251 
main.(*TaskRun).Run.func1(0xc042582008, 0xc04222b380) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:968 +0x202 
panic(0x921100, 0xc0424b09c0) 
	/home/travis/go/src/runtime/panic.go:458 +0x251 
main.(*TaskRun).Run.func3(0xc042582008, 0xc04222b380, 0xc042582010) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1001 +0x138 
panic(0x921100, 0xc0424b09c0) 
	/home/travis/go/src/runtime/panic.go:458 +0x251 
main.(*TaskRun).uploadArtifact(0xc04222b380, 0xc95e60, 0xc042331620, 0x8) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/artifacts.go:430 +0xa35 
main.(*TaskRun).Run.func5(0xc04222b380, 0xc042582008) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1060 +0xce 
main.(*TaskRun).Run(0xc04222b380, 0x0) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1103 +0x12e1 
main.FindAndRunTask(0x12a05f200) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:673 +0x502 
main.RunWorker(0x0) 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:572 +0x5ad 
main.main() 
	/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:335 +0x7fe 
2017/07/16 10:44:21  *********** PANIC occurred! ***********  
2017/07/16 10:44:21 WORKER EXCEPTION due to response code 400 from Queue when uploading artifact &main.S3Artifact{BaseArtifact:(*main.BaseArtifact)(0xc042118300), MimeType:"application/x-msdownload", ContentEncoding:""} 



At this time in the queue service logs we see (with a bit of reformatting, and some base64 and slug decoding):

Jul 16 12:44:22 taskcluster-queue app/web.19: Sun, 16 Jul 2017 10:44:21 GMT base:entity Failed to insert entity err:
---
name: EntityAlreadyExistsError
code: EntityAlreadyExists
statusCode: 409
payload:
  odata.error:
    code: EntityAlreadyExists
    message:
      lang: en-US
      value: |-
        The specified entity already exists.
        RequestId:9b0bd203-0002-00ea-7a20-fe45c2000000
        Time:2017-07-16T10:44:22.3356586Z
method: POST
path: "/QueueArtifacts"
query:
  timeout: 7
headers:
  prefer: return-no-content
  x-ms-date: Sun, 16 Jul 2017 10:44:21 GMT
  x-ms-version: '2014-02-14'
  dataserviceversion: '3.0'
  x-ms-client-request-id: fast-azure-storage
  content-length: 592
  content-type: application/json
  accept: application/json;odata=fullmetadata
requestBody:
  PartitionKey: EJS7nJ8EQjWO6CpAIQkHDg~0
  RowKey: public!2Flogs!2Flive.log
  Version: 1
  taskId@odata.type: Edm.Guid
  taskId: EJS7nJ8EQjWO6CpAIQkHDg
  runId: 0
  name: public/logs/live.log
  storageType: reference
  contentType: text/plain; charset=utf8
  __buf0_details@odata.type: Edm.Binary
  __buf0_details:
    url: https://queue.taskcluster.net/v1/task/EJS7nJ8EQjWO6CpAIQkHDg/runs/0/artifacts/public/logs/live_backing.log
  __bufchunks_details: 1
  expires@odata.type: Edm.DateTime
  expires: '2017-09-16T09:37:24.290Z'
retries: 0
I suspect what is going on here is that the redirect of public/logs/live.log to public/logs/live_backing.log is successful, but then for some reason, there is an attempt to upload public/logs/live.log as an S3 artifact, which fails, due to the redirect artifact already existing. This would explain why:

1) the Azure error, that the queue reports, includes the url to the backing log, since that might be just showing what the current value is, rather than the value that the request attempted to set it to
2) why the http status code 400 is reported for &main.S3Artifact{...} rather than &main.RedirectArtifact{...}
3) why we see "Uploading file public/logs/live.log as artifact public/logs/live.log" as the last entry in task log before the crash

What I haven't yet investigated is:

1) This seems unrelated to my change. What changed in order for this to happen now on gecko-1-b-win2012-beta?
2) Why did we not see these exceptions in the retriggered decision task (as this seems unrelated to chain of trust, but a general problem) when these also ran on gecko-1-b-win2012-beta?

Looking into this now....
Having looked more deeply in the code, the non-payload-listed artifacts (e.g. log files, chain of trust) are uploaded in a defer block, so I suspect the error comes from

Uploading file public/build/target.installer.exe as artifact public/build/mk/target.installer.exe

This would also explain why we see MimeType:"application/x-msdownload" in the error message!

I'm not sure why we see ContentEncoding:"" as I would expect the content encoding to be gzip.

I'm also not sure why we get EntityAlreadyExists, unless the mozharness task explicitly publishes this artifact, but I'm not sure how it would get credentials to do that.

So the plot thickens. Will keep on investigating.
OK, this is probably due to bug 1380976, that the payload artifacts include the same artifacts more than once!

Although I'm not sure why we don't see the "Uploading file ...." in the task logs twice.

To (quickly) test this hypothesis, I'm rerunning, but removed the duplicate artifact entries from the payload:

https://tools.taskcluster.net/groups/UwqROpOcQ-Gmnrr8PnAx3g/tasks/UwqROpOcQ-Gmnrr8PnAx3g/runs/0

Let's see how that goes.
So it looks like the "EntityAlreadyExistsError" error we saw in the queue happens all the time - when i search this in the logs, I found new error messages like this appearing all the time, so I think this was a red herring! Gahhh!!
So probably the 400 error isn't because an artifact existed before, since the queue error was a red herring - so instead, for some reason, this upload caused a 400 error from the queue:

Uploading file public/build/target.installer.exe as artifact public/build/mk/target.installer.exe

We can also see that this artifact didn't get uploaded, so this really must be the cause:

https://tools.taskcluster.net/groups/EJS7nJ8EQjWO6CpAIQkHDg/tasks/EJS7nJ8EQjWO6CpAIQkHDg/runs/0/artifacts

I suspect the fact that ContentEncoding didn't get set to "gzip" will be related. Not that an encoding is required, but the fact it didn't get set, suggests something went wrong.
Earlier in the queue logs, I now see the 400 status code referenced:

Jul 16 12:44:19 taskcluster-queue heroku/router: at=info method=POST path="/v1/task/EJS7nJ8EQjWO6CpAIQkHDg/runs/0/artifacts/public%2Fbuild%2Fmk%2Ftarget.installer.exe" host=queue.taskcluster.net request_id=d9ae38d2-2f0b-4dd9-842d-2d7f7a3df164 fwd="54.144.178.186" dyno=web.15 connect=0ms service=157ms status=400 bytes=1231 protocol=https

...

Jul 16 12:44:19 taskcluster-queue app/web.15: Authorized task-client/EJS7nJ8EQjWO6CpAIQkHDg/0/on/us-east-1/i-09e14b5f47bf34bca/until/1500202617.389 for POST access to /v1/task/EJS7nJ8EQjWO6CpAIQkHDg/runs/0/artifacts/public%2Fbuild%2Fmk%2Ftarget.installer.exe
In order to get better data about the failure, I've improved the logging in the case of failure to show the payload submitted to the queue, in the hope this might provide key insights into the failure cause. I've created release 10.1.3 for this, and am currently deploying to gecko-1-b-win2012-beta in OpenCloudConfig (which could take 90 mins or so). After that, I'll retrigger the task, to see what the new information will tell us. In short, we'll now get the full payload to the queue createArtifact API call[1].

--

[1] https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#createArtifact
Nooooooooo!!!!

I see the problem already - the artifact expiry was in the past:

https://tools.taskcluster.net/groups/EJS7nJ8EQjWO6CpAIQkHDg/tasks/EJS7nJ8EQjWO6CpAIQkHDg/details

I was pretty sure there was a check that artifact expiry was not before deadline - so I'll investigate why that check didn't occur or did not execute correctly!!
(In reply to Pete Moore [:pmoore][:pete] from comment #17)
> Nooooooooo!!!!
> 
> I see the problem already - the artifact expiry was in the past:
> 
> https://tools.taskcluster.net/groups/EJS7nJ8EQjWO6CpAIQkHDg/tasks/
> EJS7nJ8EQjWO6CpAIQkHDg/details
> 
> I was pretty sure there was a check that artifact expiry was not before
> deadline - so I'll investigate why that check didn't occur or did not
> execute correctly!!

I hadn't spotted the artifacts expire in 2018 - so actually this was not the cause.

However, the artifact expiry is after the task expiry - I suspect this is the issue! And I don't believe the worker checks for this condition, but I suspect this is the cause of the 400 http response status code from the queue.

So probably I need to add a check for this when the task is validated and return malformed-payload if there is an artifact expiry after the task expiry. At the same time, I'll add a test for it too, after validating this really is the cause of the 400 status code.
Bingo!

https://github.com/taskcluster/taskcluster-queue/blob/850359f099eeb1b784a1a5c540084b2594e3571b/src/artifacts.js#L149-L158

Note, the useful error message which the queue returns is unfortunately swallowed by taskcluster-client-go (prior to https://github.com/taskcluster/taskcluster-client-go/commit/73f0ac7ae49547837f4e1311bdc075a3ad989e58 the http requests/responses were returned by the API methods, but it was requested for this feautre to be removed).

I'm tempted to either add back *tcclient.CallSummary to the returned values of the generated API calls, or maybe just attach the httpbackoff.BadHttpResponseCode error in tcclient.APICall(...) to a new error type which also contains the *tcclient.CallSummary.

In other words, this way, the error returned from an APICall could include the *tcclient.CallSummary even if the call summary isn't available if no error is raised.
Commit pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/e4a56acf286dd972ed328789dbc1c6473e3b4486
Bug 1380978 - Resolve with malformed-payload exception if artifact expires after task expiry
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> There was no signing key on gecko-1-b-win2012-beta so I've added one:

I believe in docker-worker, if there is no signing key but chainOfTrust is enabled, we upload an unsigned chain of trust artifact.

I'm ok with either that approach or a dummy keypair for workerTypes outside of level3.
(In reply to Aki Sasaki [:aki] from comment #21)
> (In reply to Pete Moore [:pmoore][:pete] from comment #6)
> > There was no signing key on gecko-1-b-win2012-beta so I've added one:
> 
> I believe in docker-worker, if there is no signing key but chainOfTrust is
> enabled, we upload an unsigned chain of trust artifact.
> 
> I'm ok with either that approach or a dummy keypair for workerTypes outside
> of level3.

I think this is a great idea. I'll create a bug for it. Thanks!
(In reply to Pete Moore [:pmoore][:pete] from comment #22)
> (In reply to Aki Sasaki [:aki] from comment #21)
> > (In reply to Pete Moore [:pmoore][:pete] from comment #6)
> > > There was no signing key on gecko-1-b-win2012-beta so I've added one:
> > 
> > I believe in docker-worker, if there is no signing key but chainOfTrust is
> > enabled, we upload an unsigned chain of trust artifact.
> > 
> > I'm ok with either that approach or a dummy keypair for workerTypes outside
> > of level3.
> 
> I think this is a great idea. I'll create a bug for it. Thanks!

Created bug 1381785.
(In reply to Pete Moore [:pmoore][:pete] from comment #7)
> (In reply to Pete Moore [:pmoore][:pete] from comment #5)
> Running again, now as:
>   * https://tools.taskcluster.net/groups/EJS7nJ8EQjWO6CpAIQkHDg/tasks/EJS7nJ8EQjWO6CpAIQkHDg/details

Running again, now on generic-worker 10.1.5, as:
  * https://tools.taskcluster.net/groups/LgaLdDSwQ9yjNowXLhbi2g/tasks/LgaLdDSwQ9yjNowXLhbi2g/details
(In reply to Pete Moore [:pmoore][:pete] from comment #5)
> In addition, I'm running a regular try push using the win2012 beta worker
> type that I upgraded in comment 4:
> 
>   * https://treeherder.mozilla.org/#/jobs?repo=try&revision=bc4202a42083989ef820d750032377a35d917523
> 
> I'm only testing windows 2012 builders for now - we can upgrade the other
> worker types separately - only the win2012 builders use the chain of trust
> feature at the moment.

To be safe, running a new try with generic-worker 10.1.5:

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=719b7fe18ef050cb603c88fc16935f35699a2cbf
(In reply to Pete Moore [:pmoore][:pete] from comment #25)
> To be safe, running a new try with generic-worker 10.1.5:
> 
>   * https://treeherder.mozilla.org/#/jobs?repo=try&revision=719b7fe18ef050cb603c88fc16935f35699a2cbf

This actually failed due to a new problem, which I've reported in bug 1381801. I've made a new try push with a fix for that here:

  * https://treeherder.mozilla.org/#/jobs?repo=try&revision=e9dc567b73e3423a6002c83d59099a95466eb51b
I've also made a fix for that worker-side too (so we don't have to wait for it to ride all the trains, or land it on all branches), and besides it isn't a bad thing to have a worker side fix for it too.

That made it into generic-worker 10.1.6, which I rolled out, and am testing in this try push here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7bceb03cb6912502d50921406ca97b75ea5ab9b

Note, I somewhat foolishly based this try push on the mozilla-inbound head, which hadn't finished building, so if the builds don't pass, we should check if the revision it was based on was ok or not:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=092390ed4c628d13875ddf3f9412e3de2b29a98c
Attachment #8887619 - Flags: review?(rthijssen)
Attachment #8887619 - Flags: review?(rthijssen)
This is now in production.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: