Closed Bug 1436712 Opened 2 years ago Closed 2 years ago

Linux and OS X nightlies failed with error: Chain of Trust verification error!

Categories

(Release Engineering :: Release Automation: Other, defect, blocker)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1420482

People

(Reporter: aryx, Assigned: aki)

References

Details

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=0ac953fcddf10132eaecdb753d72b2ba5a43c32a&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable

Log: https://tools.taskcluster.net/groups/HMQt9pRvSXOfagdouqHNqg/tasks/Sz26vcTyQkqH-FmxZYtEnw/runs/0/logs/public%2Flogs%2Fchain_of_trust.log

signing:build:docker-image G-nqPXUNR2K--jN0UDp4Ng workerType differs!
 graph: "gecko-t-linux-xlarge"
 task: "gecko-3-images"
signing:build:docker-image G-nqPXUNR2K--jN0UDp4Ng routes differs!
 graph: [
  "tc-treeherder.v2.autoland.2ebcb42faee25c8eb2dd34d01c062bebcbc7b1ea.59266",
  "coalesce.v1.autoland.79de2402eaedaded334a"
]
 task: [
  "index.gecko.cache.level-3.docker-images.v1.debian7-amd64-build.hash.24b13257acc45f20f04f8ad54180c53623000735232b2d7a81f914aca9c0f8fc",
  "index.gecko.cache.level-3.docker-images.v1.debian7-amd64-build.latest",
  "index.gecko.cache.level-3.docker-images.v1.debian7-amd64-build.pushdate.2018.02.07.20180207214729",
  "tc-treeherder.v2.autoland.2ebcb42faee25c8eb2dd34d01c062bebcbc7b1ea.59266"
]
signing:build:docker-image G-nqPXUNR2K--jN0UDp4Ng scopes differs!
 graph: [
  "secrets:get:project/taskcluster/gecko/hgfingerprint",
  "docker-worker:feature:allowPtrace",
  "docker-worker:cache:level-3-autoland-test-workspace-v2-bc7e1a7ad01a345394f1",
  "docker-worker:cache:level-3-checkouts-v2-bc7e1a7ad01a345394f1"
]
 task: [
  "secrets:get:project/taskcluster/gecko/hgfingerprint",
  "docker-worker:cache:level-3-imagebuilder-sparse-bc7e1a7ad01a345394f1-v2-bc7e1a7ad01a345394f1"
]
signing:build:docker-image G-nqPXUNR2K--jN0UDp4Ng tags differs!
 graph: {
  "createdForUser": "mbanner@mozilla.com",
  "kind": "test",
  "label": "test-linux32-stylo-disabled/debug-web-platform-tests-e10s-12",
  "os": "linux",
  "worker-implementation": "docker-worker"
}
 task: {
  "createdForUser": "mbanner@mozilla.com",
  "kind": "docker-image",
  "label": "build-docker-image-debian7-amd64-build"
}
2018-02-08T11:48:27 CRITICAL - Can't find task signing:build:docker-image G-nqPXUNR2K--jN0UDp4Ng in signing:build:docker-image:parent TkH1qBmxTvaFAgZqDal4Pw task-graph.json!
2018-02-08T11:48:27 CRITICAL - Chain of Trust verification error!
Component: General → Release Automation
Product: Taskcluster → Release Engineering
QA Contact: catlee
I looked through the taskcluster/ changes in the regression range and there's not much.
https://bugzilla.mozilla.org/show_bug.cgi?id=1436283 added a new package to taskcluster/ci/docker-image/kind.yml, and https://bugzilla.mozilla.org/show_bug.cgi?id=1398799 made some changes to a release-only docker image. I don't see how either could cause CoT errors, but I only have limited understanding of this.

cc'ing glandium for the first bug, and I'll keep looking into the second bug (which is mine).
The fact that windows is not broken makes me wonder if something on the workers has changed - I think both of these platforms build on similar Linux images, whereas Windows builds on Windows?
15:03 < aki-away> guessing https://hg.mozilla.org/mozilla-central/rev/7eb47d8f194d and the next commit triggered a rebuild
15:04 < aki-away> we probably need to back that out until we kill cotv1
07:10 <aki-away> sure. the short explanation is docker-image is one of the two highly sensitive docker-worker task types, along with the decision task, so we have extra checks around both. the above changes break the assumptions in those checks
If our current theory (autoland retriggers broke docker-image cot verifiability) holds true, bug 1420482 would help us prevent old-style retriggers .
Depends on: 1420482
09:26 <aki> i *think* the retriggers fixed the docker-image bustage
09:27 <aki> `verify_cot --task-type build --cleanup -- Z1QpdX6IQym4fLWmubDSkA` breaks, for https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=161110628
09:27 <aki> `verify_cot --task-type build --cleanup Cv31u_kTQAaBsEXpVjm16w` works for https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=161121336

I essentially reran the gecko-3-images tasks in TkH1qBmxTvaFAgZqDal4Pw 's label-to-taskid.json, which required a few passes to get right. Thanks Tom and Dustin!
Assignee: nobody → aki
(In reply to Aki Sasaki [:aki] from comment #6)
> If our current theory (autoland retriggers broke docker-image cot
> verifiability) holds true, bug 1420482 would help us prevent old-style
> retriggers .

I think comment #7 verifies that the retriggers broke cot.
10:15 <aki> crap, repackage-signing cot issues on birch after a merge. hoping it's a one-off
10:16 <bhearsum> looks like we have it on maple too, after i merged central
10:16 <bhearsum> mozilla-beta seems fine
10:18 <aki> yeah. cot assumes docker-image tasks have an allowlisted dockerhub image sha. i think something changed docker-image to be built from another docker-image
10:19 <aki> guessing we still need to back something out or make an emergency cot fix
New theory: the reruns in comment #7 means we have new shas for the docker-images. These finished before the graphs started on birch and maple, but a delay (somewhere?) could have meant that the builds used the old sha and verification used the new sha, breaking verification.

Trying a new push to birch. If the new theory holds up, the next round of m-c nightlies should be green.
11:37 <aki> rerunning the repackage-signing task on birch fixed it

which suggests we could be in good shape... we need to verify with a full nightly graph.
is bug 1436960 related?
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1420482
(In reply to Chris AtLee [:catlee] from comment #12)
> is bug 1436960 related?

Yup. Also an old-style retrigger, caused by not fixing bug 1420482.
This bug was caused by Glandium old-style retriggering the docker-image tasks in https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=2ebcb42faee25c8eb2dd34d01c062bebcbc7b1ea&filter-searchStr=docker-image , which broke Chain of Trust verification across all trees with that revision merged in.

The fix was to rerun those tasks via the taskcluster `rerun` command (taskcluster cli or tc-filter.py).

I'm duping against bug 1436960 because although the short-term fix was to overwrite the index with properly run (read: not old-style retriggered) docker-image tasks, the deeper issue is that the treeherder retrigger defaults to old-style retriggers.
Flags: needinfo?(aki)
Also, the old graphs pointing at the broken [old-style retriggered] docker-image tasks had broken task definitions, so we had to launch new nightly graphs to pick up the fix. I waited for the afternoon nightlies.

(In reply to Aki Sasaki [:aki] from comment #16)
> The fix was to rerun those tasks via the taskcluster `rerun` command
> (taskcluster cli or tc-filter.py).

Alternately, using the new-style custom-action retriggers works here.
You need to log in before you can comment on or make changes to this bug.