Closed Bug 1624887 Opened 5 years ago Closed 5 years ago

worker-shutdown decision task leads to duplicate tasks with failed chain of trust checks

Categories

(Release Engineering :: Release Automation, defect)

defect
Not set
normal

Tracking

(firefox76 fixed)

RESOLVED FIXED
Tracking Status
firefox76 --- fixed

People

(Reporter: malexandru, Assigned: mozilla)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Failure log: https://firefox-ci-tc.services.mozilla.com/tasks/WDVM8HohRB-YS8sA6_krEA/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FWDVM8HohRB-YS8sA6_krEA%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Fchain_of_trust.log#L142

Raw log: https://firefoxci.taskcluster-artifacts.net/WDVM8HohRB-YS8sA6_krEA/0/public/logs/chain_of_trust.log

2020-03-25T15:35:41    DEBUG -  /app/workdir/cot/MiZXK0OPQoW4J6qviiSVrQ/public/actions.json matches the expected sha256 37faabb63a6d4301fdd90730a626d61ac3a5c6a186e39256a03e8e19108278c4
2020-03-25T15:35:41    DEBUG -  /app/workdir/cot/SsjuZRBYShui3gWu_akvFA/public/actions.json matches the expected sha256 626f150e268f8325ce965317f92824cf23eb6369c9f7c8cf704f5922c08daa55
2020-03-25T15:35:41     INFO - Done
2020-03-25T15:35:41    DEBUG -  /app/workdir/cot/T4KkXxALRXWnjFKUtJ-g4A/public/parameters.yml matches the expected sha256 94a77c6056d7ca9dd61f0dad868d7dc7ad276f7a639c6f2c4d8339e498f3ff65
2020-03-25T15:35:41    DEBUG -  makedirs(/app/workdir/cot/U8zp1ZfrS--R1NJ8SNtMuw/./public/build)
2020-03-25T15:35:41     INFO - Done
2020-03-25T15:35:41    DEBUG -  /app/workdir/cot/T4KkXxALRXWnjFKUtJ-g4A/public/actions.json matches the expected sha256 fffd2f2a5829a80a599631d693f49ef37942f1f67b88da317a34e295d76ec443
2020-03-25T15:35:42     INFO - Done
2020-03-25T15:35:42     INFO - Done
2020-03-25T15:35:42    DEBUG -  /app/workdir/cot/U8zp1ZfrS--R1NJ8SNtMuw/public/build/setup.exe matches the expected sha256 0b92b3c2b581a1ce2925fd7a950d7290183e2506f4ee9ef6a1b7c05077c996a5
2020-03-25T15:35:42    DEBUG -  /app/workdir/cot/SsjuZRBYShui3gWu_akvFA/public/task-graph.json matches the expected sha256 45bc93d91b29c2def28cf161d076a38f3cb97e7898ffbe573e9c62f8240e83c0
2020-03-25T15:35:42     INFO - Done
2020-03-25T15:35:42    DEBUG -  /app/workdir/cot/T4KkXxALRXWnjFKUtJ-g4A/public/task-graph.json matches the expected sha256 538cef34fd6a086960a72241e7d82db13733b68d1379e23c3278d8ede41bbc44
2020-03-25T15:35:42     INFO - Done
2020-03-25T15:35:42    DEBUG -  /app/workdir/cot/MiZXK0OPQoW4J6qviiSVrQ/public/task-graph.json matches the expected sha256 c68d2c5c22f2caf38899533893972d583c35ac1a0fc4708c276b4714822bf255
2020-03-25T15:35:45     INFO - Done
2020-03-25T15:35:45    DEBUG -  /app/workdir/cot/U8zp1ZfrS--R1NJ8SNtMuw/public/build/target.zip matches the expected sha256 df34b17b7ee243489e917d1dbf321ac36ca1c7dc18ee2bc510c3cdd73b4532d2
2020-03-25T15:35:45     INFO - Verifying scriptworker WDVM8HohRB-YS8sA6_krEA as a scriptworker task...
2020-03-25T15:35:45     INFO - Verifying scriptworker:parent MiZXK0OPQoW4J6qviiSVrQ as a decision task...
2020-03-25T15:35:45     INFO - Verifying the scriptworker WDVM8HohRB-YS8sA6_krEA task definition is part of the scriptworker:parent MiZXK0OPQoW4J6qviiSVrQ task graph...
2020-03-25T15:35:45 CRITICAL - Can't find task scriptworker WDVM8HohRB-YS8sA6_krEA in scriptworker:parent MiZXK0OPQoW4J6qviiSVrQ task-graph.json!
2020-03-25T15:35:45 CRITICAL - Chain of Trust verification error!
Traceback (most recent call last):
  File "/app/lib/python3.8/site-packages/scriptworker/cot/verify.py", line 1909, in verify_chain_of_trust
    await verify_task_types(chain)
  File "/app/lib/python3.8/site-packages/scriptworker/cot/verify.py", line 1669, in verify_task_types
    await valid_task_types[task_type](chain, obj)
  File "/app/lib/python3.8/site-packages/scriptworker/cot/verify.py", line 1591, in verify_parent_task
    verify_link_in_task_graph(chain, link, target_link)
  File "/app/lib/python3.8/site-packages/scriptworker/cot/verify.py", line 922, in verify_link_in_task_graph
    raise_on_errors(["Can't find task {} {} in {} {} task-graph.json!".format(task_link.name, task_link.task_id, decision_link.name, decision_link.task_id)])
  File "/app/lib/python3.8/site-packages/scriptworker/cot/verify.py", line 302, in raise_on_errors
    raise CoTError("\n".join(errors))
scriptworker.exceptions.CoTError: "Can't find task scriptworker WDVM8HohRB-YS8sA6_krEA in scriptworker:parent MiZXK0OPQoW4J6qviiSVrQ task-graph.json!"
2020-03-25T15:35:45    ERROR - Hit ScriptWorkerException: "Can't find task scriptworker WDVM8HohRB-YS8sA6_krEA in scriptworker:parent MiZXK0OPQoW4J6qviiSVrQ task-graph.json!"
2020-03-25T15:35:45    DEBUG -  "/app/artifacts/public/logs/chain_of_trust.log" is encoded with "None" and has mime/type "text/plain"
2020-03-25T15:35:45     INFO - "/app/artifacts/public/logs/chain_of_trust.log" can be gzip'd. Compressing...

Looking at https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=294700116&revision=c3bcf36e28be539d85a106f3c32a80fb2dd05886 , there are 2 Bs and 2 Instrs.
My guess is someone retriggered these without using an action? If so, that probably means we need to mark these tasks as rerun-only.

Flags: needinfo?(malexandru)

Ah, I do see a rerun decision task. That may explain all the duplicates. All the exceptions in signing tasks from the original, failed decision task run are expected. https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=294689070&revision=c3bcf36e28be539d85a106f3c32a80fb2dd05886

Summary: Intermittent scriptworker.exceptions.CoTError: "Can't find task scriptworker WDVM8HohRB-YS8sA6_krEA in scriptworker:parent MiZXK0OPQoW4J6qviiSVrQ task-graph.json!" → worker-shutdown decision task leads to duplicate tasks with failed chain of trust checks

Perhaps the decision task should

  1. check to see if there are any unexpected existing tasks in the taskGroup
  2. if so, cancel them

before submitting any tasks?

Flags: needinfo?(malexandru)

Other suggestions we've brainstormed:

  • set retries: 0 in decision tasks
  • set a worker-level flag that specifies which error to use on worker-shutdown. This doesn't cover claim-expired though.
  • queue changes to rerun with a new taskId for decision tasks
    • queue flag to specify "i'm doing something that shouldn't be retried"
  • dependencies can specify a runId: they only start if that specific taskId/runId goes green.
  • atomic taskGroup submission

set retries: 0 in decision tasks

I think that's the right fix here.

Assignee: nobody → aki
Status: NEW → ASSIGNED

I suppose we could set this unconditionally to 0, in case we hit this race condition in actions or cron... I'm on the fence.

breakpoint task: 1) depends on decision task, 2) calls the queue status on the decision task, and verifies that the latest runId matches the runId we were scheduled with, and that it's successful. it fails if that's not true. it succeeds if that is true
we can do that with the python client, without any platform changes
payload: python verify_decision_task_status.py --runId 0 --taskId aaaaaaa
or even verify_task_status.py since it doesn't have to be a decision task

Unspoken, but implied: all tasks would depend on the breakpoint task, other than the breakpoint task, which would depend on the decision task.

Blocks: 1373013
Pushed by asasaki@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f3a28fba69d4 set decision task retries to 0. r=Callek
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Release Automation: Signing → Release Automation
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: