Closed Bug 1431110 Opened 6 years ago Closed 6 years ago

balrog submission of complete updates fails on Windows

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Unassigned)

Details

(Whiteboard: [stockwell infra])

Attachments

(1 file)

See https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=4e429d313fd2e0f9202271ee8f3fb798817ec3e7&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable were all those "complete update" jobs on Windows are exceptions without logs.

<Callek> this sounds like the issue that was plaguing releases (on multiple branches) yesterday evening, which was triggered by some docker-worker/docker-image change glandium made
Related to bug 1430037?
Flags: needinfo?(mh+mozilla)
Is this the thing :glandium fixed last night?
Component: General → Task Configuration
Flags: needinfo?(aki)
2018-01-17T13:24:55 CRITICAL - Chain of Trust verification error!
Traceback (most recent call last):
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1557, in verify_chain_of_trust
    await build_task_dependencies(chain, chain.task, chain.name, chain.task_id)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 650, in build_task_dependencies
    await build_task_dependencies(chain, task_defn, task_name, task_id)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 650, in build_task_dependencies
    await build_task_dependencies(chain, task_defn, task_name, task_id)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 650, in build_task_dependencies
    await build_task_dependencies(chain, task_defn, task_name, task_id)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 650, in build_task_dependencies
    await build_task_dependencies(chain, task_defn, task_name, task_id)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 650, in build_task_dependencies
    await build_task_dependencies(chain, task_defn, task_name, task_id)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 650, in build_task_dependencies
    await build_task_dependencies(chain, task_defn, task_name, task_id)
  File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 635, in build_task_dependencies
    raise CoTError("Too deep recursion!\n{}".format(name))
scriptworker.exceptions.CoTError: 'Too deep recursion!\nbalrog:beetmover:signing:partials:docker-image:action:decision'

This is a different error than yesterday, which referred to task.extra.chainOfTrust.inputs being missing. It's possible they both were addressed with the same fix. I haven't verified the fix, so it's also possible this is the new manifestation of bustage.
Flags: needinfo?(aki)
The actually interesting part of the log is:
2018-01-17T13:11:10    DEBUG -               signing:beetmover:signing:partials:docker-image:action:decision BVEKMafXQgW3IP7BTrsYsw is docker-worker
2018-01-17T13:11:10     INFO - build_task_dependencies signing:beetmover:signing:partials:docker-image:action:decision BVEKMafXQgW3IP7BTrsYsw

And that decision task is, coincidentally, the one from the push that backed out bug 1430037.

That's picked up because the previous verification was for:

2018-01-17T13:11:10    DEBUG -               signing:beetmover:signing:partials:docker-image C_Qz11BdRqmguq5Z6Qk7VA is docker-worker

which is a docker-image that was generated on that same push, and had a dependency on that decision task above.

BUT. On that push, the docker images didn't trigger automatically, because of the task graph optimizations. So I triggered them manually, through treeherder's "Add new jobs". My guess is that the cot verification code doesn't like the resulting state.

I guess beta is similarly affected?
Flags: needinfo?(mh+mozilla)
This blocks nightly updates for Windows > blocker.
Severity: major → blocker
They have been rebuilt, but that failure is still using the previous images. Are they retriggers? They would need a fresh decision task doing new optimizations to get the new images.
(In reply to Aki Sasaki [:aki] from comment #8)
> Ok. I manually triggered new nightlies via the hook.
> 
> win64 https://tools.taskcluster.net/groups/LiJS01NsQJCQZDReIDYL4A
> win32 https://tools.taskcluster.net/groups/Y1oCqQbrQxqR920jDS9LNA

They're still busted on the decision task from that push... Maybe what needs to happen is for those docker images to be attached to another decision task for the cot code to be happy with the situation...

One way we can fix this for good is to update the index for all the docker images to make them point to their last value before bug 1430037 landed. I can do that quite trivially.
Landed https://hg.mozilla.org/build/puppet/rev/7e3d4c5c5bb0 and am forcing balrog+signing scriptworker puppet runs.
Once those finish, I'll rerun the exception tasks from the above graphs. Fingers crossed.
I also had to puppetize beetmover scriptworkers, for beetmover-checksums.

So I'm guessing the new docker-image tasks have more tasks in the chain than before -- decision + action + docker-image. Along with the long windows chain, that passed the hardcoded 5-max-chain-length maximum we used to have for cot verification. The new configurable 20-max-chain-length maximum should hold us over for some indeterminate length of time.

https://tools.taskcluster.net/groups/LiJS01NsQJCQZDReIDYL4A is now green except for a non-cot-related test failure.
https://tools.taskcluster.net/groups/Y1oCqQbrQxqR920jDS9LNA is now green except for 2 non-cot-related test failures.

Thanks everyone. Sorry for the ugliness here. I think we've identified some areas to clean up.
Resolving.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: