seems the last decision task has run about 6 hours ago, so something is wrong.
try, mozilla-inbound and fx-team are closed
The problem is that new tasks are not getting scheduled, which points to mozilla-taskcluster. The provisioner shows 0 pending counts. Looking at mozilla-taskcluster, there is a REDIS addon config change (release v134 -> v135) that updates the REDIS URL. After installing redis-cli, I can confirm that v135 is indeed a valid database, and that v134 is invalid, so it is not a mistake in the config value that was applied by the addon. I confirmed that the redis database is working for both staging and production envs, in case staging was involved despite being staging. Both work fine. Only the database for production was updated (REDIS_URL) but both are working. My next guess is that maybe the database was migrated prior to the config update, and that somehow some config got lost, because the URL is correct, but the timing matches when decision tasks stopped getting scheduled (around 7 hours ago). I also tried restarting mozilla-taskcluster, but this did not help. This is not so surprising as release v135 should have caused a restart anyway. Our mozilla-taskcluster expert is in Chicago, sleeping peacefully. The question now is whether we should wake him up. :)
There will be a post mortem emailed out to the takcluster tools list and sheriffs once I fully wake up. But the quick story is that the redis addon used by mozilla-taskcluster was automatically updated which also regenerated a new auth password. mozilla-taskcluster failed to see this change and was attempting to use the previous value causing an auth error. Redis is used as a job queue for mozilla-taskcluster. I updated the configuration, restarted mozilla-taskcluster and decision tasks started flowing again.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.