Closed Bug 1206536 Opened 9 years ago Closed 9 years ago

gecko-decision tasks not being scheduled

Categories

(Taskcluster :: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [infra-heroku])

Apparently earlier today AWS had an issue (visible in the graphs on http://status.taskcluster.net/ as the period when response times were in the tens of seconds), and at about the end of it several pushes like https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=0783f2174228 and https://treeherder.mozilla.org/#/jobs?repo=try&revision=eaa118b0c27a had multiple gecko-decision tasks run on them. Those were the last gecko-decision tasks to be scheduled, pushes since then on mozilla-inbound, fx-team, and try have not had one.

All trees with taskcluster jobs are closed.
I opened up a ticket with heroku support to investigate issues with connecting our a redis database that we have.  After investigation and making a change to some timeouts, it appears our dyno for processing work recovered and started submitting decision tasks.  Currently not aware of why these timeouts were not an issue before as nothing has changed on our end.
See Also: → 1206527
I got the following email from heroku at 01:15 local time (16:15 Pacific Time on 20 Sep 2015):

======

Our monitoring app has been unable to reach your HA standby (REDIS on mozilla-taskcluster-staging) since 2015-09-20 23:13:34 UTC. This is likely due to an underlying hardware or network failure and not something caused by your application.

We're attempting to bring it back online automatically. If we can't, we'll page an engineer to help. We will shortly post an update on the ticket at https://help.heroku.com/tickets/281440.

======

Then I received the following email one minute later:

======

Your HA standby (REDIS on mozilla-taskcluster-staging) is back online as of at 2015-09-20 23:15:58 UTC.

We expect operations to continue normally, but feel free to notify us on the ticket at https://help.heroku.com/tickets/281440 if there are any outstanding issues with your HA standby (REDIS on mozilla-taskcluster-staging)

======
When I go to the ticket hyperlink provided (https://help.heroku.com/tickets/281440) I get "The page you were looking for does not exist. You may have mistyped the address or the page has moved."
Yea, I got the same thing when trying to go to a bug, but then opened up a new bug ticket with them related to this.  Somehow between them looking at it and adjusting the timeout on redis connection from no time out to 60 seconds, things started working.  The support team was not sure on why this fixed the issue, but they are going to start making it the default when people use a redis app addon the first time.
I'll leave this up to philor but I think this bug could be closed.  The issue was with some default settings for the redis addon provided by heroku and after making some changes suggested by Heroku support, redis was available again.
Sweet action.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [infra-heroku]
You need to log in before you can comment on or make changes to this bug.