Closed Bug 1315932 Opened 8 years ago Closed 8 years ago

Cloud mirror is giving HTTP 503s for log downloads

Categories

(Taskcluster :: Services, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

https://dashboard.heroku.com/apps/cloud-mirror/metrics/web?starting=24-hours-ago shows the app crashing.

https://dashboard.heroku.com/apps/cloud-mirror/activity shows the `REDIS` environment variable changed 26 minutes ago.

Related from an email to all Heroku admins (sent 11 days ago):

"""
Your database (REDIS on cloud-mirror) must undergo maintenance.

Several vendors recently disclosed some vulnerabilities (CVE-2016-5195) in the Linux Kernel. You can find more details here. As a result, we need to perform maintenance to your database as soon as possible to apply the patch for this vulnerability. You can find FAQ about this maintenance here, please check it out if you have any questions regarding this maintenance.

We plan on performing this maintenance at 2016-11-08 11:00:00 +0000 during your set maintenance window of Tuesdays 11:00 to 15:00 UTC.

At that time, we will fail you over to your HA standby and recreate any followers you may have.

You may choose a new or more appropriate maintenance window, as needed. For example, you may run: heroku redis:maintenance --window="Tuesday 14:30" REDIS to set a maintenance window for Tuesdays at 2:30pm UTC.

You can also run this maintenance directly, using heroku redis:maintenance --run REDIS.
"""

And then this morning:

"""
We are currently replacing your database (REDIS on cloud-mirror) with the high availability standby.

When this is complete, the REDIS_URL config var on cloud-mirror will be changed, and your application will be restarted.

If you've copied and pasted that database's credentials to other apps, you'll need to update those manually.
"""
So the cloud-mirror app also has `REDIS_HOST` and `REDIS_PASS` environment variables, which were presumably copy-pasted from `REDIS_URL`.

However this is bad practice, since they won't be automatically updated for Heroku maintenance events, unlike `REDIS_URL`.

The correct fix is to extract password and hostname from the URL by parsing it. For now I've just updated `REDIS_HOST` manually.

However there still appear to be some errors - looking further.
Ah there was REDIS_PORT too, which was similarly hardcoded based on an old value from REDIS_URL (it appears they use random ports too).

I've updated that and now cloud-mirror is fine, eg:
https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FHOCuxd1SSoKf52hLRQnGwA%2F0%2Fpublic%2Ftest_info%2F%2Fmochitest-gl_errorsummary.log

John, could you:
* file a bug blocking this one to handle switching to parsing REDIS_URL rather than having the manual REDIS_(HOST|PASS|PORT) environment variables (the nodejs redis client can actually be passed the URL directly: https://github.com/NodeRedis/node_redis#rediscreateclient)
* add a system/group to papertrail for cloud-mirror, since I struggled to filter the logs for cloud-mirror only
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Also, given:

> 11:45 <•jhford> but because we're using herkou redis outside of heroku, the docker cloud config needed to change

...I'm presuming this can never be fully hands-free (though fixing the web dyno usage of the environment variables would still fix access to existing objects, even if the copier fails for new ones).

As such, perhaps it would be good to plan to always proactively perform maintenance failovers rather than waiting to the scheduled maintenance time? (The initial email from Heroku about it was 11 days ago).
(In reply to Ed Morley [:emorley] from comment #3)
> So the cloud-mirror app also has `REDIS_HOST` and `REDIS_PASS` environment
> variables, which were presumably copy-pasted from `REDIS_URL`.

Yes, it is.

> However this is bad practice, since they won't be automatically updated for
> Heroku maintenance events, unlike `REDIS_URL`.

Actually, the issue here is that the redis parameters are shared between heroku and docker cloud.  Heroku doesn't have any sorts of hooks for this, so we end up in a situation where changes on the heroku side break the docker cloud side.

> The correct fix is to extract password and hostname from the URL by parsing
> it. For now I've just updated `REDIS_HOST` manually.

Yes, I agree parsing REDIS is the right thing to do, but until we can move the copier nodes into a Heroku private space in us-west-2, there's little to no value in making this change.

> However there still appear to be some errors - looking further.

(In reply to Ed Morley [:emorley] from comment #5)
> Ah there was REDIS_PORT too, which was similarly hardcoded based on an old
> value from REDIS_URL (it appears they use random ports too).

Yes.

> I've updated that and now cloud-mirror is fine, eg:
> https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/
> https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-
> artifacts%2FHOCuxd1SSoKf52hLRQnGwA%2F0%2Fpublic%2Ftest_info%2F%2Fmochitest-
> gl_errorsummary.log

Actually, the fix was me setting the correct values in the docker-cloud configuration file.  I just did that.

> John, could you:
> * file a bug blocking this one to handle switching to parsing REDIS_URL
> rather than having the manual REDIS_(HOST|PASS|PORT) environment variables
> (the nodejs redis client can actually be passed the URL directly:
> https://github.com/NodeRedis/node_redis#rediscreateclient)

https://github.com/taskcluster/cloud-mirror/issues/32

> * add a system/group to papertrail for cloud-mirror, since I struggled to
> filter the logs for cloud-mirror only

We all do.  There's no good way to hook cloud-mirror copiers (where the real issue was) into papertrail, but moving to a private space in us-west-2 should get this for us automatically.
Flags: needinfo?(jhford)
Many thanks :-)

The web dynos were giving 503s too though, whereas presumably they would have fallen back to the canonical source after 30 seconds if the copiers were non-responsive?
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.