1315932 - Cloud mirror is giving HTTP 503s for log downloads

Assignee

Description

•

8 years ago

eg:

requests.exceptions:HTTPError: 503 Server Error: Service Unavailable for url: https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FHOCuxd1SSoKf52hLRQnGwA%2F0%2Fpublic%2Ftest_info%2F%2Fmochitest-gl_errorsummary.log

(https://rpm.newrelic.com/accounts/677903/applications/14179757/filterable_errors?tw%5Bend%5D=1478603821&tw%5Bstart%5D=1478602021#/table?top_facet=transactionUiName&barchart=barchart&_k=k9qm8d)

Started approx 15 minutes ago.

Flags: needinfo?(jhford)

Ed Morley [:emorley]

Assignee

Comment 2

•

8 years ago

https://dashboard.heroku.com/apps/cloud-mirror/metrics/web?starting=24-hours-ago shows the app crashing.

https://dashboard.heroku.com/apps/cloud-mirror/activity shows the `REDIS` environment variable changed 26 minutes ago.

Related from an email to all Heroku admins (sent 11 days ago):

"""
Your database (REDIS on cloud-mirror) must undergo maintenance.

Several vendors recently disclosed some vulnerabilities (CVE-2016-5195) in the Linux Kernel. You can find more details here. As a result, we need to perform maintenance to your database as soon as possible to apply the patch for this vulnerability. You can find FAQ about this maintenance here, please check it out if you have any questions regarding this maintenance.

We plan on performing this maintenance at 2016-11-08 11:00:00 +0000 during your set maintenance window of Tuesdays 11:00 to 15:00 UTC.

At that time, we will fail you over to your HA standby and recreate any followers you may have.

You may choose a new or more appropriate maintenance window, as needed. For example, you may run: heroku redis:maintenance --window="Tuesday 14:30" REDIS to set a maintenance window for Tuesdays at 2:30pm UTC.

You can also run this maintenance directly, using heroku redis:maintenance --run REDIS.
"""

And then this morning:

"""
We are currently replacing your database (REDIS on cloud-mirror) with the high availability standby.

When this is complete, the REDIS_URL config var on cloud-mirror will be changed, and your application will be restarted.

If you've copied and pasted that database's credentials to other apps, you'll need to update those manually.
"""

Ed Morley [:emorley]

Assignee

Comment 3

•

8 years ago

So the cloud-mirror app also has `REDIS_HOST` and `REDIS_PASS` environment variables, which were presumably copy-pasted from `REDIS_URL`.

However this is bad practice, since they won't be automatically updated for Heroku maintenance events, unlike `REDIS_URL`.

The correct fix is to extract password and hostname from the URL by parsing it. For now I've just updated `REDIS_HOST` manually.

However there still appear to be some errors - looking further.

Ed Morley [:emorley]

Assignee

Comment 5

•

8 years ago

Ah there was REDIS_PORT too, which was similarly hardcoded based on an old value from REDIS_URL (it appears they use random ports too).

I've updated that and now cloud-mirror is fine, eg:
https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FHOCuxd1SSoKf52hLRQnGwA%2F0%2Fpublic%2Ftest_info%2F%2Fmochitest-gl_errorsummary.log

John, could you:
* file a bug blocking this one to handle switching to parsing REDIS_URL rather than having the manual REDIS_(HOST|PASS|PORT) environment variables (the nodejs redis client can actually be passed the URL directly: https://github.com/NodeRedis/node_redis#rediscreateclient)
* add a system/group to papertrail for cloud-mirror, since I struggled to filter the logs for cloud-mirror only

Assignee: nobody → emorley

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Assignee

Comment 6

•

8 years ago

Also, given:

> 11:45 <•jhford> but because we're using herkou redis outside of heroku, the docker cloud config needed to change

...I'm presuming this can never be fully hands-free (though fixing the web dyno usage of the environment variables would still fix access to existing objects, even if the copier fails for new ones).

As such, perhaps it would be good to plan to always proactively perform maintenance failovers rather than waiting to the scheduled maintenance time? (The initial email from Heroku about it was 11 days ago).

John Ford [:jhford] CET/CEST Berlin Time

Comment 7

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #3)
> So the cloud-mirror app also has `REDIS_HOST` and `REDIS_PASS` environment
> variables, which were presumably copy-pasted from `REDIS_URL`.

Yes, it is.

> However this is bad practice, since they won't be automatically updated for
> Heroku maintenance events, unlike `REDIS_URL`.

Actually, the issue here is that the redis parameters are shared between heroku and docker cloud.  Heroku doesn't have any sorts of hooks for this, so we end up in a situation where changes on the heroku side break the docker cloud side.

> The correct fix is to extract password and hostname from the URL by parsing
> it. For now I've just updated `REDIS_HOST` manually.

Yes, I agree parsing REDIS is the right thing to do, but until we can move the copier nodes into a Heroku private space in us-west-2, there's little to no value in making this change.

> However there still appear to be some errors - looking further.

(In reply to Ed Morley [:emorley] from comment #5)
> Ah there was REDIS_PORT too, which was similarly hardcoded based on an old
> value from REDIS_URL (it appears they use random ports too).

Yes.

> I've updated that and now cloud-mirror is fine, eg:
> https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/
> https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-
> artifacts%2FHOCuxd1SSoKf52hLRQnGwA%2F0%2Fpublic%2Ftest_info%2F%2Fmochitest-
> gl_errorsummary.log

Actually, the fix was me setting the correct values in the docker-cloud configuration file.  I just did that.

> John, could you:
> * file a bug blocking this one to handle switching to parsing REDIS_URL
> rather than having the manual REDIS_(HOST|PASS|PORT) environment variables
> (the nodejs redis client can actually be passed the URL directly:
> https://github.com/NodeRedis/node_redis#rediscreateclient)

https://github.com/taskcluster/cloud-mirror/issues/32

> * add a system/group to papertrail for cloud-mirror, since I struggled to
> filter the logs for cloud-mirror only

We all do.  There's no good way to hook cloud-mirror copiers (where the real issue was) into papertrail, but moving to a private space in us-west-2 should get this for us automatically.

Flags: needinfo?(jhford)

Ed Morley [:emorley]

Assignee

Comment 8

•

8 years ago

Many thanks :-)

The web dynos were giving 503s too though, whereas presumably they would have fallen back to the canonical source after 30 seconds if the copiers were non-responsive?

Comment hidden (Intermittent Failures Robot)

930 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 930

Platform breakdown:
* linux64: 648
* android-4-3-armv7-api15: 251
* windows7-32-vm: 9
* windows10-64-vm: 8
* android-api-15-gradle: 4
* android-4-2-x86: 3
* android-4-0-armv7-api15: 3
* osx-10-7: 2
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1315932&startday=2016-11-08&endday=2016-11-08&tree=all

Comment hidden (Intermittent Failures Robot)

933 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 933

Platform breakdown:
* linux64: 652
* android-4-3-armv7-api15: 248
* windows7-32-vm: 10
* windows10-64-vm: 8
* android-api-15-gradle: 4
* android-4-0-armv7-api15: 4
* android-4-2-x86: 3
* osx-10-7: 2
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1315932&startday=2016-11-07&endday=2016-11-13&tree=all

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Platform and Services → Services

Bugzilla

Quick Search

Cloud mirror is giving HTTP 503s for log downloads

Categories

(Taskcluster :: Services, defect)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 2

Comment 3

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated