Closed Bug 1343294 Opened 7 years ago Closed 7 years ago

Aws reporting issues with s3

Categories

(Infrastructure & Operations :: MOC: Problems, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: achavez, Unassigned)

References

Details

(Whiteboard: [stockwell infra])

digi reported in #moc of sc3 issues at 9:55 AM

checking here for updates https://status.aws.amazon.com/
Group: infra
This seems to have affected https://shipit.mozilla-releng.net and https://mozilla-releng.net/treestatus which has forced a tree closure by route of making hg read only.
Group: infra
Also affecting:

Smartsheet - Amazon S3 Incident affecting Smartsheet Functionality
Amazon S3 Incident affecting Smartsheet Functionality
Incident Report for Smartsheet
New Incident Status: Identified
Smartsheet customers attempting to upload files to Smartsheet, access files previously uploaded to Smartsheet, and access Published HTML sheets will currently experience issues. Smartsheet uses a secure proxy to Amazon S3 for file storage, and at this time the Amazon S3 US-EAST-1 Region is currently experiencing an increased error rate. Our operations team is continuing to monitor the Amazon S3 incident, and we apologize for the inconvenience 

https://status.aws.amazon.com/
Feb 28, 10:24 PST
Adding travis so he can cc the appropriate people on his team, too.
Pretty sure this impacts product delivery. Looks like bouncer is dead, and most files it would point to are also hosted on S3.
10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well, which means FF downloads & updates are not happening

Also:

Auth0 - Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1
Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1
Incident Report for Auth0
New Incident Status: Identified
Older versions of lock pull assets directly from s3 in the us-east-1 region for the us region. These assets can't be reached. We are currently evaluating workarounds. Note that this impacts the lock widget only, authentication services continue to operate.
Feb 28, 10:44 PST
Other sites affected:
* New Relic APM (eg error analytics view, which makes it hard to tell what else is broken)
* Heroku dashboard/platform API/builds/slug downloads
(In reply to Ashlee Chavez [:ashlee] from comment #5)
> 10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well,
> which means FF downloads & updates are not happening

Any more information on what the errors are?

Traffic levels to bouncer looks normal and error rates (non 200) are also trending the same. 
Bouncer 302's to download.cdn.mozilla.net, which is ultimately S3 backed. If something doesn't exist in the CDN cache and it tried to pull from S3 it may be unavailable until S3 returns to normal.
12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also affected and currently some folks aren't able to use the service.
(In reply to Ashlee Chavez [:ashlee] from comment #8)
> 12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also
> affected and currently some folks aren't able to use the service.

12:09 PM <soap> lucidchart themselves are aware and are just awaiting for amazon
12:30 PM <erahm> Not sure who's in charge of it, but https://standu.ps seems to be down.

See #standup

Also:

12:39 PM <rcarroll> Moc please add Airmo VOD to services affected by AWS outage.

12:41 PM <rcarroll> Will confirm service is restored as soon as the AWS issue is resolved.
Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.
(In reply to Benson Wong [:mostlygeek] from comment #7)
> Any more information on what the errors are?

check in #buildduty -- this info was relayed from RelEng
Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.

Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.

Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
(In reply to Ashlee Chavez [:ashlee] from comment #13)
> Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully
> recovered now. We are still working to recover normal operations for adding
> new objects to S3.
> 
> Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals,
> listing and deletions. We continue to work on recovery for adding new
> objects to S3 and expect to start seeing improved error rates within the
> hour.
> 
> Update at 11:35 AM PST: We have now repaired the ability to update the
> service health dashboard. The service updates are below. We continue to
> experience high error rates with S3 in US-EAST-1, which is impacting various
> AWS services. We are working hard at repairing S3, believe we understand
> root cause, and are working on implementing what we believe will remediate
> the issue.

OK well from following both issues on the AWS outage not necessarily Mozilla related I found out that much of the service health dashboards provided to consumers like Mozilla are all reliant on a single server in S3 this does not seem extremely resilient to me. Are we working with Amazon to fix this. or just happy saying it is back up?
It appears that s3 should be fully functioning again:

Update at 2:08 PM PST: As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.
(In reply to Bill Gianopoulos [:WG9s] from comment #14)
> (In reply to Ashlee Chavez [:ashlee] from comment #13)
> > Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully
> > recovered now. We are still working to recover normal operations for adding
> > new objects to S3.
> > 
> > Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals,
> > listing and deletions. We continue to work on recovery for adding new
> > objects to S3 and expect to start seeing improved error rates within the
> > hour.
> > 
> > Update at 11:35 AM PST: We have now repaired the ability to update the
> > service health dashboard. The service updates are below. We continue to
> > experience high error rates with S3 in US-EAST-1, which is impacting various
> > AWS services. We are working hard at repairing S3, believe we understand
> > root cause, and are working on implementing what we believe will remediate
> > the issue.
> 
> OK well from following both issues on the AWS outage not necessarily Mozilla
> related I found out that much of the service health dashboards provided to
> consumers like Mozilla are all reliant on a single server in S3 this does
> not seem extremely resilient to me. Are we working with Amazon to fix this.
> or just happy saying it is back up?

I am sure many customers will be reviewing the SLA, response and notifications from AWS based on this outage and have suggestions for improvements on how to avoid in the future and proper incident management.
Heroku is still partly down (affecting builds and dyno scaling) though the API/auth came back up in the last 5-10 mins.
Travis CI is failing to run builds/process logs/update PR status still.

Slowly getting there :-)
Whiteboard: [stockwell infra]
https://aws.amazon.com/message/41926/ summary of S3 outage from Amazon
Issue resolved, AWS summary posted, closing bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.