Aws reporting issues with s3

RESOLVED FIXED

Status

RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: achavez, Unassigned)

Tracking

(Blocks: 1 bug)

Details

(Whiteboard: [stockwell infra])

(Reporter)

Description

2 years ago
digi reported in #moc of sc3 issues at 9:55 AM

checking here for updates https://status.aws.amazon.com/
(Reporter)

Updated

2 years ago
Group: infra
This seems to have affected https://shipit.mozilla-releng.net and https://mozilla-releng.net/treestatus which has forced a tree closure by route of making hg read only.
Group: infra
(Reporter)

Comment 2

2 years ago
Also affecting:

Smartsheet - Amazon S3 Incident affecting Smartsheet Functionality
Amazon S3 Incident affecting Smartsheet Functionality
Incident Report for Smartsheet
New Incident Status: Identified
Smartsheet customers attempting to upload files to Smartsheet, access files previously uploaded to Smartsheet, and access Published HTML sheets will currently experience issues. Smartsheet uses a secure proxy to Amazon S3 for file storage, and at this time the Amazon S3 US-EAST-1 Region is currently experiencing an increased error rate. Our operations team is continuing to monitor the Amazon S3 incident, and we apologize for the inconvenience 

https://status.aws.amazon.com/
Feb 28, 10:24 PST
Adding travis so he can cc the appropriate people on his team, too.
Pretty sure this impacts product delivery. Looks like bouncer is dead, and most files it would point to are also hosted on S3.
(Reporter)

Comment 5

2 years ago
10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well, which means FF downloads & updates are not happening

Also:

Auth0 - Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1
Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1
Incident Report for Auth0
New Incident Status: Identified
Older versions of lock pull assets directly from s3 in the us-east-1 region for the us region. These assets can't be reached. We are currently evaluating workarounds. Note that this impacts the lock widget only, authentication services continue to operate.
Feb 28, 10:44 PST

Comment 6

2 years ago
Other sites affected:
* New Relic APM (eg error analytics view, which makes it hard to tell what else is broken)
* Heroku dashboard/platform API/builds/slug downloads
(In reply to Ashlee Chavez [:ashlee] from comment #5)
> 10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well,
> which means FF downloads & updates are not happening

Any more information on what the errors are?

Traffic levels to bouncer looks normal and error rates (non 200) are also trending the same. 
Bouncer 302's to download.cdn.mozilla.net, which is ultimately S3 backed. If something doesn't exist in the CDN cache and it tried to pull from S3 it may be unavailable until S3 returns to normal.
(Reporter)

Comment 8

2 years ago
12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also affected and currently some folks aren't able to use the service.
(Reporter)

Comment 9

2 years ago
(In reply to Ashlee Chavez [:ashlee] from comment #8)
> 12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also
> affected and currently some folks aren't able to use the service.

12:09 PM <soap> lucidchart themselves are aware and are just awaiting for amazon
(Reporter)

Comment 10

2 years ago
12:30 PM <erahm> Not sure who's in charge of it, but https://standu.ps seems to be down.

See #standup

Also:

12:39 PM <rcarroll> Moc please add Airmo VOD to services affected by AWS outage.

12:41 PM <rcarroll> Will confirm service is restored as soon as the AWS issue is resolved.
Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.
(In reply to Benson Wong [:mostlygeek] from comment #7)
> Any more information on what the errors are?

check in #buildduty -- this info was relayed from RelEng
(Reporter)

Comment 13

2 years ago
Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3.

Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.

Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
(In reply to Ashlee Chavez [:ashlee] from comment #13)
> Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully
> recovered now. We are still working to recover normal operations for adding
> new objects to S3.
> 
> Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals,
> listing and deletions. We continue to work on recovery for adding new
> objects to S3 and expect to start seeing improved error rates within the
> hour.
> 
> Update at 11:35 AM PST: We have now repaired the ability to update the
> service health dashboard. The service updates are below. We continue to
> experience high error rates with S3 in US-EAST-1, which is impacting various
> AWS services. We are working hard at repairing S3, believe we understand
> root cause, and are working on implementing what we believe will remediate
> the issue.

OK well from following both issues on the AWS outage not necessarily Mozilla related I found out that much of the service health dashboards provided to consumers like Mozilla are all reliant on a single server in S3 this does not seem extremely resilient to me. Are we working with Amazon to fix this. or just happy saying it is back up?
It appears that s3 should be fully functioning again:

Update at 2:08 PM PST: As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.
(In reply to Bill Gianopoulos [:WG9s] from comment #14)
> (In reply to Ashlee Chavez [:ashlee] from comment #13)
> > Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully
> > recovered now. We are still working to recover normal operations for adding
> > new objects to S3.
> > 
> > Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals,
> > listing and deletions. We continue to work on recovery for adding new
> > objects to S3 and expect to start seeing improved error rates within the
> > hour.
> > 
> > Update at 11:35 AM PST: We have now repaired the ability to update the
> > service health dashboard. The service updates are below. We continue to
> > experience high error rates with S3 in US-EAST-1, which is impacting various
> > AWS services. We are working hard at repairing S3, believe we understand
> > root cause, and are working on implementing what we believe will remediate
> > the issue.
> 
> OK well from following both issues on the AWS outage not necessarily Mozilla
> related I found out that much of the service health dashboards provided to
> consumers like Mozilla are all reliant on a single server in S3 this does
> not seem extremely resilient to me. Are we working with Amazon to fix this.
> or just happy saying it is back up?

I am sure many customers will be reviewing the SLA, response and notifications from AWS based on this outage and have suggestions for improvements on how to avoid in the future and proper incident management.
Heroku is still partly down (affecting builds and dyno scaling) though the API/auth came back up in the last 5-10 mins.
Travis CI is failing to run builds/process logs/update PR status still.

Slowly getting there :-)
39 failures in 87 pushes (0.448 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-inbound: 21
* autoland: 14
* mozilla-aurora: 4

Platform breakdown:
* windowsxp: 17
* windows8-64: 17
* windows2012-64: 2
* windows2012-32: 2
* osx-10-7: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1343294&startday=2017-02-28&endday=2017-02-28&tree=all
Whiteboard: [stockwell infra]
https://aws.amazon.com/message/41926/ summary of S3 outage from Amazon
Issue resolved, AWS summary posted, closing bug.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
39 failures in 783 pushes (0.05 failures/push) were associated with this bug in the last 7 days. 

This is the #28 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* mozilla-inbound: 21
* autoland: 14
* mozilla-aurora: 4

Platform breakdown:
* windowsxp: 17
* windows8-64: 17
* windows2012-64: 2
* windows2012-32: 2
* osx-10-7: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1343294&startday=2017-02-27&endday=2017-03-05&tree=all
You need to log in before you can comment on or make changes to this bug.