1343294 - Aws reporting issues with s3

Also affecting: Smartsheet - Amazon S3 Incident affecting Smartsheet Functionality Amazon S3 Incident affecting Smartsheet Functionality Incident Report for Smartsheet New Incident Status: Identified Smartsheet customers attempting to upload files to Smartsheet, access files previously uploaded to Smartsheet, and access Published HTML sheets will currently experience issues. Smartsheet uses a secure proxy to Amazon S3 for file storage, and at this time the Amazon S3 US-EAST-1 Region is currently experiencing an increased error rate. Our operations team is continuing to monitor the Amazon S3 incident, and we apologize for the inconvenience https://status.aws.amazon.com/ Feb 28, 10:24 PST

Amy Rich [:arr] [:arich]

Comment 3

•

8 years ago

Adding travis so he can cc the appropriate people on his team, too.

Chris AtLee [:catlee]

Comment 4

•

8 years ago

Pretty sure this impacts product delivery. Looks like bouncer is dead, and most files it would point to are also hosted on S3.

Ashlee Nguyen [:ashlee]

Reporter

Comment 5

•

8 years ago

10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well, which means FF downloads & updates are not happening Also: Auth0 - Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1 Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1 Incident Report for Auth0 New Incident Status: Identified Older versions of lock pull assets directly from s3 in the us-east-1 region for the us region. These assets can't be reached. We are currently evaluating workarounds. Note that this impacts the lock widget only, authentication services continue to operate. Feb 28, 10:44 PST

Ed Morley [:emorley]

Comment 6

•

8 years ago

Other sites affected: * New Relic APM (eg error analytics view, which makes it hard to tell what else is broken) * Heroku dashboard/platform API/builds/slug downloads

Benson Wong [:mostlygeek]

Comment 7

•

8 years ago

(In reply to Ashlee Chavez [:ashlee] from comment #5) > 10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well, > which means FF downloads & updates are not happening Any more information on what the errors are? Traffic levels to bouncer looks normal and error rates (non 200) are also trending the same. Bouncer 302's to download.cdn.mozilla.net, which is ultimately S3 backed. If something doesn't exist in the CDN cache and it tried to pull from S3 it may be unavailable until S3 returns to normal.

Ashlee Nguyen [:ashlee]

Reporter

Comment 8

•

8 years ago

12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also affected and currently some folks aren't able to use the service.

Ashlee Nguyen [:ashlee]

Reporter

Comment 9

•

8 years ago

(In reply to Ashlee Chavez [:ashlee] from comment #8) > 12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also > affected and currently some folks aren't able to use the service. 12:09 PM <soap> lucidchart themselves are aware and are just awaiting for amazon

Ashlee Nguyen [:ashlee]

Reporter

Comment 10

•

8 years ago

12:30 PM <erahm> Not sure who's in charge of it, but https://standu.ps seems to be down. See #standup Also: 12:39 PM <rcarroll> Moc please add Airmo VOD to services affected by AWS outage. 12:41 PM <rcarroll> Will confirm service is restored as soon as the AWS issue is resolved.

Linda Ypulong [:unixfairy]

Comment 11

•

8 years ago

Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.

hwine

Comment 12

•

8 years ago

(In reply to Benson Wong [:mostlygeek] from comment #7) > Any more information on what the errors are? check in #buildduty -- this info was relayed from RelEng

Ashlee Nguyen [:ashlee]

Reporter

Comment 13

•

8 years ago

Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3. Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour. Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.

Bill Gianopoulos [:WG9s]

Comment 14

•

8 years ago

(In reply to Ashlee Chavez [:ashlee] from comment #13) > Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully > recovered now. We are still working to recover normal operations for adding > new objects to S3. > > Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, > listing and deletions. We continue to work on recovery for adding new > objects to S3 and expect to start seeing improved error rates within the > hour. > > Update at 11:35 AM PST: We have now repaired the ability to update the > service health dashboard. The service updates are below. We continue to > experience high error rates with S3 in US-EAST-1, which is impacting various > AWS services. We are working hard at repairing S3, believe we understand > root cause, and are working on implementing what we believe will remediate > the issue. OK well from following both issues on the AWS outage not necessarily Mozilla related I found out that much of the service health dashboards provided to consumers like Mozilla are all reliant on a single server in S3 this does not seem extremely resilient to me. Are we working with Amazon to fix this. or just happy saying it is back up?

Greg Arndt [:garndt]

Comment 15

•

8 years ago

It appears that s3 should be fully functioning again: Update at 2:08 PM PST: As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.

Linda Ypulong [:unixfairy]

Comment 16

•

8 years ago

(In reply to Bill Gianopoulos [:WG9s] from comment #14) > (In reply to Ashlee Chavez [:ashlee] from comment #13) > > Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully > > recovered now. We are still working to recover normal operations for adding > > new objects to S3. > > > > Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, > > listing and deletions. We continue to work on recovery for adding new > > objects to S3 and expect to start seeing improved error rates within the > > hour. > > > > Update at 11:35 AM PST: We have now repaired the ability to update the > > service health dashboard. The service updates are below. We continue to > > experience high error rates with S3 in US-EAST-1, which is impacting various > > AWS services. We are working hard at repairing S3, believe we understand > > root cause, and are working on implementing what we believe will remediate > > the issue. > > OK well from following both issues on the AWS outage not necessarily Mozilla > related I found out that much of the service health dashboards provided to > consumers like Mozilla are all reliant on a single server in S3 this does > not seem extremely resilient to me. Are we working with Amazon to fix this. > or just happy saying it is back up? I am sure many customers will be reviewing the SLA, response and notifications from AWS based on this outage and have suggestions for improvements on how to avoid in the future and proper incident management.

Ed Morley [:emorley]

Comment 17

•

8 years ago

Heroku is still partly down (affecting builds and dyno scaling) though the API/auth came back up in the last 5-10 mins. Travis CI is failing to run builds/process logs/update PR status still. Slowly getting there :-)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

8 years ago

Whiteboard: [stockwell infra]

Linda Ypulong [:unixfairy]

Comment 19

•

8 years ago

https://aws.amazon.com/message/41926/ summary of S3 outage from Amazon

Keegan Ferrando [:fauweh]

Comment 20

•

8 years ago

Issue resolved, AWS summary posted, closing bug.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

Bugzilla

Aws reporting issues with s3

Categories

(Infrastructure & Operations :: MOC: Problems, task)

Tracking

(Not tracked)

People

(Reporter: achavez, Unassigned)

References

Details

(Whiteboard: [stockwell infra])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated

Comment 19

Comment 20

Comment 21