Closed
Bug 1343294
Opened 7 years ago
Closed 7 years ago
Aws reporting issues with s3
Categories
(Infrastructure & Operations :: MOC: Problems, task)
Infrastructure & Operations
MOC: Problems
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: achavez, Unassigned)
References
Details
(Whiteboard: [stockwell infra])
digi reported in #moc of sc3 issues at 9:55 AM checking here for updates https://status.aws.amazon.com/
Reporter | ||
Updated•7 years ago
|
Group: infra
Comment 1•7 years ago
|
||
This seems to have affected https://shipit.mozilla-releng.net and https://mozilla-releng.net/treestatus which has forced a tree closure by route of making hg read only.
Group: infra
Reporter | ||
Comment 2•7 years ago
|
||
Also affecting: Smartsheet - Amazon S3 Incident affecting Smartsheet Functionality Amazon S3 Incident affecting Smartsheet Functionality Incident Report for Smartsheet New Incident Status: Identified Smartsheet customers attempting to upload files to Smartsheet, access files previously uploaded to Smartsheet, and access Published HTML sheets will currently experience issues. Smartsheet uses a secure proxy to Amazon S3 for file storage, and at this time the Amazon S3 US-EAST-1 Region is currently experiencing an increased error rate. Our operations team is continuing to monitor the Amazon S3 incident, and we apologize for the inconvenience https://status.aws.amazon.com/ Feb 28, 10:24 PST
Comment 3•7 years ago
|
||
Adding travis so he can cc the appropriate people on his team, too.
Comment 4•7 years ago
|
||
Pretty sure this impacts product delivery. Looks like bouncer is dead, and most files it would point to are also hosted on S3.
Reporter | ||
Comment 5•7 years ago
|
||
10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well, which means FF downloads & updates are not happening Also: Auth0 - Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1 Older versions of Auth0 lock widget are failing because of S3 outage in us-east-1 Incident Report for Auth0 New Incident Status: Identified Older versions of lock pull assets directly from s3 in the us-east-1 region for the us region. These assets can't be reached. We are currently evaluating workarounds. Note that this impacts the lock widget only, authentication services continue to operate. Feb 28, 10:44 PST
Comment 6•7 years ago
|
||
Other sites affected: * New Relic APM (eg error analytics view, which makes it hard to tell what else is broken) * Heroku dashboard/platform API/builds/slug downloads
Comment 7•7 years ago
|
||
(In reply to Ashlee Chavez [:ashlee] from comment #5) > 10:43 AM <hwine> unixfairy: fyi: we're seeing errors on bouncer as well, > which means FF downloads & updates are not happening Any more information on what the errors are? Traffic levels to bouncer looks normal and error rates (non 200) are also trending the same. Bouncer 302's to download.cdn.mozilla.net, which is ultimately S3 backed. If something doesn't exist in the CDN cache and it tried to pull from S3 it may be unavailable until S3 returns to normal.
Reporter | ||
Comment 8•7 years ago
|
||
12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also affected and currently some folks aren't able to use the service.
Reporter | ||
Comment 9•7 years ago
|
||
(In reply to Ashlee Chavez [:ashlee] from comment #8) > 12:09 PM <soap> unixfairy: adding more to amazon s3, lucidchart is also > affected and currently some folks aren't able to use the service. 12:09 PM <soap> lucidchart themselves are aware and are just awaiting for amazon
Reporter | ||
Comment 10•7 years ago
|
||
12:30 PM <erahm> Not sure who's in charge of it, but https://standu.ps seems to be down. See #standup Also: 12:39 PM <rcarroll> Moc please add Airmo VOD to services affected by AWS outage. 12:41 PM <rcarroll> Will confirm service is restored as soon as the AWS issue is resolved.
Comment 11•7 years ago
|
||
Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour.
Comment 12•7 years ago
|
||
(In reply to Benson Wong [:mostlygeek] from comment #7) > Any more information on what the errors are? check in #buildduty -- this info was relayed from RelEng
Reporter | ||
Comment 13•7 years ago
|
||
Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully recovered now. We are still working to recover normal operations for adding new objects to S3. Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, listing and deletions. We continue to work on recovery for adding new objects to S3 and expect to start seeing improved error rates within the hour. Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.
Comment 14•7 years ago
|
||
(In reply to Ashlee Chavez [:ashlee] from comment #13) > Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully > recovered now. We are still working to recover normal operations for adding > new objects to S3. > > Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, > listing and deletions. We continue to work on recovery for adding new > objects to S3 and expect to start seeing improved error rates within the > hour. > > Update at 11:35 AM PST: We have now repaired the ability to update the > service health dashboard. The service updates are below. We continue to > experience high error rates with S3 in US-EAST-1, which is impacting various > AWS services. We are working hard at repairing S3, believe we understand > root cause, and are working on implementing what we believe will remediate > the issue. OK well from following both issues on the AWS outage not necessarily Mozilla related I found out that much of the service health dashboards provided to consumers like Mozilla are all reliant on a single server in S3 this does not seem extremely resilient to me. Are we working with Amazon to fix this. or just happy saying it is back up?
Comment 15•7 years ago
|
||
It appears that s3 should be fully functioning again: Update at 2:08 PM PST: As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.
Comment 16•7 years ago
|
||
(In reply to Bill Gianopoulos [:WG9s] from comment #14) > (In reply to Ashlee Chavez [:ashlee] from comment #13) > > Update at 1:12 PM PST: S3 object retrieval, listing and deletion are fully > > recovered now. We are still working to recover normal operations for adding > > new objects to S3. > > > > Update at 12:52 PM PST: We are seeing recovery for S3 object retrievals, > > listing and deletions. We continue to work on recovery for adding new > > objects to S3 and expect to start seeing improved error rates within the > > hour. > > > > Update at 11:35 AM PST: We have now repaired the ability to update the > > service health dashboard. The service updates are below. We continue to > > experience high error rates with S3 in US-EAST-1, which is impacting various > > AWS services. We are working hard at repairing S3, believe we understand > > root cause, and are working on implementing what we believe will remediate > > the issue. > > OK well from following both issues on the AWS outage not necessarily Mozilla > related I found out that much of the service health dashboards provided to > consumers like Mozilla are all reliant on a single server in S3 this does > not seem extremely resilient to me. Are we working with Amazon to fix this. > or just happy saying it is back up? I am sure many customers will be reviewing the SLA, response and notifications from AWS based on this outage and have suggestions for improvements on how to avoid in the future and proper incident management.
Comment 17•7 years ago
|
||
Heroku is still partly down (affecting builds and dyno scaling) though the API/auth came back up in the last 5-10 mins. Travis CI is failing to run builds/process logs/update PR status still. Slowly getting there :-)
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell infra]
Comment 19•7 years ago
|
||
https://aws.amazon.com/message/41926/ summary of S3 outage from Amazon
Comment 20•7 years ago
|
||
Issue resolved, AWS summary posted, closing bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•