This is likely a dupe ticket, but I cannot find the other ticket, so opening this to ensure we have this documented. On Saturday, at approximately 1:00pm central, BSD moved Mozilla's sendto.mozilla.org fundraising EOY page to Akamai. The Akamai page was fast and functional up to the submission point of the donation, at which point it hung for anybody using credit cards. This was caught at approximately 11:30pm central. The issue was resolved at 12:30am central.
Seth Reznik, Dec 14 09:54: Below is our incident report for yesterday. Right now the Akamai integration is off as we go back today and tomorrow and work through a better way to make this happen. Report: Duration: 11h 10 minutes (2014-12-13 1408 - 2014-12-14 0118 EST) Fault: The cardholder data environment (‘CDE’), responsible for processing contributions, was unable to make a required API call to the primary environment. This was due to the combination of IP whitelisting rules in the CDE and a DNS change made at 14:08. Resolution: The DNS change made at 14:08 was reverted. Once this propagated, contributions began functioning again. Analysis: The architecture of our cardholder data environment (‘CDE’) is such that that all cardholder data is routed only through the servers in that environment, and any additional information required to process a transaction is requested via API calls to the primary, non-CDE environment. There is an explicit whitelist of the IP addresses with which the CDE is allowed to communicate. The CDE uses the hostname from the initial request in order to make these API requests. The IP address that this hostname resolved to was changed as part of an effort to mitigate risk to mozilla’s site due to network issues observed in our data center. However, the whitelist was not updated nor was an internal override put into place that would cause the requests to be routed to an existing whitelisted IP address, resulting in all contributions on the affected domain, “sendto.mozilla.org”, failing. This issue persisted both due to a failure of testing after the DNS change was made live and a lack of sufficient monitoring and alerting for errors in the CDE. Our standard operating procedure after a change such as this is to make a test contribution to validate that the change took effect as expected. However, this procedure was not followed, and therefore the issue was not caught immediately following the change. Additionally, while monitoring and alerting on error rates is performed in the CDE, it is not tuned to detect errors at this rate / time scale, and doesn’t bucket errors by client. Since the overall error rate was low, only affecting contributions for one client, the alerting threshold was not met. Next Steps: We will examine our whitelisting procedures, code and overall architecture to determine if there is a code or architectural change we can make to remove the requirement for domains hosting our contribution forms resolving to a specific set of whitelisted IP addresses. We will also examine how to monitor client-specific error rates and utilize other approaches such as synthetic monitoring to quickly detect this class of error.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.