Closed
Bug 1126410
Opened 9 years ago
Closed 9 years ago
Hello is failing to respond to some requests
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Cloud Services
Operations: Miscellaneous
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: standard8, Assigned: bobm)
References
Details
We're seeing various people report that Hello isn't working. This started around 18:25 UTC and is still going on for some people. One report was: "Loop hawkRequest error:" Object { error: Exception, message: null, code: null, errno: null, toString: this.HawkClient.prototype._constructError/errorObj.toString() } "RoomList error" Error: NS_ERROR_NET_RESET Other people have not been able to get the room lists. We're also not seeing any failure information on the status monitoring at https://status.services.mozilla.com/
Assignee | ||
Comment 1•9 years ago
|
||
This problem was the result of a load test being run against the production environment. It exposed a number of shortfalls that should be addressed in separate bugs. I shall list them here, and close out this bug after creating the bugs for those individual issues. Monitoring: The lack of anomaly alerting led to this being a user reported issue. Anomalous fluctuations in traffic should be detected, and alerted through normal escalation channels. Also, periods of anomalous traffic activity should be noted on the status dashboard. Auto-scaling: During this issue, the AWS Auto Scaling Group for the loop-server application servers should have spawned new instances to deal with the increased load. It did not. IP Throttling: A sensible rate limiting policy on Production would have done much to filtering out this traffic. See: https://github.com/mozilla/videur Feature request for Loads V2: :benbangert mentioned that it might make good sense to add an IP blacklisting, or other type of failsafe in Loads V2 to discourage the load testing of production environments.
Assignee: nobody → bobm
Status: NEW → ASSIGNED
QA Contact: nayarpapa
Assignee | ||
Comment 2•9 years ago
|
||
Fixing QA contact. Bugzilla is _helpful_ sometimes.
QA Contact: nayarpapa → rpappalardo
Assignee | ||
Comment 3•9 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #1) > Monitoring: The lack of anomaly alerting led to this being a user reported > issue. Anomalous fluctuations in traffic should be detected, and alerted > through normal escalation channels. Also, periods of anomalous traffic > activity should be noted on the status dashboard. Filed under Bug 1126589.
Assignee | ||
Comment 4•9 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #1) > Auto-scaling: During this issue, the AWS Auto Scaling Group for the > loop-server application servers should have spawned new instances to deal > with the increased load. It did not. Filed in Bug 1126605.
Assignee | ||
Comment 5•9 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #1) > IP Throttling: A sensible rate limiting policy on Production would have done > much to filtering out this traffic. See: https://github.com/mozilla/videur Filed in Bug 1126611.
Assignee | ||
Comment 6•9 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #1) > Feature request for Loads V2: :benbangert mentioned that it might make good > sense to add an IP blacklisting, or other type of failsafe in Loads V2 to > discourage the load testing of production environments. Added feature request here: https://github.com/loads/loads-broker/issues/19
Assignee | ||
Comment 7•9 years ago
|
||
Closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 9•9 years ago
|
||
Thanks for filling these bugs Bob, Another one I see (but I don't know if that's covered already or not), is the monitoring of our response times, and alerting based on that. Are we currently doing that through heka? If not, I believe we should!
Flags: needinfo?(bobm)
Assignee | ||
Comment 10•9 years ago
|
||
Alexis, there isn't a specific monitor for it. However, we are logging the request_time for users and we can add a summary to the Kibana dashboard to get that information.
Flags: needinfo?(bobm)
Assignee | ||
Comment 11•9 years ago
|
||
(In reply to Alexis Metaireau (:alexis) from comment #9) I should mention, we used response time anomaly detection in Sync 1.5 for awhile, and found it hard filter out the actionable alerts. I think we should establish a baseline model here for what they look like over time before turn those on.
You need to log in
before you can comment on or make changes to this bug.
Description
•