Closed Bug 1126410 Opened 9 years ago Closed 9 years ago

Hello is failing to respond to some requests

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: standard8, Assigned: bobm)

References

Details

We're seeing various people report that Hello isn't working. This started around 18:25 UTC and is still going on for some people.

One report was:

"Loop hawkRequest error:" Object { error: Exception, message: null, code: null, errno: null, toString: this.HawkClient.prototype._constructError/errorObj.toString() }
"RoomList error" Error: NS_ERROR_NET_RESET

Other people have not been able to get the room lists.

We're also not seeing any failure information on the status monitoring at https://status.services.mozilla.com/
This problem was the result of a load test being run against the production environment.  It exposed a number of shortfalls that should be addressed in separate bugs.  I shall list them here, and close out this bug after creating the bugs for those individual issues.

Monitoring: The lack of anomaly alerting led to this being a user reported issue.  Anomalous fluctuations in traffic should be detected, and alerted through normal escalation channels.  Also, periods of anomalous traffic activity should be noted on the status dashboard.

Auto-scaling: During this issue, the AWS Auto Scaling Group for the loop-server application servers should have spawned new instances to deal with the increased load.  It did not.

IP Throttling: A sensible rate limiting policy on Production would have done much to filtering out this traffic.  See: https://github.com/mozilla/videur

Feature request for Loads V2: :benbangert mentioned that it might make good sense to add an IP blacklisting, or other type of failsafe in Loads V2 to discourage the load testing of production environments.
Assignee: nobody → bobm
Status: NEW → ASSIGNED
QA Contact: nayarpapa
Fixing QA contact.  Bugzilla is _helpful_ sometimes.
QA Contact: nayarpapa → rpappalardo
(In reply to Bob Micheletto [:bobm] from comment #1)

> Monitoring: The lack of anomaly alerting led to this being a user reported
> issue.  Anomalous fluctuations in traffic should be detected, and alerted
> through normal escalation channels.  Also, periods of anomalous traffic
> activity should be noted on the status dashboard.

Filed under Bug 1126589.
(In reply to Bob Micheletto [:bobm] from comment #1)

> Auto-scaling: During this issue, the AWS Auto Scaling Group for the
> loop-server application servers should have spawned new instances to deal
> with the increased load.  It did not.

Filed in Bug 1126605.
(In reply to Bob Micheletto [:bobm] from comment #1)
> IP Throttling: A sensible rate limiting policy on Production would have done
> much to filtering out this traffic.  See: https://github.com/mozilla/videur

Filed in Bug 1126611.
(In reply to Bob Micheletto [:bobm] from comment #1)
 
> Feature request for Loads V2: :benbangert mentioned that it might make good
> sense to add an IP blacklisting, or other type of failsafe in Loads V2 to
> discourage the load testing of production environments.

Added feature request here: https://github.com/loads/loads-broker/issues/19
Closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Thanks for filling these bugs Bob,

Another one I see (but I don't know if that's covered already or not), is the monitoring of our response times, and alerting based on that.

Are we currently doing that through heka? If not, I believe we should!
Flags: needinfo?(bobm)
Alexis, there isn't a specific monitor for it.  However, we are logging the request_time for users and we can add a summary to the Kibana dashboard to get that information.
Flags: needinfo?(bobm)
(In reply to Alexis Metaireau (:alexis) from comment #9)

I should mention, we used response time anomaly detection in Sync 1.5 for awhile, and found it hard filter out the actionable alerts.  I think we should establish a baseline model here for what they look like over time before turn those on.
You need to log in before you can comment on or make changes to this bug.