1126410 - Hello is failing to respond to some requests

Reporter

Description

•

9 years ago

We're seeing various people report that Hello isn't working. This started around 18:25 UTC and is still going on for some people.

One report was:

"Loop hawkRequest error:" Object { error: Exception, message: null, code: null, errno: null, toString: this.HawkClient.prototype._constructError/errorObj.toString() }
"RoomList error" Error: NS_ERROR_NET_RESET

Other people have not been able to get the room lists.

We're also not seeing any failure information on the status monitoring at https://status.services.mozilla.com/

Bob Micheletto [:bobm]

Assignee

Comment 1

•

9 years ago

This problem was the result of a load test being run against the production environment.  It exposed a number of shortfalls that should be addressed in separate bugs.  I shall list them here, and close out this bug after creating the bugs for those individual issues.

Monitoring: The lack of anomaly alerting led to this being a user reported issue.  Anomalous fluctuations in traffic should be detected, and alerted through normal escalation channels.  Also, periods of anomalous traffic activity should be noted on the status dashboard.

Auto-scaling: During this issue, the AWS Auto Scaling Group for the loop-server application servers should have spawned new instances to deal with the increased load.  It did not.

IP Throttling: A sensible rate limiting policy on Production would have done much to filtering out this traffic.  See: https://github.com/mozilla/videur

Feature request for Loads V2: :benbangert mentioned that it might make good sense to add an IP blacklisting, or other type of failsafe in Loads V2 to discourage the load testing of production environments.

Assignee: nobody → bobm

Status: NEW → ASSIGNED

QA Contact: nayarpapa

Bob Micheletto [:bobm]

Assignee

Comment 2

•

9 years ago

Fixing QA contact.  Bugzilla is _helpful_ sometimes.

QA Contact: nayarpapa → rpappalardo

Bob Micheletto [:bobm]

Assignee

Comment 3

•

9 years ago

(In reply to Bob Micheletto [:bobm] from comment #1)

> Monitoring: The lack of anomaly alerting led to this being a user reported
> issue.  Anomalous fluctuations in traffic should be detected, and alerted
> through normal escalation channels.  Also, periods of anomalous traffic
> activity should be noted on the status dashboard.

Filed under Bug 1126589.

Bob Micheletto [:bobm]

Assignee

Comment 4

•

9 years ago

(In reply to Bob Micheletto [:bobm] from comment #1)

> Auto-scaling: During this issue, the AWS Auto Scaling Group for the
> loop-server application servers should have spawned new instances to deal
> with the increased load.  It did not.

Filed in Bug 1126605.

Bob Micheletto [:bobm]

Assignee

Comment 5

•

9 years ago

(In reply to Bob Micheletto [:bobm] from comment #1)
> IP Throttling: A sensible rate limiting policy on Production would have done
> much to filtering out this traffic.  See: https://github.com/mozilla/videur

Filed in Bug 1126611.

Bob Micheletto [:bobm]

Assignee

Comment 6

•

9 years ago

(In reply to Bob Micheletto [:bobm] from comment #1)
 
> Feature request for Loads V2: :benbangert mentioned that it might make good
> sense to add an IP blacklisting, or other type of failsafe in Loads V2 to
> discourage the load testing of production environments.

Added feature request here: https://github.com/loads/loads-broker/issues/19

Bob Micheletto [:bobm]

Assignee

Comment 7

•

9 years ago

Closing this bug.

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Richard Pappalardo [:rpapa][:rpappalardo]

Updated

•

9 years ago

Depends on: 1126597

Alexis Metaireau (:alexis)

Comment 9

•

9 years ago

Thanks for filling these bugs Bob,

Another one I see (but I don't know if that's covered already or not), is the monitoring of our response times, and alerting based on that.

Are we currently doing that through heka? If not, I believe we should!

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Assignee

Comment 10

•

9 years ago

Alexis, there isn't a specific monitor for it.  However, we are logging the request_time for users and we can add a summary to the Kibana dashboard to get that information.

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Assignee

Comment 11

•

9 years ago

(In reply to Alexis Metaireau (:alexis) from comment #9)

I should mention, we used response time anomaly detection in Sync 1.5 for awhile, and found it hard filter out the actionable alerts.  I think we should establish a baseline model here for what they look like over time before turn those on.

Bugzilla

Quick Search

Hello is failing to respond to some requests

Categories

(Cloud Services :: Operations: Miscellaneous, task)

Tracking

(Not tracked)

People

(Reporter: standard8, Assigned: bobm)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 9

Comment 10

Comment 11