We should be able to minimize cost and also handle large crash spikes by using the elastic loadbalancer feature to automatically spin up/down collector EC2 nodes.
It sounds like we're thinking of setting up autoscaling here, rather than a feature of ELBs. To that end, I looked to verify this was already setup, and found some bugs in the update infrastructure script which had prevented the scaling policies from being properly applied. I've run this myself, and now we have scaling policies: --- as-prod-collector-scale-down Execute policy when: as-prod-collector-CPULow breaches the alarm threshold: CPUUtilization < 20 for 3 consecutive periods of 300 seconds for the metric dimensions AutoScalingGroupName = as-prod-collector Take the action: Remove 3 instances And then wait: 300 seconds before allowing another scaling activityas-prod-collector-scale-up Execute policy when: as-prod-collector-CPUHig breaches the alarm threshold: CPUUtilization > 70 for 300 seconds for the metric dimensions AutoScalingGroupName = as-prod-collector Take the action: Add 6 instances And then wait: 300 seconds before allowing another scaling activity --- Do we think we should scale a bit harder? I have it defaulted to adding 6, but am considering (and would like feedback on) the idea of doing, say...12 or 18. https://github.com/mozilla/socorro-infra/pull/182 should fix the script errors to make this an automated addition to our infra, as a runnable Jenkins job.
We are in prod, and have been for a while.