Closed
Bug 1387191
Opened 7 years ago
Closed 5 years ago
Alerts for low CPU credits on t2 instances
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P1)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: sfraser)
References
()
Details
This would be to be sure we're not using up all our CPU credits when running signing workers.
Updated•7 years ago
|
Component: General → Buildduty
Priority: -- → P2
QA Contact: catlee
Comment 1•7 years ago
|
||
This sounds like something we should look into soon. Bumping priority.
Priority: P2 → P1
Updated•7 years ago
|
Assignee: nobody → sfraser
Assignee | ||
Comment 2•7 years ago
|
||
Short version: So our options for this are: 1. A signing worker creates its own alarm, which will email us. If we make the signing workers an autoscaling group this alarm could also add a new instance. 2. Nagios polls this data, and when we make a new signing worker, we manually add it to nagios. Alarms can be email or IRC Both options will need an IAM user with the right permissions - one on the instance, the other in nagios. The permissions don't appear particularly open to abuse: ec2:DescribeInstanceStatus ec2:DescribeInstances ec2:DescribeInstanceRecoveryAttribute ec2:RecoverInstances Is email a sufficient notification for our needs? CloudWatch's alarms only do email, instance actions or autoscaling actions Will the signing workers be an autoscaling group in the foreseeable future? Rationale: CloudWatch alarms are aimed at either 'ad hoc, manual creation' or 'completely scripted'. We can't say 'Do this for all instances that match this instance name' or even 'all instances'. AWS's model for this seems to be 'add a startup script to your instance template to create the alarm' We could do that for the signingworker tools, and set it to run on startup/install. The API reference seems to indicate that the same endpoint both creates and updates an alarm, so re-running the alarm creation should be ok http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html boto3 does most of the heavy lifting: https://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.put_metric_alarm The same API lets us read the data that's being collected, so we could write a nagios plugin easily enough. This is more work for us if the workers ever start autoscaling or being dynamically deployed.
Flags: needinfo?(aki)
Comment 3•7 years ago
|
||
I think email works. New signing scriptworker instances require new gpg pubkeys in puppet hiera, as well as the initial puppet setup, so I don't think autoscaling works unless we change those.
Flags: needinfo?(aki)
Comment 4•7 years ago
|
||
IIRC we also have some SNS notifications which are relayed into IRC by relengbot.
Assignee | ||
Comment 5•7 years ago
|
||
If this sounds like a good plan, I'll add the script to Puppet: 1. We make a new IAM user with an access key, with cloudwatch:PutMetricAlarm permissions 2. We put the access key and secret in Puppet's secrets stash 3. We make a new topic for SNS notification 4. Puppet copies a script to the signing workers, using it as a template to populate the notification destination 5. Periodically, Puppet runs the script with the access keys in its environment, which creates/updates the cloudwatch alarm If we re-run the script it doesn't break, it just updates the alarm with that existing name. This'll be useful when we change the notification settings. One gotcha is that we store the useful hostnames in Tags, which are a bit of a faff to get at from the instance itself, so I propose that since Puppet knows the nice hostnames (e.g. signing-linux-1) it passes that value along with the access keys in the environment. This lets us name the alarm with a hostname that we'll recognise. For example, signing_linux_1_CPUCreditBalance
Flags: needinfo?(catlee)
Comment 6•7 years ago
|
||
+1 do you know how to get alerts from the SNS topics?
Flags: needinfo?(catlee)
Assignee | ||
Comment 7•7 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #6) > +1 > > do you know how to get alerts from the SNS topics? I think they just function as email lists, don't they? There's an identifier for each topic that you get from the SNS tools.
Assignee | ||
Comment 8•7 years ago
|
||
signing-linux-{1..12} now have alarms to release+cloudwatch@ when their CPUCreditBalance falls below 5.0. Setup is not automatic yet, due to the blocking bug
Comment 9•6 years ago
|
||
Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483
QA Contact: catlee → jlund
Comment 10•6 years ago
|
||
Hello, any updates on this bug? Can it be Fixed?
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Comment 11•5 years ago
|
||
sfraser: can we close this bug? Should we resolve and file elsewhere if we need more worker coverage?
Flags: needinfo?(sfraser)
Assignee | ||
Comment 12•5 years ago
|
||
I think we should close it, yes - we've delayed long enough that the architecture will change before we make progress.
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(sfraser)
Resolution: --- → FIXED
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•