Closed Bug 1387191 Opened 7 years ago Closed 5 years ago

Alerts for low CPU credits on t2 instances

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: sfraser)

References

()

Details

This would be to be sure we're not using up all our CPU credits when running signing workers.
Component: General → Buildduty
Priority: -- → P2
QA Contact: catlee
This sounds like something we should look into soon. Bumping priority.
Priority: P2 → P1
Depends on: 1394130
Blocks: 1395001
Assignee: nobody → sfraser
Short version:

So our options for this are:
1. A signing worker creates its own alarm, which will email us. If we make the signing workers an autoscaling group this alarm could also add a new instance.
2. Nagios polls this data, and when we make a new signing worker, we manually add it to nagios. Alarms can be email or IRC

Both options will need an IAM user with the right permissions - one on the instance, the other in nagios. The permissions don't appear particularly open to abuse: ec2:DescribeInstanceStatus ec2:DescribeInstances ec2:DescribeInstanceRecoveryAttribute  ec2:RecoverInstances

Is email a sufficient notification for our needs? CloudWatch's alarms only do email, instance actions or autoscaling actions
Will the signing workers be an autoscaling group in the foreseeable future?

Rationale:
CloudWatch alarms are aimed at either 'ad hoc, manual creation' or 'completely scripted'. We can't say 'Do this for all instances that match this instance name' or even 'all instances'. AWS's model for this seems to be 'add a startup script to your instance template to create the alarm' We could do that for the signingworker tools, and set it to run on startup/install.

The API reference seems to indicate that the same endpoint both creates and updates an alarm, so re-running the alarm creation should be ok
http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html

boto3 does most of the heavy lifting:
https://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.put_metric_alarm

The same API lets us read the data that's being collected, so we could write a nagios plugin easily enough. This is more work for us if the workers ever start autoscaling or being dynamically deployed.
Flags: needinfo?(aki)
I think email works.
New signing scriptworker instances require new gpg pubkeys in puppet hiera, as well as the initial puppet setup, so I don't think autoscaling works unless we change those.
Flags: needinfo?(aki)
IIRC we also have some SNS notifications which are relayed into IRC by relengbot.
If this sounds like a good plan, I'll add the script to Puppet:

1. We make a new IAM user with an access key, with cloudwatch:PutMetricAlarm permissions
2. We put the access key and secret in Puppet's secrets stash
3. We make a new topic for SNS notification
4. Puppet copies a script to the signing workers, using it as a template to populate the notification destination
5. Periodically, Puppet runs the script with the access keys in its environment, which creates/updates the cloudwatch alarm

If we re-run the script it doesn't break, it just updates the alarm with that existing name. This'll be useful when we change the notification settings.

One gotcha is that we store the useful hostnames in Tags, which are a bit of a faff to get at from the instance itself, so I propose that since Puppet knows the nice hostnames (e.g. signing-linux-1) it passes that value along with the access keys in the environment. This lets us name the alarm with a hostname that we'll recognise. For example, signing_linux_1_CPUCreditBalance
Flags: needinfo?(catlee)
+1

do you know how to get alerts from the SNS topics?
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #6)
> +1
> 
> do you know how to get alerts from the SNS topics?

I think they just function as email lists, don't they? There's an identifier for each topic that you get from the SNS tools.
Depends on: 1402371
signing-linux-{1..12} now have alarms to release+cloudwatch@ when their CPUCreditBalance falls below 5.0. Setup is not automatic yet, due to the blocking bug
Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483
QA Contact: catlee → jlund
Hello, any updates on this bug?
Can it be Fixed?
Product: Release Engineering → Infrastructure & Operations

sfraser: can we close this bug? Should we resolve and file elsewhere if we need more worker coverage?

Flags: needinfo?(sfraser)

I think we should close it, yes - we've delayed long enough that the architecture will change before we make progress.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(sfraser)
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.