1387191 - Alerts for low CPU credits on t2 instances

Reporter

Description

•

7 years ago

This would be to be sure we're not using up all our CPU credits when running signing workers.

Chris AtLee [:catlee]

Updated

•

7 years ago

Component: General → Buildduty

Priority: -- → P2

QA Contact: catlee

Mihai Tabara [:mtabara]⌚️GMT

Comment 1

•

7 years ago

This sounds like something we should look into soon. Bumping priority.

Priority: P2 → P1

Aki Sasaki (not active)

Updated

•

7 years ago

Depends on: 1394130

Aki Sasaki (not active)

Updated

•

7 years ago

Blocks: 1395001

Chris AtLee [:catlee]

Updated

•

7 years ago

Assignee: nobody → sfraser

Simon Fraser [:sfraser] ⌚️GMT

Assignee

Comment 2

•

7 years ago

Short version:

So our options for this are:
1. A signing worker creates its own alarm, which will email us. If we make the signing workers an autoscaling group this alarm could also add a new instance.
2. Nagios polls this data, and when we make a new signing worker, we manually add it to nagios. Alarms can be email or IRC

Both options will need an IAM user with the right permissions - one on the instance, the other in nagios. The permissions don't appear particularly open to abuse: ec2:DescribeInstanceStatus ec2:DescribeInstances ec2:DescribeInstanceRecoveryAttribute ec2:RecoverInstances

Is email a sufficient notification for our needs? CloudWatch's alarms only do email, instance actions or autoscaling actions
Will the signing workers be an autoscaling group in the foreseeable future?

Rationale:
CloudWatch alarms are aimed at either 'ad hoc, manual creation' or 'completely scripted'. We can't say 'Do this for all instances that match this instance name' or even 'all instances'. AWS's model for this seems to be 'add a startup script to your instance template to create the alarm' We could do that for the signingworker tools, and set it to run on startup/install.

The API reference seems to indicate that the same endpoint both creates and updates an alarm, so re-running the alarm creation should be ok
http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html

boto3 does most of the heavy lifting:
https://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.put_metric_alarm

The same API lets us read the data that's being collected, so we could write a nagios plugin easily enough. This is more work for us if the workers ever start autoscaling or being dynamically deployed.

Flags: needinfo?(aki)

Aki Sasaki (not active)

Comment 3

•

7 years ago

I think email works.
New signing scriptworker instances require new gpg pubkeys in puppet hiera, as well as the initial puppet setup, so I don't think autoscaling works unless we change those.

Flags: needinfo?(aki)

Chris AtLee [:catlee]

Comment 4

•

7 years ago

IIRC we also have some SNS notifications which are relayed into IRC by relengbot.

Simon Fraser [:sfraser] ⌚️GMT

Assignee

Comment 5

•

7 years ago

If this sounds like a good plan, I'll add the script to Puppet:

1. We make a new IAM user with an access key, with cloudwatch:PutMetricAlarm permissions
2. We put the access key and secret in Puppet's secrets stash
3. We make a new topic for SNS notification
4. Puppet copies a script to the signing workers, using it as a template to populate the notification destination
5. Periodically, Puppet runs the script with the access keys in its environment, which creates/updates the cloudwatch alarm

If we re-run the script it doesn't break, it just updates the alarm with that existing name. This'll be useful when we change the notification settings.

One gotcha is that we store the useful hostnames in Tags, which are a bit of a faff to get at from the instance itself, so I propose that since Puppet knows the nice hostnames (e.g. signing-linux-1) it passes that value along with the access keys in the environment. This lets us name the alarm with a hostname that we'll recognise. For example, signing_linux_1_CPUCreditBalance

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 6

•

7 years ago

+1

do you know how to get alerts from the SNS topics?

Flags: needinfo?(catlee)

Simon Fraser [:sfraser] ⌚️GMT

Assignee

Comment 7

•

7 years ago

(In reply to Chris AtLee [:catlee] from comment #6)
> +1
> 
> do you know how to get alerts from the SNS topics?

I think they just function as email lists, don't they? There's an identifier for each topic that you get from the SNS tools.

Simon Fraser [:sfraser] ⌚️GMT

Assignee

Updated

•

7 years ago

Depends on: 1402371

Simon Fraser [:sfraser] ⌚️GMT

Assignee

Comment 8

•

7 years ago

signing-linux-{1..12} now have alarms to release+cloudwatch@ when their CPUCreditBalance falls below 5.0. Setup is not automatic yet, due to the blocking bug

Firefox Bug Husbandry Bot

Comment 9

•

6 years ago

Bulk change of QA Contact to :jlund, per https://bugzilla.mozilla.org/show_bug.cgi?id=1428483

QA Contact: catlee → jlund

Danut Labici [:dlabici]

Comment 10

•

6 years ago

Hello, any updates on this bug?
Can it be Fixed?

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

Jordan Lund (:jlund)

Comment 11

•

5 years ago

sfraser: can we close this bug? Should we resolve and file elsewhere if we need more worker coverage?

Flags: needinfo?(sfraser)

Simon Fraser [:sfraser] ⌚️GMT

Assignee

Comment 12

•

5 years ago

I think we should close it, yes - we've delayed long enough that the architecture will change before we make progress.

Status: NEW → RESOLVED

Closed: 5 years ago

Flags: needinfo?(sfraser)

Resolution: --- → FIXED

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Alerts for low CPU credits on t2 instances

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

Tracking

(Not tracked)

People

(Reporter: Callek, Assigned: sfraser)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Updated