CloudAMQP CPU alarm for treeherder-stage

RESOLVED FIXED

Status

Tree Management
Treeherder: Infrastructure
P1
normal
RESOLVED FIXED
9 months ago
6 months ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Details

(Assignee)

Description

9 months ago
I've had 5+ alerts for the last few days.

First of which was at 3am UTC 4th Feb:

"""
You are getting this email because you have enabled CPU alarms in the management console.

Your server black-lark-01 has used more than 90% of the available CPU power for at least 600 minutes.

User time: 19
    User time is CPU time spent running user application, in this case RabbitMQ.
    If this is high it probably means you are on the limit of what your server can handle. You should consider upgrading before lack of CPU power becomes a serious issue. 
System time: 3
    System time is CPU time spent running OS tasks. If this is high please please contact us at support@cloudamqp.com. 
I/O wait time 0
    I/O wait is time the CPU is waiting for disk read/writes.
    If this is high you should consider publishing more messages without the persistent flag. Respond to this email for more information. 
Steal time: 79
    Steal time is CPU time "stolen" by the virtualization system.
    If this is high it means you are using to much CPU power. This is seriously impacting the performance of your server. You should probably upgrade to a bigger instance. Respond to this email for more information. 
"""
(Assignee)

Comment 1

9 months ago
So whilst the email alerts started at 3am UTC 4th Feb, looking at the graphs, the CPU steal started increasing at 6pm UTC 3rd Feb.

The rabbitmq-server and erlang versions were updated 2am UTC 3rd Feb.

At first glance it would seem the upgrade was responsible, however:
* it was 16 hours prior, which doesn't seem close enough
* prototype's CloudAMQP instance was updated days before and is working fine

I've restarted the whole CloudAMQP instance for stage, but the problem is still occurring.

I can only think that perhaps:
* we were borderline under CPU usage before steal begins on stage (with prototype being quieter)
* load under the newer rabbitmq-server/erlang is very slightly higher
* it was only when the working day began (and so number of jobs increased) that we tipped over the threshold for the first time, hence the 16 hour delay

That said, the CloudAMQP docs say we should be able to handle 10k/s burst messages, 1k/s sustained, using the Bug Bunny plan:
https://www.cloudamqp.com/plans.html

...and at least right now we're only at ~200msgs/s (though perhaps busier later in the day).

First thing for us to try will be to follow the optimisation advice here:
https://www.cloudamqp.com/docs/celery.html

...which is bug 1215102.
Depends on: 1215102
(Assignee)

Updated

9 months ago
Blocks: 1334084
(Assignee)

Comment 2

9 months ago
There have been multiple alerts a day for this since originally filing. Today seems particularly bad - stage store_pulse_jobs is now up to 40000 jobs. Am prioritising bug 1215102.
(Assignee)

Updated

6 months ago
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.