Use monit to improve the OrangeFactor API uptime

RESOLVED FIXED

Status

Tree Management
OrangeFactor
P2
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: emorley, Assigned: emorley)

Tracking

Details

Now that we're going to be using OrangeFactor's API to proxy the Elasticsearch submissions (in bug 1235097 / bug 1153324), it's even more important that the API have a high uptime. So as annoying as it is to have to spend more time on brasstacks devops work (given it will be EOL soon), I think we need to use some kind of process monitoring after all.

Ideally this would:
1) Start the process if it died
2) If the process was hung for whatever reason (as has occurred in recent "API down" bugs), restart it

After a bit of research it seems like monit (https://mmonit.com/) might be a good fit here:
a) It can re-use the existing /etc/init.d/orangefactor script, so I don't need to touch the existing config (given it's not puppet managed)
b) it also supports restarting a process if it consumes too much RAM

Which brings me onto the next point...

I'm pretty sure the reason that the OrangeFactor API goes down periodically is that it's leaking.

eg RSS was at 2.35GB just now (process had been running for 2 weeks), but after restart (and hitting the API a few times via opening the UI in the browser, so it's not completely cold) it's down to 163MB.

[emorley@brasstacks1.dmz.scl3 ~]$ ps uU webtools
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
webtools 24601  0.7 62.8 7959600 2467172 ?     Ssl  Jan21 163:11 /home/webtools/apps/orangefactor/bin/python /

[emorley@brasstacks1.dmz.scl3 ~]$ sudo service orangefactor stop; sudo service orangefactor start
stopping orangefactor                                      [FAILED]
starting orangefactorspawn-fcgi: child spawned successfully: PID: 5147
                                                           [  OK  ]
...

[emorley@brasstacks1.dmz.scl3 ~]$ ps uU webtools
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
webtools  5597  2.8  4.2 745132 166844 ?       Ssl  14:42   0:28 /home/webtools/apps/orangefactor/bin/python /
Monit installed and configured to start at boot:

[emorley@brasstacks1.dmz.scl3 ~]$ sudo yum install monit
Loaded plugins: rhnplugin, security
This system is receiving updates from RHN Classic or RHN Satellite.
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package monit.x86_64 0:5.14-1.el6 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

==============================================================================================================

 Package                 Arch                     Version                        Repository              Size ==============================================================================================================

Installing:
 monit                   x86_64                   5.14-1.el6                     epel                   261 k

...

Installed:
  monit.x86_64 0:5.14-1.el6


Complete!

[emorley@brasstacks1.dmz.scl3 ~]$ sudo vi /etc/monit.conf
<set a non-default password for the HTTP interface. It only accepts connections from localhost>

[emorley@brasstacks1.dmz.scl3 ~]$ sudo service monit start
Starting monit:                                            [  OK  ]

[emorley@brasstacks1.dmz.scl3 ~]$ sudo chkconfig monit on

[emorley@brasstacks1.dmz.scl3 ~]$ sudo monit status
The Monit daemon 5.14 uptime: 0m

System 'brasstacks1.dmz.scl3.mozilla.com'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.00] [0.03] [0.00]
  cpu                               0.0%us 0.0%sy 0.0%wa
  memory usage                      1.9 GB [49.8%]
  swap usage                        14.5 MB [0.7%]
  data collected                    Thu, 11 Feb 2016 16:28:13
I've made some further modifications to /etc/monit.conf:
* set the mail server as smtp.mozilla.org
* set emorley@moco as the default alert recipient
* enabled the eventqueue, which queues alert messages if they cannot be sent straight away

And created a config file for monitoring orangefactor:

[root@brasstacks1.dmz.scl3 ~]# cat /etc/monit.d/orangefactor
check process orangefactor with pidfile /var/run/orangefactor/orangefactor.pid
  start program = "/etc/init.d/orangefactor start"
  stop  program = "/etc/init.d/orangefactor stop"
  # Each cycle is 30 seconds. The restart action restarts the service and alerts.
  if cpu usage > 95% for 10 cycles then restart
  if mem usage > 80% for 10 cycles then restart
  alert emorley@SNIP

This will automatically restart the orangefactor process if it's either not running or exceeds those CPU/memory thresholds. (The checks run every 30 seconds by default).

If monit has to restart the process it will email me.

Status can also be found by running:

[root@brasstacks1.dmz.scl3 ~]# monit status
The Monit daemon 5.14 uptime: 1h 2m

Process 'orangefactor'
  status                            Running
  monitoring status                 Monitored
  pid                               701
  parent pid                        1
  uid                               2349
  effective uid                     2349
  gid                               2349
  uptime                            3m
  children                          0
  memory                            19.4 MB
  memory total                      19.4 MB
  memory percent                    0.5%
  memory percent total              0.5%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Thu, 11 Feb 2016 17:29:37

System 'brasstacks1.dmz.scl3.mozilla.com'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.00] [0.02] [0.04]
  cpu                               0.4%us 0.1%sy 0.0%wa
  memory usage                      139.6 MB [3.6%]
  swap usage                        14.5 MB [0.7%]
  data collected                    Thu, 11 Feb 2016 17:29:37


We can always add more checks later (eg HTTP request to https://brasstacks.mozilla.com/orangefactor/api/ , though it 404s even when the API is up, so will need to tweak the check), but for now this should be sufficient :-)
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
As quick update - about every week or two I get an email notification from monit saying memory usage was as high as 80-90% and it has restarted the service. Seems to be working well at preventing the problems we were seeing before :-)
You need to log in before you can comment on or make changes to this bug.