Now that we're going to be using OrangeFactor's API to proxy the Elasticsearch submissions (in bug 1235097 / bug 1153324), it's even more important that the API have a high uptime. So as annoying as it is to have to spend more time on brasstacks devops work (given it will be EOL soon), I think we need to use some kind of process monitoring after all. Ideally this would: 1) Start the process if it died 2) If the process was hung for whatever reason (as has occurred in recent "API down" bugs), restart it After a bit of research it seems like monit (https://mmonit.com/) might be a good fit here: a) It can re-use the existing /etc/init.d/orangefactor script, so I don't need to touch the existing config (given it's not puppet managed) b) it also supports restarting a process if it consumes too much RAM Which brings me onto the next point... I'm pretty sure the reason that the OrangeFactor API goes down periodically is that it's leaking. eg RSS was at 2.35GB just now (process had been running for 2 weeks), but after restart (and hitting the API a few times via opening the UI in the browser, so it's not completely cold) it's down to 163MB. [email@example.com ~]$ ps uU webtools USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND webtools 24601 0.7 62.8 7959600 2467172 ? Ssl Jan21 163:11 /home/webtools/apps/orangefactor/bin/python / [firstname.lastname@example.org ~]$ sudo service orangefactor stop; sudo service orangefactor start stopping orangefactor [FAILED] starting orangefactorspawn-fcgi: child spawned successfully: PID: 5147 [ OK ] ... [email@example.com ~]$ ps uU webtools USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND webtools 5597 2.8 4.2 745132 166844 ? Ssl 14:42 0:28 /home/webtools/apps/orangefactor/bin/python /
Monit installed and configured to start at boot: [firstname.lastname@example.org ~]$ sudo yum install monit Loaded plugins: rhnplugin, security This system is receiving updates from RHN Classic or RHN Satellite. Setting up Install Process Resolving Dependencies --> Running transaction check ---> Package monit.x86_64 0:5.14-1.el6 will be installed --> Finished Dependency Resolution Dependencies Resolved ============================================================================================================== Package Arch Version Repository Size ============================================================================================================== Installing: monit x86_64 5.14-1.el6 epel 261 k ... Installed: monit.x86_64 0:5.14-1.el6 Complete! [email@example.com ~]$ sudo vi /etc/monit.conf <set a non-default password for the HTTP interface. It only accepts connections from localhost> [firstname.lastname@example.org ~]$ sudo service monit start Starting monit: [ OK ] [email@example.com ~]$ sudo chkconfig monit on [firstname.lastname@example.org ~]$ sudo monit status The Monit daemon 5.14 uptime: 0m System 'brasstacks1.dmz.scl3.mozilla.com' status Running monitoring status Monitored load average [0.00] [0.03] [0.00] cpu 0.0%us 0.0%sy 0.0%wa memory usage 1.9 GB [49.8%] swap usage 14.5 MB [0.7%] data collected Thu, 11 Feb 2016 16:28:13
I've made some further modifications to /etc/monit.conf: * set the mail server as smtp.mozilla.org * set emorley@moco as the default alert recipient * enabled the eventqueue, which queues alert messages if they cannot be sent straight away And created a config file for monitoring orangefactor: [email@example.com ~]# cat /etc/monit.d/orangefactor check process orangefactor with pidfile /var/run/orangefactor/orangefactor.pid start program = "/etc/init.d/orangefactor start" stop program = "/etc/init.d/orangefactor stop" # Each cycle is 30 seconds. The restart action restarts the service and alerts. if cpu usage > 95% for 10 cycles then restart if mem usage > 80% for 10 cycles then restart alert emorley@SNIP This will automatically restart the orangefactor process if it's either not running or exceeds those CPU/memory thresholds. (The checks run every 30 seconds by default). If monit has to restart the process it will email me. Status can also be found by running: [firstname.lastname@example.org ~]# monit status The Monit daemon 5.14 uptime: 1h 2m Process 'orangefactor' status Running monitoring status Monitored pid 701 parent pid 1 uid 2349 effective uid 2349 gid 2349 uptime 3m children 0 memory 19.4 MB memory total 19.4 MB memory percent 0.5% memory percent total 0.5% cpu percent 0.0% cpu percent total 0.0% data collected Thu, 11 Feb 2016 17:29:37 System 'brasstacks1.dmz.scl3.mozilla.com' status Running monitoring status Monitored load average [0.00] [0.02] [0.04] cpu 0.4%us 0.1%sy 0.0%wa memory usage 139.6 MB [3.6%] swap usage 14.5 MB [0.7%] data collected Thu, 11 Feb 2016 17:29:37 We can always add more checks later (eg HTTP request to https://brasstacks.mozilla.com/orangefactor/api/ , though it 404s even when the API is up, so will need to tweak the check), but for now this should be sufficient :-)
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
As quick update - about every week or two I get an email notification from monit saying memory usage was as high as 80-90% and it has restarted the service. Seems to be working well at preventing the problems we were seeing before :-)
You need to log in before you can comment on or make changes to this bug.