Closed Bug 837521 Opened 12 years ago Closed 12 years ago

jenkins1.dmz.phx1 nagios alerts on memcached

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ericz, Assigned: ericz)

References

Details

(Not sure if this is webops or dev services) Got this alert < nagios-phx1> | Sun 10:01:40 PST [199] jenkins1.dmz.phx1.mozilla.com:memcached is CRITICAL: Unable to set memcached nagios (http://m.allizom.org/memcached) Using the documentation I checked on memcached and it seemed to be running fine. The alert wouldn't clear though so I restarted it. The alert still wouldn't clear. 45 minutes later, it cleared on it's own for no obvious reason. This alert has flapped four times in the last week. Can we get it to not do that or at least in the documentation explain better what can be done about it?
Supporting info from when it alerted before I restarted memcached: [eziegenhorn@jenkins1.dmz.phx1 ~]$ service memcached status memcached (pid 20310) is running... [eziegenhorn@jenkins1.dmz.phx1 ~]$ ps aux | grep memcached nobody 20310 0.0 0.0 360368 19232 ? Ssl 2012 8:32 memcached -d -p 11211 -u nobody -m 256 -c 10024 -P /var/run/memcached/memcached.pid 1892 29008 0.0 0.0 103244 860 pts/0 S+ 10:04 0:00 grep memcached [eziegenhorn@jenkins1.dmz.phx1 ~]$ memcached-tool localhost:11211 stats #localhost:11211 Field Value accepting_conns 1 auth_cmds 0 auth_errors 0 bytes 303872 bytes_read 308246457 bytes_written 291362392 cas_badval 0 cas_hits 165948 cas_misses 0 cmd_flush 24837 cmd_get 662191 cmd_set 541256 cmd_touch 0 conn_yields 0 connection_structures 79 curr_connections 10 curr_items 607 decr_hits 0 decr_misses 0 delete_hits 30125 delete_misses 175694 evicted_unfetched 0 evictions 0 expired_unfetched 27706 get_hits 455285 get_misses 206906 hash_bytes 524288 hash_is_expanding 0 hash_power_level 16 incr_hits 9841 incr_misses 5299 libevent 1.4.13-stable limit_maxbytes 268435456 listen_disabled_num 0 pid 20310 pointer_size 64 reclaimed 99203 reserved_fds 20 rusage_system 286.289477 rusage_user 226.680539 threads 4 time 1359914716 total_connections 159917 total_items 551097 touch_hits 0 touch_misses 0 uptime 13596805 version 1.4.14
It alerted again.
The stats this time: [eziegenhorn@jenkins1.dmz.phx1 ~]$ memcached-tool localhost:11211 stats #localhost:11211 Field Value accepting_conns 1 auth_cmds 0 auth_errors 0 bytes 15346 bytes_read 320191 bytes_written 299773 cas_badval 0 cas_hits 163 cas_misses 0 cmd_flush 9 cmd_get 627 cmd_set 511 cmd_touch 0 conn_yields 0 connection_structures 68 curr_connections 10 curr_items 65 decr_hits 0 decr_misses 0 delete_hits 26 delete_misses 180 evicted_unfetched 0 evictions 0 expired_unfetched 6 get_hits 430 get_misses 197 hash_bytes 524288 hash_is_expanding 0 hash_power_level 16 incr_hits 13 incr_misses 7 libevent 1.4.13-stable limit_maxbytes 268435456 listen_disabled_num 0 pid 29825 pointer_size 64 reclaimed 28 reserved_fds 20 rusage_system 1.391788 rusage_user 1.411785 threads 4 time 1359928521 total_connections 115 total_items 524 touch_hits 0 touch_misses 0 uptime 13740 version 1.4.14
Self resolved again (without a restart) but much quicker this time.
Assignee: server-ops-webops → bburton
Blocks: 803599
This alerted once last night and once this morning. Both times it looked fine and recovered on its own. I've not seen this alert be useful yet.
Per IRC, let's not have this page oncall, but just show up in IRC for now.
Assignee: bburton → server-ops
Component: Server Operations: Web Operations → Server Operations
QA Contact: nmaul → shyam
This alerted again but as per comment 6, it shows in IRC, and doesn't page. I'm going to note in the documentation that this alert usually clears on its own and consider this closed.
Assignee: server-ops → eziegenhorn
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
This actually paged me just now so I'll investigate the nagios config.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Changed to #sysalertsonly.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.