jenkins1.dmz.phx1 nagios alerts on memcached

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: ericz, Assigned: ericz)

Tracking

Details

(Assignee)

Description

6 years ago
(Not sure if this is webops or dev services)

Got this alert

 < nagios-phx1> | Sun 10:01:40 PST [199] jenkins1.dmz.phx1.mozilla.com:memcached is CRITICAL: Unable to set memcached nagios (http://m.allizom.org/memcached)

Using the documentation I checked on memcached and it seemed to be running fine.  The alert wouldn't clear though so I restarted it.  The alert still wouldn't clear.  45 minutes later, it cleared on it's own for no obvious reason.  This alert has flapped four times in the last week.  Can we get it to not do that or at least in the documentation explain better what can be done about it?
(Assignee)

Comment 1

6 years ago
Supporting info from when it alerted before I restarted memcached:

[eziegenhorn@jenkins1.dmz.phx1 ~]$ service memcached status
memcached (pid 20310) is running...

[eziegenhorn@jenkins1.dmz.phx1 ~]$ ps aux | grep memcached
nobody   20310  0.0  0.0 360368 19232 ?        Ssl   2012   8:32 memcached -d -p 11211 -u nobody -m 256 -c 10024 -P /var/run/memcached/memcached.pid
1892     29008  0.0  0.0 103244   860 pts/0    S+   10:04   0:00 grep memcached

[eziegenhorn@jenkins1.dmz.phx1 ~]$ memcached-tool localhost:11211 stats
#localhost:11211   Field       Value
         accepting_conns           1
               auth_cmds           0
             auth_errors           0
                   bytes      303872
              bytes_read   308246457
           bytes_written   291362392
              cas_badval           0
                cas_hits      165948
              cas_misses           0
               cmd_flush       24837
                 cmd_get      662191
                 cmd_set      541256
               cmd_touch           0
             conn_yields           0
   connection_structures          79
        curr_connections          10
              curr_items         607
               decr_hits           0
             decr_misses           0
             delete_hits       30125
           delete_misses      175694
       evicted_unfetched           0
               evictions           0
       expired_unfetched       27706
                get_hits      455285
              get_misses      206906
              hash_bytes      524288
       hash_is_expanding           0
        hash_power_level          16
               incr_hits        9841
             incr_misses        5299
                libevent 1.4.13-stable
          limit_maxbytes   268435456
     listen_disabled_num           0
                     pid       20310
            pointer_size          64
               reclaimed       99203
            reserved_fds          20
           rusage_system  286.289477
             rusage_user  226.680539
                 threads           4
                    time  1359914716
       total_connections      159917
             total_items      551097
              touch_hits           0
            touch_misses           0
                  uptime    13596805
                 version      1.4.14
(Assignee)

Comment 2

6 years ago
It alerted again.
(Assignee)

Comment 3

6 years ago
The stats this time:

[eziegenhorn@jenkins1.dmz.phx1 ~]$ memcached-tool localhost:11211 stats
#localhost:11211   Field       Value
         accepting_conns           1
               auth_cmds           0
             auth_errors           0
                   bytes       15346
              bytes_read      320191
           bytes_written      299773
              cas_badval           0
                cas_hits         163
              cas_misses           0
               cmd_flush           9
                 cmd_get         627
                 cmd_set         511
               cmd_touch           0
             conn_yields           0
   connection_structures          68
        curr_connections          10
              curr_items          65
               decr_hits           0
             decr_misses           0
             delete_hits          26
           delete_misses         180
       evicted_unfetched           0
               evictions           0
       expired_unfetched           6
                get_hits         430
              get_misses         197
              hash_bytes      524288
       hash_is_expanding           0
        hash_power_level          16
               incr_hits          13
             incr_misses           7
                libevent 1.4.13-stable
          limit_maxbytes   268435456
     listen_disabled_num           0
                     pid       29825
            pointer_size          64
               reclaimed          28
            reserved_fds          20
           rusage_system    1.391788
             rusage_user    1.411785
                 threads           4
                    time  1359928521
       total_connections         115
             total_items         524
              touch_hits           0
            touch_misses           0
                  uptime       13740
                 version      1.4.14
(Assignee)

Comment 4

6 years ago
Self resolved again (without a restart) but much quicker this time.
Assignee: server-ops-webops → bburton
Blocks: 803599
(Assignee)

Comment 5

6 years ago
This alerted once last night and once this morning.  Both times it looked fine and recovered on its own.  I've not seen this alert be useful yet.
Per IRC, let's not have this page oncall, but just show up in IRC for now.
Assignee: bburton → server-ops
Component: Server Operations: Web Operations → Server Operations
QA Contact: nmaul → shyam
(Assignee)

Comment 7

6 years ago
This alerted again but as per comment 6, it shows in IRC, and doesn't page.  I'm going to note in the documentation that this alert usually clears on its own and consider this closed.
Assignee: server-ops → eziegenhorn
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Assignee)

Comment 8

6 years ago
This actually paged me just now so I'll investigate the nagios config.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 9

6 years ago
Changed to #sysalertsonly.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.