Closed Bug 1120167 Opened 11 years ago Closed 10 years ago

Check received log volume on metrics-logger1.private.scl3.mozilla.com is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/zlb1.ops.pek1.mozilla.com is low at 49MB compared to running average 54MB

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dgarvey, Unassigned)

References

()

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/33] [id=nagios1.private.scl3.mozilla.com:504538])

I had to do this a couple times the last couple of weeks. [root@zlb1.ops.pek1 ~]# bash -x /usr/local/bin/rsync_files.sh + LOCKFILE=/tmp/rsync.lock ++ hostname + HOSTNAME=zlb1.ops.pek1.mozilla.com + DIR=/var/log/zeus + LOGSERVER=metrics-logger1.private.scl3.mozilla.com + create_lock /tmp/rsync.lock + LOCK_NAME=/tmp/rsync.lock + [[ -e /tmp/rsync.lock ]] + echo 17119 + return 0 + cd /var/log/zeus + find /var/log/zeus -type f -name '*.access*' -mmin +30 '!' -name '*gz' -print0 + xargs -0 -r -P 2 -n 1 nice -n 10 gzip + find /var/log/zeus -type f -name '*.access*.gz' -print0 + xargs -0 -r chown logpull + RESULTLOGDIR=/var/log/rsync + '[' '!' -d /var/log/rsync ']' ++ date +%Y-%m-%d-%H + RESULTLOG=/var/log/rsync/rsync-metrics-logger1.private.scl3.mozilla.com-2015-01-11-00 + sudo -u logpull rsync -av '--include=*.gz' '--exclude=*' --remove-source-files -e 'ssh -o StrictHostKeyChecking=no' /var/log/zeus/ logpull@metrics-logger1.private.scl3.mozilla.com:/data/stats/logs/zlb1.ops.pek1.mozilla.com/ + '[' 0 -ne 0 ']' + rm -f /tmp/rsync.lock [root@zlb1.ops.pek1 ~]#
I don't think it is critical but I have doc everything that I do.
Assignee: infra → server-ops-webops
Component: Infrastructure: Other → WebOps: Other
QA Contact: jdow → nmaul
Whiteboard: [id=nagios1.private.scl3.mozilla.com:504538] → [kanban:https://webops.kanbanize.com/ctrl_board/2/33] [id=nagios1.private.scl3.mozilla.com:504538]
I had to run it tonight. [root@zlb1.ops.pek1 ~]# bash -x /usr/local/bin/rsync_files.sh + LOCKFILE=/tmp/rsync.lock ++ hostname + HOSTNAME=zlb1.ops.pek1.mozilla.com + DIR=/var/log/zeus + LOGSERVER=metrics-logger1.private.scl3.mozilla.com + create_lock /tmp/rsync.lock + LOCK_NAME=/tmp/rsync.lock + [[ -e /tmp/rsync.lock ]] + echo 13825 + return 0 + cd /var/log/zeus + find /var/log/zeus -type f -name '*.access*' -mmin +30 '!' -name '*gz' -print0 + xargs -0 -r -P 2 -n 1 nice -n 10 gzip + find /var/log/zeus -type f -name '*.access*.gz' -print0 + xargs -0 -r chown logpull + RESULTLOGDIR=/var/log/rsync + '[' '!' -d /var/log/rsync ']' ++ date +%Y-%m-%d-%H + RESULTLOG=/var/log/rsync/rsync-metrics-logger1.private.scl3.mozilla.com-2015-01-17-00 + sudo -u logpull rsync -av '--include=*.gz' '--exclude=*' --remove-source-files -e 'ssh -o StrictHostKeyChecking=no' /var/log/zeus/ logpull@metrics-logger1.private.scl3.mozilla.com:/data/stats/logs/zlb1.ops.pek1.mozilla.com/ + '[' 0 -ne 0 ']' + rm -f /tmp/rsync.lock [root@zlb1.ops.pek1 ~]#
Tmary, can you take a look?
Size of logs (/data/stats/logs/zlb1.ops.pek1.mozilla.com) for past N days obtained from metrics-logger1: (date, MB) 2015-01-09 39 2015-01-10 46 2015-01-11 61 2015-01-12 72 2015-01-13 64 2015-01-14 62 2015-01-15 56 2015-01-16 47 2015-01-17 49 2015-01-18 53 2015-01-19 61 2015-01-20 59 2015-01-21 58 2015-01-22 53 2015-01-23 53 2015-01-24 59 2015-01-25 57 --
David, it looks like your efforts have helped. This is not something the data team can debug any further, but you might want to check the regular cron jobs that are supposed to be doing this work and see if there are any errors in their logfiles. That might point to what's going wrong here.
Sun 00:21:24 PST [5784] metrics-logger1.private.scl3.mozilla.com:Check received log volume is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/zlb1.ops.pek1.mozilla.com is low at 49MB compared to running average 56MB Verified logs and kicked off /usr/local/bin/rsync_files.sh.
I'm not seeing any further actionable work for Webops in this bug (PEK1 zeus is gone, one alert occurred last month, and none since). So I'm goign to RESO FIXE this for now, but please reopen if we can help investigate further somehow.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Not sure what to make of this; metrics-logger1.private.scl3.mozilla.com:Check received log volume is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/weave-etl is low at 2MB compared to running average 75MB
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
weave-etl is a CloudOps metrics job from years ago.
Flags: needinfo?(scabral)
It looks like there was an error on the 14th: Something went wrong, see below for details.\n\n++ set -v ++ JAVA_HOME=/usr/local ++ PATH=/usr/bin:/bin:/usr/local/bin ++ export JAVA_HOME PATH date '+%Y-%m-%d' +++ date +%Y-%m-%d ++ DATE_YMD=2015-05-14 ++ ERRORS=0 ++ cd /opt/pentaho/kettle ++ ./kitchen.sh --level=debugging -file /opt/weave_2/writeFile/j_writeWeaveDataToFile.kjb ++ '[' 0 -ne 0 ']' ++ cd /opt/pentaho/kettle ++ ./kitchen.sh -file /opt/weave_2/writeFile/count_Users.kjb ++ '[' 0 -ne 0 ']' ++ cd /opt/weave_2/output/ ++ TRIES=0 ++ LAST_RESULT= ++ [[ 0 -lt 2 ]] ++ [[ '' != \0 ]] ++ TRIES=1 ++ [[ 1 -gt 1 ]] ++ rsync --verbose --timeout=300 -e 'ssh -i /root/.ssh/id_rsa' --perms --chmod=Fa+w ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 userCount_2015-05-14 users_2015-05-14 stats@metrics-logger1.private.scl3.mozilla.com:/data/logs/weave-etl/production/ ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 [sender] io timeout after 300 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(140) [sender=3.0.7] ++ LAST_RESULT=30 ++ [[ 1 -lt 2 ]] ++ [[ 30 != \0 ]] ++ TRIES=2 ++ [[ 2 -gt 1 ]] ++ sleep 216 ++ rsync --verbose --timeout=300 -e 'ssh -i /root/.ssh/id_rsa' --perms --chmod=Fa+w ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 userCount_2015-05-14 users_2015-05-14 stats@metrics-logger1.private.scl3.mozilla.com:/data/logs/weave-etl/production/ ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 [sender] io timeout after 300 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(140) [sender=3.0.7] ++ LAST_RESULT=30 ++ [[ 2 -lt 2 ]] ++ '[' 30 '!=' 0 ']' ++ echo 'Could not copy the weave output files to metrics-logger1.private.scl3.mozilla.com. Please investigate.' Could not copy the weave output files to metrics-logger1.private.scl3.mozilla.com. Please investigate. ++ ERRORS=1 ++ exit 1 But...we decom'd pentaho back in November, so I'm not even sure this is necessary. From the log, it looks like the problem is that the cron job (from wp-adm01.phx.weave.mozilla.com) couldn't reach the metrics-logger1 machine. That machine is the cloud services jumphost. Investigating a bit more, this is a cron script that lives in /etc/cron.d/kitchen, and calls /opt/etl-run.sh The script calls rsync to copy the file over...specifically, runs: in the dir /opt/weave_2/output/: rsync --verbose --timeout=300 -e "ssh -i /root/.ssh/id_rsa" --perms --chmod=Fa+w *_${DATE_YMD} stats@${DESTINATION}:/data/logs/weave-etl/production/ When I ran it by hand, it seemed to work fine: [root@wp-adm01.phx.weave output]# DATE_YMD=`date '+%Y-%m-%d'`[root@wp-adm01.phx.weave output]# DESTINATION='metrics-logger1.private.scl3.mozilla.com' [root@wp-adm01.phx.weave output]# rsync --verbose --timeout=300 -e "ssh -i /root/.ssh/id_rsa" --perms --chmod=Fa+w *_${DATE_YMD} stats@${DESTINATION}:/data/logs/weave-etl/production/ ldap_output_2015-05-15 totalAndActiveUsers_2015-05-15 units_2015-05-15 userCount_2015-05-15 users_2015-05-15 sent 69915 bytes received 120772 bytes 76274.80 bytes/sec total size is 141662596 speedup is 742.91 So...network blip? Because of the zayo issue yesterday?
Flags: needinfo?(scabral)
No more issues or intermittent, closing;)
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.