Closed
Bug 1120167
Opened 11 years ago
Closed 10 years ago
Check received log volume on metrics-logger1.private.scl3.mozilla.com is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/zlb1.ops.pek1.mozilla.com is low at 49MB compared to running average 54MB
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dgarvey, Unassigned)
References
()
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/33] [id=nagios1.private.scl3.mozilla.com:504538])
I had to do this a couple times the last couple of weeks.
[root@zlb1.ops.pek1 ~]# bash -x /usr/local/bin/rsync_files.sh
+ LOCKFILE=/tmp/rsync.lock
++ hostname
+ HOSTNAME=zlb1.ops.pek1.mozilla.com
+ DIR=/var/log/zeus
+ LOGSERVER=metrics-logger1.private.scl3.mozilla.com
+ create_lock /tmp/rsync.lock
+ LOCK_NAME=/tmp/rsync.lock
+ [[ -e /tmp/rsync.lock ]]
+ echo 17119
+ return 0
+ cd /var/log/zeus
+ find /var/log/zeus -type f -name '*.access*' -mmin +30 '!' -name '*gz' -print0
+ xargs -0 -r -P 2 -n 1 nice -n 10 gzip
+ find /var/log/zeus -type f -name '*.access*.gz' -print0
+ xargs -0 -r chown logpull
+ RESULTLOGDIR=/var/log/rsync
+ '[' '!' -d /var/log/rsync ']'
++ date +%Y-%m-%d-%H
+ RESULTLOG=/var/log/rsync/rsync-metrics-logger1.private.scl3.mozilla.com-2015-01-11-00
+ sudo -u logpull rsync -av '--include=*.gz' '--exclude=*' --remove-source-files -e 'ssh -o StrictHostKeyChecking=no' /var/log/zeus/ logpull@metrics-logger1.private.scl3.mozilla.com:/data/stats/logs/zlb1.ops.pek1.mozilla.com/
+ '[' 0 -ne 0 ']'
+ rm -f /tmp/rsync.lock
[root@zlb1.ops.pek1 ~]#
Reporter | ||
Comment 1•11 years ago
|
||
I don't think it is critical but I have doc everything that I do.
Assignee: infra → server-ops-webops
Component: Infrastructure: Other → WebOps: Other
QA Contact: jdow → nmaul
Whiteboard: [id=nagios1.private.scl3.mozilla.com:504538] → [kanban:https://webops.kanbanize.com/ctrl_board/2/33] [id=nagios1.private.scl3.mozilla.com:504538]
Reporter | ||
Comment 2•11 years ago
|
||
I had to run it tonight.
[root@zlb1.ops.pek1 ~]# bash -x /usr/local/bin/rsync_files.sh
+ LOCKFILE=/tmp/rsync.lock
++ hostname
+ HOSTNAME=zlb1.ops.pek1.mozilla.com
+ DIR=/var/log/zeus
+ LOGSERVER=metrics-logger1.private.scl3.mozilla.com
+ create_lock /tmp/rsync.lock
+ LOCK_NAME=/tmp/rsync.lock
+ [[ -e /tmp/rsync.lock ]]
+ echo 13825
+ return 0
+ cd /var/log/zeus
+ find /var/log/zeus -type f -name '*.access*' -mmin +30 '!' -name '*gz' -print0
+ xargs -0 -r -P 2 -n 1 nice -n 10 gzip
+ find /var/log/zeus -type f -name '*.access*.gz' -print0
+ xargs -0 -r chown logpull
+ RESULTLOGDIR=/var/log/rsync
+ '[' '!' -d /var/log/rsync ']'
++ date +%Y-%m-%d-%H
+ RESULTLOG=/var/log/rsync/rsync-metrics-logger1.private.scl3.mozilla.com-2015-01-17-00
+ sudo -u logpull rsync -av '--include=*.gz' '--exclude=*' --remove-source-files -e 'ssh -o StrictHostKeyChecking=no' /var/log/zeus/ logpull@metrics-logger1.private.scl3.mozilla.com:/data/stats/logs/zlb1.ops.pek1.mozilla.com/
+ '[' 0 -ne 0 ']'
+ rm -f /tmp/rsync.lock
[root@zlb1.ops.pek1 ~]#
Comment 3•11 years ago
|
||
Tmary, can you take a look?
Comment 4•11 years ago
|
||
Size of logs (/data/stats/logs/zlb1.ops.pek1.mozilla.com) for past N days obtained from metrics-logger1:
(date, MB)
2015-01-09 39
2015-01-10 46
2015-01-11 61
2015-01-12 72
2015-01-13 64
2015-01-14 62
2015-01-15 56
2015-01-16 47
2015-01-17 49
2015-01-18 53
2015-01-19 61
2015-01-20 59
2015-01-21 58
2015-01-22 53
2015-01-23 53
2015-01-24 59
2015-01-25 57
--
Comment 5•11 years ago
|
||
David, it looks like your efforts have helped. This is not something the data team can debug any further, but you might want to check the regular cron jobs that are supposed to be doing this work and see if there are any errors in their logfiles. That might point to what's going wrong here.
Comment 6•11 years ago
|
||
Sun 00:21:24 PST [5784] metrics-logger1.private.scl3.mozilla.com:Check received log volume is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/zlb1.ops.pek1.mozilla.com is low at 49MB compared to running average 56MB
Verified logs and kicked off /usr/local/bin/rsync_files.sh.
I'm not seeing any further actionable work for Webops in this bug (PEK1 zeus is gone, one alert occurred last month, and none since). So I'm goign to RESO FIXE this for now, but please reopen if we can help investigate further somehow.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 8•10 years ago
|
||
Not sure what to make of this;
metrics-logger1.private.scl3.mozilla.com:Check received log volume is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/weave-etl is low at 2MB compared to running average 75MB
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
weave-etl is a CloudOps metrics job from years ago.
Flags: needinfo?(scabral)
Comment 10•10 years ago
|
||
It looks like there was an error on the 14th:
Something went wrong, see below for details.\n\n++ set -v
++ JAVA_HOME=/usr/local
++ PATH=/usr/bin:/bin:/usr/local/bin
++ export JAVA_HOME PATH
date '+%Y-%m-%d'
+++ date +%Y-%m-%d
++ DATE_YMD=2015-05-14
++ ERRORS=0
++ cd /opt/pentaho/kettle
++ ./kitchen.sh --level=debugging -file /opt/weave_2/writeFile/j_writeWeaveDataToFile.kjb
++ '[' 0 -ne 0 ']'
++ cd /opt/pentaho/kettle
++ ./kitchen.sh -file /opt/weave_2/writeFile/count_Users.kjb
++ '[' 0 -ne 0 ']'
++ cd /opt/weave_2/output/
++ TRIES=0
++ LAST_RESULT=
++ [[ 0 -lt 2 ]]
++ [[ '' != \0 ]]
++ TRIES=1
++ [[ 1 -gt 1 ]]
++ rsync --verbose --timeout=300 -e 'ssh -i /root/.ssh/id_rsa' --perms --chmod=Fa+w ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 userCount_2015-05-14 users_2015-05-14 stats@metrics-logger1.private.scl3.mozilla.com:/data/logs/weave-etl/production/
ldap_output_2015-05-14
totalAndActiveUsers_2015-05-14
units_2015-05-14
[sender] io timeout after 300 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(140) [sender=3.0.7]
++ LAST_RESULT=30
++ [[ 1 -lt 2 ]]
++ [[ 30 != \0 ]]
++ TRIES=2
++ [[ 2 -gt 1 ]]
++ sleep 216
++ rsync --verbose --timeout=300 -e 'ssh -i /root/.ssh/id_rsa' --perms --chmod=Fa+w ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 userCount_2015-05-14 users_2015-05-14 stats@metrics-logger1.private.scl3.mozilla.com:/data/logs/weave-etl/production/
ldap_output_2015-05-14
totalAndActiveUsers_2015-05-14
units_2015-05-14
[sender] io timeout after 300 seconds -- exiting
rsync error: timeout in data send/receive (code 30) at io.c(140) [sender=3.0.7]
++ LAST_RESULT=30
++ [[ 2 -lt 2 ]]
++ '[' 30 '!=' 0 ']'
++ echo 'Could not copy the weave output files to metrics-logger1.private.scl3.mozilla.com. Please investigate.'
Could not copy the weave output files to metrics-logger1.private.scl3.mozilla.com. Please investigate.
++ ERRORS=1
++ exit 1
But...we decom'd pentaho back in November, so I'm not even sure this is necessary. From the log, it looks like the problem is that the cron job (from wp-adm01.phx.weave.mozilla.com) couldn't reach the metrics-logger1 machine.
That machine is the cloud services jumphost. Investigating a bit more, this is a cron script that lives in /etc/cron.d/kitchen, and calls /opt/etl-run.sh
The script calls rsync to copy the file over...specifically, runs:
in the dir /opt/weave_2/output/:
rsync --verbose --timeout=300 -e "ssh -i /root/.ssh/id_rsa" --perms --chmod=Fa+w *_${DATE_YMD} stats@${DESTINATION}:/data/logs/weave-etl/production/
When I ran it by hand, it seemed to work fine:
[root@wp-adm01.phx.weave output]# DATE_YMD=`date '+%Y-%m-%d'`[root@wp-adm01.phx.weave output]# DESTINATION='metrics-logger1.private.scl3.mozilla.com'
[root@wp-adm01.phx.weave output]# rsync --verbose --timeout=300 -e "ssh -i /root/.ssh/id_rsa" --perms --chmod=Fa+w *_${DATE_YMD} stats@${DESTINATION}:/data/logs/weave-etl/production/
ldap_output_2015-05-15
totalAndActiveUsers_2015-05-15
units_2015-05-15
userCount_2015-05-15
users_2015-05-15
sent 69915 bytes received 120772 bytes 76274.80 bytes/sec
total size is 141662596 speedup is 742.91
So...network blip? Because of the zayo issue yesterday?
Flags: needinfo?(scabral)
Reporter | ||
Comment 11•10 years ago
|
||
No more issues or intermittent, closing;)
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•