1120167 - Check received log volume on metrics-logger1.private.scl3.mozilla.com is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/zlb1.ops.pek1.mozilla.com is low at 49MB compared to running average 54MB

Reporter

Description

•

11 years ago

I had to do this a couple times the last couple of weeks. [root@zlb1.ops.pek1 ~]# bash -x /usr/local/bin/rsync_files.sh + LOCKFILE=/tmp/rsync.lock ++ hostname + HOSTNAME=zlb1.ops.pek1.mozilla.com + DIR=/var/log/zeus + LOGSERVER=metrics-logger1.private.scl3.mozilla.com + create_lock /tmp/rsync.lock + LOCK_NAME=/tmp/rsync.lock + [[ -e /tmp/rsync.lock ]] + echo 17119 + return 0 + cd /var/log/zeus + find /var/log/zeus -type f -name '*.access*' -mmin +30 '!' -name '*gz' -print0 + xargs -0 -r -P 2 -n 1 nice -n 10 gzip + find /var/log/zeus -type f -name '*.access*.gz' -print0 + xargs -0 -r chown logpull + RESULTLOGDIR=/var/log/rsync + '[' '!' -d /var/log/rsync ']' ++ date +%Y-%m-%d-%H + RESULTLOG=/var/log/rsync/rsync-metrics-logger1.private.scl3.mozilla.com-2015-01-11-00 + sudo -u logpull rsync -av '--include=*.gz' '--exclude=*' --remove-source-files -e 'ssh -o StrictHostKeyChecking=no' /var/log/zeus/ logpull@metrics-logger1.private.scl3.mozilla.com:/data/stats/logs/zlb1.ops.pek1.mozilla.com/ + '[' 0 -ne 0 ']' + rm -f /tmp/rsync.lock [root@zlb1.ops.pek1 ~]#

david garvey:dgarvey

Reporter

Comment 1

•

11 years ago

I don't think it is critical but I have doc everything that I do.

:Atoll

Updated

•

11 years ago

Assignee: infra → server-ops-webops

Component: Infrastructure: Other → WebOps: Other

QA Contact: jdow → nmaul

:kanban

Updated

•

11 years ago

Whiteboard: [id=nagios1.private.scl3.mozilla.com:504538] → [kanban:https://webops.kanbanize.com/ctrl_board/2/33] [id=nagios1.private.scl3.mozilla.com:504538]

david garvey:dgarvey

Reporter

Comment 2

•

11 years ago

I had to run it tonight. [root@zlb1.ops.pek1 ~]# bash -x /usr/local/bin/rsync_files.sh + LOCKFILE=/tmp/rsync.lock ++ hostname + HOSTNAME=zlb1.ops.pek1.mozilla.com + DIR=/var/log/zeus + LOGSERVER=metrics-logger1.private.scl3.mozilla.com + create_lock /tmp/rsync.lock + LOCK_NAME=/tmp/rsync.lock + [[ -e /tmp/rsync.lock ]] + echo 13825 + return 0 + cd /var/log/zeus + find /var/log/zeus -type f -name '*.access*' -mmin +30 '!' -name '*gz' -print0 + xargs -0 -r -P 2 -n 1 nice -n 10 gzip + find /var/log/zeus -type f -name '*.access*.gz' -print0 + xargs -0 -r chown logpull + RESULTLOGDIR=/var/log/rsync + '[' '!' -d /var/log/rsync ']' ++ date +%Y-%m-%d-%H + RESULTLOG=/var/log/rsync/rsync-metrics-logger1.private.scl3.mozilla.com-2015-01-17-00 + sudo -u logpull rsync -av '--include=*.gz' '--exclude=*' --remove-source-files -e 'ssh -o StrictHostKeyChecking=no' /var/log/zeus/ logpull@metrics-logger1.private.scl3.mozilla.com:/data/stats/logs/zlb1.ops.pek1.mozilla.com/ + '[' 0 -ne 0 ']' + rm -f /tmp/rsync.lock [root@zlb1.ops.pek1 ~]#

Sheeri Cabral [:sheeri]

Comment 3

•

11 years ago

Tmary, can you take a look?

T [:tmary] Meyarivan

Comment 4

•

11 years ago

Size of logs (/data/stats/logs/zlb1.ops.pek1.mozilla.com) for past N days obtained from metrics-logger1: (date, MB) 2015-01-09 39 2015-01-10 46 2015-01-11 61 2015-01-12 72 2015-01-13 64 2015-01-14 62 2015-01-15 56 2015-01-16 47 2015-01-17 49 2015-01-18 53 2015-01-19 61 2015-01-20 59 2015-01-21 58 2015-01-22 53 2015-01-23 53 2015-01-24 59 2015-01-25 57 --

Sheeri Cabral [:sheeri]

Comment 5

•

11 years ago

David, it looks like your efforts have helped. This is not something the data team can debug any further, but you might want to check the regular cron jobs that are supposed to be doing this work and see if there are any errors in their logfiles. That might point to what's going wrong here.

Ryan C [:ryanc] (UTC-4)

Comment 6

•

11 years ago

Sun 00:21:24 PST [5784] metrics-logger1.private.scl3.mozilla.com:Check received log volume is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/zlb1.ops.pek1.mozilla.com is low at 49MB compared to running average 56MB Verified logs and kicked off /usr/local/bin/rsync_files.sh.

:Atoll

Comment 7

•

10 years ago

I'm not seeing any further actionable work for Webops in this bug (PEK1 zeus is gone, one alert occurred last month, and none since). So I'm goign to RESO FIXE this for now, but please reopen if we can help investigate further somehow.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ryan C [:ryanc] (UTC-4)

Comment 8

•

10 years ago

Not sure what to make of this; metrics-logger1.private.scl3.mozilla.com:Check received log volume is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/weave-etl is low at 2MB compared to running average 75MB

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

:Atoll

Comment 9

•

10 years ago

weave-etl is a CloudOps metrics job from years ago.

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 10

•

10 years ago

It looks like there was an error on the 14th: Something went wrong, see below for details.\n\n++ set -v ++ JAVA_HOME=/usr/local ++ PATH=/usr/bin:/bin:/usr/local/bin ++ export JAVA_HOME PATH date '+%Y-%m-%d' +++ date +%Y-%m-%d ++ DATE_YMD=2015-05-14 ++ ERRORS=0 ++ cd /opt/pentaho/kettle ++ ./kitchen.sh --level=debugging -file /opt/weave_2/writeFile/j_writeWeaveDataToFile.kjb ++ '[' 0 -ne 0 ']' ++ cd /opt/pentaho/kettle ++ ./kitchen.sh -file /opt/weave_2/writeFile/count_Users.kjb ++ '[' 0 -ne 0 ']' ++ cd /opt/weave_2/output/ ++ TRIES=0 ++ LAST_RESULT= ++ [[ 0 -lt 2 ]] ++ [[ '' != \0 ]] ++ TRIES=1 ++ [[ 1 -gt 1 ]] ++ rsync --verbose --timeout=300 -e 'ssh -i /root/.ssh/id_rsa' --perms --chmod=Fa+w ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 userCount_2015-05-14 users_2015-05-14 stats@metrics-logger1.private.scl3.mozilla.com:/data/logs/weave-etl/production/ ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 [sender] io timeout after 300 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(140) [sender=3.0.7] ++ LAST_RESULT=30 ++ [[ 1 -lt 2 ]] ++ [[ 30 != \0 ]] ++ TRIES=2 ++ [[ 2 -gt 1 ]] ++ sleep 216 ++ rsync --verbose --timeout=300 -e 'ssh -i /root/.ssh/id_rsa' --perms --chmod=Fa+w ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 userCount_2015-05-14 users_2015-05-14 stats@metrics-logger1.private.scl3.mozilla.com:/data/logs/weave-etl/production/ ldap_output_2015-05-14 totalAndActiveUsers_2015-05-14 units_2015-05-14 [sender] io timeout after 300 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(140) [sender=3.0.7] ++ LAST_RESULT=30 ++ [[ 2 -lt 2 ]] ++ '[' 30 '!=' 0 ']' ++ echo 'Could not copy the weave output files to metrics-logger1.private.scl3.mozilla.com. Please investigate.' Could not copy the weave output files to metrics-logger1.private.scl3.mozilla.com. Please investigate. ++ ERRORS=1 ++ exit 1 But...we decom'd pentaho back in November, so I'm not even sure this is necessary. From the log, it looks like the problem is that the cron job (from wp-adm01.phx.weave.mozilla.com) couldn't reach the metrics-logger1 machine. That machine is the cloud services jumphost. Investigating a bit more, this is a cron script that lives in /etc/cron.d/kitchen, and calls /opt/etl-run.sh The script calls rsync to copy the file over...specifically, runs: in the dir /opt/weave_2/output/: rsync --verbose --timeout=300 -e "ssh -i /root/.ssh/id_rsa" --perms --chmod=Fa+w *_${DATE_YMD} stats@${DESTINATION}:/data/logs/weave-etl/production/ When I ran it by hand, it seemed to work fine: [root@wp-adm01.phx.weave output]# DATE_YMD=`date '+%Y-%m-%d'`[root@wp-adm01.phx.weave output]# DESTINATION='metrics-logger1.private.scl3.mozilla.com' [root@wp-adm01.phx.weave output]# rsync --verbose --timeout=300 -e "ssh -i /root/.ssh/id_rsa" --perms --chmod=Fa+w *_${DATE_YMD} stats@${DESTINATION}:/data/logs/weave-etl/production/ ldap_output_2015-05-15 totalAndActiveUsers_2015-05-15 units_2015-05-15 userCount_2015-05-15 users_2015-05-15 sent 69915 bytes received 120772 bytes 76274.80 bytes/sec total size is 141662596 speedup is 742.91 So...network blip? Because of the zayo issue yesterday?

Flags: needinfo?(scabral)

david garvey:dgarvey

Reporter

Comment 11

•

10 years ago

No more issues or intermittent, closing;)

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Check received log volume on metrics-logger1.private.scl3.mozilla.com is CRITICAL: CRITICAL: Daily log size from /data/stats/logs/zlb1.ops.pek1.mozilla.com is low at 49MB compared to running average 54MB

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: dgarvey, Unassigned)

References

(
URL
)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/33] [id=nagios1.private.scl3.mozilla.com:504538])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated