Here are some general ideas based on the failure observed from today: 1. Reasonable timeout on SCP 2. Log interpolated commands and the results of running them (i.e. bash -xv) 3. E-mail logs on failure 4. Retry logic?
Assignee: server-ops → nobody
Component: Server Operations → Operations
Product: mozilla.org → Mozilla Services
QA Contact: phong → operations
Version: other → unspecified
specifically, the one-off script adm1.phx1.svc:/opt/etl-run.sh that we should probably also puppetize along with /opt/weave_2/writeFile/ at the same time.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #0) > 1. Reasonable timeout on SCP It's actually the kernel permitting up to 2 days before a session dies, when the routers in between don't send any ICMP indicating otherwise. Switching to rsync will permit proper timeout handling. > 2. Log interpolated commands and the results of running them (i.e. bash -xv) I don't quite understand "interpolated commands", but set -x; set -v is certainly reasonable. > 3. E-mail logs on failure Can do. > 4. Retry logic? Can do.
Assignee: nobody → rsoderberg
Status: NEW → ASSIGNED
Implemented: rsync instead of scp, --timeout=300 seconds, --verbose, --perms --chmod=Fa+w to replace the extra chmod step, with 1 retry after a 60-240 second sleep in case of failure. if something goes wrong, the full output of etl-run.sh will be sent to <cron-weave> and to <metrics-alerts>, From: <cron+sync-etl-run>. The etl-run.sh output does NOT include the actual kettle job output, because kettle job output contains plaintext passwords. Instead, it will list the pathname to the job log, so Metrics can ask Svcops to look. Pushed the new etl-run.sh script to adm1.phx1.svc (wp-adm01) and we'll see how this morning's results look.
Results were fine. Closing this as resolved, filing a separate bug to puppetize it.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.