Our current alerting script should be modified to auto-apply the workaround (HUP the hg process), and complain noisily if it can not. Ideally, it would auto update bug 829025 with the details of the hang, but that may remain manual for a while.
Created attachment 705130 [details] [diff] [review] attempt hup of hg in socket wait Only easy way to test is to run live in production - want a sanity check first. Script would be started from cron with the --fix option. That should attempt one HUP of the socket (existing messages have shown solid detection of just the hg in socket wait), then report again. Ideally, I'll get to run manually on a live hang prior to deploy via cron.
Comment on attachment 705130 [details] [diff] [review] attempt hup of hg in socket wait Re-entrant bash doesn't make me nervous, no, not at all. >diff --git a/check_process_delay b/check_process_delay >+email_subject="[vcs2vcs] process delays" You could use this at the end of the on_exit() function that follows, yes ? >+ log "socket hang on pid $p" Nit, trailing whitespace. >+# process command line args >+attempt_fix=false >+while test $# -gt 0; do >+ case "$1" in >+ --fix) attempt_fix=true ;; >+ -h | --help) usage ;; >+ -*) usage "unknown option '$1'" ;; >+ *) break ;; >+ esac >+ shift >+done If memory serves, the case options can be indented for greater readability.
Created attachment 709202 [details] [diff] [review] automatically try to fix a hung socket This has been successfully running on gd3 for a while, and incorporates :nthomas previous feedback.
Comment on attachment 709202 [details] [diff] [review] automatically try to fix a hung socket > # likely i/o to NFS slowing things down. Notify, but may not >- # be error (unsubscripted array access is element 0) >+ # be error nit, be an error
Comment on attachment 709202 [details] [diff] [review] automatically try to fix a hung socket http://hg.mozilla.org/users/hwine_mozilla.com/repo-sync-tools/rev/0986499abadc and deployed in production