Closed Bug 698664 Opened 14 years ago Closed 14 years ago

NTP checks should retry a little bit

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: dustin, Assigned: arich)

Details

jabba is updating NTP configs, which requires restarting ntp. We're seeing 22:30 < nagios-sjc1> [81] test-master1.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [82] ganglia3.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [83] kvm1.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [85] buildbot-master09.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [86] ganglia2.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [87] buildbot-master10.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [89] buildbot-master11.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [90] buildbot-master06.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [91] buildbot-master04.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [92] talos-master:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [93] buildbot-master08.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [95] buildbot-master07.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [96] staging-mobile-master.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [97] buildbot-master13.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [98] buildbot-master17.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [99] buildbot-master14.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:31 < nagios-sjc1> [00] buildbot-master12.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [01] buildbot-master15.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [02] staging-master.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [03] buildbot-master16.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [04] buildbot-master18.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [05] production-mobile-master.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [06] kvm2.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [07] buildbot-master3.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:32 < nagios-sjc1> [08] preproduction-master.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:33 < nagios-sjc1> [10] test-master1.build.mtv1:NTP Time Check is WARNING: NTP WARNING: Offset -11.63593771 secs 22:34 < nagios-sjc1> [11] kvm1.build.mtv1:NTP Time Check is WARNING: NTP WARNING: Offset -11.56783724 secs 22:34 < nagios-sjc1> [12] ganglia3.build.mtv1:NTP Time Check is WARNING: NTP WARNING: Offset -11.56880665 secs 22:34 < nagios-sjc1> [14] buildbot-master10.build.sjc1:NTP Time Check is WARNING: NTP WARNING: Offset -11.61540222 secs 22:34 < nagios-sjc1> [15] buildbot-master09.build.sjc1:NTP Time Check is WARNING: NTP WARNING: Offset -11.6256212 secs 22:34 < nagios-sjc1> [16] ganglia2.build.sjc1:NTP Time Check is WARNING: NTP WARNING: Offset -11.61608267 secs 22:34 < nagios-sjc1> [17] buildbot-master06.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.67369723 secs 22:34 < nagios-sjc1> [18] buildbot-master11.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.68124366 secs 22:34 < nagios-sjc1> [19] buildbot-master04.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.53868377 secs 22:34 < nagios-sjc1> [20] buildbot-master13.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.66663563 secs 22:34 < nagios-sjc1> [21] buildbot-master17.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.91205525 secs 22:34 < nagios-sjc1> [22] buildbot-master12.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.6270802 secs 22:34 < nagios-sjc1> [23] buildbot-master14.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.88885212 secs 22:34 < nagios-sjc1> [24] buildbot-master07.build.sjc1:NTP Time Check is WARNING: NTP WARNING: Offset -11.6226027 secs 22:34 < nagios-sjc1> [25] buildbot-master08.build.sjc1:NTP Time Check is WARNING: NTP WARNING: Offset -11.60948634 secs 22:35 < nagios-sjc1> [26] buildbot-master15.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.67350221 secs 22:35 < nagios-sjc1> [27] staging-master.build.sjc1:NTP Time Check is WARNING: NTP WARNING: Offset -11.61113419 secs 22:35 < nagios-sjc1> [28] talos-master:NTP Time Check is WARNING: NTP WARNING: Offset -11.60472859 secs 22:35 < nagios-sjc1> [29] staging-mobile-master.build.mtv1:NTP Time Check is WARNING: NTP WARNING: Offset -11.63597026 secs 22:35 < nagios-sjc1> [30] buildbot-master16.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.68233609 secs 22:35 < nagios-sjc1> [31] buildbot-master18.build.scl1:NTP Time Check is WARNING: NTP WARNING: Offset -11.76698232 secs 22:35 * dustin holds his breath waiting for the dead fish 22:35 < nagios-sjc1> [32] kvm2.build.mtv1:NTP Time Check is WARNING: NTP WARNING: Offset -11.56496429 secs 22:35 < nagios-sjc1> [33] preproduction-master.build.sjc1:NTP Time Check is WARNING: NTP WARNING: Offset -11.59365162 secs 22:35 < nagios-sjc1> [34] production-mobile-master.build.mtv1:NTP Time Check is WARNING: NTP WARNING: Offset -11.63135057 secs 22:35 < nagios-sjc1> [35] buildbot-master3.build.mtv1:NTP Time Check is WARNING: NTP WARNING: Offset -11.76900314 secs 22:40 < nagios-sjc1> [37] test-master1.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [38] kvm1.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [39] ganglia3.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [40] buildbot-master10.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [41] buildbot-master06.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [42] buildbot-master04.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [43] buildbot-master11.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [44] buildbot-master09.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [45] buildbot-master14.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [46] buildbot-master13.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [47] buildbot-master12.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:40 < nagios-sjc1> [48] buildbot-master17.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [49] ganglia2.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [50] buildbot-master15.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [51] buildbot-master16.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [52] buildbot-master18.build.scl1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [53] buildbot-master07.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [54] buildbot-master08.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [55] staging-master.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [56] talos-master:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [57] staging-mobile-master.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:41 < nagios-sjc1> [58] preproduction-master.build.sjc1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:42 < nagios-sjc1> [59] kvm2.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:42 < nagios-sjc1> [60] buildbot-master3.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown 22:42 < nagios-sjc1> [61] production-mobile-master.build.mtv1:NTP Time Check is CRITICAL: NTP CRITICAL: Offset unknown which is (we think) due to the checks being too quick to claim failure. Since clocks rarely skew too far too quickly, we should probably have a notification delay on these.
Jabba changed which hosts the ntp servers in scl1 sync to (rhel instead of the routers in scl1, which were several minutes apart) like the ntp serves in sjc do (due to an ldap issue that needed to be resolved). This caused all ntp clients in scl1 to have check issues until they were able to sync back up. There is already some time built into the checks based on the nagios defaults of 3 retries at a minute apart. I'm not sure we want more than that, since we do want to know when machines go out of sync or when they can't connect to the ntp server at all.
Interesting, in light of the vlan75-vlan48 problems we've had, but also fair enough.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → INVALID
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.