Closed
Bug 1159440
Opened 9 years ago
Closed 8 years ago
NTP Offset flapping
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Infrastructure & Operations
Infrastructure: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Unassigned)
References
Details
Today we had NTP flap a bunch of times on various buildbot masters, mobile imaging and related... [14:31:39] dividehex Callek|Buildduty: not sure why but it looks like ns1 and ns2 lost their peers for a moment [jwatkins@log-aggregator1.srv.releng.use1.mozilla.com ~]$ ntpq ntpq> pe remote refid st t when poll reach delay offset jitter ============================================================================== ns1.private.rel LOCAL(0) 11 u 786 1024 376 77.955 -492.52 441.405 *ns2.private.rel LOCAL(0) 11 u 245 1024 377 77.777 -190.23 179.332 ntpq> pe remote refid st t when poll reach delay offset jitter ============================================================================== ns1.private.rel 132.163.4.101 2 u 121 1024 7 80.750 34.690 27.478 ns2.private.rel 199.102.46.77 2 u 125 1024 7 77.889 191.867 0.272 ... For comparison around the same time: [ashish@nagios1.private.releng.scl3 ~]$ /usr/lib64/nagios/plugins/check_ntp_time -H ns1.private.releng.scl3.mozilla.com -w 2 -c 6 -v sending request to peer 0 response from peer 0: offset 0.1639742851 sending request to peer 0 response from peer 0: offset 0.1639977694 sending request to peer 0 response from peer 0: offset 0.1640081406 sending request to peer 0 response from peer 0: offset 0.1639726162 discarding peer 0: flags=3 overall average offset: 0 NTP CRITICAL: Offset unknown| [ashish@nagios1.private.scl3 ~]$ /usr/lib64/nagios/plugins/check_ntp_time -H admin1a.private.scl3.mozilla.com -w .5 -c 1 -v sending request to peer 0 response from peer 0: offset -5.483627319e-06 sending request to peer 0 response from peer 0: offset -2.74181366e-06 sending request to peer 0 response from peer 0: offset 4.053115845e-06 sending request to peer 0 response from peer 0: offset 2.372264862e-05 overall average offset: -5.483627319e-06 NTP OK: Offset -5.483627319e-06 secs|offset=-0.000005s;0.500000;1.000000;
Comment 1•9 years ago
|
||
first alerts showed up around 11:12am PT [11:12:07] <nagios-releng> Tue 11:08:46 PDT [4550] buildbot-master110.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:12:07] <nagios-releng> Tue 11:08:46 PDT [4551] buildbot-master107.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:12:07] <nagios-releng> Tue 11:08:46 PDT [4552] buildbot-master100.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:12:07] <nagios-releng> Tue 11:08:46 PDT [4553] mobile-imaging-001.p1.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:12:18] <nagios-releng> Tue 11:08:57 PDT [4554] buildbot-master111.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:12:27] <nagios-releng> Tue 11:09:06 PDT [4555] mobile-imaging-003.p3.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:12:27] <nagios-releng> Tue 11:09:06 PDT [4556] mobile-imaging-006.p6.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
Comment 2•9 years ago
|
||
ntp_time began recovering at 11:23 [11:23:33] <nagios-releng> Tue 11:20:13 PDT [4714] buildbot-master53.bb.releng.usw2.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:23:33] <nagios-releng> Tue 11:20:13 PDT [4715] signing6.srv.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time) [11:23:53] <nagios-releng> Tue 11:20:33 PDT [4716] buildbot-master114.bb.releng.use1.mozilla.com:ntp time is ^C3OK^C^C^C: NTP OK: Offset 0.000626206398 secs (http://m.mozilla.org/ntp+time) [11:23:53] <nagios-releng> Tue 11:20:33 PDT [4717] mac-v2-signing4.srv.releng.scl3.mozilla.com:ntp time is ^C3OK^C^C^C: NTP OK: Offset 0.1207700968 secs (http://m.mozilla.org/ntp+time)
Comment 3•9 years ago
|
||
NTP logs from ns1 and ns2. Times are UTC Apr 28 17:27:13 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0613 03 spike_detect +0.349434 s Apr 28 17:44:32 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 061c 0c clock_step +0.366460 s Apr 28 17:44:32 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0615 05 clock_sync Apr 28 17:44:33 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c618 08 no_sys_peer Apr 28 17:47:49 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0613 03 spike_detect +0.136448 s Apr 28 18:04:29 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 061c 0c clock_step +0.139315 s Apr 28 18:04:29 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0614 04 freq_mode Apr 28 18:04:30 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c618 08 no_sys_peer Apr 28 18:20:10 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c612 02 freq_set kernel 33.922 PPM Apr 28 18:20:10 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c615 05 clock_sync Apr 28 18:07:37 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.190897 s Apr 28 18:10:47 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.192798 s Apr 28 18:10:47 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0615 05 clock_sync Apr 28 18:10:48 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer
Comment 4•9 years ago
|
||
I'm just not sure why ns1/ns2 lost there sync but there is not much else to check here. It seemed to recover quickly anyway. We really need to fix the peering on the servers and probably get rid of the LOCAL ref clock especially since these are running on vm instances where the virt hw clock can drift all over the place. If you lose all external disciplined clock sources, it's best not to use any clock source at all and just know hosts are not syncing.
Comment 5•9 years ago
|
||
We had another instance of this today.
Comment 6•9 years ago
|
||
and again: buildbot-master69.bb.releng.use1.mozilla.com ntp time CRITICAL 05-06-2015 07:47:25 16d 16h 59m 29s 3/3 CHECK_NRPE: Socket timeout after 15 seconds. buildbot-master108.bb.releng.scl3.mozilla.com ntp peer CRITICAL 05-06-2015 07:51:29 0d 0h 0m 27s 1/10 NTP CRITICAL: Server not synchronized, Offset unknown buildbot-master110.bb.releng.scl3.mozilla.com ntp peer CRITICAL 05-06-2015 07:50:35 0d 0h 13m 21s 5/10 NTP CRITICAL: Server not synchronized, Offset unknown buildbot-master111.bb.releng.scl3.mozilla.com ntp peer WARNING 05-06-2015 07:50:24 0d 0h 13m 32s 9/10 NTP WARNING: Server has the LI_ALARM bit set, Offset -0.454638 secs buildbot-master112.bb.releng.scl3.mozilla.com ntp peer CRITICAL 05-06-2015 07:46:05 0d 0h 7m 51s 10/10 NTP CRITICAL: Server not synchronized, Offset unknown buildbot-master69.bb.releng.use1.mozilla.com ntp peer CRITICAL 05-06-2015 07:41:27 16d 17h 0m 3s 10/10 CHECK_NRPE: Socket timeout after 15 seconds. buildbot-master81.bb.releng.scl3.mozilla.com ntp peer CRITICAL 05-06-2015 07:48:54 0d 0h 21m 24s 10/10 NTP CRITICAL: Server not synchronized, Offset unknown log-aggregator1.srv.releng.scl3.mozilla.com ntp peer WARNING 05-06-2015 07:50:04 0d 0h 1m 52s 7/10 NTP WARNING: Server has the LI_ALARM bit set, Offset 0.335071 secs signing4.srv.releng.scl3.mozilla.com ntp peer WARNING 05-06-2015 07:50:11 0d 0h 8m 13s 10/10 NTP WARNING: Server has the LI_ALARM bit set, Offset -0.432752 secs
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Updated•9 years ago
|
Component: RelOps → Infrastructure: Other
QA Contact: arich → jdow
Comment 7•9 years ago
|
||
ns1.private.releng.scl3.mozilla.com ntpq -p localhost4.localdomain: timed out, nothing received ***Request timed out ns2.private.releng.scl3.mozilla.com ntpq -p localhost4.localdomain: timed out, nothing received ***Request timed out
Comment 8•9 years ago
|
||
pir pointed out that if you don't query localhost, you see the peers. This seems a bit broken, but maybe we don't run ntpd on localhost, and my check above was a red herring? # ntpq -c pee ns1.private.releng.scl3.mozilla.com remote refid st t when poll reach delay offset jitter ============================================================================== +108.61.194.85.v 201.198.247.252 2 u 261 1024 377 21.450 433.417 394.711 +173.44.32.10 216.218.254.202 2 u 94 1024 377 101.880 433.274 394.299 xfairy.mattnordh 129.6.15.29 2 u 452 1024 377 56.279 435.321 394.200 *time-a.timefreq .ACTS. 1 u 974 1024 377 51.045 -2.733 160.450 ns1.private.rel .STEP. 16 u - 1024 0 0.000 0.000 0.000 ns2.private.rel 166.70.136.41 2 u 29 128 15 0.134 426.600 2.086 LOCAL(0) .LOCL. 10 l 8d 64 0 0.000 0.000 0.000
Comment 9•9 years ago
|
||
ntpd being queryable on localhost would certainly aid in debugging, it has bitten me in the past.
Comment 10•9 years ago
|
||
[root@ns2.private.releng.scl3 pradcliffe]# fgrep ntp /var/log/messages May 6 12:48:45 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.171964 s May 6 12:57:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.171135 s May 6 12:57:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0615 05 clock_sync May 6 12:57:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer May 6 13:16:04 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.815485 s May 6 13:18:07 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.813646 s May 6 13:18:07 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0614 04 freq_mode May 6 13:18:08 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer May 6 13:33:36 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c612 02 freq_set kernel 21.310 PPM May 6 13:33:36 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c615 05 clock_sync May 6 13:51:51 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.131647 s May 6 14:05:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.130700 s May 6 14:05:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0615 05 clock_sync May 6 14:05:16 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer May 6 14:27:16 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.193576 s May 6 14:31:40 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.191180 s May 6 14:31:40 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0614 04 freq_mode May 6 14:31:41 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer May 6 14:49:09 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c612 02 freq_set kernel 14.777 PPM May 6 14:49:09 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c615 05 clock_sync
Comment 11•9 years ago
|
||
[root@ns1.private.releng.scl3 pradcliffe]# fgrep ntp /var/log/messages May 6 15:04:05 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0613 03 spike_detect +0.431007 s
Comment 12•9 years ago
|
||
Hm, is it true that we're not monitoring ntp on the ntp servers themselves? That seems like an oversight. I would expect that they would be monitored the same way all the rest of the infra ntp servers are monitored.
Assignee: jwatkins → infra
Comment 13•9 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #12) > Hm, is it true that we're not monitoring ntp on the ntp servers themselves? > That seems like an oversight. I would expect that they would be monitored > the same way all the rest of the infra ntp servers are monitored. It doesn't appear as if ntp is being monitored. It's being monitored by releng's nagios fork, which has different service groups and different service checks.
Comment 14•9 years ago
|
||
there's not really a fork, per se, it still uses all the same check definitions, etc. Can we get the check added?
Updated•9 years ago
|
Comment 15•8 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #14) > there's not really a fork, per se, it still uses all the same check > definitions, etc. Can we get the check added? You're correct. The root cause though turned out to be slightly different, which was fixed in 1237705. Details are over there. Example check results after the change: [root@ns1.private.releng.scl3 ~]# /usr/lib64/nagios/plugins/custom/check_ntp_time_multi NTP OK: Offset 0.002087712288 secs|offset=0.002088s;60.000000;120.000000; (0.us.pool.ntp.org)
Status: REOPENED → RESOLVED
Closed: 9 years ago → 8 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 16•8 years ago
|
||
Reports from releng of ntp alerting flaps: 11:47 AM <nagios-releng> Mon 08:47:48 PST [4002] mac-v2-signing3.srv.releng.scl3.mozilla.com:ntp peer is WARNING: NTP WARNING: Server has the LI_ALARM bit set, Offset -0.510396 secs (http://m.mozilla.org/ntp+peer) 11:56 AM <nagios-releng> Mon 08:56:39 PST [4003] mac-v2-signing6.srv.releng.scl3.mozilla.com:ntp peer is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown (http://m.mozilla.org/ntp+peer) 12:00 PM <nagios-releng> Mon 09:00:48 PST [4004] partner-repack1.srv.releng.scl3.mozilla.com:ntp peer is WARNING: NTP WARNING: Server has the LI_ALARM bit set, Offset 0.031698 secs (http://m.mozilla.org/ntp+peer) 12:02 PM <nagios-releng> Mon 09:02:48 PST [4005] mac-v2-signing3.srv.releng.scl3.mozilla.com:ntp peer is OK: NTP OK: Offset 0.036437 secs (http://m.mozilla.org/ntp+peer) 12:11 PM <nagios-releng> Mon 09:11:38 PST [4006] mac-v2-signing6.srv.releng.scl3.mozilla.com:ntp peer is OK: NTP OK: Offset 0.022487 secs (http://m.mozilla.org/ntp+peer) No noticeable fallout as yet. Other information provided: $ntp_servers = [ "ns1.private.releng.scl3.mozilla.com", "ns2.private.releng.scl3.mozilla.com", "ns1.private.scl3.mozilla.com", "ns2.private.scl3.mozilla.com" ] -bash-4.1$ ntpq -c pee remote refid st t when poll reach delay offset jitter ============================================================================== *ns1.private.rel 64.71.128.26 3 u 822 1024 7 0.954 -145.17 0.237 xns2.private.rel 206.108.0.132 2 u 886 1024 3 0.922 0.773 0.973
Comment 17•8 years ago
|
||
From ashish: http://sourceforge.net/p/nagiosplug/mailman/message/24582375/
Comment 18•8 years ago
|
||
[root@ns1.private.releng.scl3 ~]# ntpq -c peers remote refid st t when poll reach delay offset jitter ============================================================================== +216.152.240.220 204.9.54.119 2 u 685 1024 77 14.462 55.086 39.741 *100tx-f1-0.c720 216.218.254.202 2 u 848 1024 17 3.737 42.228 29.385 +services.quadra 64.71.128.26 3 u 657 1024 77 12.030 43.424 29.217 +blue.1e400.net 108.31.199.162 2 u 709 1024 77 79.013 45.946 29.268 ns1.private.rel .STEP. 16 u - 1024 0 0.000 0.000 0.000 xns2.private.rel 206.108.0.132 2 u 147 1024 37 1.338 -17.361 128.488 [root@ns2.private.releng.scl3 ~]# ntpq -c peers remote refid st t when poll reach delay offset jitter ============================================================================== *ntp2.torix.ca .PPS. 1 u 381 1024 77 65.121 -100.99 77.279 +zombie.frizzen. 129.250.35.251 3 u 193 1024 377 12.132 -101.89 77.723 +209.118.204.201 128.9.176.30 2 u 389 1024 77 3.859 73.857 147.219 +time.devry.net 10.152.2.2 2 u 221 1024 377 54.978 77.627 149.611 +ns1.private.rel 64.71.128.26 3 u 803 1024 16 6.411 -139.74 53.961 ns2.private.rel .STEP. 16 u - 1024 0 0.000 0.000 0.000 [root@ns2.private.releng.scl3 ~]#
Comment 19•8 years ago
|
||
Pointed out that this seems to be a new issue, so new bug. Handing this in bug 1242619
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•