Closed Bug 1159440 Opened 9 years ago Closed 8 years ago

NTP Offset flapping

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Unassigned)

References

Details

Today we had NTP flap a bunch of times on various buildbot masters, mobile imaging and related...

[14:31:39]	dividehex	Callek|Buildduty: not sure why but it looks like ns1 and ns2 lost their peers for a moment

[jwatkins@log-aggregator1.srv.releng.use1.mozilla.com ~]$ ntpq
ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 ns1.private.rel LOCAL(0)        11 u  786 1024  376   77.955  -492.52 441.405
*ns2.private.rel LOCAL(0)        11 u  245 1024  377   77.777  -190.23 179.332

ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 ns1.private.rel 132.163.4.101    2 u  121 1024    7   80.750   34.690  27.478
 ns2.private.rel 199.102.46.77    2 u  125 1024    7   77.889  191.867   0.272

...

For comparison around the same time:

    [ashish@nagios1.private.releng.scl3 ~]$ /usr/lib64/nagios/plugins/check_ntp_time -H ns1.private.releng.scl3.mozilla.com -w 2 -c 6 -v
    sending request to peer 0
    response from peer 0: offset 0.1639742851
    sending request to peer 0
    response from peer 0: offset 0.1639977694
    sending request to peer 0
    response from peer 0: offset 0.1640081406
    sending request to peer 0
    response from peer 0: offset 0.1639726162
    discarding peer 0: flags=3
    overall average offset: 0
    NTP CRITICAL: Offset unknown|
     
     
    [ashish@nagios1.private.scl3 ~]$ /usr/lib64/nagios/plugins/check_ntp_time -H admin1a.private.scl3.mozilla.com -w .5 -c 1 -v
    sending request to peer 0
    response from peer 0: offset -5.483627319e-06
    sending request to peer 0
    response from peer 0: offset -2.74181366e-06
    sending request to peer 0
    response from peer 0: offset 4.053115845e-06
    sending request to peer 0
    response from peer 0: offset 2.372264862e-05
    overall average offset: -5.483627319e-06
    NTP OK: Offset -5.483627319e-06 secs|offset=-0.000005s;0.500000;1.000000;
first alerts showed up around 11:12am PT

[11:12:07] <nagios-releng> Tue 11:08:46 PDT [4550] buildbot-master110.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:12:07] <nagios-releng> Tue 11:08:46 PDT [4551] buildbot-master107.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:12:07] <nagios-releng> Tue 11:08:46 PDT [4552] buildbot-master100.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:12:07] <nagios-releng> Tue 11:08:46 PDT [4553] mobile-imaging-001.p1.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:12:18] <nagios-releng> Tue 11:08:57 PDT [4554] buildbot-master111.bb.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:12:27] <nagios-releng> Tue 11:09:06 PDT [4555] mobile-imaging-003.p3.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:12:27] <nagios-releng> Tue 11:09:06 PDT [4556] mobile-imaging-006.p6.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
ntp_time began recovering at 11:23

[11:23:33] <nagios-releng> Tue 11:20:13 PDT [4714] buildbot-master53.bb.releng.usw2.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:23:33] <nagios-releng> Tue 11:20:13 PDT [4715] signing6.srv.releng.scl3.mozilla.com:ntp time is ^C4CRITICAL^C^C^C: NTP CRITICAL: Offset unknown (http://m.mozilla.org/ntp+time)
[11:23:53] <nagios-releng> Tue 11:20:33 PDT [4716] buildbot-master114.bb.releng.use1.mozilla.com:ntp time is ^C3OK^C^C^C: NTP OK: Offset 0.000626206398 secs (http://m.mozilla.org/ntp+time)
[11:23:53] <nagios-releng> Tue 11:20:33 PDT [4717] mac-v2-signing4.srv.releng.scl3.mozilla.com:ntp time is ^C3OK^C^C^C: NTP OK: Offset 0.1207700968 secs (http://m.mozilla.org/ntp+time)
NTP logs from ns1 and ns2.  Times are UTC

Apr 28 17:27:13 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0613 03 spike_detect +0.349434 s
Apr 28 17:44:32 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 061c 0c clock_step +0.366460 s
Apr 28 17:44:32 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0615 05 clock_sync
Apr 28 17:44:33 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c618 08 no_sys_peer
Apr 28 17:47:49 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0613 03 spike_detect +0.136448 s
Apr 28 18:04:29 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 061c 0c clock_step +0.139315 s
Apr 28 18:04:29 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0614 04 freq_mode
Apr 28 18:04:30 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c618 08 no_sys_peer
Apr 28 18:20:10 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c612 02 freq_set kernel 33.922 PPM
Apr 28 18:20:10 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 c615 05 clock_sync

Apr 28 18:07:37 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.190897 s
Apr 28 18:10:47 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.192798 s
Apr 28 18:10:47 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0615 05 clock_sync
Apr 28 18:10:48 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer
I'm just not sure why ns1/ns2 lost there sync but there is not much else to check here.  It seemed to recover quickly anyway.  We really need to fix the peering on the servers and probably get rid of the LOCAL ref clock especially since these are running on vm instances where the virt hw clock can drift all over the place.  If you lose all external disciplined clock sources, it's best not to use any clock source at all and just know hosts are not syncing.
Blocks: 1126870
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
We had another instance of this today.
and again:

buildbot-master69.bb.releng.use1.mozilla.com
ntp time
  CRITICAL  05-06-2015 07:47:25   16d 16h 59m 29s   3/3   CHECK_NRPE: Socket timeout after 15 seconds.
buildbot-master108.bb.releng.scl3.mozilla.com
ntp peer
  CRITICAL  05-06-2015 07:51:29   0d 0h 0m 27s  1/10  NTP CRITICAL: Server not synchronized, Offset unknown
buildbot-master110.bb.releng.scl3.mozilla.com
ntp peer
  CRITICAL  05-06-2015 07:50:35   0d 0h 13m 21s   5/10  NTP CRITICAL: Server not synchronized, Offset unknown
buildbot-master111.bb.releng.scl3.mozilla.com
ntp peer
  WARNING   05-06-2015 07:50:24   0d 0h 13m 32s   9/10  NTP WARNING: Server has the LI_ALARM bit set, Offset -0.454638 secs
buildbot-master112.bb.releng.scl3.mozilla.com
ntp peer
  CRITICAL  05-06-2015 07:46:05   0d 0h 7m 51s  10/10   NTP CRITICAL: Server not synchronized, Offset unknown
buildbot-master69.bb.releng.use1.mozilla.com
ntp peer
  CRITICAL  05-06-2015 07:41:27   16d 17h 0m 3s   10/10   CHECK_NRPE: Socket timeout after 15 seconds.
buildbot-master81.bb.releng.scl3.mozilla.com
ntp peer
  CRITICAL  05-06-2015 07:48:54   0d 0h 21m 24s   10/10   NTP CRITICAL: Server not synchronized, Offset unknown
log-aggregator1.srv.releng.scl3.mozilla.com
ntp peer
  WARNING   05-06-2015 07:50:04   0d 0h 1m 52s  7/10  NTP WARNING: Server has the LI_ALARM bit set, Offset 0.335071 secs
signing4.srv.releng.scl3.mozilla.com
ntp peer
  WARNING   05-06-2015 07:50:11   0d 0h 8m 13s  10/10   NTP WARNING: Server has the LI_ALARM bit set, Offset -0.432752 secs
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Component: RelOps → Infrastructure: Other
QA Contact: arich → jdow
ns1.private.releng.scl3.mozilla.com
ntpq -p
localhost4.localdomain: timed out, nothing received
***Request timed out



ns2.private.releng.scl3.mozilla.com
ntpq -p
localhost4.localdomain: timed out, nothing received
***Request timed out
pir pointed out that if you don't query localhost, you see the peers. This seems a bit broken, but maybe we don't run ntpd on localhost, and my check above was a red herring?

# ntpq -c pee ns1.private.releng.scl3.mozilla.com
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+108.61.194.85.v 201.198.247.252  2 u  261 1024  377   21.450  433.417 394.711
+173.44.32.10    216.218.254.202  2 u   94 1024  377  101.880  433.274 394.299
xfairy.mattnordh 129.6.15.29      2 u  452 1024  377   56.279  435.321 394.200
*time-a.timefreq .ACTS.           1 u  974 1024  377   51.045   -2.733 160.450
 ns1.private.rel .STEP.          16 u    - 1024    0    0.000    0.000   0.000
 ns2.private.rel 166.70.136.41    2 u   29  128   15    0.134  426.600   2.086
 LOCAL(0)        .LOCL.          10 l   8d   64    0    0.000    0.000   0.000
ntpd being queryable on localhost would certainly aid in debugging, it has bitten me in the past.
[root@ns2.private.releng.scl3 pradcliffe]# fgrep ntp /var/log/messages
May  6 12:48:45 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.171964 s
May  6 12:57:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.171135 s
May  6 12:57:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0615 05 clock_sync
May  6 12:57:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer
May  6 13:16:04 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.815485 s
May  6 13:18:07 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.813646 s
May  6 13:18:07 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0614 04 freq_mode
May  6 13:18:08 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer
May  6 13:33:36 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c612 02 freq_set kernel 21.310 PPM
May  6 13:33:36 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c615 05 clock_sync
May  6 13:51:51 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.131647 s
May  6 14:05:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.130700 s
May  6 14:05:12 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0615 05 clock_sync
May  6 14:05:16 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer
May  6 14:27:16 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0613 03 spike_detect +0.193576 s
May  6 14:31:40 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 061c 0c clock_step +0.191180 s
May  6 14:31:40 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 0614 04 freq_mode
May  6 14:31:41 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c618 08 no_sys_peer
May  6 14:49:09 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c612 02 freq_set kernel 14.777 PPM
May  6 14:49:09 ns2.private.releng.scl3.mozilla.com ntpd[15104]: 0.0.0.0 c615 05 clock_sync
[root@ns1.private.releng.scl3 pradcliffe]# fgrep ntp /var/log/messages
May  6 15:04:05 ns1.private.releng.scl3.mozilla.com ntpd[12810]: 0.0.0.0 0613 03 spike_detect +0.431007 s
Hm, is it true that we're not monitoring ntp on the ntp servers themselves? That seems like an oversight. I would expect that they would be monitored the same way all the rest of the infra ntp servers are monitored.
Assignee: jwatkins → infra
(In reply to Amy Rich [:arich] [:arr] from comment #12)
> Hm, is it true that we're not monitoring ntp on the ntp servers themselves?
> That seems like an oversight. I would expect that they would be monitored
> the same way all the rest of the infra ntp servers are monitored.

It doesn't appear as if ntp is being monitored. It's being monitored by releng's nagios fork, which has different service groups and different service checks.
there's not really a fork, per se, it still uses all the same check definitions, etc. Can we get the check added?
No longer blocks: 1126870
Depends on: 1126870
(In reply to Amy Rich [:arr] [:arich] from comment #14)
> there's not really a fork, per se, it still uses all the same check
> definitions, etc. Can we get the check added?

You're correct. The root cause though turned out to be slightly different, which was fixed in 1237705. Details are over there. 

Example check results after the change:

[root@ns1.private.releng.scl3 ~]# /usr/lib64/nagios/plugins/custom/check_ntp_time_multi
NTP OK: Offset 0.002087712288 secs|offset=0.002088s;60.000000;120.000000; (0.us.pool.ntp.org)
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reports from releng of ntp alerting flaps:

11:47 AM <nagios-releng> Mon 08:47:48 PST [4002] mac-v2-signing3.srv.releng.scl3.mozilla.com:ntp peer is WARNING: NTP WARNING: Server has the LI_ALARM bit set, Offset -0.510396 secs (http://m.mozilla.org/ntp+peer)
11:56 AM <nagios-releng> Mon 08:56:39 PST [4003] mac-v2-signing6.srv.releng.scl3.mozilla.com:ntp peer is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown (http://m.mozilla.org/ntp+peer)
12:00 PM <nagios-releng> Mon 09:00:48 PST [4004] partner-repack1.srv.releng.scl3.mozilla.com:ntp peer is WARNING: NTP WARNING: Server has the LI_ALARM bit set, Offset 0.031698 secs (http://m.mozilla.org/ntp+peer)
12:02 PM <nagios-releng> Mon 09:02:48 PST [4005] mac-v2-signing3.srv.releng.scl3.mozilla.com:ntp peer is OK: NTP OK: Offset 0.036437 secs (http://m.mozilla.org/ntp+peer)
12:11 PM <nagios-releng> Mon 09:11:38 PST [4006] mac-v2-signing6.srv.releng.scl3.mozilla.com:ntp peer is OK: NTP OK: Offset 0.022487 secs (http://m.mozilla.org/ntp+peer)

No noticeable fallout as yet.

Other information provided:

$ntp_servers = [ "ns1.private.releng.scl3.mozilla.com",
                     "ns2.private.releng.scl3.mozilla.com",
                     "ns1.private.scl3.mozilla.com",
                     "ns2.private.scl3.mozilla.com" ]

-bash-4.1$ ntpq -c pee
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*ns1.private.rel 64.71.128.26     3 u  822 1024    7    0.954  -145.17   0.237
xns2.private.rel 206.108.0.132    2 u  886 1024    3    0.922    0.773   0.973
[root@ns1.private.releng.scl3 ~]# ntpq -c peers
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+216.152.240.220 204.9.54.119     2 u  685 1024   77   14.462   55.086  39.741
*100tx-f1-0.c720 216.218.254.202  2 u  848 1024   17    3.737   42.228  29.385
+services.quadra 64.71.128.26     3 u  657 1024   77   12.030   43.424  29.217
+blue.1e400.net  108.31.199.162   2 u  709 1024   77   79.013   45.946  29.268
 ns1.private.rel .STEP.          16 u    - 1024    0    0.000    0.000   0.000
xns2.private.rel 206.108.0.132    2 u  147 1024   37    1.338  -17.361 128.488
[root@ns2.private.releng.scl3 ~]# ntpq -c peers
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*ntp2.torix.ca   .PPS.            1 u  381 1024   77   65.121  -100.99  77.279
+zombie.frizzen. 129.250.35.251   3 u  193 1024  377   12.132  -101.89  77.723
+209.118.204.201 128.9.176.30     2 u  389 1024   77    3.859   73.857 147.219
+time.devry.net  10.152.2.2       2 u  221 1024  377   54.978   77.627 149.611
+ns1.private.rel 64.71.128.26     3 u  803 1024   16    6.411  -139.74  53.961
 ns2.private.rel .STEP.          16 u    - 1024    0    0.000    0.000   0.000
[root@ns2.private.releng.scl3 ~]#
Pointed out that this seems to be a new issue, so new bug. Handing this in bug 1242619
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.