Closed Bug 772792 Opened 12 years ago Closed 12 years ago

NTP servers are lagging by 25s

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Unassigned)

Details

Around 0350 servers in SCL3 and PHX1 alerted about NTP offset of 25s or Offset Unknown.
[ashish@ns1.private.scl3 ~]$ sudo ntpq
ntpq> peers
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 time3.chpc.utah 198.60.22.240    2 u   63   64    7   27.397   -0.466   0.625
 ns1.your-site.c 108.59.14.130    3 u    2   64   17   70.100    5.634   0.226
 server1.epic-fa 64.6.144.6       3 u   41   64    7   61.054    1.478   0.028
 tick.tadatv.com 10.0.22.51       2 u   36   64   17    4.131  -306.64   0.506
 ns1.private.scl .STEP.          16 u    -   64    0    0.000    0.000   0.000
 ns2.private.scl .INIT.          16 u   31   64    0    0.000    0.000   0.000
*LOCAL(0)        .LOCL.          10 l   27   64   17    0.000    0.000   0.000
ntpq> quit

[ashish@ns2.private.scl3 ~]$ sudo ntpq
ntpq> peers
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 pool-test.ntp.o 216.218.254.202  2 u    -   64    7    1.804    0.706   0.048
 javanese.kjsl.c 69.36.224.15     2 u   62   64    3   73.729    2.167   0.138
 clock.team-cymr 172.16.32.4      2 u   61   64    3   58.243    0.407   0.187
 razer.justynshu 132.163.4.103    2 u   62   64    3   55.319   -7.560   0.137
 ns1.private.scl .STEP.          16 u   49   64    0    0.000    0.000   0.000
 ns2.private.scl .INIT.          16 u    -   64    0    0.000    0.000   0.000
 LOCAL(0)        .LOCL.          10 l   59   64    3    0.000    0.000   0.000
Assignee: server-ops → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: phong → jdow
Recoveries all around and the nameservers are set to the right time now.
Severity: major → normal
phx ip-admin2 was way off:

[root@ip-admin02 ~]# /usr/sbin/ntpq -n -c peers localhost
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+64.191.49.17    131.107.13.100   2 u    2   64  377   72.564  25960.9   0.731
+4.53.160.75     220.183.68.66    2 u    1   64  377   61.722  25956.3   0.566
-2605:1b00:0:1:: 134.87.112.4     2 u   61   64  177   60.162  25949.4   1.327
*96.44.142.5     132.163.4.101    2 u   63   64  177   38.955  25961.0   0.568
 10.8.75.9       .INIT.          16 u    -   64    0    0.000    0.000   0.000
 127.127.1.0     .LOCL.          10 l   57   64  177    0.000    0.000   0.001

Peers were showing it with a huge jitter:

[root@ns2.private.phx1 pradcliffe]# ntpq -c peers localhost
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+4.53.160.74     220.183.68.66    2 u   77  256  377   61.724   -0.851   0.317
+64.73.32.135    198.30.92.2      2 u    8  256  377   71.163   -2.616   0.554
-109.lanets.ca   209.51.161.238   2 u    2  256  377   77.323    2.374   0.687
*ntp3.junkemailf 216.218.254.202  2 u   89  256  377   19.911   -4.733   0.452
 ip-admin02.phx. LOCAL(0)        11 u   59  256  371    0.185  -25956. 25956.9
 LOCAL(0)        .LOCL.          10 l   47   64  377    0.000    0.000   0.000


Took down ntp on ip-admin02, set the time correctly, restarted ntp.
Bounced ntpd on ns1/2.
The dependent servers recovered, gradually.
Group: infra
Think I got the root cause, thanks to tmary:

* The leapsecond.sh script was still enabled in puppet
* leapsecond.sh has a bug - it sets the date to date +'%Y-%m-%d %H:%M:%C', where %C is the *Century*, and always outputs "20"
* leapsecond.sh checks for /tmp/leapsecond_2012_06_30 on the host that it runs on. Quite likely, /tmp/leapsecond_2012_06_30 got removed by tmpcleaner (or equivalent).
* on the next puppet run, since /tmp/leapsecond_2012_06_30 was missing, the leapsecond.sh script was run, which set the date to the current min:20s, which might have been ahead or before current time. This explains why most servers were off by the same ~26s.

Fix:
* I've removed the leapsecond.sh in the base init.pp so it shouldn't run anymore.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
This was made worse by a lot of hosts in phx only having one ntp server configured, 10.8.75.9, whereas a lot of others have a far more sneisble pair of ns1/ns2.private.phx1.

The rest should be brought in line with the pair of ntp servers.
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.