Closed
Bug 772792
Opened 12 years ago
Closed 12 years ago
NTP servers are lagging by 25s
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Infrastructure & Operations
Infrastructure: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ashish, Unassigned)
Details
Around 0350 servers in SCL3 and PHX1 alerted about NTP offset of 25s or Offset Unknown.
Reporter | ||
Comment 1•12 years ago
|
||
[ashish@ns1.private.scl3 ~]$ sudo ntpq ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== time3.chpc.utah 198.60.22.240 2 u 63 64 7 27.397 -0.466 0.625 ns1.your-site.c 108.59.14.130 3 u 2 64 17 70.100 5.634 0.226 server1.epic-fa 64.6.144.6 3 u 41 64 7 61.054 1.478 0.028 tick.tadatv.com 10.0.22.51 2 u 36 64 17 4.131 -306.64 0.506 ns1.private.scl .STEP. 16 u - 64 0 0.000 0.000 0.000 ns2.private.scl .INIT. 16 u 31 64 0 0.000 0.000 0.000 *LOCAL(0) .LOCL. 10 l 27 64 17 0.000 0.000 0.000 ntpq> quit [ashish@ns2.private.scl3 ~]$ sudo ntpq ntpq> peers remote refid st t when poll reach delay offset jitter ============================================================================== pool-test.ntp.o 216.218.254.202 2 u - 64 7 1.804 0.706 0.048 javanese.kjsl.c 69.36.224.15 2 u 62 64 3 73.729 2.167 0.138 clock.team-cymr 172.16.32.4 2 u 61 64 3 58.243 0.407 0.187 razer.justynshu 132.163.4.103 2 u 62 64 3 55.319 -7.560 0.137 ns1.private.scl .STEP. 16 u 49 64 0 0.000 0.000 0.000 ns2.private.scl .INIT. 16 u - 64 0 0.000 0.000 0.000 LOCAL(0) .LOCL. 10 l 59 64 3 0.000 0.000 0.000
Reporter | ||
Updated•12 years ago
|
Assignee: server-ops → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: phong → jdow
Reporter | ||
Comment 2•12 years ago
|
||
Recoveries all around and the nameservers are set to the right time now.
Severity: major → normal
Comment 3•12 years ago
|
||
phx ip-admin2 was way off: [root@ip-admin02 ~]# /usr/sbin/ntpq -n -c peers localhost remote refid st t when poll reach delay offset jitter ============================================================================== +64.191.49.17 131.107.13.100 2 u 2 64 377 72.564 25960.9 0.731 +4.53.160.75 220.183.68.66 2 u 1 64 377 61.722 25956.3 0.566 -2605:1b00:0:1:: 134.87.112.4 2 u 61 64 177 60.162 25949.4 1.327 *96.44.142.5 132.163.4.101 2 u 63 64 177 38.955 25961.0 0.568 10.8.75.9 .INIT. 16 u - 64 0 0.000 0.000 0.000 127.127.1.0 .LOCL. 10 l 57 64 177 0.000 0.000 0.001 Peers were showing it with a huge jitter: [root@ns2.private.phx1 pradcliffe]# ntpq -c peers localhost remote refid st t when poll reach delay offset jitter ============================================================================== +4.53.160.74 220.183.68.66 2 u 77 256 377 61.724 -0.851 0.317 +64.73.32.135 198.30.92.2 2 u 8 256 377 71.163 -2.616 0.554 -109.lanets.ca 209.51.161.238 2 u 2 256 377 77.323 2.374 0.687 *ntp3.junkemailf 216.218.254.202 2 u 89 256 377 19.911 -4.733 0.452 ip-admin02.phx. LOCAL(0) 11 u 59 256 371 0.185 -25956. 25956.9 LOCAL(0) .LOCL. 10 l 47 64 377 0.000 0.000 0.000 Took down ntp on ip-admin02, set the time correctly, restarted ntp. Bounced ntpd on ns1/2. The dependent servers recovered, gradually.
Updated•12 years ago
|
Group: infra
Reporter | ||
Comment 4•12 years ago
|
||
Think I got the root cause, thanks to tmary: * The leapsecond.sh script was still enabled in puppet * leapsecond.sh has a bug - it sets the date to date +'%Y-%m-%d %H:%M:%C', where %C is the *Century*, and always outputs "20" * leapsecond.sh checks for /tmp/leapsecond_2012_06_30 on the host that it runs on. Quite likely, /tmp/leapsecond_2012_06_30 got removed by tmpcleaner (or equivalent). * on the next puppet run, since /tmp/leapsecond_2012_06_30 was missing, the leapsecond.sh script was run, which set the date to the current min:20s, which might have been ahead or before current time. This explains why most servers were off by the same ~26s. Fix: * I've removed the leapsecond.sh in the base init.pp so it shouldn't run anymore.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 5•12 years ago
|
||
This was made worse by a lot of hosts in phx only having one ntp server configured, 10.8.75.9, whereas a lot of others have a far more sneisble pair of ns1/ns2.private.phx1. The rest should be brought in line with the pair of ntp servers.
Updated•11 years ago
|
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•