Closed
Bug 1227243
Opened 9 years ago
Closed 9 years ago
lots of ntp issues today
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Infrastructure & Operations
Infrastructure: Other
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: arich, Assigned: bhourigan)
References
Details
We've seen a number of ntp alerts of various types today.
First occurrence seems to have been:
Mon 06:47:30 PST [4005] mac-v2-signing3.srv.releng.scl3.mozilla.com:ntp peer is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown (http://m.mozilla.org/ntp+peer)
My first thought was that this had something to do with the evac of phx1, so we removed the phx1 servers from the config and replaced them with other scl3 servers (change went out a bit after 9:00 PST):
server ns1.private.releng.scl3.mozilla.com iburst
server ns2.private.releng.scl3.mozilla.com iburst
server ns1.private.scl3.mozilla.com iburst
server ns2.private.scl3.mozilla.com iburst
We still saw alerts after that, though. They come in two flavors:
Mon 10:05:35 PST [4054] buildbot-master112.bb.releng.scl3.mozilla.com:ntp peer is WARNING: NTP WARNING: Server has the LI_ALARM bit set, Offset 0.245135 secs (http://m.mozilla.org/ntp+peer)
Mon 10:10:16 PST [4059] buildbot-master81.bb.releng.scl3.mozilla.com:ntp peer is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown (http://m.mozilla.org/ntp+peer)
They recover but then alert again later.
The first error generally indicates that one of the servers fell out of sync, and it's taking a bit of time to sync back up. The second is more worrying.
Assignee | ||
Comment 1•9 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #0)
> We've seen a number of ntp alerts of various types today.
Are you still seeing alerts or was this a one time occurrence?
> server ns1.private.releng.scl3.mozilla.com iburst
> server ns2.private.releng.scl3.mozilla.com iburst
> server ns1.private.scl3.mozilla.com iburst
> server ns2.private.scl3.mozilla.com iburst
I poked around these systems and while they appear to currently be in sync it looks like ns[12].private.releng.scl3 lost upstream clock sources on Nov 17 15:25-15:43, and again on Nov 23 14:22-15:33 UTC.
Sample log snippets:
messages-20151129:Nov 23 14:46:59 ns2.private.releng.scl3.mozilla.com ntpd[15754]: 0.0.0.0 0613 03 spike_detect +0.276876 s
messages-20151129:Nov 23 15:21:33 ns2.private.releng.scl3.mozilla.com ntpd[15754]: 0.0.0.0 061c 0c clock_step +0.321176 s
messages-20151129:Nov 23 15:21:33 ns2.private.releng.scl3.mozilla.com ntpd[15754]: 0.0.0.0 0615 05 clock_sync
messages-20151129:Nov 23 15:21:34 ns2.private.releng.scl3.mozilla.com ntpd[15754]: 0.0.0.0 c618 08 no_sys_peer
However, I wasn't able to find any evidence of this on ns[12].private.scl3 so I think you should of at least had one stable clock source available. If something was happening, it wasn't logged. More on that below.
>
> We still saw alerts after that, though. They come in two flavors:
>
> Mon 10:05:35 PST [4054] buildbot-master112.bb.releng.scl3.mozilla.com:ntp
> peer is WARNING: NTP WARNING: Server has the LI_ALARM bit set, Offset
> 0.245135 secs (http://m.mozilla.org/ntp+peer)
I'm unfamiliar with the LI_ALARM bit, but digging through the check_ntp_peer source[1] it looks like this gets set when the clock is no longer synchronized. Looking at the ntp_peer check this message came from it can't call out any specific upstream server, but happens when it can't synchronize with any external source.
> The first error generally indicates that one of the servers fell out of
> sync, and it's taking a bit of time to sync back up. The second is more
> worrying.
Unfortunately our ntp logging configuration is non-existent, so only critical errors are recorded. Briefly at that. I filed 1229016 to increase log verbosity on all of our ntp servers so we can better troubleshoot future problems.
The log excerpt above is all the logs I have to look at, and as far as I can tell everything is now synchronizing as expected.
[1] http://ftp.ics.uci.edu/pub/centos0/ics-custom-build/BUILD/nagios-plugins-1.4.13/plugins/check_ntp_peer.c
Assignee | ||
Updated•9 years ago
|
Assignee: infra → bhourigan
Assignee | ||
Comment 2•9 years ago
|
||
Closing due to lack of response - please reopen if issues persist.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 3•9 years ago
|
||
Sorry, I thought I was needinfoed and responded to this. Things stabilized after that day, so I'm assuming it was some phx1 fallout.
Assignee | ||
Comment 4•9 years ago
|
||
(In reply to Amy Rich [:arr] [:arich] from comment #3)
> Sorry, I thought I was needinfoed and responded to this. Things stabilized
> after that day, so I'm assuming it was some phx1 fallout.
no problem, thanks!
You need to log in
before you can comment on or make changes to this bug.
Description
•