Closed Bug 907158 Opened 11 years ago Closed 11 years ago

Some buildbot masters complaining with "ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds"

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

Details

I don't know what is going on. I thought ntp was controlled through puppet (bug 646563).

08:56 nagios-releng: Tue 05:56:49 PDT [4163] buildbot-master57.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:56 nagios-releng: Tue 05:56:49 PDT [4164] buildbot-master63.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/ntp+time)
08:56 nagios-releng: Tue 05:56:49 PDT [4165] buildbot-master61.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:56 nagios-releng: Tue 05:56:49 PDT [4166] buildbot-master67.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:57 nagios-releng: Tue 05:57:39 PDT [4167] buildbot-master51.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:57 nagios-releng: Tue 05:57:49 PDT [4168] buildbot-master55.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:57 nagios-releng: Tue 05:57:49 PDT [4169] buildbot-master62.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:58 nagios-releng: Tue 05:58:49 PDT [4170] buildbot-master71.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:58 nagios-releng: Tue 05:58:49 PDT [4171] buildbot-master70.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:58 nagios-releng: Tue 05:58:59 PDT [4172] buildbot-master76.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:58 nagios-releng: Tue 05:58:59 PDT [4173] buildbot-master77.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:59 nagios-releng: Tue 05:59:48 PDT [4174] buildbot-master96.srv.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 3255502  Threads: 277  Questions: 1532300028  Slow queries: 15205  Opens: 160385  Flush tables: 2  Open tables: 685  Queries per second avg: 470.680 (http://m.allizom.org/MySQL+Connectivity)
08:59 nagios-releng: Tue 05:59:58 PDT [4175] buildbot-master90.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:59 nagios-releng: Tue 05:59:59 PDT [4176] buildbot-master92.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:59 nagios-releng: Tue 05:59:59 PDT [4177] buildbot-master94.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
08:59 nagios-releng: Tue 05:59:59 PDT [4178] buildbot-master96.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset -0.08414900303 secs (http://m.allizom.org/ntp+time)
09:00 nagios-releng: Tue 06:00:08 PDT [4179] buildbot-master98.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
09:01 nagios-releng: Tue 06:01:28 PDT [4180] buildbot-master57.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset -0.02337396145 secs (http://m.allizom.org/ntp+time)
09:01 nagios-releng: Tue 06:01:38 PDT [4181] buildbot-master63.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset -0.02407073975 secs (http://m.allizom.org/ntp+time)
09:01 nagios-releng: Tue 06:01:38 PDT [4182] buildbot-master61.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset 0.009407997131 secs (http://m.allizom.org/ntp+time)
09:01 nagios-releng: Tue 06:01:38 PDT [4183] buildbot-master67.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset -0.0132008791 secs (http://m.allizom.org/ntp+time)
09:02 nagios-releng: Tue 06:02:38 PDT [4184] buildbot-master55.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset -0.0923666954 secs (http://m.allizom.org/ntp+time)
09:02 nagios-releng: Tue 06:02:38 PDT [4185] buildbot-master51.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/ntp+time)
09:02 nagios-releng: Tue 06:02:48 PDT [4186] buildbot-master62.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
09:03 nagios-releng: Tue 06:03:38 PDT [4187] buildbot-master61.srv.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 3255734  Threads: 301  Questions: 1532426368  Slow queries: 15232  Opens: 160385  Flush tables: 2  Open tables: 685  Queries per second avg: 470.685 (http://m.allizom.org/MySQL+Connectivity)
09:03 nagios-releng: Tue 06:03:48 PDT [4188] buildbot-master70.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.allizom.org/ntp+time)
09:03 nagios-releng: Tue 06:03:48 PDT [4189] buildbot-master77.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset -0.07612383366 secs (http://m.allizom.org/ntp+time)
09:03 nagios-releng: Tue 06:03:48 PDT [4190] buildbot-master71.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
09:03 nagios-releng: Tue 06:03:58 PDT [4191] buildbot-master76.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
09:04 nagios-releng: Tue 06:04:58 PDT [4192] buildbot-master90.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
09:04 nagios-releng: Tue 06:04:58 PDT [4193] buildbot-master92.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
09:04 nagios-releng: Tue 06:04:58 PDT [4194] buildbot-master94.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
09:05 nagios-releng: Tue 06:05:08 PDT [4195] bld-lion-r5-037.try.releng.scl3.mozilla.com:buildbot is WARNING: PROCS WARNING: 0 processes with args buildbot.tac (http://m.allizom.org/buildbot)
That's all use1, and the errors all relate to socket timeouts from NRPE, meaning that NRPE couldn't contact the buildmasters (or in the case of the second alert, the check_ntp script couldn't contact the time server).

So this looks like a VPC failure.  From what you've pasted, it recovered quickly.  My understanding is that VPC failures are part of the expected failures in AWS.
We've gotten a few hundred alerts from puppet hosts suggesting the same thing:

Tue Aug 20 05:45:13 -0700 2013 /Stage[main]/Tweaks::Fonts/File[/home/cltbld/.fonts.conf] (err): Could not evaluate: getaddrinfo: Name or service not known Could not retrieve file metadata for puppet:///modules/tweaks/fonts.conf: getaddrinfo: Name or service not known
(DNS failure)

That the emails from these hosts are getting through - and that *I* can get through - suggests that the link failures are intermittent.
Thanks to gcox, I do see a VPC adjacency change at 06:01:01 US/Pacific, that recovered 06:01:34.  I'm no expert in interpreting these messages, so "recovered" may not be correct.
I will go and help the Pulse Queue's.

10:02 nagios-releng: Tue 07:02:28 PDT [4302] buildbot-master52.srv.releng.use1.mozilla.com:Pulse Queue is CRITICAL: 240 new items:oldest item is 4541s old (http://m.allizom.org/Pulse+Queue)
I was not sure how to determine that the pulse_publisher.py was not working properly.

I tried following the info in here:
https://wiki.mozilla.org/ReleaseEngineering/Queue_directories#processors

What I did notice is that the number of new items was growing continually after the blip.

[cltbld@buildbot-master52.srv.releng.use1.mozilla.com pulse]$ ls -lrt new | wc -l
343

I tried using "status" but it returned nothing:
/etc/init.d/pulse_publisher status

Once I restarted it, everything started to work again:
/etc/init.d/pulse_publisher restart

[cltbld@buildbot-master52.srv.releng.use1.mozilla.com pulse]$ ls -lrt new | wc -l
307
I see some new messages plus ntp time issues at the end:
10:49 nagios-releng: Tue 07:49:38 PDT [4339] buildbot-master55.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:49 nagios-releng: Tue 07:49:38 PDT [4340] buildbot-master59.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:49 nagios-releng: Tue 07:49:39 PDT [4341] buildbot-master62.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:49 nagios-releng: Tue 07:49:48 PDT [4342] buildbot-master69.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:49 nagios-releng: Tue 07:49:48 PDT [4343] buildbot-master67.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:49 nagios-releng: Tue 07:49:49 PDT [4344] buildbot-master70.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:50 nagios-releng: Tue 07:50:59 PDT [4345] buildbot-master94.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
10:50 nagios-releng: Tue 07:50:59 PDT [4346] buildbot-master96.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
10:51 nagios-releng: Tue 07:51:28 PDT [4347] mw32-ix-slave08.build.mtv1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
10:53 nagios-releng: Tue 07:53:38 PDT [4348] buildbot-master52.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:53 nagios-releng: Tue 07:53:38 PDT [4349] buildbot-master57.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:55 nagios-releng: Tue 07:55:38 PDT [4350] buildbot-master51.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
10:55 nagios-releng: Tue 07:55:48 PDT [4351] buildbot-master94.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset 0.01974380016 secs (http://m.allizom.org/ntp+time)
10:55 nagios-releng: Tue 07:55:58 PDT [4352] buildbot-master96.srv.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset 0.06784319878 secs (http://m.allizom.org/ntp+time)
10:58 armenzg_brb is now known as armenzg_buildduty
11:01 nagios-releng: Tue 08:01:58 PDT [4353] buildbot-master96.srv.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.allizom.org/MySQL+Connectivity)
11:01 nagios-releng: Tue 08:01:59 PDT [4354] buildbot-master92.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
11:01 nagios-releng: Tue 08:01:59 PDT [4355] buildbot-master90.srv.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.allizom.org/ntp+time)
I can correlate all of :dustin's suspicions and saw a VPN timeout (and failover) at 09:01, but nothing seen on our side beyond that.  BGP timed out, failed over the routes on our side, and continued on.  The remainder of the VPNs remained up and stable so it doesn't look like anything on the Mozilla side.
We temporally disabled the use1 masters on slavealloc.
We also decreased use1's priority.

11:10 catlee: ok, we should decrease use1's priority then
11:11 eeejay has left IRC (Quit: Coyote finally caught me)
11:11 armenzg_buildduty: what do you mean by that?
11:12 catlee: there's a config file that sets priorities for each region, so it starts instances in higher priority regions first
11:12 catlee: /home/buildduty/aws/cloud-tools/aws/watch_pending.cfg on cruncher
11:12 catlee: if we decrease use1, we'll start more instances in usw2 first
catlee, dustin: do you think that tomorrow morning we could re-enable the masters?

I will have a look at #buildduty and hope to determine that our masters are back to normal.
Should I be moving slaves from try-aws-us-east-1 to try-aws-us-west-2?

I'm getting "no allocation available"
I re-enabled all masters.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.