Closed Bug 727194 Opened 12 years ago Closed 12 years ago

Build DNS server failures

Categories

(mozilla.org Graveyard :: Server Operations, task, P2)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

There are several jobs not being able to clone the tools repo.
This started impacting development in the last hour.
Can you please investigate what is going on?

See bug 727171 and https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=be845a0d6234

[1]
abort: error: nodename nor servname provided, or not known
https://tbpl.mozilla.org/php/getParsedLog.php?id=9330727&tree=Mozilla-Inbound

[2]
abort: error: 
program finished with exit code 255
This can also be affecting the buildapi (used by developers to manage jobs).

> Cron <root@buildapi01> sudo -u buildapi /usr/local/bin/update_hg_wc.sh /home/buildapi/src && /etc/init.d/buildapi restart
========= Started clone build tools failed (results: 2, elapsed: 1 mins, 43 secs) ==========
hg clone http://hg.mozilla.org/build/tools tools
 in dir /builds/slave/m-in-osx64/. (timeout 1320 secs)
 watching logfiles {}
 argv: ['hg', 'clone', 'http://hg.mozilla.org/build/tools', 'tools']
 environment:
  Apple_PubSub_Socket_Render=/tmp/launch-5UYYzc/Render
  CVS_RSH=ssh
  DISPLAY=/tmp/launch-pMQ6PU/:0
  HOME=/Users/cltbld
  LOGNAME=cltbld
  PATH=/tools/buildbot/bin:/tools/python/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/slave/m-in-osx64
  PYTHONPATH=/tools/buildbot/lib/python2.6/site-packages:/tools/twisted/lib/python2.6/site-packages/:/tools/twisted-core/lib/python2.6/site-packages:/tools/zope-interface/lib/python2.6/site-packages
  SHELL=/bin/bash
  SSH_AUTH_SOCK=/tmp/launch-li2CVQ/Listeners
  TMPDIR=/var/folders/7I/7I253dv+HLesSBUBGCX08E+++TM/-Tmp-/
  USER=cltbld
  VERSIONER_PYTHON_PREFER_32_BIT=no
  VERSIONER_PYTHON_VERSION=2.6
  __CF_USER_TEXT_ENCODING=0x1F6:0:0
 using PTY: False
abort: error: nodename nor servname provided, or not known
program finished with exit code 255
elapsedTime=103.547167
======== Finished clone build tools failed (results: 2, elapsed: 1 mins, 43 secs) ========
Assignee: server-ops-releng → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: zandr → cshields
That was moz2-darwin10-slave45, which is in mtv1.

That's a DNS resolution failure.  On that host:

moz2-darwin10-slave45:~ cltbld$ host hg.mozilla.org
hg.mozilla.org has address 10.2.74.153

and in fact, running that in a loop shows no problems.

  nameserver[0] : 10.250.48.19 -- the build DNS VIP in mtv1
Feb 14 10:50:51 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Transition to MASTER STATE
Feb 14 10:50:52 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Entering MASTER STATE
Feb 14 10:50:52 ns1b Keepalived_vrrp: VRRP_Instance(vip48) setting protocol VIPs.
Feb 14 10:50:52 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Sending gratuitous ARPs on eth0 for 10.250.48.19
Feb 14 10:50:53 ns1b ntpd[1355]: Listening on interface #10 eth0, 10.250.48.19#123 Enabled
Feb 14 10:50:57 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Sending gratuitous ARPs on eth0 for 10.250.48.19
Feb 14 10:51:05 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Received higher prio advert
Feb 14 10:51:05 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Entering BACKUP STATE
Feb 14 10:51:05 ns1b Keepalived_vrrp: VRRP_Instance(vip48) removing protocol VIPs.
Feb 14 10:51:06 ns1b ntpd[1355]: Deleting interface #10 eth0, 10.250.48.19#123, interface stats: received=0, sent=0, dropped=0, active_time=13 secs

which corresponds closely to the failed build:

OS X 10.6.2 mozilla-inbound build on 2012-02-14 10:52:53 PST for push 85f3cf72938a

ns1a only has:

Feb 14 10:47:48 ns1a dhclient[1074]: DHCPREQUEST on eth0 to 10.250.0.21 port 67 (xid=0x57338270)
Feb 14 10:47:48 ns1a dhclient[1074]: DHCPACK from 10.250.0.21 (xid=0x57338270)
Feb 14 10:47:48 ns1a dhclient[1074]: bound to 10.250.48.17 -- renewal in 4253 seconds.
Feb 14 10:50:01 ns1a kernel: audit: error converting sid to string

I'm downgrading since this doesn't appear to be ongoing -- looks like it was about 15 seconds long?  And over to SRE/server ops to see what the cause was.

(releng, please copy interested folks here - the above is considered infra info and not public)
Group: infra
Severity: critical → major
Summary: Hitting hg issues → Build DNS server failures in mtv1
Slaves that this has happened to (I'm limiting to one per platform):
    s: moz2-darwin10-slave45
    s: talos-r3-leopard-008
    s: talos-r3-fed-011
    s: talos-r4-snow-067
and maybe tegras:
"socket.gaierror: [Errno 8] nodename nor servname provided, or not known"
https://tbpl.mozilla.org/php/getParsedLog.php?id=9331036&tree=Mozilla-Inbound
Severity: major → critical
Component: Server Operations → Server Operations: RelEng
Ah, okay, so 15 seconds is how long it took the vm to fail over to the secondary host.  Was there some sort of bad interaction with keepalived here that didn't have DNS queries failing over to the secondary node during that blip?
Severity: critical → normal
Going over armen's list, this isn't ns1a in mtv1, then, because those systems are not all in mtv1.  Not sure *what* this was, then.
It wasn't just mtv. Updated the summary. Priority is correct.

If I see it again I will let you know.
Summary: Build DNS server failures in mtv1 → Build DNS server failures
comment 4 is likely a red herring, since it was not limited to mtv1, and this was a 15s blip due to the ns1a vm moving.
There might be more in other trees but these are the two time slots were this happened (there are another 10-15 talos failues on the 11:32-33 timeframe):
started 11:31, finished 11:33, took 2mins    s: talos-r3-leopard-008 
started 11:30, finished 11:32, took 3mins    s: talos-r4-lion-026
started 11:31, finished 11:32, took 2mins    s: talos-r3-fed-011
started 11:31, finished 11:33, took 2mins    s: talos-r3-fed-053
started 11:31, finished 11:33, took 2mins    s: talos-r3-fed-074

started 10:52, finished 10:54, took 2mins    s: moz2-darwin10-slave45


Which nagios alert should have gone off? (this way I know if I see it again)
Component: Server Operations: RelEng → Server Operations
Group: infra
For the record, resolvers for scl1 stuff (talos-*) are

[root@talos-r3-fed-011 ~]# cat /etc/resolv.conf 
; generated by /sbin/dhclient-script
search build.scl1.mozilla.com
nameserver 10.12.48.19

  which is a VIP on admin1a/b.infra.scl1

and resolvers for mtv1 hosts are (comment 3)

nameserver[0] : 10.250.48.19 -- the build DNS VIP in mtv1

  which is a VIP on ns1a/b.build.mtv1

I don't see anything in the logs from ~11:30 on admin1a/b.
(In reply to Dustin J. Mitchell [:dustin] from comment #11)
> 
> I don't see anything in the logs from ~11:30 on admin1a/b.
Those where the ones that actually made me worry and file the bug. Anywhere else that we might have be able to see this? Any nagios alerts around that time?

Here are the two changesets that I saw failing for that time:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=be845a0d6234
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=rev&rev=6564da6bf49e
It's interesting that the two DC's had failures at wildly different times.

All of this happens within a single broadcast domain.  The only commonality I can imagine between the systems is puppet.  Puppet ran at 11:06 and 11:05 on admin1a/b, respectively, and at 10:39 and 10:49 on ns11/b so that timing doesn't really work out.  (the timing is close on ns1b, but that's the backup host, and 3 minutes is still a long time for a failure to propagate).

The failing talos systems aren't all in the same rack, or even in the same row, so a localized switch failure is an unlikely culprit.

My hunch is still comment 4, coupled with an unrelated and unknown failure in scl1 40 minutes later, but not everyone agrees with me.
If it's two unrelated failures, then it's possible that the 15s blip in mtv1 was responsible for a small handful of failures there (a 15s outage should not trigger build failures, but that's another issue).

armen, did we see any failures in sjc1 at all?
(In reply to Amy Rich [:arich] [:arr] from comment #14)
> armen, did we see any failures in sjc1 at all?

I will have to investigate harder and see if I can catch something.
Grabbing until I can get you that info. More info tomorrow.
Assignee: server-ops → armenzg
FTR, around 8:20 we had some Win64 DNS issues.
I still have to get such promised data (*sigh*).

I saw this for a win64 build today:

retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #1
Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip']
retry: Failed, sleeping 5 seconds before retrying
retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #2
Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip']
retry: Failed, sleeping 5 seconds before retrying
retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #3
Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip']
retry: Failed, sleeping 5 seconds before retrying
retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #4
Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip']
retry: Failed, sleeping 5 seconds before retrying
retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #5
Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip']
Process stdio:
change(s) NOT sent, something went wrong:

[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.DNSLookupError'>: DNS lookup failed: address 'buildbot-master10.build.mozilla.org' not found: [Errno 11004] getaddrinfo failed.

]


Process stdio:
change(s) NOT sent, something went wrong:

[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.DNSLookupError'>: DNS lookup failed: address 'buildbot-master10.build.mozilla.org' not found: [Errno 11004] getaddrinfo failed.

]
Well, those are in scl1, and temporally correlated with network blips in scl1 (bug 730675).

The DNS server for the win64 systems is dc01 -- a host on the winbuild network and not one of the infra DNS servers that the other slaves talk to.

Philor pointed out a failure on a talos system in scl1:
  Connecting to stage.mozilla.org|10.2.74.116|:80... failed: Connection refused.
(this following a successful DNS resolution to that IP) sometime after 10:48:33 today

and also at least one DNS failure on a tegra in mtv1.

Overall, it's hard for me to see how everything in this bug could be related.  We're seeing "the same" failure on multiple, completely independent DNS servers -- dc01, mtv1 build DNS, and scl1 build DNS.  And the connection-refused above is obviously completely different.

Rather than assume the NTP nagios failures in bug 720675 are the result of a link failure of some sort, maybe they're symptomatic of some other issue.
Priority: -- → P2
I really have grabbed this bug way too long without being able to help.

I don't think we're seeing these anymore or at least we don't have easy ways to see them happening.

We can close this bug if no one has any objections.
Assignee: armenzg → server-ops
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.