Closed
Bug 727194
Opened 12 years ago
Closed 12 years ago
Build DNS server failures
Categories
(mozilla.org Graveyard :: Server Operations, task, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Unassigned)
References
Details
There are several jobs not being able to clone the tools repo. This started impacting development in the last hour. Can you please investigate what is going on? See bug 727171 and https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=be845a0d6234 [1] abort: error: nodename nor servname provided, or not known https://tbpl.mozilla.org/php/getParsedLog.php?id=9330727&tree=Mozilla-Inbound [2] abort: error: program finished with exit code 255
Reporter | ||
Comment 1•12 years ago
|
||
This can also be affecting the buildapi (used by developers to manage jobs).
> Cron <root@buildapi01> sudo -u buildapi /usr/local/bin/update_hg_wc.sh /home/buildapi/src && /etc/init.d/buildapi restart
Comment 2•12 years ago
|
||
========= Started clone build tools failed (results: 2, elapsed: 1 mins, 43 secs) ========== hg clone http://hg.mozilla.org/build/tools tools in dir /builds/slave/m-in-osx64/. (timeout 1320 secs) watching logfiles {} argv: ['hg', 'clone', 'http://hg.mozilla.org/build/tools', 'tools'] environment: Apple_PubSub_Socket_Render=/tmp/launch-5UYYzc/Render CVS_RSH=ssh DISPLAY=/tmp/launch-pMQ6PU/:0 HOME=/Users/cltbld LOGNAME=cltbld PATH=/tools/buildbot/bin:/tools/python/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PWD=/builds/slave/m-in-osx64 PYTHONPATH=/tools/buildbot/lib/python2.6/site-packages:/tools/twisted/lib/python2.6/site-packages/:/tools/twisted-core/lib/python2.6/site-packages:/tools/zope-interface/lib/python2.6/site-packages SHELL=/bin/bash SSH_AUTH_SOCK=/tmp/launch-li2CVQ/Listeners TMPDIR=/var/folders/7I/7I253dv+HLesSBUBGCX08E+++TM/-Tmp-/ USER=cltbld VERSIONER_PYTHON_PREFER_32_BIT=no VERSIONER_PYTHON_VERSION=2.6 __CF_USER_TEXT_ENCODING=0x1F6:0:0 using PTY: False abort: error: nodename nor servname provided, or not known program finished with exit code 255 elapsedTime=103.547167 ======== Finished clone build tools failed (results: 2, elapsed: 1 mins, 43 secs) ========
Updated•12 years ago
|
Assignee: server-ops-releng → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: zandr → cshields
Comment 3•12 years ago
|
||
That was moz2-darwin10-slave45, which is in mtv1. That's a DNS resolution failure. On that host: moz2-darwin10-slave45:~ cltbld$ host hg.mozilla.org hg.mozilla.org has address 10.2.74.153 and in fact, running that in a loop shows no problems. nameserver[0] : 10.250.48.19 -- the build DNS VIP in mtv1
Comment 4•12 years ago
|
||
Feb 14 10:50:51 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Transition to MASTER STATE Feb 14 10:50:52 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Entering MASTER STATE Feb 14 10:50:52 ns1b Keepalived_vrrp: VRRP_Instance(vip48) setting protocol VIPs. Feb 14 10:50:52 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Sending gratuitous ARPs on eth0 for 10.250.48.19 Feb 14 10:50:53 ns1b ntpd[1355]: Listening on interface #10 eth0, 10.250.48.19#123 Enabled Feb 14 10:50:57 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Sending gratuitous ARPs on eth0 for 10.250.48.19 Feb 14 10:51:05 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Received higher prio advert Feb 14 10:51:05 ns1b Keepalived_vrrp: VRRP_Instance(vip48) Entering BACKUP STATE Feb 14 10:51:05 ns1b Keepalived_vrrp: VRRP_Instance(vip48) removing protocol VIPs. Feb 14 10:51:06 ns1b ntpd[1355]: Deleting interface #10 eth0, 10.250.48.19#123, interface stats: received=0, sent=0, dropped=0, active_time=13 secs which corresponds closely to the failed build: OS X 10.6.2 mozilla-inbound build on 2012-02-14 10:52:53 PST for push 85f3cf72938a ns1a only has: Feb 14 10:47:48 ns1a dhclient[1074]: DHCPREQUEST on eth0 to 10.250.0.21 port 67 (xid=0x57338270) Feb 14 10:47:48 ns1a dhclient[1074]: DHCPACK from 10.250.0.21 (xid=0x57338270) Feb 14 10:47:48 ns1a dhclient[1074]: bound to 10.250.48.17 -- renewal in 4253 seconds. Feb 14 10:50:01 ns1a kernel: audit: error converting sid to string I'm downgrading since this doesn't appear to be ongoing -- looks like it was about 15 seconds long? And over to SRE/server ops to see what the cause was. (releng, please copy interested folks here - the above is considered infra info and not public)
Updated•12 years ago
|
Group: infra
Severity: critical → major
Summary: Hitting hg issues → Build DNS server failures in mtv1
Reporter | ||
Comment 5•12 years ago
|
||
Slaves that this has happened to (I'm limiting to one per platform): s: moz2-darwin10-slave45 s: talos-r3-leopard-008 s: talos-r3-fed-011 s: talos-r4-snow-067 and maybe tegras: "socket.gaierror: [Errno 8] nodename nor servname provided, or not known" https://tbpl.mozilla.org/php/getParsedLog.php?id=9331036&tree=Mozilla-Inbound
Severity: major → critical
Component: Server Operations → Server Operations: RelEng
Comment 6•12 years ago
|
||
Ah, okay, so 15 seconds is how long it took the vm to fail over to the secondary host. Was there some sort of bad interaction with keepalived here that didn't have DNS queries failing over to the secondary node during that blip?
Severity: critical → normal
Comment 7•12 years ago
|
||
Going over armen's list, this isn't ns1a in mtv1, then, because those systems are not all in mtv1. Not sure *what* this was, then.
Reporter | ||
Comment 8•12 years ago
|
||
It wasn't just mtv. Updated the summary. Priority is correct. If I see it again I will let you know.
Summary: Build DNS server failures in mtv1 → Build DNS server failures
Comment 9•12 years ago
|
||
comment 4 is likely a red herring, since it was not limited to mtv1, and this was a 15s blip due to the ns1a vm moving.
Reporter | ||
Comment 10•12 years ago
|
||
There might be more in other trees but these are the two time slots were this happened (there are another 10-15 talos failues on the 11:32-33 timeframe): started 11:31, finished 11:33, took 2mins s: talos-r3-leopard-008 started 11:30, finished 11:32, took 3mins s: talos-r4-lion-026 started 11:31, finished 11:32, took 2mins s: talos-r3-fed-011 started 11:31, finished 11:33, took 2mins s: talos-r3-fed-053 started 11:31, finished 11:33, took 2mins s: talos-r3-fed-074 started 10:52, finished 10:54, took 2mins s: moz2-darwin10-slave45 Which nagios alert should have gone off? (this way I know if I see it again)
Updated•12 years ago
|
Component: Server Operations: RelEng → Server Operations
Updated•12 years ago
|
Group: infra
Comment 11•12 years ago
|
||
For the record, resolvers for scl1 stuff (talos-*) are [root@talos-r3-fed-011 ~]# cat /etc/resolv.conf ; generated by /sbin/dhclient-script search build.scl1.mozilla.com nameserver 10.12.48.19 which is a VIP on admin1a/b.infra.scl1 and resolvers for mtv1 hosts are (comment 3) nameserver[0] : 10.250.48.19 -- the build DNS VIP in mtv1 which is a VIP on ns1a/b.build.mtv1 I don't see anything in the logs from ~11:30 on admin1a/b.
Reporter | ||
Comment 12•12 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #11) > > I don't see anything in the logs from ~11:30 on admin1a/b. Those where the ones that actually made me worry and file the bug. Anywhere else that we might have be able to see this? Any nagios alerts around that time? Here are the two changesets that I saw failing for that time: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=be845a0d6234 https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=rev&rev=6564da6bf49e
Comment 13•12 years ago
|
||
It's interesting that the two DC's had failures at wildly different times. All of this happens within a single broadcast domain. The only commonality I can imagine between the systems is puppet. Puppet ran at 11:06 and 11:05 on admin1a/b, respectively, and at 10:39 and 10:49 on ns11/b so that timing doesn't really work out. (the timing is close on ns1b, but that's the backup host, and 3 minutes is still a long time for a failure to propagate). The failing talos systems aren't all in the same rack, or even in the same row, so a localized switch failure is an unlikely culprit. My hunch is still comment 4, coupled with an unrelated and unknown failure in scl1 40 minutes later, but not everyone agrees with me.
Comment 14•12 years ago
|
||
If it's two unrelated failures, then it's possible that the 15s blip in mtv1 was responsible for a small handful of failures there (a 15s outage should not trigger build failures, but that's another issue). armen, did we see any failures in sjc1 at all?
Reporter | ||
Comment 15•12 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #14) > armen, did we see any failures in sjc1 at all? I will have to investigate harder and see if I can catch something. Grabbing until I can get you that info. More info tomorrow.
Assignee: server-ops → armenzg
Reporter | ||
Comment 16•12 years ago
|
||
FTR, around 8:20 we had some Win64 DNS issues. I still have to get such promised data (*sigh*). I saw this for a win64 build today: retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #1 Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'] retry: Failed, sleeping 5 seconds before retrying retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #2 Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'] retry: Failed, sleeping 5 seconds before retrying retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #3 Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'] retry: Failed, sleeping 5 seconds before retrying retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #4 Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'] retry: Failed, sleeping 5 seconds before retrying retry: Calling <function run_with_timeout at 0x0274D070> with args: (['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'], 1800, 'change sent successfully', None, False, True), kwargs: {}, attempt #5 Executing: ['buildbot.bat', 'sendchange', '--master', 'buildbot-master10.build.mozilla.org:9301', '--username', 'sendchange', '--branch', 'mozilla-central-win64-talos', '--revision', '53e10e2b327b', '--comments', 'Bug 730601 - Don_t use GetListenerManager(false) to check existence of ELM, but HasListenerManager, r=jst', '--property', 'buildid:20120227063207', '--property', 'pgo_build:False', '--property', 'builduid:24a2900d924542c8be37cdf1c8d4c56f', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win64/1330353127/firefox-13.0a1.en-US.win64-x86_64.zip'] Process stdio: change(s) NOT sent, something went wrong: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.DNSLookupError'>: DNS lookup failed: address 'buildbot-master10.build.mozilla.org' not found: [Errno 11004] getaddrinfo failed. ] Process stdio: change(s) NOT sent, something went wrong: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.DNSLookupError'>: DNS lookup failed: address 'buildbot-master10.build.mozilla.org' not found: [Errno 11004] getaddrinfo failed. ]
Comment 17•12 years ago
|
||
Well, those are in scl1, and temporally correlated with network blips in scl1 (bug 730675). The DNS server for the win64 systems is dc01 -- a host on the winbuild network and not one of the infra DNS servers that the other slaves talk to. Philor pointed out a failure on a talos system in scl1: Connecting to stage.mozilla.org|10.2.74.116|:80... failed: Connection refused. (this following a successful DNS resolution to that IP) sometime after 10:48:33 today and also at least one DNS failure on a tegra in mtv1. Overall, it's hard for me to see how everything in this bug could be related. We're seeing "the same" failure on multiple, completely independent DNS servers -- dc01, mtv1 build DNS, and scl1 build DNS. And the connection-refused above is obviously completely different. Rather than assume the NTP nagios failures in bug 720675 are the result of a link failure of some sort, maybe they're symptomatic of some other issue.
Reporter | ||
Updated•12 years ago
|
Priority: -- → P2
Reporter | ||
Comment 18•12 years ago
|
||
I really have grabbed this bug way too long without being able to help. I don't think we're seeing these anymore or at least we don't have easy ways to see them happening. We can close this bug if no one has any objections.
Assignee: armenzg → server-ops
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•