Closed Bug 907981 Opened 12 years ago Closed 12 years ago

tree closure due to scl1 DNS bustage

Categories

(Infrastructure & Operations :: DNS and Domain Registration, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: bhourigan)

Details

(Whiteboard: [reit-ops] [closed-trees])

Per puppet reports there has been DNS issues in scl1 for the last ~1 hour 15 minutes. This is busting all panda tests in a weird way (that doesn't show in log as DNS) but we see the errors on other linux hosts in scl1 as well. [19:34:31] Callek Puppet Report for foopy50.p2.releng.scl1.mozilla.com [19:34:47] Callek Puppet Report for bld-centos6-hp-016.build.scl1.mozilla.com [19:35:06] Callek Puppet Report for bld-linux64-ix-036.build.scl1.mozilla.com Are three example hosts. shyam is on it, and philor closed trees at 2013-08-21 16:17:08
Haven't been able to reproduce these issues at all. Nameservers seem to be operational and working.
Reopened at 17:35, since it's been a while since the last failure, and I'm running out of running jobs that could fail.
Severity: blocker → normal
(In reply to Shyam Mani [:fox2mike] from comment #1) > Haven't been able to reproduce these issues at all. Nameservers seem to be > operational and working. :fox2mike -- were you able to confirm the errors reported by puppet? :arr -- I know :dustin is travelling - are there any logs we need to save off so we can get an RFO on his return? (or anyone else who can provide that?)
Flags: needinfo?(shyam)
Flags: needinfo?(arich)
Whiteboard: [reit-ops] [closed-trees]
Any pertinent logs that we don't already have email records of (in the case of puppet) are going to be on the dns server end. This bug should go to the infra team for further analysis.
Assignee: shyam → infra
Component: Server Operations → Infrastructure: DNS
Flags: needinfo?(arich)
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → jdow
The puppet errors look like this: Wed Aug 21 16:21:19 -0700 2013 /Stage[main]/Users::People/Users::Person[jbraddock]/File[/home/jbraddock] (err): Failed to generate additional resources using 'eval_generate: getaddrinfo: Name or service not known Wed Aug 21 16:21:50 -0700 2013 /Stage[main]/Users::People/Users::Person[ahill2]/File[/home/ahill2] (err): Failed to generate additional resources using 'eval_generate: getaddrinfo: Name or service not known Wed Aug 21 16:22:27 -0700 2013 /Stage[main]/Users::People/Users::Person[aignacio]/File[/home/aignacio] (err): Failed to generate additional resources using 'eval_generate: getaddrinfo: Name or service not known Wed Aug 21 16:22:37 -0700 2013 /Stage[main]/Users::People/Users::Person[aignacio]/File[/home/aignacio] (err): Could not evaluate: getaddrinfo: Name or service not known Could not retrieve file metadata for puppet:///modules/users/people/aignacio: getaddrinfo: Name or service not known which is to say, puppet managed to look up its server name, get a catalog, etc., but then some of the followup HTTP connections failed in getaddrinfo. We often see these errors from AWS hosts, too. I've assumed that was just packet loss over the VPC, but I have no data to back up that assumption. Callek, can you paste some logs from the failed panda runs? (and please paste the logs, not a tbpl link, for posterity)
Flags: needinfo?(bugspam.Callek)
from https://tbpl.mozilla.org/php/getParsedLog.php?id=26839051&tree=Mozilla-Central#error0 15:43:05 INFO - ##### 15:43:05 INFO - ##### Running request-device step. 15:43:05 INFO - ##### 15:43:05 INFO - Running pre-action listener: _resource_record_pre_action 15:43:05 INFO - Running main action method: request_device 15:43:15 INFO - Running post-action listener: _resource_record_post_action 15:43:15 FATAL - Uncaught exception: Traceback (most recent call last): 15:43:15 FATAL - File "/builds/panda-0035/test/scripts/mozharness/base/script.py", line 1066, in run 15:43:15 FATAL - self.run_action(action) 15:43:15 FATAL - File "/builds/panda-0035/test/scripts/mozharness/base/script.py", line 1008, in run_action 15:43:15 FATAL - self._possibly_run_method(method_name, error_if_missing=True) 15:43:15 FATAL - File "/builds/panda-0035/test/scripts/mozharness/base/script.py", line 949, in _possibly_run_method 15:43:15 FATAL - return getattr(self, method_name)() 15:43:15 FATAL - File "scripts/scripts/android_panda.py", line 159, in request_device 15:43:15 FATAL - self.retrieve_android_device(b2gbase="") 15:43:15 FATAL - File "/builds/panda-0035/test/scripts/mozharness/mozilla/testing/mozpool.py", line 82, in retrieve_android_device 15:43:15 FATAL - mph = self.query_mozpool_handler(self.mozpool_device) 15:43:15 FATAL - File "/builds/panda-0035/test/scripts/mozharness/mozilla/testing/mozpool.py", line 42, in query_mozpool_handler 15:43:15 FATAL - self.mozpool_api_url = self.determine_mozpool_host(device) if device else mozpool_api_url 15:43:15 FATAL - File "/builds/panda-0035/test/scripts/mozharness/mozilla/testing/mozpool.py", line 35, in determine_mozpool_host 15:43:15 FATAL - raise self.MozpoolException("This panda board does not have an associated BMM.") 15:43:15 FATAL - AttributeError: 'PandaTest' object has no attribute 'MozpoolException' 15:43:15 FATAL - Running post_fatal callback... 15:43:15 FATAL - Exiting -1 Which as said above is not an obvious DNS error, but when you look at the code that raises the exception: http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/testing/mozpool.py#27 And realize that its actual fqdn/check from the foopy its in (also the .p10 vlan) does exist and work, its clear it was a failure to get the fqdn from DNS that caused the error.
Flags: needinfo?(bugspam.Callek)
it seems it is just fileserver operations that were failing in the puppet logs. We've seen these types of issues in infra puppet when there was a misconfiguration causing the $server variable to get lost or reverted to the default of "puppet". I don't have any debug logs on the nameservers in SCL1 that go back far enough to be of use here. Could there be any dependency of any of the servers (puppet clients, puppet masters, puppet file servers) that rely on ns1a.dmz.scl3? That host was down due to a seamicro outage during this time, but we can't find anything with scl1 that should have been affected by it.
Flags: needinfo?(shyam)
I didn't look through all of the errors, but from the AWS failures we'll often also see puppet unable to get the catalog due to a failure in getaddrinfo. It just happens that the majority of HTTP requests from an agent are for fileserver. 'puppet' is a legitimate name for all of these hosts, with master certs set up to allow either 'puppet' or the master's fqdn, so that shouldn't be the issue. [root@foopy50.p2.releng.scl1.mozilla.com ~]# cat /etc/resolv.conf ; generated by /sbin/dhclient-script search p2.releng.scl1.mozilla.com nameserver 10.12.75.11 so, not talking directly to ns1a.dmz.scl3. The mystery deepens :)
admin1b.infra.scl1.mozilla.com (which was master for the DNS vip in scl1) became very laggy and unresponsive today. I'm not sure if this caused further issue, but I asked AJ to fail over the vip to admin1a, and then admin1b suddenly became responsive again. AJ probably has more details about his debugging.
Flags: needinfo?(afernandez)
More info about the debugging is in bug 908739
Flags: needinfo?(afernandez)
We have moved DNS services to two dedicated hosts, I think this has been fixed for now.
Assignee: infra → bhourigan
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.