tree closure due to scl1 DNS bustage

RESOLVED FIXED

Status

RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: Callek, Assigned: bhourigan)

Tracking

Details

(Whiteboard: [reit-ops] [closed-trees])

(Reporter)

Description

5 years ago
Per puppet reports there has been DNS issues in scl1 for the last ~1 hour 15 minutes.

This is busting all panda tests in a weird way (that doesn't show in log as DNS) but we see the errors on other linux hosts in scl1 as well.

[19:34:31]	Callek	Puppet Report for foopy50.p2.releng.scl1.mozilla.com
[19:34:47]	Callek	Puppet Report for bld-centos6-hp-016.build.scl1.mozilla.com
[19:35:06]	Callek	Puppet Report for bld-linux64-ix-036.build.scl1.mozilla.com

Are three example hosts.

shyam is on it, and philor closed trees at 2013-08-21 16:17:08
Haven't been able to reproduce these issues at all. Nameservers seem to be operational and working.
Reopened at 17:35, since it's been a while since the last failure, and I'm running out of running jobs that could fail.
Severity: blocker → normal
(In reply to Shyam Mani [:fox2mike] from comment #1)
> Haven't been able to reproduce these issues at all. Nameservers seem to be
> operational and working.

:fox2mike -- were you able to confirm the errors reported by puppet?

:arr -- I know :dustin is travelling - are there any logs we need to save off so we can get an RFO on his return? (or anyone else who can provide that?)
Flags: needinfo?(shyam)
Flags: needinfo?(arich)
Whiteboard: [reit-ops] [closed-trees]
Any pertinent logs that we don't already have email records of (in the case of puppet) are going to be on the dns server end.  This bug should go to the infra team for further analysis.
Assignee: shyam → infra
Component: Server Operations → Infrastructure: DNS
Flags: needinfo?(arich)
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → jdow
The puppet errors look like this:

Wed Aug 21 16:21:19 -0700 2013 /Stage[main]/Users::People/Users::Person[jbraddock]/File[/home/jbraddock] (err): Failed to generate additional resources using 'eval_generate: getaddrinfo: Name or service not known
Wed Aug 21 16:21:50 -0700 2013 /Stage[main]/Users::People/Users::Person[ahill2]/File[/home/ahill2] (err): Failed to generate additional resources using 'eval_generate: getaddrinfo: Name or service not known
Wed Aug 21 16:22:27 -0700 2013 /Stage[main]/Users::People/Users::Person[aignacio]/File[/home/aignacio] (err): Failed to generate additional resources using 'eval_generate: getaddrinfo: Name or service not known
Wed Aug 21 16:22:37 -0700 2013 /Stage[main]/Users::People/Users::Person[aignacio]/File[/home/aignacio] (err): Could not evaluate: getaddrinfo: Name or service not known Could not retrieve file metadata for puppet:///modules/users/people/aignacio: getaddrinfo: Name or service not known

which is to say, puppet managed to look up its server name, get a catalog, etc., but then some of the followup HTTP connections failed in getaddrinfo.  We often see these errors from AWS hosts, too.  I've assumed that was just packet loss over the VPC, but I have no data to back up that assumption.

Callek, can you paste some logs from the failed panda runs? (and please paste the logs, not a tbpl link, for posterity)
Flags: needinfo?(bugspam.Callek)
(Reporter)

Comment 6

5 years ago
from https://tbpl.mozilla.org/php/getParsedLog.php?id=26839051&tree=Mozilla-Central#error0

15:43:05     INFO - #####
15:43:05     INFO - ##### Running request-device step.
15:43:05     INFO - #####
15:43:05     INFO - Running pre-action listener: _resource_record_pre_action
15:43:05     INFO - Running main action method: request_device
15:43:15     INFO - Running post-action listener: _resource_record_post_action
15:43:15    FATAL - Uncaught exception: Traceback (most recent call last):
15:43:15    FATAL -   File "/builds/panda-0035/test/scripts/mozharness/base/script.py", line 1066, in run
15:43:15    FATAL -     self.run_action(action)
15:43:15    FATAL -   File "/builds/panda-0035/test/scripts/mozharness/base/script.py", line 1008, in run_action
15:43:15    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
15:43:15    FATAL -   File "/builds/panda-0035/test/scripts/mozharness/base/script.py", line 949, in _possibly_run_method
15:43:15    FATAL -     return getattr(self, method_name)()
15:43:15    FATAL -   File "scripts/scripts/android_panda.py", line 159, in request_device
15:43:15    FATAL -     self.retrieve_android_device(b2gbase="")
15:43:15    FATAL -   File "/builds/panda-0035/test/scripts/mozharness/mozilla/testing/mozpool.py", line 82, in retrieve_android_device
15:43:15    FATAL -     mph = self.query_mozpool_handler(self.mozpool_device)
15:43:15    FATAL -   File "/builds/panda-0035/test/scripts/mozharness/mozilla/testing/mozpool.py", line 42, in query_mozpool_handler
15:43:15    FATAL -     self.mozpool_api_url = self.determine_mozpool_host(device) if device else mozpool_api_url
15:43:15    FATAL -   File "/builds/panda-0035/test/scripts/mozharness/mozilla/testing/mozpool.py", line 35, in determine_mozpool_host
15:43:15    FATAL -     raise self.MozpoolException("This panda board does not have an associated BMM.")
15:43:15    FATAL - AttributeError: 'PandaTest' object has no attribute 'MozpoolException'
15:43:15    FATAL - Running post_fatal callback...
15:43:15    FATAL - Exiting -1

Which as said above is not an obvious DNS error, but when you look at the code that raises the exception: http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/testing/mozpool.py#27

And realize that its actual fqdn/check from the foopy its in (also the .p10 vlan) does exist and work, its clear it was a failure to get the fqdn from DNS that caused the error.
Flags: needinfo?(bugspam.Callek)

Comment 7

5 years ago
it seems it is just fileserver operations that were failing in the puppet logs. We've seen these types of issues in infra puppet when there was a misconfiguration causing the $server variable to get lost or reverted to the default of "puppet".

I don't have any debug logs on the nameservers in SCL1 that go back far enough to be of use here. Could there be any dependency of any of the servers (puppet clients, puppet masters, puppet file servers) that rely on ns1a.dmz.scl3? That host was down due to a seamicro outage during this time, but we can't find anything with scl1 that should have been affected by it.

Updated

5 years ago
Flags: needinfo?(shyam)
I didn't look through all of the errors, but from the AWS failures we'll often also see puppet unable to get the catalog due to a failure in getaddrinfo.  It just happens that the majority of HTTP requests from an agent are for fileserver.  'puppet' is a legitimate name for all of these hosts, with master certs set up to allow either 'puppet' or the master's fqdn, so that shouldn't be the issue.

[root@foopy50.p2.releng.scl1.mozilla.com ~]# cat /etc/resolv.conf 
; generated by /sbin/dhclient-script
search p2.releng.scl1.mozilla.com
nameserver 10.12.75.11

so, not talking directly to ns1a.dmz.scl3.

The mystery deepens :)
admin1b.infra.scl1.mozilla.com (which was master for the DNS vip in scl1) became very laggy and unresponsive today.  I'm not sure if this caused further issue, but I asked AJ to fail over the vip to admin1a, and then admin1b suddenly became responsive again.  AJ probably has more details about his debugging.
Flags: needinfo?(afernandez)
Depends on: 908739
More info about the debugging is in bug 908739
Flags: needinfo?(afernandez)
(Assignee)

Comment 11

5 years ago
We have moved DNS services to two dedicated hosts, I think this has been fixed for now.
Assignee: infra → bhourigan
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.