Closed Bug 762342 Opened 12 years ago Closed 12 years ago

DNS Issues on production

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 762346

People

(Reporter: st3fan, Unassigned)

Details

This should never happen. If the DNS servers that we use are unreliable then we might want to  maintain an /etc/hosts file.

[root@pancake-web4 supervisor]# curl -i http://pancake-elasticsearch1:9200/pancake
curl: (6) Couldn't resolve host 'pancake-elasticsearch1'
Can you paste /etc/resolv.conf from that host here ?
[root@pancake-web4 ~]# cat /etc/resolv.conf 
search	labs.phx1.mozilla.com
nameserver	10.8.110.5
Seems to be working now, but why is DHCP only returning a single nameserver? 

Punting over to server ops in case they can shed some light.

I'll venture a guess and some DNS outage in PHX1
Assignee: gozer → server-ops
Component: General → Server Operations
Product: Pancake → mozilla.org
QA Contact: general → phong
Target Milestone: M3 → ---
Version: unspecified → other
(In reply to Stefan Arentz [:st3fan] from comment #0)
> This should never happen. If the DNS servers that we use are unreliable then
> we might want to  maintain an /etc/hosts file.
> 
> [root@pancake-web4 supervisor]# curl -i
> http://pancake-elasticsearch1:9200/pancake
> curl: (6) Couldn't resolve host 'pancake-elasticsearch1'

Before you go off on reliability (and we don't know what happened here, yet), PLEASE use fqdns in your configs. If it's super critical, use IPs. Our DBs use IPs vs hostnames because :

1) It cuts down resolution, DBs don't move every day
2) It doesn't fail if there's a blip in DNS.

Punting over to the infra team, CC'ing rtucker to check about DHCP.
Assignee: server-ops → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: phong → jdow
Also, do you have a timeline? So we can narrow down the search? 

Gozer, do you *know* of a DNS outage?
Here are some timestamps:

fxhome-lattice-server.stderr.log:[W 120514 09:35:42 elasticsearch:270] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log:[E 120606 19:01:05 elasticsearch:272] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log:[E 120606 19:05:16 elasticsearch:272] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log:[E 120606 19:01:48 elasticsearch:272] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log:[W 120514 13:34:53 elasticsearch:270] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log:[E 120606 19:25:09 elasticsearch:272] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log:[E 120606 19:25:19 elasticsearch:272] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log:[E 120606 19:26:24 elasticsearch:272] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log-20120408:[W 120404 16:20:56 elasticsearch:268] ElasticSearch Request Error 599
fxhome-lattice-server.stderr.log-20120408:[W 120404 16:20:59 elasticsearch:268] ElasticSearch Request Error 599
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Curious how fqdns will help. If the DNS is unreachable then those will also fail no?
True, but it's the right way™ to go. It helps keeps things sane, like looking at pancake-elasticsearch I have no idea which datacenter that's in. I know phx1 because 10.8 is phx1 and we're starting to have across DC ES instances (over in IT, not Labs) and in those cases, not using FQDNs can cause issues.
Sorry this should not have been marked as fixed. We still need to improve the DNS config.

We will configure full names.
But I would also like to make this more resilient by configuring at least 2 nameservers. Is there a pair that we can use?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → DUPLICATE
726346 has the correct nameservers you can use.
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.