Closed Bug 971983 Opened 10 years ago Closed 10 years ago

Cannot connect to verifier.login.persona.org from oneanddone on prod PaaS

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: osmose, Assigned: cturra)

References

Details

Attachments

(1 file)

I'm unable to connect to verifier.login.persona.org from the oneanddone app on the production PaaS. This causes our django-browserid-based login to fail.

Even when I SSH in and run

curl -X POST 'https://verifier.login.persona.org/verify'

I get the following:

curl: (6) Couldn't resolve host 'verifier.login.persona.org'

Similar errors come up with other ways of diagnosing connection issues or using other tools to connect.

Mythmon's solution of using a proxy from bug 889509 doesn't fix this for me, neither does recreating the app from scratch. Any ideas?
So a ghost came back from the dead to help me out, and we found out that resolv.conf on oneanddone is pointing to 127.0.0.1 as the DNS server. I checked on nucleus, also running on the prod paas, and it has different IPs for the DNS. I used dig with those IPs on oneanddone, and was able to resolve google.com successfully.

I'm also able to connect to external IPs and ping stuff, so it seems like just the DNS is borked.


So it looks like resolv.conf isn't getting set up properly for oneanddone. Halp!
this is odd i agree. in fact, we saw some similar things last week also. i have reached out to activestate to discuss this.

in the mean time, i believe this was "sorted out" previously by delete/push[ing] - possibly more than once. clearly that's not a solution, but it might be a work around for now. give it a go.
The instance running openwebdevice.org is currently in the same situation w.r.t. DNS resolution: /etc/resolv.conf has 127.0.0.1 as the nameserver, but no dnsmasq or other local dns resolver is running, and so no DNS resolution is occurring locally for that instance, either-- 'dig verifier.login.persona.org', 'dig github.com', or any other domain results in 'connection timed out; no servers could be reached'.

Unlike oneanddone, however, the openwebdevice.org site only has a dependency on local dns resolution during startup, so once the pip install and github clones are complete, the site will continue to run without issue, so there is no need to push the service again. 

:Osmose, I believe a better temporary workaround would be to replace your /etc/resolv.conf with the version I'm attaching from the working nucleus instance, which you can do now that we have sudo privileges on stackato containers.
Attached file /etc/resolv.conf
To further clarify, /etc/resolv.conf is normally automatically generated, and has a comment at the top warning to that effect, and there's no way for me to predict how long your manually modified version will last, but it should last at least long enough for you and probably a few other people to log in to the admin and begin populating data until a more permanent solution is in place. Also, I'm going to give step by step instructions for what I described in comment #3:

1. download the attached /etc/resolv.conf file to you local machine into your oneanddone working directory, then open a terminal and change to that directory
2. stackato target https://api.paas.mozilla.org
3. stackato group oneanddone
4. connect to VPN
5. stackato scp resolv.conf :.
6. stackato ssh
7. Verify that you are logged in to onanddone.mozilla.org before executing the next command
8. sudo cp resolv.conf /etc/

:Osmose, I realize I probably didn't need to explain in that level of detail for you, but I'm working on documenting things like this in bugs instead of IRC for others that might run into similar issues.
To even further clarify: I manually fixed /etc/resolv.conf on the openwebdevice.org container and local DNS resolution is now working, which is why I believe it should work for oneanddone, even if only temporarily.
thnx for all this detailed information Josh. i have been running down the same path as you guys! 

it looks like 2 out of the 3 dea nodes in the prod cluster have an incorrect resolv.conf file, which is why the app (lxc) containers have been inheriting this. i am yet to figure out why that is tho (btw - i have learnt to hate resolvconf.d through this). i don't have an update on a solution yet, but i have a couple ideas to play with this evening. hopefully i will have better news tomorrow.
Assignee: server-ops-webops → cturra
Status: NEW → ASSIGNED
cturra: FYI if you want to avoid messing with resolvconf.d, you could modify /etc/network/interfaces instead:

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
  address 192.168.3.3
  netmask 255.255.255.0
  gateway 192.168.3.1
  dns-search mozilla.com scl3.mozilla.com mozilla.org mozilla.net
  dns-nameservers 10.22.75.40 10.22.75.41
just providing an update so everyone watching this bug knows we're not sitting on it. the name servers are correctly provided during the (static) dhcp lease so that's not an issue here. what i have found is we rely on dnsmasq, which is why it's common to find 127.0.0.1 name servers around our infra. however, in this case, the app (lxc) containers inherit this from the dea nodes, but don't have dnsmasq running in the same way causing the name lookup issues we're seeing.

i am actively working with our infrastructure team to come up with a correct solution and will report back when that's in place.
(In reply to Chris Turra [:cturra] from comment #2)
> this is odd i agree. in fact, we saw some similar things last week also. i
> have reached out to activestate to discuss this.
> 
> in the mean time, i believe this was "sorted out" previously by
> delete/push[ing] - possibly more than once. clearly that's not a solution,
> but it might be a work around for now. give it a go.

Just an update that re-pushing eventually made the DNS issue go away. Yay!
cturra FYI in other contexts I have my lxc containers configured to use the dnsmasq running on the host OS, with a fallback to the upstream DNS resolvers. To adapt that approach to the previous example, the generated /etc/resolv.conf would look something like:

nameserver 192.168.3.1
nameserver 10.22.75.40
nameserver 10.22.75.41
search mozilla.com scl3.mozilla.com mozilla.org mozilla.net
over in bug 972757 we've come up with a solution to this. i have pushed the changes out to 2 of the 3 nodes in the production stackato cluster already and will do the third after this evenings environment patch. *i left dea3 out of the mix because it's running most of the applications at the moment. in the mean time, however, all new deployments *should* work just fine.

i will update this bug after the patches this evening.
this should all be sorted now. 

$ for i in {1..3}; do echo "--> Connected to dea$i"; ssh -A stackato-dea$i.paas.scl3.mozilla.com "host verifier.login.persona.org"; done
--> Connected to dea1
verifier.login.persona.org is an alias for login.persona.org.
login.persona.org is an alias for persona-org-0210-1801955454.us-east-1.elb.amazonaws.com.
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.236.195.220
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.84.127.61
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 107.21.18.235
--> Connected to dea2
verifier.login.persona.org is an alias for login.persona.org.
login.persona.org is an alias for persona-org-0210-1801955454.us-east-1.elb.amazonaws.com.
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.84.127.61
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 107.21.18.235
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.236.195.220
--> Connected to dea3
verifier.login.persona.org is an alias for login.persona.org.
login.persona.org is an alias for persona-org-0210-1801955454.us-east-1.elb.amazonaws.com.
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 107.21.18.235
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.236.195.220
persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.84.127.61
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: