971983 - Cannot connect to verifier.login.persona.org from oneanddone on prod PaaS

Reporter

Description

•

11 years ago

I'm unable to connect to verifier.login.persona.org from the oneanddone app on the production PaaS. This causes our django-browserid-based login to fail. Even when I SSH in and run curl -X POST 'https://verifier.login.persona.org/verify' I get the following: curl: (6) Couldn't resolve host 'verifier.login.persona.org' Similar errors come up with other ways of diagnosing connection issues or using other tools to connect. Mythmon's solution of using a proxy from bug 889509 doesn't fix this for me, neither does recreating the app from scratch. Any ideas?

Osmose [:osmose, :mkelly]

Reporter

Comment 1

•

11 years ago

So a ghost came back from the dead to help me out, and we found out that resolv.conf on oneanddone is pointing to 127.0.0.1 as the DNS server. I checked on nucleus, also running on the prod paas, and it has different IPs for the DNS. I used dig with those IPs on oneanddone, and was able to resolve google.com successfully. I'm also able to connect to external IPs and ping stuff, so it seems like just the DNS is borked. So it looks like resolv.conf isn't getting set up properly for oneanddone. Halp!

Chris Turra [:cturra]

Assignee

Comment 2

•

11 years ago

this is odd i agree. in fact, we saw some similar things last week also. i have reached out to activestate to discuss this. in the mean time, i believe this was "sorted out" previously by delete/push[ing] - possibly more than once. clearly that's not a solution, but it might be a work around for now. give it a go.

Josh Mize [:jgmize]

Comment 3

•

11 years ago

The instance running openwebdevice.org is currently in the same situation w.r.t. DNS resolution: /etc/resolv.conf has 127.0.0.1 as the nameserver, but no dnsmasq or other local dns resolver is running, and so no DNS resolution is occurring locally for that instance, either-- 'dig verifier.login.persona.org', 'dig github.com', or any other domain results in 'connection timed out; no servers could be reached'. Unlike oneanddone, however, the openwebdevice.org site only has a dependency on local dns resolution during startup, so once the pip install and github clones are complete, the site will continue to run without issue, so there is no need to push the service again. :Osmose, I believe a better temporary workaround would be to replace your /etc/resolv.conf with the version I'm attaching from the working nucleus instance, which you can do now that we have sudo privileges on stackato containers.

Josh Mize [:jgmize]

Comment 4

•

11 years ago

Attached file /etc/resolv.conf — Details

Josh Mize [:jgmize]

Comment 5

•

11 years ago

To further clarify, /etc/resolv.conf is normally automatically generated, and has a comment at the top warning to that effect, and there's no way for me to predict how long your manually modified version will last, but it should last at least long enough for you and probably a few other people to log in to the admin and begin populating data until a more permanent solution is in place. Also, I'm going to give step by step instructions for what I described in comment #3: 1. download the attached /etc/resolv.conf file to you local machine into your oneanddone working directory, then open a terminal and change to that directory 2. stackato target https://api.paas.mozilla.org 3. stackato group oneanddone 4. connect to VPN 5. stackato scp resolv.conf :. 6. stackato ssh 7. Verify that you are logged in to onanddone.mozilla.org before executing the next command 8. sudo cp resolv.conf /etc/ :Osmose, I realize I probably didn't need to explain in that level of detail for you, but I'm working on documenting things like this in bugs instead of IRC for others that might run into similar issues.

Josh Mize [:jgmize]

Comment 6

•

11 years ago

To even further clarify: I manually fixed /etc/resolv.conf on the openwebdevice.org container and local DNS resolution is now working, which is why I believe it should work for oneanddone, even if only temporarily.

Chris Turra [:cturra]

Assignee

Comment 7

•

11 years ago

thnx for all this detailed information Josh. i have been running down the same path as you guys! it looks like 2 out of the 3 dea nodes in the prod cluster have an incorrect resolv.conf file, which is why the app (lxc) containers have been inheriting this. i am yet to figure out why that is tho (btw - i have learnt to hate resolvconf.d through this). i don't have an update on a solution yet, but i have a couple ideas to play with this evening. hopefully i will have better news tomorrow.

Assignee: server-ops-webops → cturra

Status: NEW → ASSIGNED

Josh Mize [:jgmize]

Comment 8

•

11 years ago

cturra: FYI if you want to avoid messing with resolvconf.d, you could modify /etc/network/interfaces instead: auto lo iface lo inet loopback auto eth0 iface eth0 inet static address 192.168.3.3 netmask 255.255.255.0 gateway 192.168.3.1 dns-search mozilla.com scl3.mozilla.com mozilla.org mozilla.net dns-nameservers 10.22.75.40 10.22.75.41

Chris Turra [:cturra]

Assignee

Comment 9

•

11 years ago

just providing an update so everyone watching this bug knows we're not sitting on it. the name servers are correctly provided during the (static) dhcp lease so that's not an issue here. what i have found is we rely on dnsmasq, which is why it's common to find 127.0.0.1 name servers around our infra. however, in this case, the app (lxc) containers inherit this from the dea nodes, but don't have dnsmasq running in the same way causing the name lookup issues we're seeing. i am actively working with our infrastructure team to come up with a correct solution and will report back when that's in place.

Osmose [:osmose, :mkelly]

Reporter

Comment 10

•

11 years ago

(In reply to Chris Turra [:cturra] from comment #2) > this is odd i agree. in fact, we saw some similar things last week also. i > have reached out to activestate to discuss this. > > in the mean time, i believe this was "sorted out" previously by > delete/push[ing] - possibly more than once. clearly that's not a solution, > but it might be a work around for now. give it a go. Just an update that re-pushing eventually made the DNS issue go away. Yay!

Josh Mize [:jgmize]

Comment 11

•

11 years ago

cturra FYI in other contexts I have my lxc containers configured to use the dnsmasq running on the host OS, with a fallback to the upstream DNS resolvers. To adapt that approach to the previous example, the generated /etc/resolv.conf would look something like: nameserver 192.168.3.1 nameserver 10.22.75.40 nameserver 10.22.75.41 search mozilla.com scl3.mozilla.com mozilla.org mozilla.net

Chris Turra [:cturra]

Assignee

Comment 12

•

11 years ago

over in bug 972757 we've come up with a solution to this. i have pushed the changes out to 2 of the 3 nodes in the production stackato cluster already and will do the third after this evenings environment patch. *i left dea3 out of the mix because it's running most of the applications at the moment. in the mean time, however, all new deployments *should* work just fine. i will update this bug after the patches this evening.

Stephen Donner [:stephend] Not actively reading bugmail

Updated

•

11 years ago

Blocks: 973068

Chris Turra [:cturra]

Assignee

Comment 13

•

11 years ago

this should all be sorted now. $ for i in {1..3}; do echo "--> Connected to dea$i"; ssh -A stackato-dea$i.paas.scl3.mozilla.com "host verifier.login.persona.org"; done --> Connected to dea1 verifier.login.persona.org is an alias for login.persona.org. login.persona.org is an alias for persona-org-0210-1801955454.us-east-1.elb.amazonaws.com. persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.236.195.220 persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.84.127.61 persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 107.21.18.235 --> Connected to dea2 verifier.login.persona.org is an alias for login.persona.org. login.persona.org is an alias for persona-org-0210-1801955454.us-east-1.elb.amazonaws.com. persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.84.127.61 persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 107.21.18.235 persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.236.195.220 --> Connected to dea3 verifier.login.persona.org is an alias for login.persona.org. login.persona.org is an alias for persona-org-0210-1801955454.us-east-1.elb.amazonaws.com. persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 107.21.18.235 persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.236.195.220 persona-org-0210-1801955454.us-east-1.elb.amazonaws.com has address 54.84.127.61

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Cannot connect to verifier.login.persona.org from oneanddone on prod PaaS

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: osmose, Assigned: cturra)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Updated

Attachment

General

Description

File Name

Content Type