Closed
Bug 780312
Opened 13 years ago
Closed 13 years ago
Token Server not using GETHOSTBYNAME(3) to do DNS Resolution
Categories
(Cloud Services Graveyard :: Server: Token, defect)
Cloud Services Graveyard
Server: Token
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: bobm, Assigned: tarek)
Details
(Whiteboard: [qa+])
For every connection to STOKEN from TOKEN three DNS lookups occur. First it makes two AAAA records, for IPv6 compatability (and because there are two DNS servers in /etc/resolv.conf) and then an A record lookup. The application should either use GETHOSTBYNAME(3) from glibc (to take advantage of the DNS caching provided by NSCD), use DNS caching local to the application, or try DNS queries again upon failure to resolve.
When DNS resolution fails the application returns a 503 error, as in the following:
2012-08-03T22:19:22.027607 bad json body from sreg (http://stoken.sreg.scl2.stage.svc.mozilla.com/1.0/aitc-1.0): <urlopen error [Errno 3] name does not exist>
2012-08-03T22:19:22.044636 fec58e7380d04cb99d691c6c09de1c87
2012-08-03T22:19:22.044864 BackendError
Bad answer from proxy
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/mozsvc/tweens.py", line 36, in catch_backend_errors_tween
return handler(request)
File "/usr/lib/python2.6/site-packages/cornice/__init__.py", line 48, in cornice_tween
response = handler(request)
File "/usr/lib/python2.6/site-packages/pyramid/tweens.py", line 20, in excview_tween
response = handler(request)
File "/usr/lib/python2.6/site-packages/pyramid/router.py", line 164, in handle_request
response = view_callable(context, request)
File "/usr/lib/python2.6/site-packages/pyramid/config/views.py", line 307, in attr_view
return view(context, request)
File "/usr/lib/python2.6/site-packages/pyramid/config/views.py", line 280, in predicate_wrapper
return view(context, request)
File "/usr/lib/python2.6/site-packages/pyramid/config/views.py", line 333, in rendered_view
result = view(context, request)
File "/usr/lib/python2.6/site-packages/cornice/service.py", line 30, in call_service
return func(request)
File "/usr/lib/python2.6/site-packages/tokenserver/views.py", line 182, in return_token
uid, node = backend.allocate_node(email, service)
File "/usr/lib/python2.6/site-packages/tokenserver/assignment/sqlnode.py", line 86, in allocate_node
status, body = self._proxy('POST', url, data)
File "/usr/lib/python2.6/site-packages/tokenserver/assignment/sqlnode.py", line 79, in _proxy
raise BackendError('Bad answer from proxy')
BackendError: BackendError
Bad answer from proxy
user: -
| Reporter | ||
Comment 1•13 years ago
|
||
Output from a TCPDUMP on token server showing constant DNS queries:
15:22:00.015203 IP token1.reg.scl2.stage.svc.mozilla.com.38571 > ns1.scl2.svc.mozilla.com.domain: 14615+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com. (56)
15:22:00.018020 IP token1.reg.scl2.stage.svc.mozilla.com.37417 > ns2.scl2.svc.mozilla.com.domain: 1165+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com.mozilla.org. (68)
15:22:00.027819 IP token1.reg.scl2.stage.svc.mozilla.com.20913 > ns2.scl2.svc.mozilla.com.domain: 32299+ A? stoken.sreg.scl2.stage.svc.mozilla.com. (56)
15:22:00.027936 IP token1.reg.scl2.stage.svc.mozilla.com.17172 > ns1.scl2.svc.mozilla.com.domain: 34706+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com. (56)
15:22:00.031290 IP token1.reg.scl2.stage.svc.mozilla.com.20913 > ns2.scl2.svc.mozilla.com.domain: 29482+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com.mozilla.org. (68)
15:22:00.089260 IP token1.reg.scl2.stage.svc.mozilla.com.46551 > ns2.scl2.svc.mozilla.com.domain: 51125+ A? stoken.sreg.scl2.stage.svc.mozilla.com. (56)
15:22:00.089385 IP token1.reg.scl2.stage.svc.mozilla.com.61327 > ns1.scl2.svc.mozilla.com.domain: 11916+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com. (56)
15:22:00.092052 IP token1.reg.scl2.stage.svc.mozilla.com.46551 > ns2.scl2.svc.mozilla.com.domain: 43914+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com.mozilla.org. (68)
15:22:00.191025 IP token1.reg.scl2.stage.svc.mozilla.com.10448 > ns2.scl2.svc.mozilla.com.domain: 502+ A? stoken.sreg.scl2.stage.svc.mozilla.com. (56)
| Reporter | ||
Comment 2•13 years ago
|
||
NSCD debug logs showing caching working for GETHOSTBYNAME(3) queries:
root@token1.reg.scl2.stage.svc log]# perl -e 'gethostbyname("ibm.com");'
[root@token1.reg.scl2.stage.svc log]# grep ibm.com /var/log/nscd.log
Fri 03 Aug 2012 03:54:39 PM PDT - 2075: GETHOSTBYNAME (ibm.com)
Fri 03 Aug 2012 03:54:39 PM PDT - 2075: Haven't found "ibm.com" in hosts cache!
The GETHOSTBYNAME(3) query added an address to the NSCD cache:
[bobm@token1.reg.scl2.stage.svc ~]$ sudo strings /var/db/nscd/hosts |grep ibm.com
ibm.com
ibm.com
The application queries are not causing STOKEN servers to be added to NSCD cache:
[root@token1.reg.scl2.stage.svc log]# grep stoken /var/log/nscd.log
[root@token1.reg.scl2.stage.svc log]#
| Reporter | ||
Comment 3•13 years ago
|
||
The Operations Services Development team has located this useful information regarding the library making the DNS queries: http://stackoverflow.com/questions/2236498/tell-urllib2-to-use-custom-dns
Updated•13 years ago
|
Whiteboard: [qa+]
| Assignee | ||
Updated•13 years ago
|
Assignee: nobody → tarek
| Assignee | ||
Comment 4•13 years ago
|
||
Looks like Python has exactly the same behavior:
$ python -c "import socket;print socket.gethostbyname('afpy.org')"
88.191.125.34
[tarek@token1.reg.scl2.stage.svc ~]$ sudo strings /var/db/nscd/hosts |grep afpy
afpy.org
}"afpy.org
So maybe somehow the code is not calling this API, but I have a doubt
Also, I don't see a /var/log/nscd.log file on token1...
| Assignee | ||
Comment 5•13 years ago
|
||
After discussions with Rob in IRC and further digging, getaddrinfo() is the API used by the socket module.
It seems that this API doesn't cache dns lookups when the domain has AAA *and* the call is made in a particular order. https://bugzilla.redhat.com/show_bug.cgi?id=576801
The solution is to use gethostbyname() to translate any call to the socket creation.
I am note sure what's the best solution yet, because it has to be compatible with Gevent. Will work on it.
Notice that the solution should be used in all our projects.
| Assignee | ||
Comment 6•13 years ago
|
||
s/I am not sure what's the best solution yet/I am note sure what's the best spot to apply the solution yet/
| Assignee | ||
Comment 7•13 years ago
|
||
It appears that gevent has its own implementation, based on libevent's evdns_resolve_ipv4 which is completely different from glibc's resolver.
See http://monkey.org/~provos/libevent/doxygen/evdns_8h.html#68e22e497f71fc97e7b85512319867ee
I am not going to digg further and make the assumption that using the original stdlib blocking calls should be fine in our apps and will use the nscd cache.
| Assignee | ||
Comment 8•13 years ago
|
||
notice that gevent replaces both gethostbyname and getaddrinfo when the stdlib is patched
| Assignee | ||
Comment 9•13 years ago
|
||
I have one pull request pending in mozsvc at https://github.com/mozilla-services/mozservices/pull/21
Rob, do you mind reviewing it ? Once it's in I can use it in the token server, with an extra memory cache so it's called at most once an hour
Comment 10•13 years ago
|
||
Code reviewed and pull request merged.
Comment 11•13 years ago
|
||
So, at this point, I am going to assume we need a new deploy ticket similar to the previous one:
Bug 774918 - Deploy tokenserver 0.10-2 to dev/stage token servers
Status: NEW → ASSIGNED
QA Contact: jbonacci
Comment 12•13 years ago
|
||
And we have that deploy ticket:
Bug 784083 - token: Deploy tokenserver 1.1-1 to dev/stage token servers
And it has been Verified for Dev and Stage.
:bobm do you consider this ticket as Resolved/Fixed?
| Reporter | ||
Comment 13•13 years ago
|
||
Not yet. The surge in DNS requests was not observed during token server testing today, but token server was not working.
| Reporter | ||
Comment 14•13 years ago
|
||
This has been resolved. However, there may be some section of the code doing unnecessary lookups on db.user. But, on a very minor level.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 15•13 years ago
|
||
OK. Given Comment 14, I am going to mark this Verified.
We can always open a ticket for that specific case.
Status: RESOLVED → VERIFIED
Updated•3 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•