Closed Bug 780312 Opened 13 years ago Closed 13 years ago

Token Server not using GETHOSTBYNAME(3) to do DNS Resolution

Categories

(Cloud Services Graveyard :: Server: Token, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: bobm, Assigned: tarek)

Details

(Whiteboard: [qa+])

For every connection to STOKEN from TOKEN three DNS lookups occur. First it makes two AAAA records, for IPv6 compatability (and because there are two DNS servers in /etc/resolv.conf) and then an A record lookup. The application should either use GETHOSTBYNAME(3) from glibc (to take advantage of the DNS caching provided by NSCD), use DNS caching local to the application, or try DNS queries again upon failure to resolve. When DNS resolution fails the application returns a 503 error, as in the following: 2012-08-03T22:19:22.027607 bad json body from sreg (http://stoken.sreg.scl2.stage.svc.mozilla.com/1.0/aitc-1.0): <urlopen error [Errno 3] name does not exist> 2012-08-03T22:19:22.044636 fec58e7380d04cb99d691c6c09de1c87 2012-08-03T22:19:22.044864 BackendError Bad answer from proxy Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/mozsvc/tweens.py", line 36, in catch_backend_errors_tween return handler(request) File "/usr/lib/python2.6/site-packages/cornice/__init__.py", line 48, in cornice_tween response = handler(request) File "/usr/lib/python2.6/site-packages/pyramid/tweens.py", line 20, in excview_tween response = handler(request) File "/usr/lib/python2.6/site-packages/pyramid/router.py", line 164, in handle_request response = view_callable(context, request) File "/usr/lib/python2.6/site-packages/pyramid/config/views.py", line 307, in attr_view return view(context, request) File "/usr/lib/python2.6/site-packages/pyramid/config/views.py", line 280, in predicate_wrapper return view(context, request) File "/usr/lib/python2.6/site-packages/pyramid/config/views.py", line 333, in rendered_view result = view(context, request) File "/usr/lib/python2.6/site-packages/cornice/service.py", line 30, in call_service return func(request) File "/usr/lib/python2.6/site-packages/tokenserver/views.py", line 182, in return_token uid, node = backend.allocate_node(email, service) File "/usr/lib/python2.6/site-packages/tokenserver/assignment/sqlnode.py", line 86, in allocate_node status, body = self._proxy('POST', url, data) File "/usr/lib/python2.6/site-packages/tokenserver/assignment/sqlnode.py", line 79, in _proxy raise BackendError('Bad answer from proxy') BackendError: BackendError Bad answer from proxy user: -
Output from a TCPDUMP on token server showing constant DNS queries: 15:22:00.015203 IP token1.reg.scl2.stage.svc.mozilla.com.38571 > ns1.scl2.svc.mozilla.com.domain: 14615+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com. (56) 15:22:00.018020 IP token1.reg.scl2.stage.svc.mozilla.com.37417 > ns2.scl2.svc.mozilla.com.domain: 1165+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com.mozilla.org. (68) 15:22:00.027819 IP token1.reg.scl2.stage.svc.mozilla.com.20913 > ns2.scl2.svc.mozilla.com.domain: 32299+ A? stoken.sreg.scl2.stage.svc.mozilla.com. (56) 15:22:00.027936 IP token1.reg.scl2.stage.svc.mozilla.com.17172 > ns1.scl2.svc.mozilla.com.domain: 34706+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com. (56) 15:22:00.031290 IP token1.reg.scl2.stage.svc.mozilla.com.20913 > ns2.scl2.svc.mozilla.com.domain: 29482+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com.mozilla.org. (68) 15:22:00.089260 IP token1.reg.scl2.stage.svc.mozilla.com.46551 > ns2.scl2.svc.mozilla.com.domain: 51125+ A? stoken.sreg.scl2.stage.svc.mozilla.com. (56) 15:22:00.089385 IP token1.reg.scl2.stage.svc.mozilla.com.61327 > ns1.scl2.svc.mozilla.com.domain: 11916+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com. (56) 15:22:00.092052 IP token1.reg.scl2.stage.svc.mozilla.com.46551 > ns2.scl2.svc.mozilla.com.domain: 43914+ AAAA? stoken.sreg.scl2.stage.svc.mozilla.com.mozilla.org. (68) 15:22:00.191025 IP token1.reg.scl2.stage.svc.mozilla.com.10448 > ns2.scl2.svc.mozilla.com.domain: 502+ A? stoken.sreg.scl2.stage.svc.mozilla.com. (56)
NSCD debug logs showing caching working for GETHOSTBYNAME(3) queries: root@token1.reg.scl2.stage.svc log]# perl -e 'gethostbyname("ibm.com");' [root@token1.reg.scl2.stage.svc log]# grep ibm.com /var/log/nscd.log Fri 03 Aug 2012 03:54:39 PM PDT - 2075: GETHOSTBYNAME (ibm.com) Fri 03 Aug 2012 03:54:39 PM PDT - 2075: Haven't found "ibm.com" in hosts cache! The GETHOSTBYNAME(3) query added an address to the NSCD cache: [bobm@token1.reg.scl2.stage.svc ~]$ sudo strings /var/db/nscd/hosts |grep ibm.com ibm.com ibm.com The application queries are not causing STOKEN servers to be added to NSCD cache: [root@token1.reg.scl2.stage.svc log]# grep stoken /var/log/nscd.log [root@token1.reg.scl2.stage.svc log]#
The Operations Services Development team has located this useful information regarding the library making the DNS queries: http://stackoverflow.com/questions/2236498/tell-urllib2-to-use-custom-dns
Whiteboard: [qa+]
Assignee: nobody → tarek
Looks like Python has exactly the same behavior: $ python -c "import socket;print socket.gethostbyname('afpy.org')" 88.191.125.34 [tarek@token1.reg.scl2.stage.svc ~]$ sudo strings /var/db/nscd/hosts |grep afpy afpy.org }"afpy.org So maybe somehow the code is not calling this API, but I have a doubt Also, I don't see a /var/log/nscd.log file on token1...
After discussions with Rob in IRC and further digging, getaddrinfo() is the API used by the socket module. It seems that this API doesn't cache dns lookups when the domain has AAA *and* the call is made in a particular order. https://bugzilla.redhat.com/show_bug.cgi?id=576801 The solution is to use gethostbyname() to translate any call to the socket creation. I am note sure what's the best solution yet, because it has to be compatible with Gevent. Will work on it. Notice that the solution should be used in all our projects.
s/I am not sure what's the best solution yet/I am note sure what's the best spot to apply the solution yet/
It appears that gevent has its own implementation, based on libevent's evdns_resolve_ipv4 which is completely different from glibc's resolver. See http://monkey.org/~provos/libevent/doxygen/evdns_8h.html#68e22e497f71fc97e7b85512319867ee I am not going to digg further and make the assumption that using the original stdlib blocking calls should be fine in our apps and will use the nscd cache.
notice that gevent replaces both gethostbyname and getaddrinfo when the stdlib is patched
I have one pull request pending in mozsvc at https://github.com/mozilla-services/mozservices/pull/21 Rob, do you mind reviewing it ? Once it's in I can use it in the token server, with an extra memory cache so it's called at most once an hour
Code reviewed and pull request merged.
So, at this point, I am going to assume we need a new deploy ticket similar to the previous one: Bug 774918 - Deploy tokenserver 0.10-2 to dev/stage token servers
Status: NEW → ASSIGNED
QA Contact: jbonacci
And we have that deploy ticket: Bug 784083 - token: Deploy tokenserver 1.1-1 to dev/stage token servers And it has been Verified for Dev and Stage. :bobm do you consider this ticket as Resolved/Fixed?
Not yet. The surge in DNS requests was not observed during token server testing today, but token server was not working.
This has been resolved. However, there may be some section of the code doing unnecessary lookups on db.user. But, on a very minor level.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
OK. Given Comment 14, I am going to mark this Verified. We can always open a ticket for that specific case.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.