investigate what is causing frequent talos failures on 10.10

RESOLVED FIXED

Status

Release Engineering
Platform Support
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: kmoir, Assigned: kmoir)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

3 years ago
Current thinking is that DNS resolution is failing on the yosemite slaves for graphs.mozilla.org on an intermittent basis

from #releng

philor	make talos hit graphserver at the start, just downloading a 404, to get it cached before DNS breaks?
	philor	or is it particularly about graphserver, because it has a tiny ttl?
	jmaher	philor: not sure I understand- this is a 10.10 issue- 10.8 and 10.6 never had issues with finding dns for graphs.mozilla.org
	jmaher	philor: we can work around it in talos if needed, but there is probably something specific to 10.10 causing the problem
	philor	jmaher: yeah, google will tell you an enormous amount, nearly none of it useful, about people's problems with DNS on 10.10
	philor	and how they've copied the library that 10.8 used to do DNS and grafted it into their 10.10 install
	jmaher	philor: reason 174 to not use OSX
	philor	which I haven't quite suggested yet
	jmaher	I believe we need to adjust our 10.10 image for this unless there is a trick with python to force the DNS resolution or retry it to make it work
	coop|buildduty	/etc/hosts managed by puppet?
	jmaher	coop|buildduty: oh, good idea
	catlee	
	catlee	so we used to have this problem on osx
	catlee	had to do with security contexts or somesuch
	catlee	esp. when buildbot was started via ssh
(Assignee)

Comment 1

3 years ago
The logs have errors like these
Mar 17 12:16:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver  Error 9 on socket - this might be a closed socket
Mar 17 12:26:06 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:27:11 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:28:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:29:42 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:31:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver  Re-Binding to random udp port 61696
Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver  Error 9 on socket - this might be a closed socket
Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 15
Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned -65568
Mar 17 13:30:41 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:31:46 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:32:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:34:16 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:35:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response
(Assignee)

Comment 2

3 years ago
There are /var/run/mDNSResponder streams and 
[root@t-yosemite-r5-0037.test.releng.scl3.mozilla.com log]# netstat | grep -i dns
udp6       0      0  *.mdns                 *.*                               
udp4       0      0  *.mdns                 *.*                               
udp6       0      0  *.mdns                 *.*                               
udp4       0      0  *.mdns                 *.*                               
5e25c65006795c6f stream      0      0                0 5e25c6500c1fa88f                0                0 /var/run/mDNSResponder
5e25c65003a659b7 stream      0      0                0 5e25c65003a65d9f                0                0 /var/run/mDNSResponder
etc
(In reply to Kim Moir [:kmoir] from comment #1)
> Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com
> sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned
> -65568

One amusing aspect of my current theory, that it is Bonjour and we need to disable it like many of the poorly-phrased "wifi on Yosemite is busted" blog posts say, is that it could explain why when we think nothing has changed we suddenly see huge swings in the percentage of jobs failing: make the Bonjour situation on the network more complicated by, say, bringing in some laptops for diagnostics or reimaging, and all the 10.10 slaves start trying to discover whether they have shared iTunes libraries, and shared photos, and printers attached to them, and on and on. Then "reimage 15 more 10.8 machines as 10.10" and all the others will have to rabidly try to discover everything about *them*.
(Assignee)

Comment 4

3 years ago
loaned t-yosemite-r5-0009 to myself to test config suggested in http://blogs.smartertools.com/2014/10/31/a-fix-for-yosemite-wi-fi-issues/ on dev-master
(Assignee)

Comment 5

3 years ago
This repo seems to lend credence to the issue (they are a large mac colo)
https://github.com/MacMiniVault/Mac-Scripts/blob/master/disablebonjour/disablebonjour-README.md

I'm going to test this fix on my loaner
(Assignee)

Comment 6

3 years ago
Created attachment 8579411 [details] [diff] [review]
bug1144206.patch
(Assignee)

Comment 7

3 years ago
Created attachment 8579445 [details] [diff] [review]
bug1144206.patch
Attachment #8579411 - Attachment is obsolete: true
(Assignee)

Comment 8

3 years ago
Test runs in staging are green so far
Comment on attachment 8579445 [details] [diff] [review]
bug1144206.patch

Review of attachment 8579445 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm

please clean up trailing whitespace before landing!
Attachment #8579445 - Flags: review+
(Assignee)

Comment 10

3 years ago
Comment on attachment 8579445 [details] [diff] [review]
bug1144206.patch

and merged to production
whitespace fixed
Attachment #8579445 - Flags: checked-in+
(Assignee)

Comment 11

3 years ago
So I can see this is deployed to machines that have recently rebooted but not ones that have not (3hrs up time in some cases). Puppet runs less often now that we have runner reducing the number of reboots.
I lost patience with it and rebooted most of them (accursed slave health won't let you reboot everything that hasn't reported for 0 minutes, so I had to let a few slide, at least in my first round) at 22:19, since I had an infra tree closure going anyway.
Rebooted all of them by 23:02, except that t-yosemite-r5-0034 and t-yosemite-r5-0004 refused to reboot but continued taking jobs, so I disabled them on the off chance that'll stop them, and t-yosemite-r5-0005 refused to reboot but at least had the courtesy to not keep taking jobs. So as of jobs starting at 23:02 PDT, we don't expect anything other than perhaps those three to show up in bug 1134790.
(Assignee)

Comment 14

3 years ago
Thanks philor!

I renabled t-yosemite-r5-0034 and t-yosemite-r5-0004 since I can see that they ran puppet and have no-multicast option enabled now.
(Assignee)

Comment 15

3 years ago
10.10 Talos tests look green this morning so I'm going to close this bug.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Blocks: 1201230
(Assignee)

Updated

2 years ago
Assignee: nobody → kmoir
You need to log in before you can comment on or make changes to this bug.