Closed Bug 1144206 Opened 9 years ago Closed 9 years ago

investigate what is causing frequent talos failures on 10.10

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: kmoir)

References

Details

Attachments

(1 file, 1 obsolete file)

Current thinking is that DNS resolution is failing on the yosemite slaves for graphs.mozilla.org on an intermittent basis

from #releng

philor	make talos hit graphserver at the start, just downloading a 404, to get it cached before DNS breaks?
	philor	or is it particularly about graphserver, because it has a tiny ttl?
	jmaher	philor: not sure I understand- this is a 10.10 issue- 10.8 and 10.6 never had issues with finding dns for graphs.mozilla.org
	jmaher	philor: we can work around it in talos if needed, but there is probably something specific to 10.10 causing the problem
	philor	jmaher: yeah, google will tell you an enormous amount, nearly none of it useful, about people's problems with DNS on 10.10
	philor	and how they've copied the library that 10.8 used to do DNS and grafted it into their 10.10 install
	jmaher	philor: reason 174 to not use OSX
	philor	which I haven't quite suggested yet
	jmaher	I believe we need to adjust our 10.10 image for this unless there is a trick with python to force the DNS resolution or retry it to make it work
	coop|buildduty	/etc/hosts managed by puppet?
	jmaher	coop|buildduty: oh, good idea
	catlee	
	catlee	so we used to have this problem on osx
	catlee	had to do with security contexts or somesuch
	catlee	esp. when buildbot was started via ssh
The logs have errors like these
Mar 17 12:16:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver  Error 9 on socket - this might be a closed socket
Mar 17 12:26:06 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:27:11 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:28:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:29:42 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:31:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver  Re-Binding to random udp port 61696
Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver  Error 9 on socket - this might be a closed socket
Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 15
Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned -65568
Mar 17 13:30:41 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:31:46 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:32:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:34:16 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:35:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response
There are /var/run/mDNSResponder streams and 
[root@t-yosemite-r5-0037.test.releng.scl3.mozilla.com log]# netstat | grep -i dns
udp6       0      0  *.mdns                 *.*                               
udp4       0      0  *.mdns                 *.*                               
udp6       0      0  *.mdns                 *.*                               
udp4       0      0  *.mdns                 *.*                               
5e25c65006795c6f stream      0      0                0 5e25c6500c1fa88f                0                0 /var/run/mDNSResponder
5e25c65003a659b7 stream      0      0                0 5e25c65003a65d9f                0                0 /var/run/mDNSResponder
etc
(In reply to Kim Moir [:kmoir] from comment #1)
> Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com
> sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned
> -65568

One amusing aspect of my current theory, that it is Bonjour and we need to disable it like many of the poorly-phrased "wifi on Yosemite is busted" blog posts say, is that it could explain why when we think nothing has changed we suddenly see huge swings in the percentage of jobs failing: make the Bonjour situation on the network more complicated by, say, bringing in some laptops for diagnostics or reimaging, and all the 10.10 slaves start trying to discover whether they have shared iTunes libraries, and shared photos, and printers attached to them, and on and on. Then "reimage 15 more 10.8 machines as 10.10" and all the others will have to rabidly try to discover everything about *them*.
loaned t-yosemite-r5-0009 to myself to test config suggested in http://blogs.smartertools.com/2014/10/31/a-fix-for-yosemite-wi-fi-issues/ on dev-master
This repo seems to lend credence to the issue (they are a large mac colo)
https://github.com/MacMiniVault/Mac-Scripts/blob/master/disablebonjour/disablebonjour-README.md

I'm going to test this fix on my loaner
Attached patch bug1144206.patch (obsolete) — Splinter Review
Attached patch bug1144206.patchSplinter Review
Attachment #8579411 - Attachment is obsolete: true
Test runs in staging are green so far
Comment on attachment 8579445 [details] [diff] [review]
bug1144206.patch

Review of attachment 8579445 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm

please clean up trailing whitespace before landing!
Attachment #8579445 - Flags: review+
Comment on attachment 8579445 [details] [diff] [review]
bug1144206.patch

and merged to production
whitespace fixed
Attachment #8579445 - Flags: checked-in+
So I can see this is deployed to machines that have recently rebooted but not ones that have not (3hrs up time in some cases). Puppet runs less often now that we have runner reducing the number of reboots.
I lost patience with it and rebooted most of them (accursed slave health won't let you reboot everything that hasn't reported for 0 minutes, so I had to let a few slide, at least in my first round) at 22:19, since I had an infra tree closure going anyway.
Rebooted all of them by 23:02, except that t-yosemite-r5-0034 and t-yosemite-r5-0004 refused to reboot but continued taking jobs, so I disabled them on the off chance that'll stop them, and t-yosemite-r5-0005 refused to reboot but at least had the courtesy to not keep taking jobs. So as of jobs starting at 23:02 PDT, we don't expect anything other than perhaps those three to show up in bug 1134790.
Thanks philor!

I renabled t-yosemite-r5-0034 and t-yosemite-r5-0004 since I can see that they ran puppet and have no-multicast option enabled now.
10.10 Talos tests look green this morning so I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1201230
Assignee: nobody → kmoir
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: