Closed Bug 1144206 Opened 10 years ago Closed 10 years ago

investigate what is causing frequent talos failures on 10.10

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kmoir, Assigned: kmoir)

References

Details

Attachments

(1 file, 1 obsolete file)

Current thinking is that DNS resolution is failing on the yosemite slaves for graphs.mozilla.org on an intermittent basis from #releng philor make talos hit graphserver at the start, just downloading a 404, to get it cached before DNS breaks? philor or is it particularly about graphserver, because it has a tiny ttl? jmaher philor: not sure I understand- this is a 10.10 issue- 10.8 and 10.6 never had issues with finding dns for graphs.mozilla.org jmaher philor: we can work around it in talos if needed, but there is probably something specific to 10.10 causing the problem philor jmaher: yeah, google will tell you an enormous amount, nearly none of it useful, about people's problems with DNS on 10.10 philor and how they've copied the library that 10.8 used to do DNS and grafted it into their 10.10 install jmaher philor: reason 174 to not use OSX philor which I haven't quite suggested yet jmaher I believe we need to adjust our 10.10 image for this unless there is a trick with python to force the DNS resolution or retry it to make it work coop|buildduty /etc/hosts managed by puppet? jmaher coop|buildduty: oh, good idea catlee catlee so we used to have this problem on osx catlee had to do with security contexts or somesuch catlee esp. when buildbot was started via ssh
The logs have errors like these Mar 17 12:16:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver Error 9 on socket - this might be a closed socket Mar 17 12:26:06 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 12:27:11 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 12:28:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 12:29:42 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 12:31:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver Re-Binding to random udp port 61696 Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver Error 9 on socket - this might be a closed socket Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 15 Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned -65568 Mar 17 13:30:41 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 13:31:46 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 13:32:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 13:34:16 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13 Mar 17 13:35:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response
There are /var/run/mDNSResponder streams and [root@t-yosemite-r5-0037.test.releng.scl3.mozilla.com log]# netstat | grep -i dns udp6 0 0 *.mdns *.* udp4 0 0 *.mdns *.* udp6 0 0 *.mdns *.* udp4 0 0 *.mdns *.* 5e25c65006795c6f stream 0 0 0 5e25c6500c1fa88f 0 0 /var/run/mDNSResponder 5e25c65003a659b7 stream 0 0 0 5e25c65003a65d9f 0 0 /var/run/mDNSResponder etc
(In reply to Kim Moir [:kmoir] from comment #1) > Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com > sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned > -65568 One amusing aspect of my current theory, that it is Bonjour and we need to disable it like many of the poorly-phrased "wifi on Yosemite is busted" blog posts say, is that it could explain why when we think nothing has changed we suddenly see huge swings in the percentage of jobs failing: make the Bonjour situation on the network more complicated by, say, bringing in some laptops for diagnostics or reimaging, and all the 10.10 slaves start trying to discover whether they have shared iTunes libraries, and shared photos, and printers attached to them, and on and on. Then "reimage 15 more 10.8 machines as 10.10" and all the others will have to rabidly try to discover everything about *them*.
loaned t-yosemite-r5-0009 to myself to test config suggested in http://blogs.smartertools.com/2014/10/31/a-fix-for-yosemite-wi-fi-issues/ on dev-master
This repo seems to lend credence to the issue (they are a large mac colo) https://github.com/MacMiniVault/Mac-Scripts/blob/master/disablebonjour/disablebonjour-README.md I'm going to test this fix on my loaner
Attached patch bug1144206.patch (obsolete) — Splinter Review
Attached patch bug1144206.patchSplinter Review
Attachment #8579411 - Attachment is obsolete: true
Test runs in staging are green so far
Comment on attachment 8579445 [details] [diff] [review] bug1144206.patch Review of attachment 8579445 [details] [diff] [review]: ----------------------------------------------------------------- lgtm please clean up trailing whitespace before landing!
Attachment #8579445 - Flags: review+
Comment on attachment 8579445 [details] [diff] [review] bug1144206.patch and merged to production whitespace fixed
Attachment #8579445 - Flags: checked-in+
So I can see this is deployed to machines that have recently rebooted but not ones that have not (3hrs up time in some cases). Puppet runs less often now that we have runner reducing the number of reboots.
I lost patience with it and rebooted most of them (accursed slave health won't let you reboot everything that hasn't reported for 0 minutes, so I had to let a few slide, at least in my first round) at 22:19, since I had an infra tree closure going anyway.
Rebooted all of them by 23:02, except that t-yosemite-r5-0034 and t-yosemite-r5-0004 refused to reboot but continued taking jobs, so I disabled them on the off chance that'll stop them, and t-yosemite-r5-0005 refused to reboot but at least had the courtesy to not keep taking jobs. So as of jobs starting at 23:02 PDT, we don't expect anything other than perhaps those three to show up in bug 1134790.
Thanks philor! I renabled t-yosemite-r5-0034 and t-yosemite-r5-0004 since I can see that they ran puppet and have no-multicast option enabled now.
10.10 Talos tests look green this morning so I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 1201230
Assignee: nobody → kmoir
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: