Closed
Bug 1144206
Opened 10 years ago
Closed 10 years ago
investigate what is causing frequent talos failures on 10.10
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kmoir, Assigned: kmoir)
References
Details
Attachments
(1 file, 1 obsolete file)
1.23 KB,
patch
|
catlee
:
review+
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
Current thinking is that DNS resolution is failing on the yosemite slaves for graphs.mozilla.org on an intermittent basis
from #releng
philor make talos hit graphserver at the start, just downloading a 404, to get it cached before DNS breaks?
philor or is it particularly about graphserver, because it has a tiny ttl?
jmaher philor: not sure I understand- this is a 10.10 issue- 10.8 and 10.6 never had issues with finding dns for graphs.mozilla.org
jmaher philor: we can work around it in talos if needed, but there is probably something specific to 10.10 causing the problem
philor jmaher: yeah, google will tell you an enormous amount, nearly none of it useful, about people's problems with DNS on 10.10
philor and how they've copied the library that 10.8 used to do DNS and grafted it into their 10.10 install
jmaher philor: reason 174 to not use OSX
philor which I haven't quite suggested yet
jmaher I believe we need to adjust our 10.10 image for this unless there is a trick with python to force the DNS resolution or retry it to make it work
coop|buildduty /etc/hosts managed by puppet?
jmaher coop|buildduty: oh, good idea
catlee
catlee so we used to have this problem on osx
catlee had to do with security contexts or somesuch
catlee esp. when buildbot was started via ssh
Assignee | ||
Comment 1•10 years ago
|
||
The logs have errors like these
Mar 17 12:16:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver Error 9 on socket - this might be a closed socket
Mar 17 12:26:06 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:27:11 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:28:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:29:42 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:31:22 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[6443]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver Re-Binding to random udp port 61696
Mar 17 12:35:43 t-yosemite-r5-0037.test.releng.scl3.mozilla.com discoveryd[48]: Basic DNSResolver Error 9 on socket - this might be a closed socket
Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 15
Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned -65568
Mar 17 13:30:41 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:31:46 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:32:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:34:16 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response: Socket 13
Mar 17 13:35:56 t-yosemite-r5-0037.test.releng.scl3.mozilla.com python[7328]: dnssd_clientstub set_waitlimit:_daemon timed out (60 secs) without any response
Assignee | ||
Comment 2•10 years ago
|
||
There are /var/run/mDNSResponder streams and
[root@t-yosemite-r5-0037.test.releng.scl3.mozilla.com log]# netstat | grep -i dns
udp6 0 0 *.mdns *.*
udp4 0 0 *.mdns *.*
udp6 0 0 *.mdns *.*
udp4 0 0 *.mdns *.*
5e25c65006795c6f stream 0 0 0 5e25c6500c1fa88f 0 0 /var/run/mDNSResponder
5e25c65003a659b7 stream 0 0 0 5e25c65003a65d9f 0 0 /var/run/mDNSResponder
etc
Comment 3•10 years ago
|
||
(In reply to Kim Moir [:kmoir] from comment #1)
> Mar 17 12:37:01 t-yosemite-r5-0037.test.releng.scl3.mozilla.com
> sharingd[234]: 12:37:01.652 : SDBonjourBrowser::DNSServiceBrowse returned
> -65568
One amusing aspect of my current theory, that it is Bonjour and we need to disable it like many of the poorly-phrased "wifi on Yosemite is busted" blog posts say, is that it could explain why when we think nothing has changed we suddenly see huge swings in the percentage of jobs failing: make the Bonjour situation on the network more complicated by, say, bringing in some laptops for diagnostics or reimaging, and all the 10.10 slaves start trying to discover whether they have shared iTunes libraries, and shared photos, and printers attached to them, and on and on. Then "reimage 15 more 10.8 machines as 10.10" and all the others will have to rabidly try to discover everything about *them*.
Assignee | ||
Comment 4•10 years ago
|
||
loaned t-yosemite-r5-0009 to myself to test config suggested in http://blogs.smartertools.com/2014/10/31/a-fix-for-yosemite-wi-fi-issues/ on dev-master
Assignee | ||
Comment 5•10 years ago
|
||
This repo seems to lend credence to the issue (they are a large mac colo)
https://github.com/MacMiniVault/Mac-Scripts/blob/master/disablebonjour/disablebonjour-README.md
I'm going to test this fix on my loaner
Assignee | ||
Comment 6•10 years ago
|
||
Assignee | ||
Comment 7•10 years ago
|
||
Attachment #8579411 -
Attachment is obsolete: true
Assignee | ||
Comment 8•10 years ago
|
||
Test runs in staging are green so far
Comment 9•10 years ago
|
||
Comment on attachment 8579445 [details] [diff] [review]
bug1144206.patch
Review of attachment 8579445 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm
please clean up trailing whitespace before landing!
Attachment #8579445 -
Flags: review+
Assignee | ||
Comment 10•10 years ago
|
||
Comment on attachment 8579445 [details] [diff] [review]
bug1144206.patch
and merged to production
whitespace fixed
Attachment #8579445 -
Flags: checked-in+
Assignee | ||
Comment 11•10 years ago
|
||
So I can see this is deployed to machines that have recently rebooted but not ones that have not (3hrs up time in some cases). Puppet runs less often now that we have runner reducing the number of reboots.
Comment 12•10 years ago
|
||
I lost patience with it and rebooted most of them (accursed slave health won't let you reboot everything that hasn't reported for 0 minutes, so I had to let a few slide, at least in my first round) at 22:19, since I had an infra tree closure going anyway.
Comment 13•10 years ago
|
||
Rebooted all of them by 23:02, except that t-yosemite-r5-0034 and t-yosemite-r5-0004 refused to reboot but continued taking jobs, so I disabled them on the off chance that'll stop them, and t-yosemite-r5-0005 refused to reboot but at least had the courtesy to not keep taking jobs. So as of jobs starting at 23:02 PDT, we don't expect anything other than perhaps those three to show up in bug 1134790.
Assignee | ||
Comment 14•10 years ago
|
||
Thanks philor!
I renabled t-yosemite-r5-0034 and t-yosemite-r5-0004 since I can see that they ran puppet and have no-multicast option enabled now.
Assignee | ||
Comment 15•10 years ago
|
||
10.10 Talos tests look green this morning so I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → kmoir
Updated•7 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•