Closed Bug 980889 Opened 8 years ago Closed 8 years ago
servo linux slaves couldn't successfully run puppet
From #releng on IRC (14 Mar 2014, times in CET pm): Ms2ger 1:18:10 Looks like all Servo's Linux slaves are dead? bhearsum|afk? 1:18:11 http://servo-buildbot.pub.build.mozilla.org/buildslaves pmoore 1:20:09 Ms2ger: bhearsum|afk should be around in about an hour or so Ms2ger 1:21:09 You don't happen to be able to fix it? :) pmoore 1:22:12 Ms2ger: haha, well i'm just taking a look, but at the moment i know nothing about the servo slaves - but i'm taking a look now 1:23:27 mgerva, simone: do you have experience with the servo build slaves? Ms2ger 1:23:46 I'm afraid bhearsum is pretty much our single point of failure in releng at this point :/ simone 1:24:20 pmoore, Ms2ger: I also don't have previous experience with servo build slaves :-( pmoore 1:24:21 Ms2ger: let me see if i can find some wiki pages mgerva 1:24:38 checks Ms2ger 1:25:21 Thanks folks :) 1:25:46 It's not a huge rush, our traffic is pretty limited :) mgerva 1:27:01 mmh looks like slaves are not connected Ms2ger 1:27:11 runs off to class 1:27:25 Jesse [jruderman@moz-9754CB0.hsd1.ca.comcast.net] entered the room. 1:27:26 Ms2ger left the room (quit: Quit: bbl). 1:29:21 Jesse left the room (quit: Ping timeout). 1:30:37 jrmuizel [jrmuizel@C492F63A.8F86291A.971E19F6.IP] entered the room. 1:30:58 whimboo is now known as whimboo|afk pmoore 1:35:48 mgerva: my ssh key is not authorized on servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com (for either the build account or my own personal account) - are you having any better luck? mgerva 1:35:50 buildbot is not running on servo-linux64-ec2-001 pmoore 1:36:14 mgerva: ah cool, nice work :) mgerva 1:36:24 pmoore: you need aws-releng key to ssh on it pmoore 1:36:51 mgerva: that would explain it! :) 1:37:37 mgerva: are you able to see why the slave stopped? mgerva 1:37:59 it has just rebooted 1:38:40 https://pastebin.mozilla.org/4503495 1:38:52 it was shut down from the master 1:39:51 jrmuizel left the room (quit: Client exited). pmoore 1:46:23 mgerva: cool, i see it is now available to the master again: http://servo-buildbot.pub.build.mozilla.org/buildslaves mgerva 1:47:07 it gets shutdown by the master as soon it gets online 1:47:24 cyborgshadow left the room (quit: Ping timeout). pmoore 1:47:46 mgerva: but i see it is "idle" now, not "not connected" - so i think the first slave is ok now, isn't it? 1:47:50 cyborgshadow [quassel@CD1B8F1C.61598B32.2C421B25.IP] entered the room. pmoore 1:49:21 i think this should be enough to tie us over until bhearsum|afk arrives - we have at least a linux build slave available now, right? mgerva 1:50:36 there'a build in progress 1:50:44 there's 1:51:12 http://servo-buildbot.pub.build.mozilla.org/builders pmoore 1:52:39 i'm confused, this says there are none: http://servo-buildbot.pub.build.mozilla.org/builders/linux mgerva 1:55:35 one of them is cached :) 1:56:07 mgerva is now known as mgerva|lunch pmoore 1:56:18 mgerva|lunch: and now the slave is offline again! 1:56:48 i'll raise a bug
They're reciving a SIGTERM right after connecting. 2014-03-07 04:37:38-0800 [-] Connecting to buildbot-master-servo-01.srv.servo.releng.use1.mozilla.com:9001 2014-03-07 04:37:38-0800 [-] Watching /builds/slave/shutdown.stamp's mtime to initiate shutdown 2014-03-07 04:37:38-0800 [Broker,client] message from master: attached 2014-03-07 04:37:38-0800 [Broker,client] I have a leftover directory 'test' that is not being used by the buildmaster: you can delete it now 2014-03-07 04:37:38-0800 [Broker,client] SlaveBuilder.remote_print(linux): message from master: attached 2014-03-07 04:37:38-0800 [Broker,client] Connected to buildbot-master-servo-01.srv.servo.releng.use1.mozilla.com:9001; slave is ready2014-03-07 04:37:39-0800 [-] Received SIGTERM, shutting down. 2014-03-07 04:37:39-0800 [Broker,client] lost remote 2014-03-07 04:37:39-0800 [Broker,client] Lost connection to buildbot-master-servo-01.srv.servo.releng.use1.mozilla.com:9001 2014-03-07 04:37:39-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x11b4cb0> 2014-03-07 04:37:39-0800 [-] Main loop terminated. 2014-03-07 04:37:39-0800 [-] Server Shut Down.
The master got the slave then tried to start the build but the slave was gone by then. 2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] slave 'servo-linux64-ec2-001' attaching from IPv4Address(TCP, '10.134.80.21', 41935) 2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] Starting buildslave keepalive timer for 'servo-linux64-ec2-001' 2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] Got slaveinfo from 'servo-linux64-ec2-001' 2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] bot attached 2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] Buildslave servo-linux64-ec2-001 attached to linux 2014-03-07 04:37:39-0800 [-] starting build <Build linux> using slave <SlaveBuilder builder='linux' slave='servo-linux64-ec2-001'> 2014-03-07 04:37:39-0800 [-] acquireLocks(slave <BuildSlave 'servo-linux64-ec2-001'>, locks ) 2014-03-07 04:37:39-0800 [-] starting build <Build linux>.. pinging the slave <SlaveBuilder builder='linux' slave='servo-linux64-ec2-001'> 2014-03-07 04:37:39-0800 [-] sending ping ... 2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] ping finished: failure 2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] slave ping failed; re-queueing the request 2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] releaseLocks(<BuildSlave 'servo-linux64-ec2-001'>):  2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] BuildSlave.detached(servo-linux64-ec2-001) 2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] releaseLocks(<BuildSlave 'servo-linux64-ec2-001'>):  2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] Buildslave servo-linux64-ec2-001 detached from linux
The slave rebooted on me. Broadcast message from firstname.lastname@example.org (/dev/console) at 5:46 ... The system is going down for reboot NOW! Connection to servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com closed by remote host. Connection to servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com closed.
Puppet runs are failing with: Mar 7 05:48:36 servo-linux64-ec2-001 puppet-agent: Could not request certificate: Error 400 on SERVER: this master is not a CA There was some sort of Puppet upgrade in bug 946872 yesterday, could be related.
Servo's puppet master doesn't think it's a CA either, not sure if this is expected or not: Mar 7 05:50:37 servo-puppet1 puppet-master: this master is not a CA
I'm pretty sure the Puppet upgrade borked something...the final successful run upgraded Puppet: Mar 6 14:01:21 servo-linux64-ec2-001 puppet-agent: (/Stage[main]/Packages::Puppet/Package[puppet]/ensure) ensure changed '3.2.2-1.el6' to '3.4.2-1.el6'
OK, so /etc/hosts was different than what it should've been, and I think the facter upgrade from bug 946872 tickled a problem. I've fixed that on -001, and will fix it on the rest too. The messages from the buildbot slave's twistd.log are a red herring - that's a stale connection from the last time it successfully connected. The master didn't realize it was gone until it tried to start a build. (That's pretty normal, nothing to worry about.)
Oh, and the correct /etc/hosts looks like: 127.0.0.1 servo-linux64-ec2-002.build.servo.releng.use1.mozilla.com localhost ::1 localhost6.localdomain6 localhost6 10.134.82.20 repos 10.134.82.20 puppet The problematic ones are like: 127.0.0.1 localhost.localdomain localhost servo-linux64-ec2-002.build.servo.releng.use1.mozilla.com ::1 localhost6.localdomain6 localhost6 10.134.82.20 repos 10.134.82.20 puppet 09:05 < bhearsum> i wonder if it's the servo-linux64-ec2-001.localdomain part that's buggering them 09:05 < bhearsum> looks like there's a signed cert for the fqdn version 09:06 < bhearsum> /etc/sysconfig/network has the right hostname... 09:06 < bhearsum> but facter doesn't 09:07 < rail> /etc/hosts must have clues 09:07 < bhearsum> indeed 09:07 < bhearsum> 127.0.0.1 bld-linux64-spot-466.build.releng.usw2.mozilla.com localhost 09:07 < bhearsum> vs. 09:07 < bhearsum> 127.0.0.1 localhost.localdomain localhost servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com 09:07 < bhearsum> /etc/hosts hasn't been touched since this machine was created 09:07 < rail> the fist entry should be fqdn 09:08 < rail> facter uses it
All fixed, very sorry for the lengthy outage.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Summary: Servo linux slaves keep going offline → servo linux slaves couldn't successfully run puppet
Hey Ben, do you know if /etc/hosts are handled by puppet on the servo slaves now? Or should we create a bug to put them under puppet control? Pete
(In reply to Pete Moore [:pete][:pmoore] from comment #10) > Hey Ben, > do you know if /etc/hosts are handled by puppet on the servo slaves now? Or > should we create a bug to put them under puppet control? > Pete I don't think Puppet handles /etc/hosts for any machines. Check with Rail or Dustin if you want to be sure, but you're probably right - we need a bug for this.
Hey Dustin, Is there a reason we shouldn't manage /etc/hosts with our puppet config on the servo build slaves (e.g. in case we interfere with IT's puppet)? If not, I'll create a bug so we can do it. Thanks, Pete
If we manage /etc/hosts, it should only be to prevent use of that file for anything but localhost. Note that network::aws adds 'puppet' and 'repos', both pointing to the current puppetmaster IP, but that's probably a bug.
Are we in agreement that /etc/hosts should be managed by puppet, and its content should be the following two lines: 127.0.0.1 <fully qualified host name> localhost ::1 localhost6.localdomain6 localhost6 And that these lines should then be gone: 10.134.82.20 repos 10.134.82.20 puppet I ask because: 1) Dustin said "If we manage /etc/hosts" but I think it is necessary, since not managing it in puppet requires manual configuration, which I think we want to avoid. Do we agree? 2) Removing repos and puppet hostnames from /etc/hosts could have a negative impact - but I cannot see why the servo build slaves would need a local name for the puppet server - so maybe they can go (of course we need to test, too).
We only need to manage it manually when it gets set to something other than the OS default, I think. But that's still a good reason to manage it with puppet. I would prefer ---- 127.0.0.1 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 ---- at least where that's functional. If Ubuntu's resolver *requires* the fqdn in /etc/hosts, then managing that with puppet is a good idea (but renaming a host will still not work). See bug 938629 for some notes about puppet/repos.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #15) > I would prefer > > ---- > 127.0.0.1 localhost.localdomain localhost > ::1 localhost6.localdomain6 localhost6 > ---- > > at least where that's functional. I think in the current state, this config would not be functional for us (see Ben's comment 8 above). So I think our choices are: 1) go with the adapted /etc/hosts to include the fqdn 2) fix the underlying problem described in this bug, in a different way (i.e. not modifying /etc/hosts)
For reference, a random CentOS box has ---- [email@example.com ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ---- It's certainly possible that Ubuntu builds its resolver with different config, but at least in principle this can work!
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.