Closed Bug 980889 Opened 8 years ago Closed 8 years ago

servo linux slaves couldn't successfully run puppet


(Infrastructure & Operations Graveyard :: CIDuty, task)

Not set


(Not tracked)



(Reporter: pmoore, Unassigned)


From #releng on IRC (14 Mar 2014, times in CET pm):

1:18:10 Looks like all Servo's Linux slaves are dead? bhearsum|afk?
1:20:09 Ms2ger: bhearsum|afk should be around in about an hour or so
1:21:09 You don't happen to be able to fix it? :)
1:22:12 Ms2ger: haha, well i'm just taking a look, but at the moment i know nothing about the servo slaves - but i'm taking a look now
1:23:27 mgerva, simone: do you have experience with the servo build slaves?
1:23:46 I'm afraid bhearsum is pretty much our single point of failure in releng at this point :/
1:24:20 pmoore, Ms2ger: I also don't have previous experience with servo build slaves :-(
1:24:21 Ms2ger: let me see if i can find some wiki pages
1:24:38 checks
1:25:21 Thanks folks :)
1:25:46 It's not a huge rush, our traffic is pretty limited :)
1:27:01 mmh looks like slaves are not connected
1:27:11 runs off to class
1:27:25 Jesse [] entered the room.
1:27:26 Ms2ger left the room (quit: Quit: bbl).
1:29:21 Jesse left the room (quit: Ping timeout).
1:30:37 jrmuizel [jrmuizel@C492F63A.8F86291A.971E19F6.IP] entered the room.
1:30:58 whimboo is now known as whimboo|afk
1:35:48 mgerva: my ssh key is not authorized on (for either the build account or my own personal account) - are you having any better luck?
1:35:50 buildbot is not running on servo-linux64-ec2-001
1:36:14 mgerva: ah cool, nice work :)
1:36:24 pmoore: you need aws-releng key to ssh on it
1:36:51 mgerva: that would explain it! :)
1:37:37 mgerva: are you able to see why the slave stopped?
1:37:59 it has just rebooted
1:38:52 it was shut down from the master
1:39:51 jrmuizel left the room (quit: Client exited).
1:46:23 mgerva: cool, i see it is now available to the master again:
1:47:07 it gets shutdown by the master as soon it gets online
1:47:24 cyborgshadow left the room (quit: Ping timeout).
1:47:46 mgerva: but i see it is "idle" now, not "not connected" - so i think the first slave is ok now, isn't it?
1:47:50 cyborgshadow [quassel@CD1B8F1C.61598B32.2C421B25.IP] entered the room.
1:49:21 i think this should be enough to tie us over until bhearsum|afk arrives - we have at least a linux build slave available now, right?
1:50:36 there'a build in progress
1:50:44 there's
1:52:39 i'm confused, this says there are none:
1:55:35 one of them is cached :)
1:56:07 mgerva is now known as mgerva|lunch
1:56:18 mgerva|lunch: and now the slave is offline again!
1:56:48 i'll raise a bug
They're reciving a SIGTERM right after connecting.

2014-03-07 04:37:38-0800 [-] Connecting to
2014-03-07 04:37:38-0800 [-] Watching /builds/slave/shutdown.stamp's mtime to initiate shutdown
2014-03-07 04:37:38-0800 [Broker,client] message from master: attached
2014-03-07 04:37:38-0800 [Broker,client] I have a leftover directory 'test' that is not being used by the buildmaster: you can delete it now
2014-03-07 04:37:38-0800 [Broker,client] SlaveBuilder.remote_print(linux): message from master: attached
2014-03-07 04:37:38-0800 [Broker,client] Connected to; slave is ready2014-03-07 04:37:39-0800 [-] Received SIGTERM, shutting down.
2014-03-07 04:37:39-0800 [Broker,client] lost remote
2014-03-07 04:37:39-0800 [Broker,client] Lost connection to
2014-03-07 04:37:39-0800 [Broker,client] Stopping factory < instance at 0x11b4cb0>
2014-03-07 04:37:39-0800 [-] Main loop terminated.
2014-03-07 04:37:39-0800 [-] Server Shut Down.
The master got the slave then tried to start the build but the slave was gone by then.

2014-03-07 04:37:38-0800 [Broker,1554,] slave 'servo-linux64-ec2-001' attaching from IPv4Address(TCP, '', 41935)
2014-03-07 04:37:38-0800 [Broker,1554,] Starting buildslave keepalive timer for 'servo-linux64-ec2-001'
2014-03-07 04:37:38-0800 [Broker,1554,] Got slaveinfo from 'servo-linux64-ec2-001'
2014-03-07 04:37:38-0800 [Broker,1554,] bot attached
2014-03-07 04:37:38-0800 [Broker,1554,] Buildslave servo-linux64-ec2-001 attached to linux 
2014-03-07 04:37:39-0800 [-] starting build <Build linux> using slave <SlaveBuilder builder='linux' slave='servo-linux64-ec2-001'>
2014-03-07 04:37:39-0800 [-] acquireLocks(slave <BuildSlave 'servo-linux64-ec2-001'>, locks [])
2014-03-07 04:37:39-0800 [-] starting build <Build linux>.. pinging the slave <SlaveBuilder builder='linux' slave='servo-linux64-ec2-001'>
2014-03-07 04:37:39-0800 [-] sending ping
2014-03-07 04:53:10-0800 [Broker,1554,] ping finished: failure
2014-03-07 04:53:10-0800 [Broker,1554,] slave ping failed; re-queueing the request
2014-03-07 04:53:10-0800 [Broker,1554,] releaseLocks(<BuildSlave 'servo-linux64-ec2-001'>): []
2014-03-07 04:53:10-0800 [Broker,1554,] BuildSlave.detached(servo-linux64-ec2-001)
2014-03-07 04:53:10-0800 [Broker,1554,] releaseLocks(<BuildSlave 'servo-linux64-ec2-001'>): []
2014-03-07 04:53:10-0800 [Broker,1554,] Buildslave servo-linux64-ec2-001 detached from linux
The slave rebooted on me.

Broadcast message from
	(/dev/console) at 5:46 ...

The system is going down for reboot NOW!
Connection to closed by remote host.
Connection to closed.
Puppet runs are failing with:
Mar  7 05:48:36 servo-linux64-ec2-001 puppet-agent[1151]: Could not request certificate: Error 400 on SERVER: this master is not a CA

There was some sort of Puppet upgrade in bug 946872 yesterday, could be related.
Servo's puppet master doesn't think it's a CA either, not sure if this is expected or not:
Mar  7 05:50:37 servo-puppet1 puppet-master[5083]: this master is not a CA
I'm pretty sure the Puppet upgrade borked something...the final successful run upgraded Puppet:
Mar  6 14:01:21 servo-linux64-ec2-001 puppet-agent[1110]: (/Stage[main]/Packages::Puppet/Package[puppet]/ensure) ensure changed '3.2.2-1.el6' to '3.4.2-1.el6'
OK, so /etc/hosts was different than what it should've been, and I think the facter upgrade from bug 946872 tickled a problem. I've fixed that on -001, and will fix it on the rest too.

The messages from the buildbot slave's twistd.log are a red herring - that's a stale connection from the last time it successfully connected. The master didn't realize it was gone until it tried to start a build. (That's pretty normal, nothing to worry about.)
Oh, and the correct /etc/hosts looks like: localhost
::1 localhost6.localdomain6 localhost6 repos puppet

The problematic ones are like: localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6 repos puppet

09:05 < bhearsum> i wonder if it's the servo-linux64-ec2-001.localdomain part that's buggering them
09:05 < bhearsum> looks like there's a signed cert for the fqdn version
09:06 < bhearsum> /etc/sysconfig/network has the right hostname...
09:06 < bhearsum> but facter doesn't
09:07 < rail> /etc/hosts must have clues
09:07 < bhearsum> indeed
09:07 < bhearsum> localhost
09:07 < bhearsum> vs.
09:07 < bhearsum> localhost.localdomain localhost
09:07 < bhearsum> /etc/hosts hasn't been touched since this machine was created
09:07 < rail> the fist entry should be fqdn
09:08 < rail> facter uses it
All fixed, very sorry for the lengthy outage.
Closed: 8 years ago
Resolution: --- → FIXED
Summary: Servo linux slaves keep going offline → servo linux slaves couldn't successfully run puppet
Hey Ben,
do you know if /etc/hosts are handled by puppet on the servo slaves now? Or should we create a bug to put them under puppet control?
Flags: needinfo?(bhearsum)
(In reply to Pete Moore [:pete][:pmoore] from comment #10)
> Hey Ben,
> do you know if /etc/hosts are handled by puppet on the servo slaves now? Or
> should we create a bug to put them under puppet control?
> Pete

I don't think Puppet handles /etc/hosts for any machines. Check with Rail or Dustin if you want to be sure, but you're probably right - we need a bug for this.
Flags: needinfo?(bhearsum)
Hey Dustin,

Is there a reason we shouldn't manage /etc/hosts with our puppet config on the servo build slaves (e.g. in case we interfere with IT's puppet)? If not, I'll create a bug so we can do it.

Flags: needinfo?(dustin)
If we manage /etc/hosts, it should only be to prevent use of that file for anything but localhost.  Note that network::aws adds 'puppet' and 'repos', both pointing to the current puppetmaster IP, but that's probably a bug.
Flags: needinfo?(dustin)
Are we in agreement that /etc/hosts should be managed by puppet, and its content should be the following two lines: <fully qualified host name> localhost
::1 localhost6.localdomain6 localhost6

And that these lines should then be gone: repos puppet

I ask because:

1) Dustin said "If we manage /etc/hosts" but I think it is necessary, since not managing it in puppet requires manual configuration, which I think we want to avoid. Do we agree?

2) Removing repos and puppet hostnames from /etc/hosts could have a negative impact - but I cannot see why the servo build slaves would need a local name for the puppet server - so maybe they can go (of course we need to test, too).
We only need to manage it manually when it gets set to something other than the OS default, I think.  But that's still a good reason to manage it with puppet.  I would prefer

---- localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

at least where that's functional.  If Ubuntu's resolver *requires* the fqdn in /etc/hosts, then managing that with puppet is a good idea (but renaming a host will still not work).

See bug 938629 for some notes about puppet/repos.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #15)
> I would prefer
> ----
> localhost.localdomain localhost
> ::1 localhost6.localdomain6 localhost6
> ----
> at least where that's functional.

I think in the current state, this config would not be functional for us (see Ben's comment 8 above).

So I think our choices are:
1) go with the adapted /etc/hosts to include the fqdn
2) fix the underlying problem described in this bug, in a different way (i.e. not modifying /etc/hosts)
Ah - my bad - I see the trailing fqdn now in comment 8 - so maybe your version also works dustin, without a fqdn at all (as opposed to being the last entry of the line as per comment 8)...
For reference, a random CentOS box has

[ ~]# cat /etc/hosts   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

It's certainly possible that Ubuntu builds its resolver with different config, but at least in principle this can work!
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.