Closed Bug 980889 Opened 8 years ago Closed 8 years ago

servo linux slaves couldn't successfully run puppet

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

Details

From #releng on IRC (14 Mar 2014, times in CET pm):

Ms2ger
1:18:10 Looks like all Servo's Linux slaves are dead? bhearsum|afk?
1:18:11 http://servo-buildbot.pub.build.mozilla.org/buildslaves
pmoore
1:20:09 Ms2ger: bhearsum|afk should be around in about an hour or so
Ms2ger
1:21:09 You don't happen to be able to fix it? :)
pmoore
1:22:12 Ms2ger: haha, well i'm just taking a look, but at the moment i know nothing about the servo slaves - but i'm taking a look now
1:23:27 mgerva, simone: do you have experience with the servo build slaves?
Ms2ger
1:23:46 I'm afraid bhearsum is pretty much our single point of failure in releng at this point :/
simone
1:24:20 pmoore, Ms2ger: I also don't have previous experience with servo build slaves :-(
pmoore
1:24:21 Ms2ger: let me see if i can find some wiki pages
mgerva
1:24:38 checks
Ms2ger
1:25:21 Thanks folks :)
1:25:46 It's not a huge rush, our traffic is pretty limited :)
mgerva
1:27:01 mmh looks like slaves are not connected
Ms2ger
1:27:11 runs off to class
1:27:25 Jesse [jruderman@moz-9754CB0.hsd1.ca.comcast.net] entered the room.
1:27:26 Ms2ger left the room (quit: Quit: bbl).
1:29:21 Jesse left the room (quit: Ping timeout).
1:30:37 jrmuizel [jrmuizel@C492F63A.8F86291A.971E19F6.IP] entered the room.
1:30:58 whimboo is now known as whimboo|afk
pmoore
1:35:48 mgerva: my ssh key is not authorized on servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com (for either the build account or my own personal account) - are you having any better luck?
mgerva
1:35:50 buildbot is not running on servo-linux64-ec2-001
pmoore
1:36:14 mgerva: ah cool, nice work :)
mgerva
1:36:24 pmoore: you need aws-releng key to ssh on it
pmoore
1:36:51 mgerva: that would explain it! :)
1:37:37 mgerva: are you able to see why the slave stopped?
mgerva
1:37:59 it has just rebooted
1:38:40 https://pastebin.mozilla.org/4503495
1:38:52 it was shut down from the master
1:39:51 jrmuizel left the room (quit: Client exited).
pmoore
1:46:23 mgerva: cool, i see it is now available to the master again: http://servo-buildbot.pub.build.mozilla.org/buildslaves
mgerva
1:47:07 it gets shutdown by the master as soon it gets online
1:47:24 cyborgshadow left the room (quit: Ping timeout).
pmoore
1:47:46 mgerva: but i see it is "idle" now, not "not connected" - so i think the first slave is ok now, isn't it?
1:47:50 cyborgshadow [quassel@CD1B8F1C.61598B32.2C421B25.IP] entered the room.
pmoore
1:49:21 i think this should be enough to tie us over until bhearsum|afk arrives - we have at least a linux build slave available now, right?
mgerva
1:50:36 there'a build in progress
1:50:44 there's
1:51:12 http://servo-buildbot.pub.build.mozilla.org/builders
pmoore
1:52:39 i'm confused, this says there are none: http://servo-buildbot.pub.build.mozilla.org/builders/linux
mgerva
1:55:35 one of them is cached :)
1:56:07 mgerva is now known as mgerva|lunch
pmoore
1:56:18 mgerva|lunch: and now the slave is offline again!
1:56:48 i'll raise a bug
They're reciving a SIGTERM right after connecting.

2014-03-07 04:37:38-0800 [-] Connecting to buildbot-master-servo-01.srv.servo.releng.use1.mozilla.com:9001
2014-03-07 04:37:38-0800 [-] Watching /builds/slave/shutdown.stamp's mtime to initiate shutdown
2014-03-07 04:37:38-0800 [Broker,client] message from master: attached
2014-03-07 04:37:38-0800 [Broker,client] I have a leftover directory 'test' that is not being used by the buildmaster: you can delete it now
2014-03-07 04:37:38-0800 [Broker,client] SlaveBuilder.remote_print(linux): message from master: attached
2014-03-07 04:37:38-0800 [Broker,client] Connected to buildbot-master-servo-01.srv.servo.releng.use1.mozilla.com:9001; slave is ready2014-03-07 04:37:39-0800 [-] Received SIGTERM, shutting down.
2014-03-07 04:37:39-0800 [Broker,client] lost remote
2014-03-07 04:37:39-0800 [Broker,client] Lost connection to buildbot-master-servo-01.srv.servo.releng.use1.mozilla.com:9001
2014-03-07 04:37:39-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x11b4cb0>
2014-03-07 04:37:39-0800 [-] Main loop terminated.
2014-03-07 04:37:39-0800 [-] Server Shut Down.
The master got the slave then tried to start the build but the slave was gone by then.

2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] slave 'servo-linux64-ec2-001' attaching from IPv4Address(TCP, '10.134.80.21', 41935)
2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] Starting buildslave keepalive timer for 'servo-linux64-ec2-001'
2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] Got slaveinfo from 'servo-linux64-ec2-001'
2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] bot attached
2014-03-07 04:37:38-0800 [Broker,1554,10.134.80.21] Buildslave servo-linux64-ec2-001 attached to linux 
2014-03-07 04:37:39-0800 [-] starting build <Build linux> using slave <SlaveBuilder builder='linux' slave='servo-linux64-ec2-001'>
2014-03-07 04:37:39-0800 [-] acquireLocks(slave <BuildSlave 'servo-linux64-ec2-001'>, locks [])
2014-03-07 04:37:39-0800 [-] starting build <Build linux>.. pinging the slave <SlaveBuilder builder='linux' slave='servo-linux64-ec2-001'>
2014-03-07 04:37:39-0800 [-] sending ping
...
2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] ping finished: failure
2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] slave ping failed; re-queueing the request
2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] releaseLocks(<BuildSlave 'servo-linux64-ec2-001'>): []
2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] BuildSlave.detached(servo-linux64-ec2-001)
2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] releaseLocks(<BuildSlave 'servo-linux64-ec2-001'>): []
2014-03-07 04:53:10-0800 [Broker,1554,10.134.80.21] Buildslave servo-linux64-ec2-001 detached from linux
The slave rebooted on me.

Broadcast message from root@servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com
	(/dev/console) at 5:46 ...

The system is going down for reboot NOW!
Connection to servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com closed by remote host.
Connection to servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com closed.
Puppet runs are failing with:
Mar  7 05:48:36 servo-linux64-ec2-001 puppet-agent[1151]: Could not request certificate: Error 400 on SERVER: this master is not a CA

There was some sort of Puppet upgrade in bug 946872 yesterday, could be related.
Servo's puppet master doesn't think it's a CA either, not sure if this is expected or not:
Mar  7 05:50:37 servo-puppet1 puppet-master[5083]: this master is not a CA
I'm pretty sure the Puppet upgrade borked something...the final successful run upgraded Puppet:
Mar  6 14:01:21 servo-linux64-ec2-001 puppet-agent[1110]: (/Stage[main]/Packages::Puppet/Package[puppet]/ensure) ensure changed '3.2.2-1.el6' to '3.4.2-1.el6'
OK, so /etc/hosts was different than what it should've been, and I think the facter upgrade from bug 946872 tickled a problem. I've fixed that on -001, and will fix it on the rest too.

The messages from the buildbot slave's twistd.log are a red herring - that's a stale connection from the last time it successfully connected. The master didn't realize it was gone until it tried to start a build. (That's pretty normal, nothing to worry about.)
Oh, and the correct /etc/hosts looks like:
127.0.0.1 servo-linux64-ec2-002.build.servo.releng.use1.mozilla.com localhost
::1 localhost6.localdomain6 localhost6
10.134.82.20 repos
10.134.82.20 puppet

The problematic ones are like:
127.0.0.1 localhost.localdomain localhost servo-linux64-ec2-002.build.servo.releng.use1.mozilla.com
::1 localhost6.localdomain6 localhost6
10.134.82.20 repos
10.134.82.20 puppet


09:05 < bhearsum> i wonder if it's the servo-linux64-ec2-001.localdomain part that's buggering them
09:05 < bhearsum> looks like there's a signed cert for the fqdn version
09:06 < bhearsum> /etc/sysconfig/network has the right hostname...
09:06 < bhearsum> but facter doesn't
09:07 < rail> /etc/hosts must have clues
09:07 < bhearsum> indeed
09:07 < bhearsum> 127.0.0.1 bld-linux64-spot-466.build.releng.usw2.mozilla.com localhost
09:07 < bhearsum> vs.
09:07 < bhearsum> 127.0.0.1 localhost.localdomain localhost servo-linux64-ec2-001.build.servo.releng.use1.mozilla.com
09:07 < bhearsum> /etc/hosts hasn't been touched since this machine was created
09:07 < rail> the fist entry should be fqdn
09:08 < rail> facter uses it
All fixed, very sorry for the lengthy outage.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Summary: Servo linux slaves keep going offline → servo linux slaves couldn't successfully run puppet
Hey Ben,
do you know if /etc/hosts are handled by puppet on the servo slaves now? Or should we create a bug to put them under puppet control?
Pete
Flags: needinfo?(bhearsum)
(In reply to Pete Moore [:pete][:pmoore] from comment #10)
> Hey Ben,
> do you know if /etc/hosts are handled by puppet on the servo slaves now? Or
> should we create a bug to put them under puppet control?
> Pete

I don't think Puppet handles /etc/hosts for any machines. Check with Rail or Dustin if you want to be sure, but you're probably right - we need a bug for this.
Flags: needinfo?(bhearsum)
Hey Dustin,

Is there a reason we shouldn't manage /etc/hosts with our puppet config on the servo build slaves (e.g. in case we interfere with IT's puppet)? If not, I'll create a bug so we can do it.

Thanks,
Pete
Flags: needinfo?(dustin)
If we manage /etc/hosts, it should only be to prevent use of that file for anything but localhost.  Note that network::aws adds 'puppet' and 'repos', both pointing to the current puppetmaster IP, but that's probably a bug.
Flags: needinfo?(dustin)
Are we in agreement that /etc/hosts should be managed by puppet, and its content should be the following two lines:

127.0.0.1 <fully qualified host name> localhost
::1 localhost6.localdomain6 localhost6

And that these lines should then be gone:

10.134.82.20 repos
10.134.82.20 puppet

I ask because:

1) Dustin said "If we manage /etc/hosts" but I think it is necessary, since not managing it in puppet requires manual configuration, which I think we want to avoid. Do we agree?

2) Removing repos and puppet hostnames from /etc/hosts could have a negative impact - but I cannot see why the servo build slaves would need a local name for the puppet server - so maybe they can go (of course we need to test, too).
We only need to manage it manually when it gets set to something other than the OS default, I think.  But that's still a good reason to manage it with puppet.  I would prefer

----
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
----

at least where that's functional.  If Ubuntu's resolver *requires* the fqdn in /etc/hosts, then managing that with puppet is a good idea (but renaming a host will still not work).

See bug 938629 for some notes about puppet/repos.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #15)
> I would prefer
> 
> ----
> 127.0.0.1 localhost.localdomain localhost
> ::1 localhost6.localdomain6 localhost6
> ----
> 
> at least where that's functional.

I think in the current state, this config would not be functional for us (see Ben's comment 8 above).

So I think our choices are:
1) go with the adapted /etc/hosts to include the fqdn
2) fix the underlying problem described in this bug, in a different way (i.e. not modifying /etc/hosts)
Ah - my bad - I see the trailing fqdn now in comment 8 - so maybe your version also works dustin, without a fqdn at all (as opposed to being the last entry of the line as per comment 8)...
For reference, a random CentOS box has

----
[root@mobile-imaging-001.p1.releng.scl1.mozilla.com ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
----

It's certainly possible that Ubuntu builds its resolver with different config, but at least in principle this can work!
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.