Closed Bug 1139300 Opened 9 years ago Closed 9 years ago

some/all of sea-hp-linux64-* are not connecting to the master

Categories

(SeaMonkey :: Release Engineering, defect)

x86_64
Linux
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ewong, Assigned: ewong)

References

Details

Attachments

(3 files)

Something is going on weird with sea-hp-linux64-{2,6-13}.  They aren't
staying connected with the buildbotmaster.  

2015-03-03 12:07:06-0800 [Broker,client]   watching logfiles {}
2015-03-03 12:07:06-0800 [Broker,client]   argv: ['python', 'tools/buildfarm/maintenance/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']
2015-03-03 12:07:06-0800 [Broker,client]  environment: {'LANG': 'en_US.UTF-8', 'CCACHE_HASHDIR': '', 'TERM': 'linux', 'SHELL': '/bin/bash', 'SHLVL': '1', 'HOSTNAME': 'sea-hp-linux64-6.community.scl3.mozilla.com', 'G_BROKEN_FILENAMES': '1', 'TMOUT': '86400', 'HISTSIZE': '1000', 'HISTCONTROL': 'ignoredups', 'PWD': '/builds/slave/test', 'LOGNAME': 'seabld', 'USER': 'seabld', 'MAIL': '/var/spool/mail/seabld', 'PATH': '/usr/local/bin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/seabld/bin', 'LESSOPEN': '|/usr/bin/lesspipe.sh %s', 'HOME': '/home/seabld', '_': '/tools/buildbot/bin/python'}
2015-03-03 12:07:06-0800 [Broker,client]   using PTY: False
2015-03-03 12:07:09-0800 [-] Received SIGTERM, shutting down.
2015-03-03 12:07:09-0800 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x188f128>
2015-03-03 12:07:09-0800 [-] command interrupted, attempting to kill
2015-03-03 12:07:09-0800 [-] trying to kill process group 10504
2015-03-03 12:07:09-0800 [-]  signal 9 sent successfully
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote
2015-03-03 12:07:09-0800 [Broker,client] lost remote step
2015-03-03 12:07:09-0800 [Broker,client] Lost connection to cb-seamonkey-linuxmaster-01.mozilla.org:9010

Looking at sea-puppet, I only see the following error:
Mar  4 00:37:33 sea-puppet.community.scl3.mozilla.com puppet-master[36154]: Compiled catalog for sea-hp-linux64-2.community.scl3.mozilla.com in environment production in 5.76 seconds
Mar  4 00:40:06 sea-puppet.community.scl3.mozilla.com puppet-master[36154]: Variable access via 'master_name' is deprecated. Use '@master_name' instead. template[/etc/puppet/production/modules/buildmaster/templates/postrun-seamonkey.cfg.erb]:5
Mar  4 00:40:06 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:    (at /usr/lib/ruby/site_ruby/1.8/puppet/parser/templatewrapper.rb:76:in `method_missing')
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]: Failed to parse template ssh/extra_authorized_key.erb:
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Filepath: /etc/puppet/production/modules/ssh/templates/extra_authorized_key.erb
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Line: 6
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Detail: no key for release-runner
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:  at /etc/puppet/production/modules/ssh/manifests/extra_authorized_key.pp:15 on node sea-master1.community.scl3.mozilla.com
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]: Failed to parse template ssh/extra_authorized_key.erb:
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Filepath: /etc/puppet/production/modules/ssh/templates/extra_authorized_key.erb
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Line: 6
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Detail: no key for release-runner
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:  at /etc/puppet/production/modules/ssh/manifests/extra_authorized_key.pp:15 on node sea-master1.community.scl3.mozilla.com
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]: Failed to parse template ssh/extra_authorized_key.erb:
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Filepath: /etc/puppet/production/modules/ssh/templates/extra_authorized_key.erb
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Line: 6
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:   Detail: no key for release-runner
Mar  4 00:40:08 sea-puppet.community.scl3.mozilla.com puppet-master[36154]:  at /etc/puppet/production/modules/ssh/manifests/extra_authorized_key.pp:15 on node sea-master1.community.scl3.mozilla.com

But these aren't the problem (afaik).

Also, ssh'ing into these machines, I get disconnected from them.

  STR:
    1) ssh -l seabld sea-hp-linux64-#
    2) tail -f /builds/slave/twistd.log
    3) wait for a moment, I get disconnected. (It shouldn't disconnect
because of a 10 second idle time...should it?)

Looking at the /var/logs/messages in sea-puppet again, I see:

Mar  4 00:58:18 sea-puppet.community.scl3.mozilla.com puppet-master[36154]: Compiled catalog for sea-hp-linux64-13.community.scl3.mozia.com in environment production in 5.00 seconds

Looking at the slaves' messages.log after disconnection, I only see

Mar  4 00:57:32 sea-hp-linux64-9 sshd[3939]: Accepted publickey for seabld from
63.245.223.8 port 55182 ssh2
Mar  4 00:57:32 sea-hp-linux64-9 sshd[3939]: User child is on pid 3942
Mar  4 01:00:24 sea-hp-linux64-9 sshd[4060]: Connection from 63.245.223.8 port 5
5187

So between 00:57:32 and 01:00:24, I get disconnected and I reconnect. 

I don't have a clue as to what's going on right now.  Still investigating.
Earlier I had added taskcluster
Summary: sea-hp-linux64-{2,6-13} are not connecting to the master → sea-hp-linux64-{2,3,10-13} are not connecting to the master
(In reply to Edmund Wong (:ewong) from comment #1)
> Earlier I had added taskcluster

pressed Enter by mistake.

I added taskcluster_clientId and taskcluster_accessToken to the
hiera secrets file.  

(again, this is irrelevant to the problem)
Summary: sea-hp-linux64-{2,3,10-13} are not connecting to the master → some of sea-hp-linux64-* are not connecting to the master
even if a slave gets up, it does a build but once it reboots it's disconnected 
from the master.
Attached file runner.log
This is the runner.log from sea-hp-linux64-13.
Looking through the attached log, something is a little weird (ignoring the
clobbering issue for the moment).  

runner.py has a bunch of tasks in /opt/runner/tasks.d.

runner.py executes each, starting with 0-checkout_tools and then it goes
to task_hook (runs influxdb_hook.py) and then it does clobberer.py (via
3-clobber).

now here's where it's weird.  it runs cleanslate.py which supposedly 'cleans'
the system from root logins (which makes debugging this via root hard since
it disconnects all root logins).  However, aside for killing the root 'su'
session, it also kills the parent sshd process.

i.e.:
2015-03-04 05:38:40,690 - DEBUG - Adding pid:2710 cmd:'sshd: seabld@pts/0' to kill set.
2015-03-04 05:38:40,690 - DEBUG - Adding pid:2711 cmd:'-bash' to kill set.
2015-03-04 05:38:40,690 - DEBUG - Adding pid:2734 cmd:'su' to kill set.
2015-03-04 05:38:40,690 - DEBUG - Killing process 2734 with SIGNAL 15
2015-03-04 05:38:40,790 - DEBUG - (failed to kill 2734 via sig 15) 
2015-03-04 05:38:40,790 - DEBUG - Killing process 2710 with SIGNAL 15
2015-03-04 05:38:40,891 - DEBUG - Killing process 2711 with SIGNAL 15
2015-03-04 05:38:40,920 - DEBUG - (failed to kill 2711 via sig 15) 
2015-03-04 05:38:40,920 - DEBUG - Killing process 2734 with SIGNAL 9
2015-03-04 05:38:41,020 - DEBUG - Killing process 2711 with SIGNAL 9
2015-03-04 05:38:41,020 - DEBUG - (failed to kill 2711 via sig 9) 
Traceback (most recent call last):
  File "/opt/cleanslate/bin/cleanslate.py", line 181, in <module>
    clean_process_set(args.user, args.filename, args.snapshot, args.dryrun)
  File "/opt/cleanslate/bin/cleanslate.py", line 147, in clean_process_set
    raise Exception('Failed to kill: %s' % fail_set)
Exception: Failed to kill: set([2711])

After this,  it somehow jogs 'something' and it starts the build process
(i.e. get a job).  But once the job completes and the system reboots,
it restarts the whole process again. (Yes, a bit handwavy, but I haven't
found what's really going on.)  (this is what I've seen so far.)
Summary: some of sea-hp-linux64-* are not connecting to the master → some/all of sea-hp-linux64-* are not connecting to the master
Severity: normal → blocker
So 2711 in your log is an ssh session with bash, I'm curious what the twistd.log output looks like, and what else is/may be failing in runner.
I'm looking at both the master's twistd.log and the slave's twistd.log (-2) and
came across the following tidbit that wasn't in any of the logs before 26th Feb.

-2's log:
2015-03-03 03:33:43-0800 [Broker,client]   using PTY: False
2015-03-03 03:33:44-0800 [-] command finished with signal None, exit code 2, elapsedTime: 0.191815
2015-03-03 03:33:44-0800 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x2d8cd88>
2015-03-03 03:33:45-0800 [Broker,client] slave shutting down on command from master
2015-03-03 03:33:45-0800 [Broker,client] NOTE: master is using deprecated slavebuilder.shutdown method
2015-03-03 03:33:45-0800 [Broker,client] lost remote

master's log:  (around the same time)
2015-03-03 03:33:44-0800 [Broker,451600,63.245.223.113] <RemoteShellCommand '['mock_mozilla', '-v', '-r', 'mozilla-centos6-i386', '--cwd', '/builds/slave/c-cen-t-lnx-l10n-ntly/build/objdir/suite/locales', '--unpriv', '--shell', '/usr/bin/env DOWNLOAD_BASE_URL="http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly" SYMBOL_SERVER_HOST="symbolpush.mozilla.org" EN_US_BINARY_URL="http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-central-trunk" CCACHE_DIR="/builds/ccache" POST_SYMBOL_UPLOAD_CMD="/usr/local/bin/post-symbol-upload.py" MOZ_AUTOMATION="1" SYMBOL_SERVER_SSH_KEY="/home/seabld/.ssh/seabld_dsa" MOZ_MAKE_COMPLETE_MAR="1" LD_LIBRARY_PATH="/tools/gcc-4.5/lib" PATH="/tools/buildbot/bin:/usr/local/bin:/usr/lib/ccache:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/tools/git/bin:/tools/python27/bin:/tools/python27-mercurial/bin:/home/cltbld/bin" MOZ_UPDATE_CHANNEL="nightly" L10NBASEDIR="../../l10n" CVS_RSH="ssh" CCACHE_COMPRESS="1" SYMBOL_SERVER_PATH="/mnt/netapp/breakpad/symbols_sea/" MOZ_OBJDIR="objdir" MOZ_CRASHREPORTER_NO_REPORT="1" SYMBOL_SERVER_USER="seabld" DISPLAY=":2" TINDERBOX_OUTPUT="1" CCACHE_UMASK="002" make unpack']'> rc=2
2015-03-03 03:33:44-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0x1c6a20e0>
2015-03-03 03:33:44-0800 [-] releaseLocks(<buildbotcustom.steps.mock.MockCommand instance at 0xc07b5a8>): []
2015-03-03 03:33:44-0800 [-]  step 'make_unpack' complete: failure
2015-03-03 03:33:44-0800 [-]  <Build Seamonkey comm-central-trunk linux l10n nightly>: build finished
2015-03-03 03:33:44-0800 [-] sending mail (59767 bytes) to ['tinderbox-daemon@tinderbox.mozilla.org']
2015-03-03 03:33:44-0800 [-] Starting factory <twisted.mail.smtp.ESMTPSenderFactory instance at 0x5a6e7e8>
2015-03-03 03:33:44-0800 [-] sending mail (59773 bytes) to ['tinderbox-daemon@tinderbox.mozilla.org']
2015-03-03 03:33:44-0800 [-] Starting factory <twisted.mail.smtp.ESMTPSenderFactory instance at 0x5a6eab8>
2015-03-03 03:33:45-0800 [-] Shutting down slave: sea-hp-linux64-2
2015-03-03 03:33:45-0800 [-] releaseLocks(<BuildSlave 'sea-hp-linux64-2'>): []
2015-03-03 03:33:45-0800 [Broker,451600,63.245.223.113] Lost connection to sea-hp-linux64-2
2015-03-03 03:33:45-0800 [Broker,451600,63.245.223.113] BuildSlave.detached(sea-hp-linux64-2)

now I'm looking for changes that might have started around 26th Feb.
Also, while digging around runner, I noticed we get a bunch of the following
errors:

/opt/runner/task_hook.py failed: list index out of range
/opt/runner/task_hook.py failed: list index out of range
/opt/runner/task_hook.py failed: list index out of range
/opt/runner/task_hook.py failed: list index out of range

I'm guessing http://hg.mozilla.org/build/puppet/rev/0d428cb7fc4a or
anything related to influxdb_hook.py doesn't apply to SeaMonkey's infra.
Some more digging:

Buildbot master is running 0.8.2
Buildslaves are running 0.8.4-pre-moz5

Thus the INFO in the log: 
  NOTE: master is using deprecated slavebuilder.shutdown method

This cset http://hg.mozilla.org/build/puppet/rev/4fa2d1cb198d from bug 1103123,
might be causing some havoc with our linux slaves.

Callek, could this actually be the smoking gun?
Flags: needinfo?(bugspam.Callek)
(In reply to Edmund Wong (:ewong) from comment #9)
> Some more digging:
> 
> Buildbot master is running 0.8.2
> Buildslaves are running 0.8.4-pre-moz5
> 
> Thus the INFO in the log: 
>   NOTE: master is using deprecated slavebuilder.shutdown method
> 
> This cset http://hg.mozilla.org/build/puppet/rev/4fa2d1cb198d from bug
> 1103123,
> might be causing some havoc with our linux slaves.
> 

Actually, that whole bug might be a potential cause for this bug.
OS: Windows 8.1 → Linux
Blocks: SM2.33
I applied the patch to sea-puppet locally, and committed
http://hg.mozilla.org/users/Callek_gmail.com/tools/rev/3940dd3916fe.

Since there are no legacy arguments for clobber.sh (http://hg.mozilla.org/build/puppet/file/9af78063613b/modules/runner/files/clobber.sh#l12), 
the clobberer.py enters the 'new' mode which expects the output of
--url (i.e. clobberer_url) to return a JSON struct.  However, our
clobberer_url doesn't do that, ergo it chokes with a JSON error.

Now I thought that I could patch the clobberer.py to use the two 
parameters --url (clobberer_url) and --dir (basically /builds/slave),
and then clobber the stuff in /builds/slave by taking a list of the
current dirs in /builds/slave, compare it with the last_clobber(if it exists)
and clobber it if it needs to.  However, is it possible for a slave
to take two different jobs at the same time (I'm thinking not)?
(also need to whitelist the buildbot.tac file..just in case).

But then there'll be some duplication with our current buildbot clobber
step. (not that this is a big deal)
Comment on attachment 8573766 [details] [diff] [review]
add runner_clobberer_url to seamonkey-config.pp

Review of attachment 8573766 [details] [diff] [review]:
-----------------------------------------------------------------

Shouldn't you ask Callek for these reviews?
Attachment #8573766 - Flags: review?(dustin) → review+
This patch is to ensure that the runner process can do the clobber as well as each build themselves being able to do the clobbering.
Attachment #8573929 - Flags: review?(bugspam.Callek)
Comment on attachment 8573929 [details] [diff] [review]
clobberer.py changes patch (v1)

Pushed to the repo: 
http://hg.mozilla.org/users/Callek_gmail.com/tools/rev/4906379d0ad8
Ok, with both patches 'committed', the clobbering is working.

But one side effect of this is that it takes a bit of time for the slave
to reappear (due to wholesale /builds/slave clobbering (I have added exceptions
so that it doesn't kill buildbot.tac and the twistd.*)).  So I'm not
100% sure the clobberer.py modifications is that great but at the point of
runner running(no pun intended), it doesn't know of any jobs or branches.

right now, I'm guessing all the linux slaves are currently blowing away the
/builds/slave (minus exceptions) so it's gonna take some time. (and this is
only linux. Cursory glance at osx64, it looks ok though osx64 is backed up
in jobs).
Comment on attachment 8573766 [details] [diff] [review]
add runner_clobberer_url to seamonkey-config.pp

Pushed to puppet:
 https://hg.mozilla.org/build/puppet/rev/a7963d22adae
Colour me very stunned!

They're BUILDING AGAIN!

Two things to note:

1)  /opt/runner/task_hook.py failed: list index out of range
  this needs fixing. 

2) Probably very important to note is that logging on to do something on any
of the linux slaves will require 'quickly' killing the runner process because
it will kill all processes that weren't there during the boot up (as noted
by cleanslate.py)  (i.e. ssh sessions, su sessions)

Anyway, will keep monitoring the slaves and see how they react to booting.
Note: I removed the manual patch of sea-puppet's manifest/seamonkey-config.pp.

I just pushed the seamonkey-config.pp to puppet's default branch.  Waiting
for someone to merge the default to production.  Then they should be back up.
(In reply to Edmund Wong (:ewong) from comment #19)
> Note: I removed the manual patch of sea-puppet's
> manifest/seamonkey-config.pp.
> 
> I just pushed the seamonkey-config.pp to puppet's default branch.  Waiting
> for someone to merge the default to production.  Then they should be back up.

Merge has happened
Flags: needinfo?(bugspam.Callek)
Comment on attachment 8573929 [details] [diff] [review]
clobberer.py changes patch (v1)

Review of attachment 8573929 [details] [diff] [review]:
-----------------------------------------------------------------

stamp, assuming that this matches what :mrrrgn did in production. :-)
Attachment #8573929 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8573929 [details] [diff] [review]
clobberer.py changes patch (v1)

Actually, no.  It's 'like' what he did, but I added some extra
code to adapt the clobberer.py to our current buildbotcustom 
expectations.

Particularly the parsed_args.skip.extend(...) part and I refactored
out code to create formulate_legacy_data() as I needed that to 
both clobber via runner's 3-clobber, and via buildbotcustom
step clobber.
Assignee: nobody → ewong
Status: NEW → ASSIGNED
they're back up.. so closing this bug.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: