Some slaves do not reboot - twistd.pid still exists

RESOLVED FIXED

Status

RESOLVED FIXED
8 years ago
5 years ago

People

(Reporter: armenzg, Assigned: dustin)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [slaveduty])

Attachments

(1 attachment)

(Reporter)

Description

8 years ago
The job will finish properly but the reboot step did not happen.

In fact, if you look at the job you will notice that the reboot step took more than 30 seconds instead of 2-3 seconds. I have not yet made up my mind if the job should be marked with a different colour or not.

The symptom is that the machine finishes the job properly but the machine does not reboot.


2011-04-07 19:38:53-0700 [-] Received SIGTERM, shutting down.
2011-04-07 19:38:53-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x10183bc20>
2011-04-07 19:38:53-0700 [-] command interrupted, attempting to kill
2011-04-07 19:38:53-0700 [-] trying to kill process group 348
2011-04-07 19:38:53-0700 [-] failed to kill process group (ignored): [Errno 1] Operation not permitted
2011-04-07 19:38:53-0700 [-] trying process.signalProcess('KILL')
2011-04-07 19:38:53-0700 [-]  signal KILL sent successfully
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote
...
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] Received SIGTERM, shutting down.
2011-04-07 19:38:53-0700 [Broker,client] lost remote
...
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote step
2011-04-07 19:38:53-0700 [Broker,client] Lost connection to buildbot-master2.build.scl1.mozilla.com:9012
2011-04-07 19:38:53-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1016b7440>
2011-04-07 19:38:53-0700 [-] Main loop terminated.
2011-04-07 19:38:53-0700 [-] Server Shut Down.
(Reporter)

Comment 1

8 years ago
I have rebooted the following:
talos-r3-snow-027    2011-04-07 19:38 bm02:9012
talos-r3-snow-017    2011-04-07 21:09 bm05:9011
talos-r3-leopard-022 2011-04-07 16:15 bm02:9012
talos-r3-leopard-014 2011-04-07 14:52 bm01:9012
talos-r3-leopard-006

dustin is having a look at talos-r3-snow-014
failed to kill process group (ignored): [Errno 1]
Operation not permitted

this is normal.  What concerns me is 

2011-04-07 19:38:53-0700 [-] Received SIGTERM, shutting down.

since the slave shouldn't normally be getting signals.  I think what's happening here is that the reboot is causing the buildslave process to shut down, which is then stopping the command.

I think all of this is relatively normal, but it doesn't point to a good reason for the slave not to shut down.  Running 'sudo reboot' by hand later succeeds immediately.
Armen, if you'd like to turn this bug into an actionable item then please reopen but otherwise this is just a discussion in a bug with no clear path for action and I'm closing it.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → INVALID
(Reporter)

Comment 4

8 years ago
There is one more thing I need to check.
Assignee: nobody → armenzg
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
I think this is "most" leopard and snow slaves at this point.  It looks like it's some interaction between the OS shutdown process and the buildslave process:
 1. run count_and_reboot.py
 2. trigger os reboot
 3. buildslave gets SIGTERM
 4. buildslave kills count_and_reboot.py
 5. buildslave disconnects from master
 6. os reboot doesn't complete

I'm guessing that if we could catch this quickly enough, we'd see an onscreen prompt of some sort, that disappears after a timeout.
Priority: -- → P2
I've now seen the following twice:

2011-04-14 06:58:58-0700 [Broker,client]  startCommand:shell [id 111505]
2011-04-14 06:58:58-0700 [Broker,client] RunProcess._startCommand
2011-04-14 06:58:58-0700 [Broker,client]  python count_and_reboot.py -f ../talos_count.txt -n 1 -z
2011-04-14 06:58:58-0700 [Broker,client]   in dir /Users/cltbld/talos-slave/test/../talos-data (timeout 1200 secs)
2011-04-14 06:58:58-0700 [Broker,client]   watching logfiles {}
2011-04-14 06:58:58-0700 [Broker,client]   argv: ['python', 'count_and_reboot.py', '-f', '../talos_count.txt', '-n', '1', '-z']
2011-04-14 06:58:58-0700 [Broker,client]  environment: {'XPCOM_DEBUG_BREAK': 'warn', 'Apple_PubSub_Socket_Render': '/tmp/launch-RYSGlh/Render', 'CVS_RSH': 'ssh', 'SSH_AUTH_SOCK': '/tmp/launch-5rNZKk/Listeners', 'MOZ_CRASHREPORTER_NO_RE
PORT': '1', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'MOZ_NO_REMOTE': '1', 'PWD': '/Users/cltbld/talos-slave/talos-data', 'SHELL': '/bin/bash', 'LOGNAME': 'cltbld', 'USER': 'cltbld', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
:/usr/X11/bin', 'HOME': '/Users/cltbld', 'NO_EM_RESTART': '1', 'PAGER': '/bin/cat', 'DISPLAY': '/tmp/launch-r2Pz8f/:0', 'TMPDIR': '/var/folders/Xr/Xr--yJnSEY0U11ET5NZuMU+++TM/-Tmp-/'}
2011-04-14 06:58:58-0700 [Broker,client]   using PTY: False
2011-04-14 06:59:00-0700 [-] Received SIGTERM, shutting down.
2011-04-14 06:59:00-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x722030>
2011-04-14 06:59:00-0700 [-] command interrupted, attempting to kill
2011-04-14 06:59:00-0700 [-] trying to kill process group 518
2011-04-14 06:59:00-0700 [-]  signal 9 sent successfully
2011-04-14 06:59:00-0700 [Broker,client] lost remote
2011-04-14 06:59:00-0700 [Broker,client] lost remote
...
2011-04-14 06:59:00-0700 [Broker,client] lost remote
2011-04-14 06:59:00-0700 [Broker,client] lost remote step
2011-04-14 06:59:00-0700 [Broker,client] Lost connection to buildbot-master5.build.scl1.mozilla.com:9011
2011-04-14 06:59:00-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x65ce68>
2011-04-14 06:59:00-0700 [-] Main loop terminated.
2011-04-14 06:59:00-0700 [-] Server Shut Down.
2011-04-14 07:01:00-0700 [-] Log opened.
2011-04-14 07:01:00-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.5.1) starting up.
2011-04-14 07:01:00-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.

In this case, it looks like the slave rebooted but just didn't come up all the way.  I don't know what to make of it beyond that point.
The above was on talos-r3-leopard-026.  I see the same on talos-r3-leopard-025, and last night I saw it on talos-r3-leopard-043.
(Reporter)

Comment 8

8 years ago
I was only assigned to this bug to check what nthomas checked on https://bugzilla.mozilla.org/show_bug.cgi?id=652744#c5.

I have no special expertise on how buildbot handles SIGTERM or any good theory on what to investigate. I am unassigning myself because of this.
Assignee: armenzg → nobody
Priority: P2 → --

Updated

8 years ago
Whiteboard: [slave

Updated

8 years ago
Whiteboard: [slave → [slaveduty]
I just saw this on linux-ix-slave12.  It *is* actually rebooting:

2011-04-27 14:42:23-0700 [Broker,client] lost remote
2011-04-27 14:42:23-0700 [Broker,client] lost remote
2011-04-27 14:42:23-0700 [Broker,client] lost remote step
2011-04-27 14:42:23-0700 [Broker,client] Lost connection to buildbot-master5.build.mozilla.org:9010
2011-04-27 14:42:23-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x92d136c>
2011-04-27 14:42:23-0700 [-] Main loop terminated.
2011-04-27 14:42:23-0700 [-] Server Shut Down.
[cltbld@linux-ix-slave12 slave]$ date
Wed Apr 27 14:56:58 PDT 2011
[cltbld@linux-ix-slave12 slave]$ uptime
 14:57:00 up 13 min,  1 user,  load average: 0.00, 0.07, 0.14
[cltbld@linux-ix-slave12 slave]$ 

so it looks like the slave startup process is failing for some reason.

I don't see a log entry in the slave allocator from ~14:45, which is when puppet finished running.

I re-ran /etc/init.d/buildbot start as root, and it started right up.  I'll see if I can replicate this through repetition tomorrow.
Assignee: nobody → dustin
OK, I did manage to replicate this on linux-ix-slave05 by rebooting repeatedly until twistd failed to start up.  Now, to add some debugging to /etc/init.d/buildbot!
Not entirely unexpectedly:

---
Another twistd server is running, PID 3231

This could either be a previously started instance of your application or a
different application entirely. To start a new one, either run it in some other
directory, or use the --pidfile and --logfile parameters to avoid clashes.

Fri Apr 29 15:42:13 PDT 2011 starting buildbot -> 1
---

I suspect that the right thing to do is to configure runslave.py to automatically delete the pidfile on startup.  Thoughts?
Summary: Some leopard and snow testing machines do not reboot → Some slaves do not reboot
Created attachment 529250 [details] [diff] [review]
m648665-puppet-manifests-r1.patch

I tested this manually on linux-ix-slave05 and it successfully deleted the pidfile and started the slave.
Attachment #529250 - Flags: review?(armenzg)
(Reporter)

Comment 13

8 years ago
Comment on attachment 529250 [details] [diff] [review]
m648665-puppet-manifests-r1.patch

It seems straight forward. I am trying to picture if there could be any fallouts but I can't think of any.

The assumption is that while we reboot buildbot does not manage to clean up the twistd.pid. From your findings it seems than then it fails to start buildbot.

How did we use to start/stop buildbot differently with runslave.py? Do you think we used to hit this before as well?

How did Windows behave with this change?
Attachment #529250 - Flags: review?(armenzg) → review+
Summary: Some slaves do not reboot → Some slaves do not reboot - twistd.pid still exists
(In reply to comment #13)
> How did we use to start/stop buildbot differently with runslave.py? Do you
> think we used to hit this before as well?

I suspect that we did.  There was an open bug to remove twistd.pid with puppet.

> How did Windows behave with this change?

It doesn't affect windows, because twistd.pid does not exist there.
(Reporter)

Comment 15

8 years ago
Thanks for your replies!

Thanks for fixing this :)
(Reporter)

Updated

8 years ago
Blocks: 649734
This was deployed along with bug 629694 (with the wrong bug in the commit message): http://hg.mozilla.org/build/puppet-manifests/rev/b458579e4665
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.