648665 - Some slaves do not reboot - twistd.pid still exists

Reporter

Description

•

13 years ago

The job will finish properly but the reboot step did not happen.

In fact, if you look at the job you will notice that the reboot step took more than 30 seconds instead of 2-3 seconds. I have not yet made up my mind if the job should be marked with a different colour or not.

The symptom is that the machine finishes the job properly but the machine does not reboot.


2011-04-07 19:38:53-0700 [-] Received SIGTERM, shutting down.
2011-04-07 19:38:53-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x10183bc20>
2011-04-07 19:38:53-0700 [-] command interrupted, attempting to kill
2011-04-07 19:38:53-0700 [-] trying to kill process group 348
2011-04-07 19:38:53-0700 [-] failed to kill process group (ignored): [Errno 1] Operation not permitted
2011-04-07 19:38:53-0700 [-] trying process.signalProcess('KILL')
2011-04-07 19:38:53-0700 [-]  signal KILL sent successfully
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote
...
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] Received SIGTERM, shutting down.
2011-04-07 19:38:53-0700 [Broker,client] lost remote
...
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote
2011-04-07 19:38:53-0700 [Broker,client] lost remote step
2011-04-07 19:38:53-0700 [Broker,client] Lost connection to buildbot-master2.build.scl1.mozilla.com:9012
2011-04-07 19:38:53-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x1016b7440>
2011-04-07 19:38:53-0700 [-] Main loop terminated.
2011-04-07 19:38:53-0700 [-] Server Shut Down.

Armen [:armenzg]

Reporter

Comment 1

•

13 years ago

I have rebooted the following:
talos-r3-snow-027    2011-04-07 19:38 bm02:9012
talos-r3-snow-017    2011-04-07 21:09 bm05:9011
talos-r3-leopard-022 2011-04-07 16:15 bm02:9012
talos-r3-leopard-014 2011-04-07 14:52 bm01:9012
talos-r3-leopard-006

dustin is having a look at talos-r3-snow-014

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

13 years ago

failed to kill process group (ignored): [Errno 1]
Operation not permitted

this is normal.  What concerns me is 

2011-04-07 19:38:53-0700 [-] Received SIGTERM, shutting down.

since the slave shouldn't normally be getting signals.  I think what's happening here is that the reboot is causing the buildslave process to shut down, which is then stopping the command.

I think all of this is relatively normal, but it doesn't point to a good reason for the slave not to shut down.  Running 'sudo reboot' by hand later succeeds immediately.

Lukas Blakk [:lsblakk] use ?needinfo

Comment 3

•

13 years ago

Armen, if you'd like to turn this bug into an actionable item then please reopen but otherwise this is just a discussion in a bug with no clear path for action and I'm closing it.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → INVALID

Armen [:armenzg]

Reporter

Comment 4

•

13 years ago

There is one more thing I need to check.

Assignee: nobody → armenzg

Status: RESOLVED → REOPENED

Resolution: INVALID → ---

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

13 years ago

I think this is "most" leopard and snow slaves at this point.  It looks like it's some interaction between the OS shutdown process and the buildslave process:
 1. run count_and_reboot.py
 2. trigger os reboot
 3. buildslave gets SIGTERM
 4. buildslave kills count_and_reboot.py
 5. buildslave disconnects from master
 6. os reboot doesn't complete

I'm guessing that if we could catch this quickly enough, we'd see an onscreen prompt of some sort, that disappears after a timeout.

Priority: -- → P2

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

13 years ago

I've now seen the following twice:

2011-04-14 06:58:58-0700 [Broker,client]  startCommand:shell [id 111505]
2011-04-14 06:58:58-0700 [Broker,client] RunProcess._startCommand
2011-04-14 06:58:58-0700 [Broker,client]  python count_and_reboot.py -f ../talos_count.txt -n 1 -z
2011-04-14 06:58:58-0700 [Broker,client]   in dir /Users/cltbld/talos-slave/test/../talos-data (timeout 1200 secs)
2011-04-14 06:58:58-0700 [Broker,client]   watching logfiles {}
2011-04-14 06:58:58-0700 [Broker,client]   argv: ['python', 'count_and_reboot.py', '-f', '../talos_count.txt', '-n', '1', '-z']
2011-04-14 06:58:58-0700 [Broker,client]  environment: {'XPCOM_DEBUG_BREAK': 'warn', 'Apple_PubSub_Socket_Render': '/tmp/launch-RYSGlh/Render', 'CVS_RSH': 'ssh', 'SSH_AUTH_SOCK': '/tmp/launch-5rNZKk/Listeners', 'MOZ_CRASHREPORTER_NO_RE
PORT': '1', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'MOZ_NO_REMOTE': '1', 'PWD': '/Users/cltbld/talos-slave/talos-data', 'SHELL': '/bin/bash', 'LOGNAME': 'cltbld', 'USER': 'cltbld', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
:/usr/X11/bin', 'HOME': '/Users/cltbld', 'NO_EM_RESTART': '1', 'PAGER': '/bin/cat', 'DISPLAY': '/tmp/launch-r2Pz8f/:0', 'TMPDIR': '/var/folders/Xr/Xr--yJnSEY0U11ET5NZuMU+++TM/-Tmp-/'}
2011-04-14 06:58:58-0700 [Broker,client]   using PTY: False
2011-04-14 06:59:00-0700 [-] Received SIGTERM, shutting down.
2011-04-14 06:59:00-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x722030>
2011-04-14 06:59:00-0700 [-] command interrupted, attempting to kill
2011-04-14 06:59:00-0700 [-] trying to kill process group 518
2011-04-14 06:59:00-0700 [-]  signal 9 sent successfully
2011-04-14 06:59:00-0700 [Broker,client] lost remote
2011-04-14 06:59:00-0700 [Broker,client] lost remote
...
2011-04-14 06:59:00-0700 [Broker,client] lost remote
2011-04-14 06:59:00-0700 [Broker,client] lost remote step
2011-04-14 06:59:00-0700 [Broker,client] Lost connection to buildbot-master5.build.scl1.mozilla.com:9011
2011-04-14 06:59:00-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x65ce68>
2011-04-14 06:59:00-0700 [-] Main loop terminated.
2011-04-14 06:59:00-0700 [-] Server Shut Down.
2011-04-14 07:01:00-0700 [-] Log opened.
2011-04-14 07:01:00-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.5.1) starting up.
2011-04-14 07:01:00-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.

In this case, it looks like the slave rebooted but just didn't come up all the way.  I don't know what to make of it beyond that point.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

13 years ago

The above was on talos-r3-leopard-026.  I see the same on talos-r3-leopard-025, and last night I saw it on talos-r3-leopard-043.

Armen [:armenzg]

Reporter

Comment 8

•

13 years ago

I was only assigned to this bug to check what nthomas checked on https://bugzilla.mozilla.org/show_bug.cgi?id=652744#c5.

I have no special expertise on how buildbot handles SIGTERM or any good theory on what to investigate. I am unassigning myself because of this.

Assignee: armenzg → nobody

Priority: P2 → --

Mike Taylor [:bear]

Updated

•

13 years ago

Whiteboard: [slave

Mike Taylor [:bear]

Updated

•

13 years ago

Whiteboard: [slave → [slaveduty]

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

13 years ago

I just saw this on linux-ix-slave12.  It *is* actually rebooting:

2011-04-27 14:42:23-0700 [Broker,client] lost remote
2011-04-27 14:42:23-0700 [Broker,client] lost remote
2011-04-27 14:42:23-0700 [Broker,client] lost remote step
2011-04-27 14:42:23-0700 [Broker,client] Lost connection to buildbot-master5.build.mozilla.org:9010
2011-04-27 14:42:23-0700 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x92d136c>
2011-04-27 14:42:23-0700 [-] Main loop terminated.
2011-04-27 14:42:23-0700 [-] Server Shut Down.
[cltbld@linux-ix-slave12 slave]$ date
Wed Apr 27 14:56:58 PDT 2011
[cltbld@linux-ix-slave12 slave]$ uptime
 14:57:00 up 13 min,  1 user,  load average: 0.00, 0.07, 0.14
[cltbld@linux-ix-slave12 slave]$ 

so it looks like the slave startup process is failing for some reason.

I don't see a log entry in the slave allocator from ~14:45, which is when puppet finished running.

I re-ran /etc/init.d/buildbot start as root, and it started right up.  I'll see if I can replicate this through repetition tomorrow.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Assignee: nobody → dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

13 years ago

OK, I did manage to replicate this on linux-ix-slave05 by rebooting repeatedly until twistd failed to start up.  Now, to add some debugging to /etc/init.d/buildbot!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

13 years ago

Not entirely unexpectedly:

---
Another twistd server is running, PID 3231

This could either be a previously started instance of your application or a
different application entirely. To start a new one, either run it in some other
directory, or use the --pidfile and --logfile parameters to avoid clashes.

Fri Apr 29 15:42:13 PDT 2011 starting buildbot -> 1
---

I suspect that the right thing to do is to configure runslave.py to automatically delete the pidfile on startup.  Thoughts?

Summary: Some leopard and snow testing machines do not reboot → Some slaves do not reboot

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

13 years ago

Attached patch m648665-puppet-manifests-r1.patch — Details — Splinter Review

I tested this manually on linux-ix-slave05 and it successfully deleted the pidfile and started the slave.

Attachment #529250 - Flags: review?(armenzg)

Armen [:armenzg]

Reporter

Comment 13

•

13 years ago

Comment on attachment 529250 [details] [diff] [review]
m648665-puppet-manifests-r1.patch

It seems straight forward. I am trying to picture if there could be any fallouts but I can't think of any.

The assumption is that while we reboot buildbot does not manage to clean up the twistd.pid. From your findings it seems than then it fails to start buildbot.

How did we use to start/stop buildbot differently with runslave.py? Do you think we used to hit this before as well?

How did Windows behave with this change?

Attachment #529250 - Flags: review?(armenzg) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Summary: Some slaves do not reboot → Some slaves do not reboot - twistd.pid still exists

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

13 years ago

(In reply to comment #13)
> How did we use to start/stop buildbot differently with runslave.py? Do you
> think we used to hit this before as well?

I suspect that we did.  There was an open bug to remove twistd.pid with puppet.

> How did Windows behave with this change?

It doesn't affect windows, because twistd.pid does not exist there.

Armen [:armenzg]

Reporter

Comment 15

•

13 years ago

Thanks for your replies!

Thanks for fixing this :)

Armen [:armenzg]

Reporter

Updated

•

13 years ago

Blocks: 649734

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 16

•

13 years ago

This was deployed along with bug 629694 (with the wrong bug in the commit message): http://hg.mozilla.org/build/puppet-manifests/rev/b458579e4665

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering