Closed Bug 522078 Opened 16 years ago Closed 16 years ago

Windows 2003 build slaves and talos XP sometimes get stuck at the login dialog

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(3 files, 1 obsolete file)

We've seen this a few times before, and over the weekend nearly all of them got hung in this manner. The fix is to connect and hit ctrl-alt-delete, and then things continue on they're merry way. This seems like it's related to load on the OPSI load, as outlined in https://bugzilla.mozilla.org/show_bug.cgi?id=521722#c7 I've asked the the OPSI folks for some help here: https://forum.opsi.org/viewtopic.php?f=8&t=994, I'll be looking into this in the meantime, though.
One possible solution here may be to simply disable the need to hit ctrl-alt-del. I'll look into how to do this, and give it a try. This would be helpful for the Talos XP machines, too.
We've been rebooting talos machines after each test run without hitting login prompt before. Is this a new change-in-behaviour since opsi rollout?
(In reply to comment #2) > We've been rebooting talos machines after each test run without hitting login > prompt before. > > Is this a new change-in-behaviour since opsi rollout? No, OPSI didn't change the ctrl-alt-del behaviour. We also didn't hit these problems pre-OPSI, though.
Just finished looking through the logs on a couple of slaves. There are clear gaps when the slave was missing, but nothing indicating why it failed to log in.
Since I can't reproduce the problem I don't know if this fixes it, but it _seems_ like it should. We can use this on the XP talos machines and win2k3 build machines. I don't think it's applicable to the Talos Vista problems.
Attachment #406084 - Flags: review?(ccooper)
Attachment #406084 - Flags: review?(ccooper) → review+
Comment on attachment 406084 [details] [diff] [review] opsi package to disable CAD at the login screen changeset: 23:0434d3f09a56
Attachment #406084 - Flags: checked-in+
Comment on attachment 406084 [details] [diff] [review] opsi package to disable CAD at the login screen I've set this package to install on all of our win32 build machines. I haven't done any testing on Talos XP yet, so I'll hold off on that for now.
The package is installing fine on all of the machines, it's hard to know if this has helped, though. I've got an OPSI server and ref vm locally and I'm going to try to reproduce the failure with them in the meantime.
Summary: Windows 2003 build slaves sometimes get stuck at the login dialog → Windows 2003 build slaves and talos XP sometimes get stuck at the login dialog
I tested out the disablecad package on a staging XP slave today and it deployed without issue. I gave it a few reboots and everything seems to be fine. I'll have it deploy on the production slaves and see how that goes.
Haven't seen any more of these since we rolled out the disablecad package. Let's let them run over the weekend, though.
Still haven't seen a re-occurrence of this. I'm declaring it FIXED.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Had a bunch of build and other slaves hit this over the weekend. Definitely seems related to load, as there was a spike early this morning.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Happened again today. Tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=527229. OPSI server was chewing all of the RAM until the opsiconfd service was restarted.
I added the following entry to root's crontab. Restarting the service will hopefully let us avoid this issue. @weekly /etc/init.d/opsiconfd restart
Status: REOPENED → ASSIGNED
See Also: → 517862
I've seen a few talos xp slaves stuck at the screensaver again recently. Haven't seen a highly loaded OPSI server to go along with it, though.
(In reply to comment #14) > I added the following entry to root's crontab. Restarting the service will > hopefully let us avoid this issue. > @weekly /etc/init.d/opsiconfd restart ... which apparently means midnight on Sunday. production-opsi# date Mon Nov 30 04:35:11 CET 2009 --> We're after the cron trigger time. (Plus the clock is set to Central European time, and is 15 minutes slow). # ps aux | grep -E '(USER|opsi)' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 993 28144 1.1 91.3 2171692 1896572 ? Sl Nov16 217:13 /usr/bin/python /usr/sbin/opsiconfd -D # free -m total used free shared buffers cached Mem: 2028 1952 75 0 6 26 -/+ buffers/cache: 1919 108 Swap: 1019 593 426 --> opsiconfd isn't getting restarted by cron. # /etc/init.d/opsiconfd restart Stopping opsi config service... (done). Starting opsi config service.... (done). # ps aux | grep -E '(USER|opsi)' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 993 28144 1.1 91.5 2141000 1900860 ? Rl Nov16 217:15 /usr/bin/python /usr/sbin/opsiconfd -D 993 6700 2.6 0.5 15348 10812 ? S 04:32 0:00 /usr/bin/python /usr/sbin/opsiconfd -D Ended up stopping opsiconfd, killing process 28144, and starting it again.
Thanks for noticing that Nick, I'll try and fix it somehow...
I ended up hitting this issue as part of bug 429418 and trying a few other things to fix it. Here's what I found: OPSI has a configuration parameter called 'SecsUntilConnectionTimeout', which is set to 180 by default. When I changed this to 10, the OPSI dialog went away quicker and the automatic login succeeded. My theory is that the screen saver starting is what's breaking the login, not OPSI. By timing out sooner we prevent that from happening. This parameter lives in the registry, here: HKML\Software\opsi.org\pcptch It should be a simple OPSI package to roll this out.
This is an OPSI package that will lower the timeout of the OPSI connect dialog from 3 minutes to 30 seconds, which will prevent the screensaver from coming on and screwing up the automatic login. Tested in staging and the installation and uninstallation worked fine.
Attachment #415431 - Flags: review?(ccooper)
Attachment #415431 - Flags: review?(ccooper) → review+
Comment on attachment 415431 [details] [diff] [review] opsi package to lower the opsi timeout changeset: 25:02f055f2f155
Attachment #415431 - Flags: checked-in+
The timeout lowering package is set to install on all of our build and XP talos slaves.
The rollout went fine, I think this is fixed for real this time....
Status: ASSIGNED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
(In reply to comment #17) > Thanks for noticing that Nick, I'll try and fix it somehow... How'd you go with fixing up the cron job ?
(In reply to comment #23) > (In reply to comment #17) > > Thanks for noticing that Nick, I'll try and fix it somehow... > > How'd you go with fixing up the cron job ? I forgot to =\
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I had to restart opsiconfd today because windows slaves were getting stuck at the screensaver (a simple vnc connection & mouse waggle fixes that up). Like comment 16 I had to kill the process because it didn't respond to a stop. Perhaps we should switch it from @weekly to every 2-3 days ? Also, "/etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd" seems to be missing a trailing "start".
(In reply to comment #25) > I had to restart opsiconfd today because windows slaves were getting stuck at > the screensaver (a simple vnc connection & mouse waggle fixes that up). Like > comment 16 I had to kill the process because it didn't respond to a stop. > Perhaps we should switch it from @weekly to every 2-3 days ? Also, > "/etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd" seems to be > missing a trailing "start". I changed the cronjobs to be: 0 2 * * 0,2,4,6 /etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd start Which should have us restarted the opsiconfd service at 2am every Sunday, Tuesday, Thursday, and Saturday. I'll check in on it tomorrow to confirm that.
(In reply to comment #25) > I had to restart opsiconfd today because windows slaves were getting stuck at > the screensaver (a simple vnc connection & mouse waggle fixes that up). Like > comment 16 I had to kill the process because it didn't respond to a stop. Oh, and I've had that problem too where the service doesn't respond to 'stop'. It gets that way when it hits the memory leak. Under "normal" circumstances it responds signals sent from the init script.
production-opsi managed to restart the opsi process this morning, but staging-opsi didn't. I think it's because staging-opsi had an opsiconfd that was alive for a couple of weeks. Going to leave this open until Monday to see if it works on both machines then.
(In reply to comment #29) > production-opsi managed to restart the opsi process this morning, but > staging-opsi didn't. I think it's because staging-opsi had an opsiconfd that > was alive for a couple of weeks. Going to leave this open until Monday to see > if it works on both machines then. Didn't work either. production-opsi had a week old process. Going to try doing it nightly - if that doesn't work, I'll try a different approach.
Restarting daily is working fine for production-opsi, but not staging. It's a lot slower, so I'm going to try bumping the 'sleep' between stop and start.
Actually, we had a couple of production build slaves get stuck on the screensaver yesterday so I manually killed opsiconfd and restarted it. Perhaps we could change the cronjob to kill the process if it hasn't exited after the 15 second sleep ?
(In reply to comment #32) > Actually, we had a couple of production build slaves get stuck on the > screensaver yesterday so I manually killed opsiconfd and restarted it. Perhaps > we could change the cronjob to kill the process if it hasn't exited after the > 15 second sleep ? Yeah....I'll write a script that actually does this properly.
Attached file opsi service restarter cronjob (obsolete) —
Going to try this out on staging-opsi for a couple of days.
This script ended up killing opsiconfd overnight, and not restarting it.
Strangely, the script works perfectly fine when run at the command line. I'll have to investigate more, later.
(In reply to comment #36) > Strangely, the script works perfectly fine when run at the command line. I'll > have to investigate more, later. Looks like the opsi init script doesn't work properly when run through cron with just 'bash restart-opsiconfd.sh'. Using 'bash -l -c ....' works though. I'm going to run it like this over the weekend on staging-opsi.
(In reply to comment #37) > (In reply to comment #36) > > Strangely, the script works perfectly fine when run at the command line. I'll > > have to investigate more, later. > > Looks like the opsi init script doesn't work properly when run through cron > with just 'bash restart-opsiconfd.sh'. Using 'bash -l -c ....' works though. > I'm going to run it like this over the weekend on staging-opsi. Looks like this worked over the weekend.
Attachment #422548 - Flags: review?(nrthomas)
Comment on attachment 422548 [details] opsi service restarter cronjob >is_running() { > PID=`cat /var/run/opsiconfd/opsiconfd.pid 2>/dev/null` > if [ $PID ]; then > return 1 > else > return 0 > fi >} Does the pid file get emptied on SIGTERM and SIGKILL ? Perhaps we should be testing for an opsiconfd process. Otherwise it looks fine for shell. ;-)
(In reply to comment #39) > (From update of attachment 422548 [details]) > >is_running() { > > PID=`cat /var/run/opsiconfd/opsiconfd.pid 2>/dev/null` > > if [ $PID ]; then > > return 1 > > else > > return 0 > > fi > >} > > Does the pid file get emptied on SIGTERM and SIGKILL ? Perhaps we should be > testing for an opsiconfd process. Good catch...the PID file is cleaned up after SIGTERM, but not after SIGKILL. I'll fix is_running.
This seems to work. I've got it setup to run on staging.
Attachment #422548 - Attachment is obsolete: true
Attachment #423987 - Flags: review?(nrthomas)
Attachment #422548 - Flags: review?(nrthomas)
Comment on attachment 423987 [details] [diff] [review] opsi service restarter, v2 r+. Is the cronmail set to let us know if it fails ?
Attachment #423987 - Flags: review?(nrthomas) → review+
We got an nagios test for the opsiconfd process set up a couple of days ago, and it's reporting it missing staging-opsi (consequently a couple of windows machines were stuck at the screensaver). Restarted it manually for now.
Comment on attachment 423987 [details] [diff] [review] opsi service restarter, v2 >diff --git a/restart-opsiconfd.sh b/restart-opsiconfd.sh >+is_running() { >+ # Returns 0 when opsi is running, 1 when it is not >+ ps auxwww | grep opsiconfd | grep -qv grep This should be ps auxwww | grep -vE '(grep|restart)' | grep -q opsiconfd otherwise we get the uninteresting exit status from the 'grep -v grep', and the script name fools us into thinking opsiconfd is still running. Updated the script on staging-opsi.
Nick - thanks for that fix. I *think* things are working now, it restarted overnight in staging just fine. I'll let it run over the weekend though...
Comment on attachment 423987 [details] [diff] [review] opsi service restarter, v2 changeset: 36:96b99da5c40f
Attachment #423987 - Flags: checked-in+
Alright, this latest version worked over the weekend so I'm going to put it on production and land it. It's still just a workaround, but as long as it continues to restart the service every night we shouldn't see any more issues with slaves getting hung up. I've updated the staging crontab as follows: MAILTO=release@mozilla.com # m h dom mon dow command */10 * * * * rsync -av /N/*ref* /var/lib/opsi/config/clients/ &>> /root/config-rsync.log*/5 * * * * cd /home/cltbld/opsi-package-sources && /usr/bin/python look-for-new-slaves.py -f staging-slaves &>> /root/new-slaves.log0 2 * * * bash -l -c /home/cltbld/opsi-package-sources/restart-opsiconfd.sh &>> /root/restart-opsi.log || echo "Failed to restart OPSI." And the production one like so: MAILTO=release@mozilla.com # m h dom mon dow command*/5 * * * * cd /home/cltbld/opsi-package-sources && /usr/bin/python loo k-for-new-slaves.py -f production-slaves &>>/root/new-slaves.log 0 2 * * * bash -l -c /home/cltbld/opsi-package-sources/restart-opsiconfd.s h &>> /root/restart-opsi.log || echo "Failed to restart OPSI."
Worked fine on staging and production overnight. While this bug doesn't fix the root cause of the problem, I'm going to declare this FIXED since we have a workaround now.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Blocks: 652391
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: