Closed
Bug 522078
Opened 16 years ago
Closed 16 years ago
Windows 2003 build slaves and talos XP sometimes get stuck at the login dialog
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
References
Details
Attachments
(3 files, 1 obsolete file)
|
2.23 KB,
patch
|
coop
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
|
2.76 KB,
patch
|
coop
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.94 KB,
patch
|
nthomas
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
We've seen this a few times before, and over the weekend nearly all of them got hung in this manner. The fix is to connect and hit ctrl-alt-delete, and then things continue on they're merry way.
This seems like it's related to load on the OPSI load, as outlined in https://bugzilla.mozilla.org/show_bug.cgi?id=521722#c7
I've asked the the OPSI folks for some help here: https://forum.opsi.org/viewtopic.php?f=8&t=994, I'll be looking into this in the meantime, though.
| Assignee | ||
Comment 1•16 years ago
|
||
One possible solution here may be to simply disable the need to hit ctrl-alt-del. I'll look into how to do this, and give it a try. This would be helpful for the Talos XP machines, too.
Comment 2•16 years ago
|
||
We've been rebooting talos machines after each test run without hitting login prompt before.
Is this a new change-in-behaviour since opsi rollout?
| Assignee | ||
Comment 3•16 years ago
|
||
(In reply to comment #2)
> We've been rebooting talos machines after each test run without hitting login
> prompt before.
>
> Is this a new change-in-behaviour since opsi rollout?
No, OPSI didn't change the ctrl-alt-del behaviour. We also didn't hit these problems pre-OPSI, though.
| Assignee | ||
Comment 4•16 years ago
|
||
Just finished looking through the logs on a couple of slaves. There are clear gaps when the slave was missing, but nothing indicating why it failed to log in.
| Assignee | ||
Comment 5•16 years ago
|
||
Since I can't reproduce the problem I don't know if this fixes it, but it _seems_ like it should.
We can use this on the XP talos machines and win2k3 build machines. I don't think it's applicable to the Talos Vista problems.
Attachment #406084 -
Flags: review?(ccooper)
Updated•16 years ago
|
Attachment #406084 -
Flags: review?(ccooper) → review+
| Assignee | ||
Comment 6•16 years ago
|
||
Comment on attachment 406084 [details] [diff] [review]
opsi package to disable CAD at the login screen
changeset: 23:0434d3f09a56
Attachment #406084 -
Flags: checked-in+
| Assignee | ||
Comment 7•16 years ago
|
||
Comment on attachment 406084 [details] [diff] [review]
opsi package to disable CAD at the login screen
I've set this package to install on all of our win32 build machines. I haven't done any testing on Talos XP yet, so I'll hold off on that for now.
| Assignee | ||
Comment 8•16 years ago
|
||
The package is installing fine on all of the machines, it's hard to know if this has helped, though. I've got an OPSI server and ref vm locally and I'm going to try to reproduce the failure with them in the meantime.
| Assignee | ||
Updated•16 years ago
|
Summary: Windows 2003 build slaves sometimes get stuck at the login dialog → Windows 2003 build slaves and talos XP sometimes get stuck at the login dialog
| Assignee | ||
Comment 9•16 years ago
|
||
I tested out the disablecad package on a staging XP slave today and it deployed without issue. I gave it a few reboots and everything seems to be fine. I'll have it deploy on the production slaves and see how that goes.
| Assignee | ||
Comment 10•16 years ago
|
||
Haven't seen any more of these since we rolled out the disablecad package. Let's let them run over the weekend, though.
| Assignee | ||
Comment 11•16 years ago
|
||
Still haven't seen a re-occurrence of this. I'm declaring it FIXED.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
| Assignee | ||
Comment 12•16 years ago
|
||
Had a bunch of build and other slaves hit this over the weekend. Definitely seems related to load, as there was a spike early this morning.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 13•16 years ago
|
||
Happened again today. Tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=527229. OPSI server was chewing all of the RAM until the opsiconfd service was restarted.
| Assignee | ||
Comment 14•16 years ago
|
||
I added the following entry to root's crontab. Restarting the service will hopefully let us avoid this issue.
@weekly /etc/init.d/opsiconfd restart
Status: REOPENED → ASSIGNED
| Assignee | ||
Comment 15•16 years ago
|
||
I've seen a few talos xp slaves stuck at the screensaver again recently. Haven't seen a highly loaded OPSI server to go along with it, though.
Comment 16•16 years ago
|
||
(In reply to comment #14)
> I added the following entry to root's crontab. Restarting the service will
> hopefully let us avoid this issue.
> @weekly /etc/init.d/opsiconfd restart
... which apparently means midnight on Sunday.
production-opsi# date
Mon Nov 30 04:35:11 CET 2009
--> We're after the cron trigger time. (Plus the clock is set to Central European time, and is 15 minutes slow).
# ps aux | grep -E '(USER|opsi)'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
993 28144 1.1 91.3 2171692 1896572 ? Sl Nov16 217:13 /usr/bin/python /usr/sbin/opsiconfd -D
# free -m
total used free shared buffers cached
Mem: 2028 1952 75 0 6 26
-/+ buffers/cache: 1919 108
Swap: 1019 593 426
--> opsiconfd isn't getting restarted by cron.
# /etc/init.d/opsiconfd restart
Stopping opsi config service... (done).
Starting opsi config service.... (done).
# ps aux | grep -E '(USER|opsi)'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
993 28144 1.1 91.5 2141000 1900860 ? Rl Nov16 217:15 /usr/bin/python /usr/sbin/opsiconfd -D
993 6700 2.6 0.5 15348 10812 ? S 04:32 0:00 /usr/bin/python /usr/sbin/opsiconfd -D
Ended up stopping opsiconfd, killing process 28144, and starting it again.
| Assignee | ||
Comment 17•16 years ago
|
||
Thanks for noticing that Nick, I'll try and fix it somehow...
| Assignee | ||
Comment 18•16 years ago
|
||
I ended up hitting this issue as part of bug 429418 and trying a few other things to fix it. Here's what I found:
OPSI has a configuration parameter called 'SecsUntilConnectionTimeout', which is set to 180 by default. When I changed this to 10, the OPSI dialog went away quicker and the automatic login succeeded. My theory is that the screen saver starting is what's breaking the login, not OPSI. By timing out sooner we prevent that from happening.
This parameter lives in the registry, here:
HKML\Software\opsi.org\pcptch
It should be a simple OPSI package to roll this out.
| Assignee | ||
Comment 19•16 years ago
|
||
This is an OPSI package that will lower the timeout of the OPSI connect dialog from 3 minutes to 30 seconds, which will prevent the screensaver from coming on and screwing up the automatic login.
Tested in staging and the installation and uninstallation worked fine.
Attachment #415431 -
Flags: review?(ccooper)
Updated•16 years ago
|
Attachment #415431 -
Flags: review?(ccooper) → review+
| Assignee | ||
Comment 20•16 years ago
|
||
Comment on attachment 415431 [details] [diff] [review]
opsi package to lower the opsi timeout
changeset: 25:02f055f2f155
Attachment #415431 -
Flags: checked-in+
| Assignee | ||
Comment 21•16 years ago
|
||
The timeout lowering package is set to install on all of our build and XP talos slaves.
| Assignee | ||
Comment 22•16 years ago
|
||
The rollout went fine, I think this is fixed for real this time....
Status: ASSIGNED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Comment 23•16 years ago
|
||
(In reply to comment #17)
> Thanks for noticing that Nick, I'll try and fix it somehow...
How'd you go with fixing up the cron job ?
| Assignee | ||
Comment 24•16 years ago
|
||
(In reply to comment #23)
> (In reply to comment #17)
> > Thanks for noticing that Nick, I'll try and fix it somehow...
>
> How'd you go with fixing up the cron job ?
I forgot to =\
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 25•16 years ago
|
||
I had to restart opsiconfd today because windows slaves were getting stuck at the screensaver (a simple vnc connection & mouse waggle fixes that up). Like comment 16 I had to kill the process because it didn't respond to a stop. Perhaps we should switch it from @weekly to every 2-3 days ? Also, "/etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd" seems to be missing a trailing "start".
| Assignee | ||
Comment 27•16 years ago
|
||
(In reply to comment #25)
> I had to restart opsiconfd today because windows slaves were getting stuck at
> the screensaver (a simple vnc connection & mouse waggle fixes that up). Like
> comment 16 I had to kill the process because it didn't respond to a stop.
> Perhaps we should switch it from @weekly to every 2-3 days ? Also,
> "/etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd" seems to be
> missing a trailing "start".
I changed the cronjobs to be:
0 2 * * 0,2,4,6 /etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd start
Which should have us restarted the opsiconfd service at 2am every Sunday, Tuesday, Thursday, and Saturday. I'll check in on it tomorrow to confirm that.
| Assignee | ||
Comment 28•16 years ago
|
||
(In reply to comment #25)
> I had to restart opsiconfd today because windows slaves were getting stuck at
> the screensaver (a simple vnc connection & mouse waggle fixes that up). Like
> comment 16 I had to kill the process because it didn't respond to a stop.
Oh, and I've had that problem too where the service doesn't respond to 'stop'. It gets that way when it hits the memory leak. Under "normal" circumstances it responds signals sent from the init script.
| Assignee | ||
Comment 29•16 years ago
|
||
production-opsi managed to restart the opsi process this morning, but staging-opsi didn't. I think it's because staging-opsi had an opsiconfd that was alive for a couple of weeks. Going to leave this open until Monday to see if it works on both machines then.
| Assignee | ||
Comment 30•16 years ago
|
||
(In reply to comment #29)
> production-opsi managed to restart the opsi process this morning, but
> staging-opsi didn't. I think it's because staging-opsi had an opsiconfd that
> was alive for a couple of weeks. Going to leave this open until Monday to see
> if it works on both machines then.
Didn't work either. production-opsi had a week old process. Going to try doing it nightly - if that doesn't work, I'll try a different approach.
| Assignee | ||
Comment 31•16 years ago
|
||
Restarting daily is working fine for production-opsi, but not staging. It's a lot slower, so I'm going to try bumping the 'sleep' between stop and start.
Comment 32•16 years ago
|
||
Actually, we had a couple of production build slaves get stuck on the screensaver yesterday so I manually killed opsiconfd and restarted it. Perhaps we could change the cronjob to kill the process if it hasn't exited after the 15 second sleep ?
| Assignee | ||
Comment 33•16 years ago
|
||
(In reply to comment #32)
> Actually, we had a couple of production build slaves get stuck on the
> screensaver yesterday so I manually killed opsiconfd and restarted it. Perhaps
> we could change the cronjob to kill the process if it hasn't exited after the
> 15 second sleep ?
Yeah....I'll write a script that actually does this properly.
| Assignee | ||
Comment 34•16 years ago
|
||
Going to try this out on staging-opsi for a couple of days.
| Assignee | ||
Comment 35•16 years ago
|
||
This script ended up killing opsiconfd overnight, and not restarting it.
| Assignee | ||
Comment 36•16 years ago
|
||
Strangely, the script works perfectly fine when run at the command line. I'll have to investigate more, later.
| Assignee | ||
Comment 37•16 years ago
|
||
(In reply to comment #36)
> Strangely, the script works perfectly fine when run at the command line. I'll
> have to investigate more, later.
Looks like the opsi init script doesn't work properly when run through cron with just 'bash restart-opsiconfd.sh'. Using 'bash -l -c ....' works though. I'm going to run it like this over the weekend on staging-opsi.
| Assignee | ||
Comment 38•16 years ago
|
||
(In reply to comment #37)
> (In reply to comment #36)
> > Strangely, the script works perfectly fine when run at the command line. I'll
> > have to investigate more, later.
>
> Looks like the opsi init script doesn't work properly when run through cron
> with just 'bash restart-opsiconfd.sh'. Using 'bash -l -c ....' works though.
> I'm going to run it like this over the weekend on staging-opsi.
Looks like this worked over the weekend.
| Assignee | ||
Updated•16 years ago
|
Attachment #422548 -
Flags: review?(nrthomas)
Comment 39•16 years ago
|
||
Comment on attachment 422548 [details]
opsi service restarter cronjob
>is_running() {
> PID=`cat /var/run/opsiconfd/opsiconfd.pid 2>/dev/null`
> if [ $PID ]; then
> return 1
> else
> return 0
> fi
>}
Does the pid file get emptied on SIGTERM and SIGKILL ? Perhaps we should be testing for an opsiconfd process.
Otherwise it looks fine for shell. ;-)
| Assignee | ||
Comment 40•16 years ago
|
||
(In reply to comment #39)
> (From update of attachment 422548 [details])
> >is_running() {
> > PID=`cat /var/run/opsiconfd/opsiconfd.pid 2>/dev/null`
> > if [ $PID ]; then
> > return 1
> > else
> > return 0
> > fi
> >}
>
> Does the pid file get emptied on SIGTERM and SIGKILL ? Perhaps we should be
> testing for an opsiconfd process.
Good catch...the PID file is cleaned up after SIGTERM, but not after SIGKILL. I'll fix is_running.
| Assignee | ||
Comment 41•16 years ago
|
||
This seems to work. I've got it setup to run on staging.
Attachment #422548 -
Attachment is obsolete: true
Attachment #423987 -
Flags: review?(nrthomas)
Attachment #422548 -
Flags: review?(nrthomas)
Comment 42•16 years ago
|
||
Comment on attachment 423987 [details] [diff] [review]
opsi service restarter, v2
r+. Is the cronmail set to let us know if it fails ?
Attachment #423987 -
Flags: review?(nrthomas) → review+
Comment 43•16 years ago
|
||
We got an nagios test for the opsiconfd process set up a couple of days ago, and it's reporting it missing staging-opsi (consequently a couple of windows machines were stuck at the screensaver). Restarted it manually for now.
Comment 44•16 years ago
|
||
Comment on attachment 423987 [details] [diff] [review]
opsi service restarter, v2
>diff --git a/restart-opsiconfd.sh b/restart-opsiconfd.sh
>+is_running() {
>+ # Returns 0 when opsi is running, 1 when it is not
>+ ps auxwww | grep opsiconfd | grep -qv grep
This should be
ps auxwww | grep -vE '(grep|restart)' | grep -q opsiconfd
otherwise we get the uninteresting exit status from the 'grep -v grep', and the script name fools us into thinking opsiconfd is still running. Updated the script on staging-opsi.
| Assignee | ||
Comment 45•16 years ago
|
||
Nick - thanks for that fix. I *think* things are working now, it restarted overnight in staging just fine. I'll let it run over the weekend though...
| Assignee | ||
Comment 46•16 years ago
|
||
Comment on attachment 423987 [details] [diff] [review]
opsi service restarter, v2
changeset: 36:96b99da5c40f
Attachment #423987 -
Flags: checked-in+
| Assignee | ||
Comment 47•16 years ago
|
||
Alright, this latest version worked over the weekend so I'm going to put it on production and land it. It's still just a workaround, but as long as it continues to restart the service every night we shouldn't see any more issues with slaves getting hung up. I've updated the staging crontab as follows:
MAILTO=release@mozilla.com
# m h dom mon dow command
*/10 * * * * rsync -av /N/*ref* /var/lib/opsi/config/clients/ &>> /root/config-rsync.log*/5 * * * * cd /home/cltbld/opsi-package-sources && /usr/bin/python look-for-new-slaves.py -f staging-slaves &>> /root/new-slaves.log0 2 * * * bash -l -c /home/cltbld/opsi-package-sources/restart-opsiconfd.sh &>> /root/restart-opsi.log || echo "Failed to restart OPSI."
And the production one like so:
MAILTO=release@mozilla.com
# m h dom mon dow command*/5 * * * * cd /home/cltbld/opsi-package-sources && /usr/bin/python loo
k-for-new-slaves.py -f production-slaves &>>/root/new-slaves.log
0 2 * * * bash -l -c /home/cltbld/opsi-package-sources/restart-opsiconfd.s
h &>> /root/restart-opsi.log || echo "Failed to restart OPSI."
| Assignee | ||
Comment 48•16 years ago
|
||
Worked fine on staging and production overnight. While this bug doesn't fix the root cause of the problem, I'm going to declare this FIXED since we have a workaround now.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•