Closed Bug 522078 Opened 16 years ago Closed 16 years ago

Windows 2003 build slaves and talos XP sometimes get stuck at the login dialog

Categories

(Release Engineering :: General, defect)

Product:

Component:

Platform:

x86

macOS

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(3 files, 1 obsolete file)

opsi package to disable CAD at the login screen 16 years ago bhearsum@mozilla.com (:bhearsum) 2.23 KB, patch	coop : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
opsi package to lower the opsi timeout 16 years ago bhearsum@mozilla.com (:bhearsum) 2.76 KB, patch	coop : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
opsi service restarter cronjob 16 years ago bhearsum@mozilla.com (:bhearsum) 1.66 KB, text/plain		Details
opsi service restarter, v2 16 years ago bhearsum@mozilla.com (:bhearsum) 1.94 KB, patch	nthomas : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Assignee

Description

•

16 years ago

We've seen this a few times before, and over the weekend nearly all of them got hung in this manner. The fix is to connect and hit ctrl-alt-delete, and then things continue on they're merry way. This seems like it's related to load on the OPSI load, as outlined in https://bugzilla.mozilla.org/show_bug.cgi?id=521722#c7 I've asked the the OPSI folks for some help here: https://forum.opsi.org/viewtopic.php?f=8&t=994, I'll be looking into this in the meantime, though.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 1

•

16 years ago

One possible solution here may be to simply disable the need to hit ctrl-alt-del. I'll look into how to do this, and give it a try. This would be helpful for the Talos XP machines, too.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

16 years ago

We've been rebooting talos machines after each test run without hitting login prompt before. Is this a new change-in-behaviour since opsi rollout?

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 3

•

16 years ago

(In reply to comment #2) > We've been rebooting talos machines after each test run without hitting login > prompt before. > > Is this a new change-in-behaviour since opsi rollout? No, OPSI didn't change the ctrl-alt-del behaviour. We also didn't hit these problems pre-OPSI, though.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 4

•

16 years ago

Just finished looking through the logs on a couple of slaves. There are clear gaps when the slave was missing, but nothing indicating why it failed to log in.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 5

•

16 years ago

Attached patch opsi package to disable CAD at the login screen — Details — Splinter Review

Since I can't reproduce the problem I don't know if this fixes it, but it _seems_ like it should. We can use this on the XP talos machines and win2k3 build machines. I don't think it's applicable to the Talos Vista problems.

Attachment #406084 - Flags: review?(ccooper)

Chris Cooper [:coop] (he/him)

Updated

•

16 years ago

Attachment #406084 - Flags: review?(ccooper) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

16 years ago

Comment on attachment 406084 [details] [diff] [review] opsi package to disable CAD at the login screen changeset: 23:0434d3f09a56

Attachment #406084 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 7

•

16 years ago

Comment on attachment 406084 [details] [diff] [review] opsi package to disable CAD at the login screen I've set this package to install on all of our win32 build machines. I haven't done any testing on Talos XP yet, so I'll hold off on that for now.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 8

•

16 years ago

The package is installing fine on all of the machines, it's hard to know if this has helped, though. I've got an OPSI server and ref vm locally and I'm going to try to reproduce the failure with them in the meantime.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

16 years ago

Summary: Windows 2003 build slaves sometimes get stuck at the login dialog → Windows 2003 build slaves and talos XP sometimes get stuck at the login dialog

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 9

•

16 years ago

I tested out the disablecad package on a staging XP slave today and it deployed without issue. I gave it a few reboots and everything seems to be fine. I'll have it deploy on the production slaves and see how that goes.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 10

•

16 years ago

Haven't seen any more of these since we rolled out the disablecad package. Let's let them run over the weekend, though.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 11

•

16 years ago

Still haven't seen a re-occurrence of this. I'm declaring it FIXED.

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 12

•

16 years ago

Had a bunch of build and other slaves hit this over the weekend. Definitely seems related to load, as there was a spike early this morning.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 13

•

16 years ago

Happened again today. Tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=527229. OPSI server was chewing all of the RAM until the opsiconfd service was restarted.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 14

•

16 years ago

I added the following entry to root's crontab. Restarting the service will hopefully let us avoid this issue. @weekly /etc/init.d/opsiconfd restart

Status: REOPENED → ASSIGNED

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

16 years ago

See Also: → 517862

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 15

•

16 years ago

I've seen a few talos xp slaves stuck at the screensaver again recently. Haven't seen a highly loaded OPSI server to go along with it, though.

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

16 years ago

(In reply to comment #14) > I added the following entry to root's crontab. Restarting the service will > hopefully let us avoid this issue. > @weekly /etc/init.d/opsiconfd restart ... which apparently means midnight on Sunday. production-opsi# date Mon Nov 30 04:35:11 CET 2009 --> We're after the cron trigger time. (Plus the clock is set to Central European time, and is 15 minutes slow). # ps aux | grep -E '(USER|opsi)' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 993 28144 1.1 91.3 2171692 1896572 ? Sl Nov16 217:13 /usr/bin/python /usr/sbin/opsiconfd -D # free -m total used free shared buffers cached Mem: 2028 1952 75 0 6 26 -/+ buffers/cache: 1919 108 Swap: 1019 593 426 --> opsiconfd isn't getting restarted by cron. # /etc/init.d/opsiconfd restart Stopping opsi config service... (done). Starting opsi config service.... (done). # ps aux | grep -E '(USER|opsi)' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 993 28144 1.1 91.5 2141000 1900860 ? Rl Nov16 217:15 /usr/bin/python /usr/sbin/opsiconfd -D 993 6700 2.6 0.5 15348 10812 ? S 04:32 0:00 /usr/bin/python /usr/sbin/opsiconfd -D Ended up stopping opsiconfd, killing process 28144, and starting it again.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 17

•

16 years ago

Thanks for noticing that Nick, I'll try and fix it somehow...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 18

•

16 years ago

I ended up hitting this issue as part of bug 429418 and trying a few other things to fix it. Here's what I found: OPSI has a configuration parameter called 'SecsUntilConnectionTimeout', which is set to 180 by default. When I changed this to 10, the OPSI dialog went away quicker and the automatic login succeeded. My theory is that the screen saver starting is what's breaking the login, not OPSI. By timing out sooner we prevent that from happening. This parameter lives in the registry, here: HKML\Software\opsi.org\pcptch It should be a simple OPSI package to roll this out.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 19

•

16 years ago

Attached patch opsi package to lower the opsi timeout — Details — Splinter Review

This is an OPSI package that will lower the timeout of the OPSI connect dialog from 3 minutes to 30 seconds, which will prevent the screensaver from coming on and screwing up the automatic login. Tested in staging and the installation and uninstallation worked fine.

Attachment #415431 - Flags: review?(ccooper)

Chris Cooper [:coop] (he/him)

Updated

•

16 years ago

Attachment #415431 - Flags: review?(ccooper) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 20

•

16 years ago

Comment on attachment 415431 [details] [diff] [review] opsi package to lower the opsi timeout changeset: 25:02f055f2f155

Attachment #415431 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 21

•

16 years ago

The timeout lowering package is set to install on all of our build and XP talos slaves.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

16 years ago

The rollout went fine, I think this is fixed for real this time....

Status: ASSIGNED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Comment 23

•

16 years ago

(In reply to comment #17) > Thanks for noticing that Nick, I'll try and fix it somehow... How'd you go with fixing up the cron job ?

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 24

•

16 years ago

(In reply to comment #23) > (In reply to comment #17) > > Thanks for noticing that Nick, I'll try and fix it somehow... > > How'd you go with fixing up the cron job ? I forgot to =\

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Nick Thomas [:nthomas] (UTC+12)

Comment 25

•

16 years ago

I had to restart opsiconfd today because windows slaves were getting stuck at the screensaver (a simple vnc connection & mouse waggle fixes that up). Like comment 16 I had to kill the process because it didn't respond to a stop. Perhaps we should switch it from @weekly to every 2-3 days ? Also, "/etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd" seems to be missing a trailing "start".

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 27

•

16 years ago

(In reply to comment #25) > I had to restart opsiconfd today because windows slaves were getting stuck at > the screensaver (a simple vnc connection & mouse waggle fixes that up). Like > comment 16 I had to kill the process because it didn't respond to a stop. > Perhaps we should switch it from @weekly to every 2-3 days ? Also, > "/etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd" seems to be > missing a trailing "start". I changed the cronjobs to be: 0 2 * * 0,2,4,6 /etc/init.d/opsiconfd stop && sleep 15 && /etc/init.d/opsiconfd start Which should have us restarted the opsiconfd service at 2am every Sunday, Tuesday, Thursday, and Saturday. I'll check in on it tomorrow to confirm that.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 28

•

16 years ago

(In reply to comment #25) > I had to restart opsiconfd today because windows slaves were getting stuck at > the screensaver (a simple vnc connection & mouse waggle fixes that up). Like > comment 16 I had to kill the process because it didn't respond to a stop. Oh, and I've had that problem too where the service doesn't respond to 'stop'. It gets that way when it hits the memory leak. Under "normal" circumstances it responds signals sent from the init script.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 29

•

16 years ago

production-opsi managed to restart the opsi process this morning, but staging-opsi didn't. I think it's because staging-opsi had an opsiconfd that was alive for a couple of weeks. Going to leave this open until Monday to see if it works on both machines then.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 30

•

16 years ago

(In reply to comment #29) > production-opsi managed to restart the opsi process this morning, but > staging-opsi didn't. I think it's because staging-opsi had an opsiconfd that > was alive for a couple of weeks. Going to leave this open until Monday to see > if it works on both machines then. Didn't work either. production-opsi had a week old process. Going to try doing it nightly - if that doesn't work, I'll try a different approach.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 31

•

16 years ago

Restarting daily is working fine for production-opsi, but not staging. It's a lot slower, so I'm going to try bumping the 'sleep' between stop and start.

Nick Thomas [:nthomas] (UTC+12)

Comment 32

•

16 years ago

Actually, we had a couple of production build slaves get stuck on the screensaver yesterday so I manually killed opsiconfd and restarted it. Perhaps we could change the cronjob to kill the process if it hasn't exited after the 15 second sleep ?

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 33

•

16 years ago

(In reply to comment #32) > Actually, we had a couple of production build slaves get stuck on the > screensaver yesterday so I manually killed opsiconfd and restarted it. Perhaps > we could change the cronjob to kill the process if it hasn't exited after the > 15 second sleep ? Yeah....I'll write a script that actually does this properly.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 34

•

16 years ago

Attached file opsi service restarter cronjob (obsolete) — Details

Going to try this out on staging-opsi for a couple of days.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 35

•

16 years ago

This script ended up killing opsiconfd overnight, and not restarting it.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 36

•

16 years ago

Strangely, the script works perfectly fine when run at the command line. I'll have to investigate more, later.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 37

•

16 years ago

(In reply to comment #36) > Strangely, the script works perfectly fine when run at the command line. I'll > have to investigate more, later. Looks like the opsi init script doesn't work properly when run through cron with just 'bash restart-opsiconfd.sh'. Using 'bash -l -c ....' works though. I'm going to run it like this over the weekend on staging-opsi.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 38

•

16 years ago

(In reply to comment #37) > (In reply to comment #36) > > Strangely, the script works perfectly fine when run at the command line. I'll > > have to investigate more, later. > > Looks like the opsi init script doesn't work properly when run through cron > with just 'bash restart-opsiconfd.sh'. Using 'bash -l -c ....' works though. > I'm going to run it like this over the weekend on staging-opsi. Looks like this worked over the weekend.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

16 years ago

Attachment #422548 - Flags: review?(nrthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 39

•

16 years ago

Comment on attachment 422548 [details] opsi service restarter cronjob >is_running() { > PID=`cat /var/run/opsiconfd/opsiconfd.pid 2>/dev/null` > if [ $PID ]; then > return 1 > else > return 0 > fi >} Does the pid file get emptied on SIGTERM and SIGKILL ? Perhaps we should be testing for an opsiconfd process. Otherwise it looks fine for shell. ;-)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 40

•

16 years ago

(In reply to comment #39) > (From update of attachment 422548 [details]) > >is_running() { > > PID=`cat /var/run/opsiconfd/opsiconfd.pid 2>/dev/null` > > if [ $PID ]; then > > return 1 > > else > > return 0 > > fi > >} > > Does the pid file get emptied on SIGTERM and SIGKILL ? Perhaps we should be > testing for an opsiconfd process. Good catch...the PID file is cleaned up after SIGTERM, but not after SIGKILL. I'll fix is_running.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 41

•

16 years ago

Attached patch opsi service restarter, v2 — Details — Splinter Review

This seems to work. I've got it setup to run on staging.

Attachment #422548 - Attachment is obsolete: true

Attachment #423987 - Flags: review?(nrthomas)

Attachment #422548 - Flags: review?(nrthomas)

Nick Thomas [:nthomas] (UTC+12)

Comment 42

•

16 years ago

Comment on attachment 423987 [details] [diff] [review] opsi service restarter, v2 r+. Is the cronmail set to let us know if it fails ?

Attachment #423987 - Flags: review?(nrthomas) → review+

Nick Thomas [:nthomas] (UTC+12)

Comment 43

•

16 years ago

We got an nagios test for the opsiconfd process set up a couple of days ago, and it's reporting it missing staging-opsi (consequently a couple of windows machines were stuck at the screensaver). Restarted it manually for now.

Nick Thomas [:nthomas] (UTC+12)

Comment 44

•

16 years ago

Comment on attachment 423987 [details] [diff] [review] opsi service restarter, v2 >diff --git a/restart-opsiconfd.sh b/restart-opsiconfd.sh >+is_running() { >+ # Returns 0 when opsi is running, 1 when it is not >+ ps auxwww | grep opsiconfd | grep -qv grep This should be ps auxwww | grep -vE '(grep|restart)' | grep -q opsiconfd otherwise we get the uninteresting exit status from the 'grep -v grep', and the script name fools us into thinking opsiconfd is still running. Updated the script on staging-opsi.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 45

•

16 years ago

Nick - thanks for that fix. I *think* things are working now, it restarted overnight in staging just fine. I'll let it run over the weekend though...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 46

•

16 years ago

Comment on attachment 423987 [details] [diff] [review] opsi service restarter, v2 changeset: 36:96b99da5c40f

Attachment #423987 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 47

•

16 years ago

Alright, this latest version worked over the weekend so I'm going to put it on production and land it. It's still just a workaround, but as long as it continues to restart the service every night we shouldn't see any more issues with slaves getting hung up. I've updated the staging crontab as follows: MAILTO=release@mozilla.com # m h dom mon dow command */10 * * * * rsync -av /N/*ref* /var/lib/opsi/config/clients/ &>> /root/config-rsync.log*/5 * * * * cd /home/cltbld/opsi-package-sources && /usr/bin/python look-for-new-slaves.py -f staging-slaves &>> /root/new-slaves.log0 2 * * * bash -l -c /home/cltbld/opsi-package-sources/restart-opsiconfd.sh &>> /root/restart-opsi.log || echo "Failed to restart OPSI." And the production one like so: MAILTO=release@mozilla.com # m h dom mon dow command*/5 * * * * cd /home/cltbld/opsi-package-sources && /usr/bin/python loo k-for-new-slaves.py -f production-slaves &>>/root/new-slaves.log 0 2 * * * bash -l -c /home/cltbld/opsi-package-sources/restart-opsiconfd.s h &>> /root/restart-opsi.log || echo "Failed to restart OPSI."

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 48

•

16 years ago

Worked fine on staging and production overnight. While this bug doesn't fix the root cause of the problem, I'm going to declare this FIXED since we have a workaround now.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Updated

•

15 years ago

Blocks: 652391

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.