Closed Bug 550815 Opened 11 years ago Closed 11 years ago

Buildbot doesn't start reliably on recent win32 slaves

Categories

(Release Engineering :: General, defect)

x86
Windows Server 2003
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: rail)

References

()

Details

(Whiteboard: [buildslaves][opsi])

Attachments

(8 files, 3 obsolete files)

7.74 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
66.20 KB, image/png
Details
10.44 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.38 KB, patch
catlee
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
1.08 KB, patch
bhearsum
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
462 bytes, patch
bhearsum
: review+
Details | Diff | Splinter Review
933 bytes, patch
bhearsum
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
393 bytes, text/plain
Details
Two related sets of symptoms
* there are several mw32-ix-slaveNN slaves that don't always start buildbot after rebooting, and are just sitting at the desktop with nothing running. I'm using the timestamps in twistd.log and 'net statistics server | head' (as uptime proxy) to determine that. Reopened bug 547799 to get a nagios check for buildbot
* moz2-win32-slave50ish - 60 have intermittent problems determining the installed compilers when the MozillaBuild terminal starts, or 

Hopefully we can figure out the changes between VM slave50ish and earlier that might have caused this.
Checked just now and only 5 of 24 win32 ix slaves were still connected to their production masters, over the space of a day or two.
I'll do my best to resolve this after I finish up with the linux ix machine issues in bug 549672
Assignee: nobody → bhearsum
Looks like OPSI is at fault here. It appears we're running a different version of the "preloginloader" package on all the failing machines, as well as the ref platforms. After uninstalling OPSI on mw32-ix-slave01 buildbot not no trouble starting through a few reboots.
And for posterity, preloginloader version 3.4-27 is the problem version.
We have 3.4-27 on win32-slave50 and upwards, so that correlates with the VMs that also have trouble starting buildbot.
No longer blocks: 545136
It appears that the working machines are running preloginloader 3.3-22, based on my diffing. This would make sense, because we ran OPSI 3.3 prior to deploying on Vista, when we had to upgrade to 3.4. The Vista machines *require* preloginloader 3.4-27 to work.

Assuming I'm right about the version, this gets tricky. There's no way I know of to have multiple versions of the a package installed on the server, so trying to manage having different versions for different machines is very difficult. The best option would be to find a preloginloader that works for all of our machines. There's been quite a few released since last time we tried to upgrade, so the most recent one (3.4-39) may work for us. Last time we tried to upgrade it was a disaster, and caused Vista hangs.
Blocks: 545136
No longer blocks: 545136
Alice helpfully reminds me that we've mostly phased at Vista at this point, so Vista support is not required in whatever preloginloader package we use.

I've tested 3.4-39 on XP and win2k3. Works fine on XP, crashes at startup on Win2k3. This version is useless for us. I'll look for other versions to try next week.
Possibly related or at fault here is that the automatic login happens while OPSI is checking for and installing packages, rather than afterwards. This didn't used to be the case, so it's possibly a bug that only exists in 3.4-27.
On busy days for the trees we lose the ix machines at quite a clip. Any luck finding other versions to test ?
Sorry, I got sidetracked with other things, again :(.

I didn't find any other versions to try other than even *more* experimental, which don't seem worth the effort.

Catlee suggested starting Buildbot in a loop at startup, rather than just once. I gave this a try by modifying d:\mozilla-build\start-buildbot.bat to do so, and that seemed to continue to work after many reboots. I still need to tweak it a bit to support both the 'slave' and 'moz2_slave' buildbot dirs, and add it to the things deployed in the buildbot-batch-file OPSI package.

I *hope* I can get to this tomorrow or Thursday.
Blocks: 556994
(In reply to comment #11)
> Sorry, I got sidetracked with other things, again :(.
> 
> I didn't find any other versions to try other than even *more* experimental,
> which don't seem worth the effort.
> 
> Catlee suggested starting Buildbot in a loop at startup, rather than just once.
> I gave this a try by modifying d:\mozilla-build\start-buildbot.bat to do so,
> and that seemed to continue to work after many reboots. I still need to tweak
> it a bit to support both the 'slave' and 'moz2_slave' buildbot dirs, and add it
> to the things deployed in the buildbot-batch-file OPSI package.

I did some more testing and my script continued to work well. However, I could not reproduce the original problem in staging, even with a production slave that had just failed. Furthermore, there seems to be two ways to fail here:
1) Removing the sleep from the batch file in the start menu causes Buildbot to fail to start 100% of the time, even in staging.
2) Sometimes in production, slaves will fail to start buildbot correctly despite the sleep. I believe the cause of this is different than 1) and based on not being able to reproduce in staging I'm starting to think it's related to load on the master. The scenario could be: slave tries to start -> master doesn't respond for awhile -> slave dies. The Buildbot process isn't supposed to die in that case, but something weird could be going on.

I'm going to prepare an OPSI package that rolls out my updated start-up scripts and I'd like to roll it out in production and see if it fixes the issue. If the scenario is anything like what I described, it should.
So, this package replaces the buildbot-batch-file package and drops in the three files we use in starting Buildbot:
buildbot.bat (start menu, deals with tac generation as well)
start-buildbot.bat (copy of start-msvs8.bat from mozillabuild, modified to launch buildbot)
start-buildbot.sh (called from start-buildbot.bat, launches buildbot slave in a loop for some peried of time)

I've tested this on mw32-ix-slave01 and both the installation and launching of Buildbot has worked fine.
Attachment #437355 - Flags: review?(catlee)
Catlee,

I addressed the comments about the sleeping and simplified the elapsed time measurement.

I also tested a few scenarios where jobs ended quickly. Thankfully, the machine rebooted cleanly, and without Buildbot reconnecting to the master in the middle of the shutdown. So, AFAICT, doing it this way is safe.
Attachment #437355 - Attachment is obsolete: true
Attachment #437617 - Flags: review?(catlee)
Attachment #437355 - Flags: review?(catlee)
Attachment #437617 - Flags: review?(catlee) → review+
Comment on attachment 437617 [details] [diff] [review]
buildbot startup, rev2

changeset:   43:f394588cd62c
Attachment #437617 - Flags: checked-in+
I set this to roll out on all of the build slaves. I tested one by hand, and it's working fine. Leaving open until at least tomorrow, when we can assess whether or not it worked.
(In reply to comment #0)
> Two related sets of symptoms
> * there are several mw32-ix-slaveNN slaves that don't always start buildbot
> after rebooting, and are just sitting at the desktop with nothing running. I'm
> using the timestamps in twistd.log and 'net statistics server | head' (as
> uptime proxy) to determine that. Reopened bug 547799 to get a nagios check for
> buildbot

Overnight, we haven't seen any more of these.

> * moz2-win32-slave50ish - 60 have intermittent problems determining the
> installed compilers when the MozillaBuild terminal starts, or 

I found two machines hitting these issues this morning. Neither had the new OPSI package installed, but considering they hang before start-buildbot.bat completes, I doubt it will help.
mw32-ix-slave18 didn't start buildbot properly today. Nagios reported it down 2 minutes after the machine booted -- so it never even got into the .sh file :-(. It could be that the IX machines are failing with the same, strange SDK error the VMs have hit, but aren't blocking on the dialog.

Back to square one here :(.
(In reply to comment #0)
> Two related sets of symptoms
> * there are several mw32-ix-slaveNN slaves that don't always start buildbot
> after rebooting, and are just sitting at the desktop with nothing running. I'm
> using the timestamps in twistd.log and 'net statistics server | head' (as
> uptime proxy) to determine that. Reopened bug 547799 to get a nagios check for
> buildbot

Based on the fact that there have been very few machines disconnected since the new startup scripts landed, I think that this part is fixed.

> * moz2-win32-slave50ish - 60 have intermittent problems determining the
> installed compilers when the MozillaBuild terminal starts, or 

But as mentioned in my previous comment, there is at least one ix machine which has the same issues as these VMs -- which don't appear to be solved yet.
A number of ix boxes seem to be hitting this.
I've been restarting them by hand, and then getting new nagios errors later -- I'm going to just start acking with this bug.
(In reply to comment #21)
> A number of ix boxes seem to be hitting this.
> I've been restarting them by hand, and then getting new nagios errors later --
> I'm going to just start acking with this bug.

Yes...this bug has been tracking the issues with the mw32-ix machines and slaves 50-59 for weeks now.

When you see machines in this state, please start buildbot by launching the batch file in the start menu until we have a fix for this.
Grasping at straws, I'm doing some large diffs to see if there's any interesting, undocumented changes between an older slave (win32-slave26) and a newer one (win32-slave52). So far, I've diff'ed the entire mozilla-build install directory, and come up with nothing interesting. Full list of differences:
* Lots of pyc files (expected, harmless)
* Parts of the tar installation were not updated on the latest machines (binaries were, support scripts and docs were not)
* .bash_history for Administrator a cltbld

Other than that listed above, the two mozilla-build installations were absolutely identical.

I've also diffed dumps of Visual Studio portion of the registry between them. They were 100% identical.

Next up is a diff of the entire msvs8 installation.
There appears to be no meaningful difference in the MSVS8 installations, either.  Complete list:
* msvs8/Common7/IDE/ItemTemplatesCache/cache.bin
* msvs8/Common7/IDE/ProjectTemplatesCache/cache.bin
* msvs8/SmartDevices/SDK/SDKTools/cabwiz.exe.inf
* msvs8/VC/vcpackages/WCE.VCPlatform.config
Duplicate of this bug: 558693
(In reply to comment #20)
> (In reply to comment #0)
> > Two related sets of symptoms
> > * there are several mw32-ix-slaveNN slaves that don't always start buildbot
> > after rebooting, and are just sitting at the desktop with nothing running. I'm
> > using the timestamps in twistd.log and 'net statistics server | head' (as
> > uptime proxy) to determine that. Reopened bug 547799 to get a nagios check for
> > buildbot
> 
> Based on the fact that there have been very few machines disconnected since the
> new startup scripts landed, I think that this part is fixed.
> 

I just saw an IX machine that definitely hit this exact issue. It's looking more and more like my startup script changes haven't fixed anything.
This morning I found 21/24 production ix machines disconnected, and 2/10 afflicted VMs.
Adds better logging to all of the files we use in the Buildbot startup, including guess-msvc.bat, which is managed with this package starting with this patch. Hopefully this will clue us in to exactly where we're hitting issues.
Attachment #438758 - Flags: review?(catlee)
Attachment #438758 - Flags: review?(catlee) → review+
Attachment #438758 - Flags: checked-in+
Comment on attachment 438758 [details] [diff] [review]
add better startup logging to windows boot

This is set to roll out on all of the build slaves.
Lots of slaves failed overnight again, but this time they did some logging!
Here's a successful start:
Tue 04/13/2010 22:31:20.79 - Very start of buildbot.bat" 
"Tue 04/13/2010 22:31:21.62 - Sleeping at the end of buildbot.bat" 
"Tue 04/13/2010 22:31:38.32 - About to run start-buildbot.bat" 
"Tue 04/13/2010 22:31:38.43 - About to call guess-msvc.bat" 
"Tue 04/13/2010 22:31:38.45 - Start of guess-msvc.bat" 
"Tue 04/13/2010 22:31:38.46 - About to query MSVC8KEY" 
"Tue 04/13/2010 22:31:38.48 - Queried, VC8DIR is " 
"Tue 04/13/2010 22:31:38.57 - End of guess-msvc.bat" 
"Tue 04/13/2010 22:31:38.57 - Calling vcvars32.bat in VC8DIR" 
"Tue 04/13/2010 22:31:38.64 - About to run start-buildbot.sh" 
"Tue 04/13/2010 22:31:38.65 - start-buildbot.sh finished" 
Tue Apr 13 22:31:41 PDT 2010 - start of start-buildbot.sh
1: Running /d/mozilla-build/python25/scripts/buildbot start /e/builds/moz2_slave
1: Ran /d/mozilla-build/python25/scripts/buildbot start /e/builds/moz2_slave

It's a mid muddled because we're logging to the same file from both cmd and MSYS. The "start-buildbot.sh" reference is supposed to be at the end.

Here's a failed one:
Wed 04/14/2010  2:00:06.25 - Very start of buildbot.bat" 
"Wed 04/14/2010  2:00:07.01 - Sleeping at the end of buildbot.bat" 
"Wed 04/14/2010  2:00:38.42 - About to run start-buildbot.bat" 
"Wed 04/14/2010  2:00:38.53 - About to call guess-msvc.bat" 
"Wed 04/14/2010  2:00:38.54 - Start of guess-msvc.bat" 
"Wed 04/14/2010  2:00:38.57 - About to query MSVC8KEY" 
"Wed 04/14/2010  2:00:38.57 - Queried, VC8DIR is " 
"Wed 04/14/2010  2:00:38.67 - End of guess-msvc.bat" 
"Wed 04/14/2010  2:00:38.67 - Calling vcvars32.bat in VC8DIR" 
"Wed 04/14/2010  2:00:38.73 - About to run start-buildbot.sh" 
"Wed 04/14/2010  2:00:38.75 - start-buildbot.sh finished" 

So, it appears as if the 'start' command at the end of start-buildbot.bat (http://hg.mozilla.org/build/opsi-package-sources/file/78a67cfb7bff/buildbot-startup/CLIENT_DATA/start-buildbot.bat#l66) fails, because we never see the prints from start-buildbot.sh.
One quick theory is that rxvt is crashing. It feels familiar, but I can't dig anything like that up at the moment.
(In reply to comment #31)
> One quick theory is that rxvt is crashing. It feels familiar, but I can't dig
> anything like that up at the moment.

Can we rum cmd instead of rxvt? Is there any requirements for rxvt?

The following command worked fine on one of the dead slaves:

start /min "Buildbot" "d:\mozilla-build\msys\bin\bash" --login -c /d/mozilla-build/start-buildbot.sh
(In reply to comment #32)
> (In reply to comment #31)
> > One quick theory is that rxvt is crashing. It feels familiar, but I can't dig
> > anything like that up at the moment.
> 
> Can we rum cmd instead of rxvt? Is there any requirements for rxvt?

We could, but I'm hard pressed to make a change like that without evidence that is helps, or at least doesn't bust anything. If you have time, could you run a slave in staging which launches with cmd.exe for awhile?
I found some registry differences in the OPSI section between ix01, which is working fine at this point, so I tried copying those settings to another ix machine, but that didn't fix the problem there. Then I tried a bunch of other things:
* Reinstalling OPSI from the server -- still had startup troubles
* Uninstalling OPSI entirely -- fixed it
* Installing OPSI from the server again -- still had startup troubles
* Manually installing the OPSI client with files from a VM that has never failed -- still had startup troubles

It's pretty clear that something OPSI is doing is getting in the way, but I'm not sure what.
(In reply to comment #33)
> We could, but I'm hard pressed to make a change like that without evidence that
> is helps, or at least doesn't bust anything. If you have time, could you run a
> slave in staging which launches with cmd.exe for awhile?

win32-slave03 has been running buildbot using cmd.exe ~13 hours and attached to sm02. So far so good. A bunch of different builds passed, from nightlies and unittests to release builds and repacks. Let's see what's happening next some days.

The only unusual thing is cmd.exe's caption changes:
http://img217.imageshack.us/img217/4288/screenshotsy.png
Another straw to grasp at: We modify the registry at every boot with a few OPSI packages. Somehow this could be screwing with the boot, maybe? I've disabled these start up jobs, which aren't technically required, anyways.
This patch uses 'set -x', signal trapping, and redirection of all output to a file to hopefully glean some information about how or why the shell script is dying.

I wanted to use:
exec > >(tee -a $LOG) 2>&1

to log to the console, too, but apparently that particular construction is not implemented in MSYS' bash. Probably doesn't matter, since the shell window closes after sh dies.
Attachment #439607 - Flags: review?(catlee)
How about to try using the following approach:

=== Snippet ===
cd %USERPROFILE%

:start
"%MOZILLABUILD%\msys\bin\bash" --login -c /d/mozilla-build/buildbot-start.sh
"%MOZILLABUILD%\msys\bin\sleep" 30

goto start
=== Snippet ===

In this case we have some kind of self-recovery. 

win32-slave03 has been running this script more than 24 hours without any visible regression so far. We can try this script on one of the problematic boxes and increase the sleep period so that nagios can detect the failure.

Another approach is convert the bat file to exe (no FOSS converter found, freeware only) and run it as a service. Windows® can monitor and restart its services in case of failure.
(In reply to comment #38)
> How about to try using the following approach:
> 
> === Snippet ===
> cd %USERPROFILE%
> 
> :start
> "%MOZILLABUILD%\msys\bin\bash" --login -c /d/mozilla-build/buildbot-start.sh
> "%MOZILLABUILD%\msys\bin\sleep" 30
> 
> goto start
> === Snippet ===
> 
> In this case we have some kind of self-recovery. 
> 
> win32-slave03 has been running this script more than 24 hours without any
> visible regression so far. We can try this script on one of the problematic
> boxes and increase the sleep period so that nagios can detect the failure.

This is absolutely worth a try. Can you post a patch for this?

> Another approach is convert the bat file to exe (no FOSS converter found,
> freeware only) and run it as a service. Windows® can monitor and restart its
> services in case of failure.

The service approach doesn't work because Buildbot doesn't know how to launch processes on the Desktop. We'd need to do Buildbot hacking first.
Comment on attachment 437617 [details] [diff] [review]
buildbot startup, rev2

This patch ended up busting some try slaves that rebooted (for some reason). They were using /e/builds/sendchange-slave. I switched them to use /e/builds/slave rather than add on to this patch. There might be other try slaves that need this change, too.
(In reply to comment #39)
> This is absolutely worth a try. Can you post a patch for this?

Use endless loop without launching rxvt minimized.
Attachment #439815 - Flags: review?(bhearsum)
Attachment #439607 - Flags: review?(catlee) → review+
I've added create-shortcut.vbs which creates a shortcut in Starup group which runs minimized. Tested on win32-slave03.b.m.o.

buildbot-startup.ins removes old menu entry as well.
Attachment #439815 - Attachment is obsolete: true
Attachment #439961 - Flags: feedback?(bhearsum)
Attachment #439815 - Flags: review?(bhearsum)
Attachment #439961 - Attachment is obsolete: true
Attachment #439961 - Flags: feedback?(bhearsum)
Attachment #440215 - Flags: review?(bhearsum)
Attachment #440214 - Flags: review?(bhearsum)
Attachment #440215 - Flags: review?(bhearsum) → review+
Comment on attachment 440214 [details] [diff] [review]
start-buildbot.bat: use endless loop, run without rxvt

Assuming the build currently running on win32-slave03 finishes successfully, let's land these later today.
Attachment #440214 - Flags: review?(bhearsum) → review+
Comment on attachment 439607 [details] [diff] [review]
better logging, signal catching from buildbot sh script

changeset:   45:5f1bc7cfec25
Attachment #439607 - Flags: checked-in+
Comment on attachment 440214 [details] [diff] [review]
start-buildbot.bat: use endless loop, run without rxvt

changeset:   46:a0ded5a249ce
Attachment #440214 - Flags: checked-in+
Attachment #440215 - Flags: checked-in+
Comment on attachment 440215 [details] [diff] [review]
Run start-buildbot.bat minimized

Checking in buildbot-startup/buildbot.bat;
/mofo/opsi-binaries/buildbot-startup/buildbot.bat,v  <--  buildbot.bat
new revision: 1.3; previous revision: 1.2
done
Rail set the buildbot-startup package to deploy on all of the build machines again.
Seems like start-buildbot.sh hasn't deployed:

Error: copy of P:\install\buildbot-startup\start-buildbot.sh to d:\mozilla-build\start-buildbot.sh not possible. File Err. No. 32 (The process cannot access the file because it is being used by another process) Errorcode 32 ("The process cannot access the file because it is being used by another process")

Very strange to see a user startup process running *before* OPSI client.
Just compare the time when opsi preloadingloader and buildbot start. This shouldn't happen imho. 

c:\tmp\buildbot-startup.log snippet
-------------------------------------
"Tue 04/20/2010 14:39:56.56 - Sleeping at the end of buildbot.bat" 
"Tue 04/20/2010 14:40:27.73 - About to run start-buildbot.bat" 
"Tue 04/20/2010 14:40:27.84 - About to call guess-msvc.bat" 
"Tue 04/20/2010 14:40:27.84 - Start of guess-msvc.bat" 
"Tue 04/20/2010 14:40:27.85 - About to query MSVC8KEY" 
"Tue 04/20/2010 14:40:27.87 - Queried, VC8DIR is " 
"Tue 04/20/2010 14:40:27.96 - End of guess-msvc.bat" 
"Tue 04/20/2010 14:40:27.96 - Calling vcvars32.bat in VC8DIR" 
"Tue 04/20/2010 14:40:28.03 - About to run start-buildbot.sh" 
"Tue 04/20/2010 14:40:28.06 - start-buildbot.sh finished" 
-------------------------------------

c:\tmp\instlog.txt snippet:
-------------------------------------
    ============ Version 4.8.8.1 WIN32 script "P:\install\buildbot-startup\buildbot-startup.ins"
                 start: 2010-04-20  14:40:28  (on client named as : "mw32-ix-slave23.uib.local")
    [executing: "C:\Program Files\opsi.org\preloginloader\opsi-winst\winst32.exe"]
    system infos:
    mw32-ix-slave23.build.mozilla.org
-------------------------------------
Probably we should install loginblocker to these machines. I can see some differences between win32-slave03 and ix-slave23:

HKEY_LOCAL_MACHINE\SOFTWARE\Miscrosoft\Windows NT\CurrentVersion\WinLogon\GinaDLL which is set on win32-slave03 and doesn't exist at all on ix machine.

Seems like we have the same opsi version but not the same login behavior.

Randomly checked slaves:

Problematic ones:
mw32-ix-slave01: no GinaDLL registry entry
mw32-ix-slave05: no GinaDLL registry entry
mw32-ix-slave23: no GinaDLL registry entry
moz2-win32-slave50: no GinaDLL registry entry
moz2-win32-slave54: no GinaDLL registry entry

Stable ones:
moz2-win32-slave05: has a GinaDLL registry entry
win32-slave03: has a GinaDLL registry entry
moz2-win32-slave40: has a GinaDLL registry entry
moz2-win32-slave49: has a GinaDLL registry entry

If loginblocker fixes this issue I want my beer! :)
Snippet from preloginloader.ins:
----------------------------------------
comment "copying loginblocker"
if ($INST_MinorOS$ = "Windows Vista")
        if ($INST_system_bit$ = "64")
                Files_copy_vista_loginblocker_64
                DosInAnIcon_vista64_loginblocker
                ExecWith_vista64_loginblocker "%systemroot%\cmd64.exe" /c
        else
                Files_copy_vista_loginblocker_32
                Files_del_cmd64
        endif
endif
if (($INST_MinorOS$ = "WinXP") or ($INST_MinorOS$ = "Win2k"))
        if ($INST_system_bit$ = "64")
                Files_copy_xp_loginblocker_64
        else
                Files_copy_xp_loginblocker_32
                Files_del_cmd64
        endif
endif
----------------------------------------

No win2k3 check (all checks fail, you can look to c:\tmp\instlog.txt), so no pgina.dll is going to be installed.
Rail has some fresh ideas here, passing this bug off to him :)
Assignee: bhearsum → rail
Seems like the main problem is preloginloader opsi package, which doesn't install its library and registry entries on Windows 2003.

The package itself is complicated a bit, so extracting the loginblocker related pieces and creating a new package will be risky a bit. I'd prefer to patch the preloginloader package and reinstall it, at least as a short term fix.
Attachment #440470 - Flags: review?(bhearsum)
Attached file Recreation procedure
Comment on attachment 440470 [details] [diff] [review]
preloginloader opsi package patch

Awesome work here, Rail.
Attachment #440470 - Flags: review?(bhearsum) → review+
Comment on attachment 440470 [details] [diff] [review]
preloginloader opsi package patch

Rail landed this shortly ago. We also landed a follow-up patch to disable the forced reboots after the installation finishes, since jobs could be running while this happens:

RCS file: /mofo/opsi-binaries/preloginloader/CLIENT_DATA/files/opsi/preloginloader.ins,v
retrieving revision 1.2
retrieving revision 1.3
diff -u -r1.2 -r1.3
--- files/opsi/preloginloader.ins	23 Apr 2010 19:39:23 -0000	1.2
+++ files/opsi/preloginloader.ins	23 Apr 2010 19:48:35 -0000	1.3
@@ -402,9 +402,11 @@
 		sub_clean_up
 	 
 		; all is done but make a reboot after terminating with the script
-		if ($INST_AllowReboot$ = "true")
-			ExitWindows /Reboot
-		endif
+		; Commented out so we can roll out to machines which aren't
+		; properly blocking the login.
+		;if ($INST_AllowReboot$ = "true")
+		;	ExitWindows /Reboot
+		;endif
 	endif    ; diskspace
 endif      ; correct OS Version
Attachment #440470 - Flags: checked-in+
We rolled out the updated preloginloader to mw32-ix-slave02, 03, and win32-slave50 and 51. If these slaves stay up over the weekend, we'll roll out to the rest of them.

Go Rail!
We marked the new preloginloader for install on the rest of the ix slaves and VMs (ix 04-25, VMs 01-49; 52-59, and the try VMs).
Most of the machines have picked up the new preloader. I also rebooted both ref images, and they now have it.
Occasionally, we're seeing the following in an OPSI dialog, which ends up blocking the login until it is clicked through:
Zeitüberschreitung bei verbindung -- which translates to "Connection timeout". We lowered the connection timeout for OPSI down to 20 seconds in bug 522078 -- we could try bumping it back up to a minute or so.
nagios complained about mw32-ix-slave10 today, on inspection the screen saver was running but it logged on soon after I connected.
(In reply to comment #63)
> nagios complained about mw32-ix-slave10 today, on inspection the screen saver
> was running but it logged on soon after I connected.

Hmmm, Rail and I logged on before you and clicked through the OPSI dialog described in comment #62. We should bump the timeout for OPSI to avoid that dialog overall, but I don't know why or how they're getting stuck on the screensaver, I thought that was fixed :-(.
(This is probably the same but more data for the mill)

* mw32-ix-slave02 - nagios alerts going for 19 hours over weekend
* screensaver showing on first connection with VNC
* two dialogs showing after screensaver gone
* topmost is 'Eventlog Service' complaining:
 The 'opsi Log' is full. If this is the first time you have seen this message,
 take the follwing steps:
 1. Click Start, click Run, type "eventvmr", and then click OK
 2. Click opsi, click the Action menu, click Clear All Events, and then click No.
 If this dialog reappears, contact your helpdesk or system administrator
* lower dialog is the wInst-Message dialog of comment #62
IIRC, the first dialog (opsi Log is full) is not blocking, but in any case we should clean up old logs somehow.
Yeah, the 'log is full dialog' disappears on its own after the login finishes.
Whiteboard: [buildslaves][opsi]
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 652391
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.