win32 buildbot slaves should reboot ready for use

RESOLVED FIXED

Status

P2
normal
RESOLVED FIXED
11 years ago
5 years ago

People

(Reporter: joduinn, Assigned: bhearsum)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(7 attachments, 2 obsolete attachments)

Splitting out from bug#417887, as each o.s. will have different gotchas.

Basically, how to make each buildbot master/slave reboot cleanly, reconnect and handle new jobs?
(Assignee)

Comment 1

11 years ago
Buildbot has a script that will add it as a Windows Service. IIRC there was problems related to not having a real console or not having a desktop to launch things on. This requires further investigation.
Created attachment 314862 [details]
services.png

example services properties dialog on vista. Note the Interact With Desktop switch.
I believe there's also a way to set a service's running policy through the properties dialog. I don't have a local copy of win2k3 to test this though.
Summary: win32 buildbot masters/slaves should reboot ready for use → win32 buildbot slaves should reboot ready for use
(Assignee)

Updated

10 years ago
Blocks: 472517
(Assignee)

Comment 4

10 years ago
We chatted a bunch about this today and decided that part of this will be doing scheduled, periodic reboots of staging machines both to iron out kinks in the rebooting and to look for potential performance gains.
Status: NEW → ASSIGNED
Component: Release Engineering: Future → Release Engineering
Priority: -- → P3
(Assignee)

Updated

10 years ago
Assignee: nobody → bhearsum
(Assignee)

Comment 5

10 years ago
Starting working on this today. Looking promising, Buildbot was able to launch Firefox after being started as a win32 service. Initial problems:
* $PATH is different, I imagine it doesn't inherit the system or user set $PATH. We can probably fix this from Buildbot.
* Noticed MochiTest saying a lot of things like 'INFO Error: Unable to restore focus, expect failures and timeouts.' yet the tests still pass.

I haven't run a full set of unittests yet, but I plan to soon. I'm sure there's going to be more problems down the road, but I'm encouraged by these initial results.

*fingers crossed*
Priority: P3 → P2
(Assignee)

Comment 6

10 years ago
So, it turns out we get *tons* of failure when Buildbot is started as a service. I suspect this is completely because of the fact that firefox.exe isn't running in any sort of "real" Desktop. I tried a few things to work around this, including running as the Local System Account with "allow desktop access" checked - but that made no difference.

Before going further into this blackhole I'm doing a test to see if running unittests from the "console" (that is to say, the real display) makes a difference in terms of time. Going this route is much more well documented. There's lots of information about how to automatically logon in Windows 2003, start processes, etc.

If it increases test run time by a significant amount I'll try and track down the issues with running as a service.
(Assignee)

Comment 7

10 years ago
As it turns out, running tests in the "console" session of a win32 VM makes almost no difference in timing. Unit tests overall took 2 minutes longer - which is a trivial amount in the grand scheme of things.

Additionally, unittests pass completely when running here (nb. I tripped a legitimate mochitest leak, filed in bug 477066).

Given the above, I'm going to work on clean reboots using the console session.
(Assignee)

Comment 8

10 years ago
Made good progress today. Here's a summary of what I did on moz2-win32-slave21:
* Installed RealVNC
* Turned off Firewall
* Edited its VMX file by hand to add the following:
svga.maxHeight = 1024
svga.maxWidth = 1280
svga.vramSize = 16777216
* Added a couple of batch files (to be attached shortly) to aid in starting buildbot on boot.
* Edited the registry to automatically login cltbld

It's now currently logging in and starting Buildbot on boot, and currently running unittests. I'm going to let things run overnight and if it looks good I'll be looking to apply these changes to all of the staging slaves next week.
(Assignee)

Comment 9

10 years ago
Had some additional fails overnight. Some of them are related to the fact that there's no audio driver in the console session.

I tried installing http://www.rigexpert.net/gettingstarted/reaudio.htm, which helped, but some tests ended up hanging.

Then, I installed the demo of http://software.muzychenko.net/eng/vac.html and video tests started passing. The demo claims to be "feature limited", which I suspect means some recording features aren't available. More importantly, the demo doesn't seem to be time limited, so I think we're totally within our rights to use it for as long as we want. I'm going to leave moz2-win32-slave21 running builds and tests over the weekend to get some more results out of it.
(Assignee)

Comment 10

10 years ago
Things ran perfectly well on mozilla-1.9.1 unittests over the weekend. The only failure was the mochitest leak mentioned in comment #7. Given that, I think I'm ready to roll this out into the real staging environment. I'm tempted to adjust the mochitest leak threshold to cope with the failures for now...I'm not getting the impression it'll be easy to get energy on bug 477066 right now.
(Assignee)

Comment 11

10 years ago
Created attachment 362741 [details]
script to go in c:\documents and settings\cltbld\start menu\programs\startup
(Assignee)

Comment 12

10 years ago
Created attachment 362742 [details]
accompanying file to go in d:\mozilla-build
(Assignee)

Updated

10 years ago
Attachment #362741 - Attachment is patch: false
(Assignee)

Comment 13

10 years ago
Created attachment 362743 [details]
registry keys needed for autologin

Password removed.
(In reply to comment #10)
> Things ran perfectly well on mozilla-1.9.1 unittests over the weekend. The only
> failure was the mochitest leak mentioned in comment #7. Given that, I think I'm
> ready to roll this out into the real staging environment. I'm tempted to adjust
> the mochitest leak threshold to cope with the failures for now...I'm not
> getting the impression it'll be easy to get energy on bug 477066 right now.

Making these changes in staging is fine. However, bug#477066 needs to be fixed (or the test disabled?) before we can make these changes to the production slaves.
Depends on: 477066
(Assignee)

Comment 15

10 years ago
(In reply to comment #14)
> (In reply to comment #10)
> > Things ran perfectly well on mozilla-1.9.1 unittests over the weekend. The only
> > failure was the mochitest leak mentioned in comment #7. Given that, I think I'm
> > ready to roll this out into the real staging environment. I'm tempted to adjust
> > the mochitest leak threshold to cope with the failures for now...I'm not
> > getting the impression it'll be easy to get energy on bug 477066 right now.
> 
> Making these changes in staging is fine. However, bug#477066 needs to be fixed
> (or the test disabled?) before we can make these changes to the production
> slaves.

I guess that's an option. But we have a --leak-threshold for Mochitest specifically so we can run tests that are known to cause leaks, and not turn the tree orange.
(Assignee)

Comment 16

10 years ago
Here's more detailed instructions on how to deploy:
* Shut down VM, add the following lines to its vmx file:
svga.maxHeight = 1024
svga.maxWidth = 1280
svga.vramSize = 16777216
* Start the VM back up again, login as Administrator
* Download VNC from: http://realvnc.com/products/free/4.1/download.html
* Install with defaults
* When post-install dialog pops up set a password and turn off the java viewer (configure -> 'Serve Java Viewer...')
* Start -> Run -> 'services.msc'
* Disable and turn off Windows Firewall
* Download and install http://software.muzychenko.net/vac409.zip
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362743, edit with proper password, import into registry.
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362741 to ~cltbld/start menu/programs/startup
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362742 to /d/mozilla-build
* Make sure the Buildbot slave is located in /e/builds/moz2_slave (if you have to rename the directory make sure to update buildbot.tac).
* Restart
* Login with VNC and set resolution to 1280x1024

From this point forward you should NOT be logging in as cltbld with RDP.
(Assignee)

Comment 17

10 years ago
One last thing, cltbld must be given permission to reboot the system:
* Start menu -> Run -> gpedit.msc
* Computer Configuration -> Windows Settings -> Security Settings -> Local Policies -> User Rights Assignment
* Double click 'Shut down the system', add cltbld to the list.
* Reboot for the changes to take effect.
(Assignee)

Comment 18

10 years ago
Removing blocking in favour of setting the leak threshold.
No longer blocks: 472517
(Assignee)

Comment 19

10 years ago
Created attachment 362779 [details] [diff] [review]
allow for mochitest leak threshold in unittests

Pretty simple patch, just allows you to pass a leak threshold on to the mochitest step.
Attachment #362779 - Flags: review?(catlee)
(Assignee)

Comment 20

10 years ago
Created attachment 362781 [details] [diff] [review]
periodic reboots + leak threshold

Pretty simple master side patch. Enable reboots every 5 builds, just like Linux and Mac, and add the 188 byte mochitest leak threshold to win32 builds.
Attachment #362781 - Flags: review?(catlee)
(Assignee)

Comment 21

10 years ago
Nick pointed out to me yesterday that it would be better not to download the RealVNC and software audio driver from the internet every time we need it. I'm going to import them into the mofo repo for safekeeping and update my instructions.
(Assignee)

Comment 22

10 years ago
Comment on attachment 362781 [details] [diff] [review]
periodic reboots + leak threshold

I need to update the leak thresholds here.
Attachment #362781 - Flags: review?(catlee)
(Assignee)

Comment 23

10 years ago
Created attachment 362904 [details] [diff] [review]
leak threshold for tm, 1.9.1, not for m-c

After examining the logs on staging-master I've noticed that sometimes we leak 188 bytes, and sometimes we leak 200. I guess this means the threshold needs to be 200, which kindof sucks since it means it's possible to miss another leak (albeit, a small one). Is there a better way of dealing with this?
Attachment #362781 - Attachment is obsolete: true
Attachment #362904 - Flags: review?(ted.mielczarek)
(Assignee)

Comment 24

10 years ago
Alright, those two packages are now checked into the mofo repo:
Checking in vac409.zip;
/mofo/ref-platforms/win32/vac409.zip,v  <--  vac409.zip
initial revision: 1.1
done
RCS file: /mofo/ref-platforms/win32/vnc-4_1_3-x86_win32.exe,v
done
Checking in vnc-4_1_3-x86_win32.exe;
/mofo/ref-platforms/win32/vnc-4_1_3-x86_win32.exe,v  <--  vnc-4_1_3-x86_win32.exe
initial revision: 1.1
done

Updated

10 years ago
Attachment #362779 - Flags: review?(catlee) → review+
Comment on attachment 362904 [details] [diff] [review]
leak threshold for tm, 1.9.1, not for m-c

I am saddened, but bhearsum says he is looking into what patch fixed this on m-c.
Attachment #362904 - Flags: review?(ted.mielczarek) → review+
(Assignee)

Updated

10 years ago
Attachment #362904 - Flags: checked‑in+
(Assignee)

Updated

10 years ago
Attachment #362779 - Flags: checked‑in+
(Assignee)

Comment 26

10 years ago
It's looking like moz2-win32-slave03 is able to run mochitests without leaking. Seems like there's some subtle difference between it and the other two. I'm going to try and track down what it is so the leak threshold isn't necessary.
(Assignee)

Comment 27

10 years ago
So I misread before, moz2-win32-slave03 *and* 04 were passing all of the unittests. Only moz-win32-slave21 was failing. The only appreciable difference I found was a software audio driver I was testing being installed on it. After uninstalling that the tests have started passing. I have no idea if this is coincidence or what, I'm not sure how this driver (which isn't a browser plugin AFAIK). I don't see any suspicious checkins to 1.9.1, either.

I have some other things to do right now, so I'm just going to let this run in staging for a few days or a week and monitor it. If things stay green we can turn the leak threshold down to 0 and proceed.
(Assignee)

Comment 28

10 years ago
Not a single run of 1.9.1 unittests on moz2-win32-slave21 since Friday. However, as of Friday, 6:30pm EST it was still leaking. I'd like one more run to confirm this before digging deeper...
(Assignee)

Comment 29

10 years ago
moz2-win32-slave21 is still failing. As a last resort, I'm going to try recloning the VM and applying the changes exactly as I did to slave03 and 04. Maybe there's something strange from when I was testing RDP and other various things?
(Assignee)

Comment 30

10 years ago
After recloning moz2-win32-slave21 it seems that the mochitest leak has gone away. I suspect something I did to it early on tripped the failure. I'm going to let it run for a day or two before declaring it gone for realz, though.
(Assignee)

Comment 31

10 years ago
Created attachment 365216 [details] [diff] [review]
backout leak threshold

Disable the leak threshold on 1.9.1/tm, since we haven't seen it in forever.
Attachment #365216 - Flags: review?(ccooper)
Attachment #365216 - Flags: review?(ccooper) → review+
(Assignee)

Comment 32

10 years ago
Comment on attachment 365216 [details] [diff] [review]
backout leak threshold

changeset:   976:27c75f479ff3
Attachment #365216 - Flags: checked‑in+
(Assignee)

Comment 33

10 years ago
I'm planning to roll this out on Monday, March 16th starting in the EDT morning. It's probably going to take half the day or so to fully deploy, but no downtime will be needed.
No longer blocks: 472517
(Assignee)

Updated

10 years ago
Blocks: 472517
(Assignee)

Comment 34

10 years ago
Updated deployment instructions:
* Shut down VM, add the following lines to its vmx file:
svga.maxHeight = 1024
svga.maxWidth = 1280
svga.vramSize = 16777216
* Start the VM back up again, login as Administrator
* Start menu -> Run -> gpedit.msc
* Computer Configuration -> Windows Settings -> Security Settings -> Local
Policies -> User Rights Assignment
* Double click 'Shut down the system', add cltbld to the list.
* Reboot for the changes to take effect.
* Download VNC from: http://realvnc.com/products/free/4.1/download.html
* Install with defaults
* When post-install dialog pops up set a password and turn off the java viewer
(configure -> 'Serve Java Viewer...')
* Start -> Control Panel -> Windows Firewall
* Add TCP/5900 as an exception.
* Download and install http://software.muzychenko.net/vac409.zip
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362743, edit with
proper password, import into registry.
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362741 to
~cltbld/start menu/programs/startup
* Download https://bugzilla.mozilla.org/attachment.cgi?id=362742 to
/d/mozilla-build
* Make sure the Buildbot slave is located in /e/builds/moz2_slave (if you have
to rename the directory make sure to update buildbot.tac).
* Restart
* Login with VNC and set resolution to 1280x1024

From this point forward you should NOT be logging in as cltbld with RDP.
(Assignee)

Comment 35

10 years ago
Created attachment 367599 [details]
readable .reg file
Attachment #362743 - Attachment is obsolete: true
(Assignee)

Comment 36

10 years ago
I got the last slave updated today. This is done!
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.