Closed Bug 886640 Opened 11 years ago Closed 8 years ago

Run unittests on Win64 builds in VMs

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1223509

People

(Reporter: catlee, Unassigned)

References

Details

Attachments

(1 obsolete file)

We're going to try and use EC2 slaves to run unittests for our Win64 builds. We'll be using Windows Server 2012 instances.
Doesn't look too bad...

I'd like to enable per-push win64 builds and unittests on a twig first (doesn't have to be date, that's just a placeholder).

Bikeshedding topics:
- platform naming ("win64" as the top-level platform with "win64_vm" as the "slave platform")
- slave naming scheme ("tst-w64-ec2-XXX")
- "long" name for the platform, which I've currently got as "win64_vm". For win8 this is "WINNT 6.2", which means nothing to me.
Attachment #767000 - Flags: feedback?(rail)
Attachment #767000 - Flags: feedback?(philringnalda)
Comment on attachment 767000 [details] [diff] [review]
initial pass at win64 testing configs

Review of attachment 767000 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla-tests/config.py
@@ +108,5 @@
>      'hg_bin': 'c:\\mozilla-build\\hg\\hg',
>      'reboot_command': ['c:/mozilla-build/python27/python', '-u'] + MOZHARNESS_REBOOT_CMD,
>  }
>  
> +PLATFORMS['win64']['slave_platforms'] = ['win64_vm']

In overall it looks good. My only concern is the slave_platform name, which usually includes platform version (like winxp-ix).
Attachment #767000 - Flags: feedback?(rail) → feedback+
(In reply to comment #1)
> I'd like to enable per-push win64 builds and unittests on a twig first (doesn't
> have to be date, that's just a placeholder).

Does that mean that we need to set up some kind of a script to keep that twig synced with m-c?
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #3)
> (In reply to comment #1)
> > I'd like to enable per-push win64 builds and unittests on a twig first (doesn't
> > have to be date, that's just a placeholder).
> 
> Does that mean that we need to set up some kind of a script to keep that
> twig synced with m-c?

In the short term, yes.
Comment on attachment 767000 [details] [diff] [review]
initial pass at win64 testing configs

It kills me not to have a bikeshedding opinion, but I don't. My vague memory from when we last ran tests on Server Whichever is that people were constantly asking which OS it actually was, and then blaming things that both were not the result of running on it and things which actually were the result of running on it on it not being a consumer OS. I don't think any of WINNT 6.2 or Windows Server 2012 or Win8ishmostly can fix that.
Attachment #767000 - Flags: feedback?(philringnalda) → feedback+
I've reserved the Date branch for this purpose.  Chris, that's what your patch assumes as well, right?
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #6)
> I've reserved the Date branch for this purpose.  Chris, that's what your
> patch assumes as well, right?

Yup. It didn't have to be Date, but great to have it reserved now!
It doesn't really block/depend on, but for reference, bug 882138 is tracking current test failures.
current status:

Using vlad's EC2Config powershell script as a base, I've created scripts and configs for our cloud-tools repo that will create fresh windows AMIs and then test machine instances from that AMI.

I have the test instance connected to a staging buildbot master, and it is able to run buildbot and start jobs...and then crash in burn in various interesting ways.

Still TODO:
- the EC2Config powershell scripts need error checking. occasionally one of the steps will fail, which results in an unusable AMI or instance.
- Make sure we disable Windows defender and the indexing service (as per bug 889475)
- Allow the non-admin cltbld user to reboot the machine (or, make cltbld an admin user?)
- Fix path for running tests. I start buildbot from a batch script that sets PATH to include msys's rm, unzip, hg, etc. But some steps don't seem to inherit PATH, and so fail to find these commands
- Figure out how to resolve http://repos/, which is where mozharness is currently looking to download python packages for its virtual envs.
FTR, I added CNAMEs using inv-tool:

invtool CNAME create --fqdn repos.srv.releng.use1.mozilla.com --target releng-puppet1.srv.releng.use1.mozilla.com --description "Bug 886640"
invtool CNAME create --fqdn repos.srv.releng.usw2.mozilla.com --target releng-puppet1.srv.releng.usw2.mozilla.com --description "Bug 886640"
Status update:

AMI and instance creation scripts are mostly working.
PATH is fixed
cltlbld can reboot the machine
resolving http://repos/ is fixed...mostly

The main issue I'm hitting now is network funkiness. Most of the time the machines reboot and re-connect ok. Sometimes after rebooting they're no longer able to resolve internal addresses, and sometimes they lose their dns search order preferences. Sometimes the slave looks ok (buildbot is running, no errors in logs, etc.), but just isn't taking jobs.
Looks like the machine comes up without any DNS resolvers sometimes...
Cool!  But weird about the network issues.  How is the network configured?  Just DHCP?
Depends on: 892614
Attachment #767000 - Attachment is obsolete: true
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #13)
> Cool!  But weird about the network issues.  How is the network configured? 
> Just DHCP?

Yeah, AWS's DHCP. We have them configured to use our internal DNS servers. refreshing the dhcp lease on the client fixes it...
Today I enabled the builds and tests on the date branch.

You can see a bunch of the test results here: https://tbpl.mozilla.org/?tree=Date&rev=f2d3b5149d3a&jobname=win64_vm

Some green, lots of orange, and a bit of red.

The jetpack failure should be a simple fix, and is on my plate for next week. I haven't looked at any of the other test failures.
Whether or not it'll be the only thing keeping you from getting accelerated graphics, you need a resolution of whatever x 1024 - reftests want their window's innerheight to be over 1000.
Cool, though something is odd. "Can't create a WebGL context" shouldn't be happening.  Can we get about:support info from a running browser there?  (Or, can you make the AMI available to me somehow and I can boot one up?)
Hm, I can't see the Date results nay more -- did they fall off?  Can we get them triggered maybe once a day to have up to date results?
We'll only get builds when there are pushes. I'll push m-c to date again now.
The builds are busted: bug 895083.

Once I fix that, I'll set up an automated push script which syncs it up to m-c.  I can also do m-i, but since we run builds/tests on all platforms there, doing that would mean doubling the load of the infra for every m-i push. :(
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #17)
> Cool, though something is odd. "Can't create a WebGL context" shouldn't be
> happening.  Can we get about:support info from a running browser there? 
> (Or, can you make the AMI available to me somehow and I can boot one up?)

Vlad, have you had any luck with the graphics context on the AMI I sent?
Flags: needinfo?(vladimir)
I created an instance using the AMI, logged in via RDP, and downloaded a 64-bit windows nightly zip.  I get the graphics environment that I'd expect:

Adapter Description RDPUDD Chained DD
Adapter Drivers	RDPUDD
Device ID	0xfefe
Direct2D Enabled	true
DirectWrite Enabled	true (6.2.9200.16433)
Driver Date	01-01-1970
Driver Version	6.2.9200.16434
GPU Accelerated Windows	1/1 Direct3D 10
Vendor ID	0x1414
WebGL Renderer	Google Inc. -- ANGLE (Microsoft Basic Render Driver Direct3D9Ex vs_3_0 ps_3_0)

Should I be testing/connecting a different way?
Flags: needinfo?(vladimir)
Ah ok, so it looks like if the session is autologon'd, we only get a generic display adapter, not the emulated GPU one.  The solution seems to be to create a "starter" user that will autologon, and then start a RDP session to localhost to the real cltbld user.  Then that session will be created with the right config.

Steps I did:
- Logged in as Administrator
- Ran "mstsc" to open up the RDP client
- "localhost" as the computer, "Allow me to save credentials", changed Display to 1680x1050 (or whatever is appropriate), 32-bit.
- Back on General, I hit "Save as" and saved it as "C:\temp\cltbld.rdp"
- Connect, enter cltbld and the password as username/password and check "Save these settings"
- Close the window to disconnect
- Created a cltbld-starter user
- Set a task to run when cltbld-starter logs in to just run "mstsc C:\temp\cltbld.rdp"
- Use autologon to have cltbld-starter log in

Upon reboot, when you RDP connect to cltbld, running firefox in there should show the "RDPUDD Chained DD" device instead of "Microsoft Basic Display".
Oh, in "Advanced" for the RDP settings, under server auth, I set "Connect and don't warn me" under "If server authentication fails".
Depends on: 901051
Depends on: 901057
Product: mozilla.org → Release Engineering
I've recreated tst-w64-ec2-{001..005} so that cltbld runs inside an RDP session. These should be picking up new jobs on Date shortly.
Turns out we need to set the RDP experience params properly to get certain features.  When the saved RDP session is being created, in the RDP Connection dialog, we need to go to Show Options, Experience tab.. set the dropdown to LAN (10MBps or higher), and then make sure all the options are checked (specifically Font smoothing, Desktop composition, and Visual styles I think, but they should all be checked).  Visually, all the text in Firefox should be antialiased instead of jagged.
We're using Remote Desktop Plus (from http://www.donkz.nl/), which doesn't look like it directly supports those options. It does support extra RDP options via the /o switch. Can you save your RDP session to a .rdp, and let me know what those checkboxes correspond to in the saved file?
Will do.  Also, along with this, we need to explicitly turn on ClearType for the cltbld user.  I don't know how to do that programatically (I'm sure there is a way), but it's right-click-on-desktop Screen Properties -> Make text and other items larger -> Adjust ClearType text (on the left) -> check Turn On Cleartype, and then hit next a bunch of times (accepting the defaults).
Here are the relevant-looking bits from my rdp file:

allow font smoothing:i:1
allow desktop composition:i:1
disable full window drag:i:0
disable menu anims:i:0
disable themes:i:0
disable cursor setting:i:0
bitmapcachepersistenable:i:1
disable wallpaper:i:0

I would also suggest adding:

audiomode:i:1
audiocapturemode:i:0
videoplaybackmode:i:1
connection type:i:6
networkautodetect:i:0
bandwidthautodetect:i:1

and -maybe- changing that bandwidthautodetect to 0, though it shouldn't matter for a LAN connection.  The audio settings tell it to "play audio on the remote server" instead of trying to forward through RDP; Ehsan and I ran into some stability problems with the forwarding that can make the entire machine look like it's hung.  The server not having any audio hardware doesn't seem to affect the audio/video tests.

Note that since I made the ClearType change to enable it for cltbld (and allow it via RDP), I am getting a very green run so far where it wasn't before.
The cleartype stuff isn't enough, looking more.

It would be very helpful if you guys could fix Win64 Debug builds -- they seem to be timing out somewhere at the end of the process, in pulling out the debug info.  e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=28313722&tree=Date&full=1
Depends on: 669384
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #30)
> The cleartype stuff isn't enough, looking more.
> 
> It would be very helpful if you guys could fix Win64 Debug builds -- they
> seem to be timing out somewhere at the end of the process, in pulling out
> the debug info.  e.g.
> https://tbpl.mozilla.org/php/getParsedLog.php?id=28313722&tree=Date&full=1

That's bug 669384 right?  That seems very difficult to fix, since it's mspdbsrv dying on us.
> Per Makoto in bug 893139 comment 15, this is fixed in VS2012.

Ugh.  I would actually love to move to VS2012 for the 64-bit builds.  How easy is that?  I think the toolchain is already deployed?
Some more notes: with the latest AMI, I keep getting notified that my password is expiring today for the cltbld user.  We should disable password expiration wherever that lives.

Also, the root image is only 30GB in size.  With the AMI installed and a firefox unzip'd + tests, I only have 3GB free on C:\.  This is probably fine, but if we ever accidentally leave some junk around, it could be bad times.  There is a 400GB ephemeral partition mounted at Z:\; I wonder if we can use that as our scratch workspace for downloading builds/temp files/etc?

(I did quick a CrystalDiskMark test on a m1.medium instance type.. for the C: drive, 512k random reads are 100MB/s, writes 50MB/s.  4K is 8MB/s reads, 3.0MB/s writes.  For the Z: drive, 512k is 65MB/s reads, writes 63MB/s writes. 4K is 1.3MB/s reads, 1.4MB/s, writes.  So the Z drive is slower, but not by a huge amount.)

Also, I ran M1 unit tests on a m1.medium instance and a c1.xlarge instance (which has sigificantly more memory and compute power).  There wasn't a significant speedup in the unit test run time.
More notes: the codecs and other bits come from "Desktop Experience".  We should prep the AMI with this -- when logged in as Administrator, go to the server mgr (bottom left on the start menu), then select Manage -> Add Roles and Features.  Click Next a whole bunch of times until you get to Features on the left; scroll down to User Interfaces and Infrastructure, and check "Desktop Experience".  Hit Next, then Install.  I also checked "Automatically restart server if needed"; not sure if that's required, but the server will reboot as part of the installation.

This fixes bug 886640!
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #32)
> > Per Makoto in bug 893139 comment 15, this is fixed in VS2012.
> 
> Ugh.  I would actually love to move to VS2012 for the 64-bit builds.  How
> easy is that?  I think the toolchain is already deployed?

Last week, ctalbert raised the question of moving to VS2013 as soon as its available (because it would fix a PGO memory limit). Should we move to VS2012 now and then VS2013 when it comes out? Or stick with what we have for now and move once VS2013 is available?

(fyi: I dont know exact VS2013 release date, but VS2013 release candidates are available for download already.)
Hit an interesting road bump today when trying to deploy some of the above changes.

The base AMI we've been using for these instances has disappeared from Amazon. It's been replaced by a newer Win64 2012 AMI. I don't know what the differences are....but it means that any new instances created from now on will have a different base image than before. If this is a problem, we should probably copy or otherwise lock down Amazon's AMI before building our own off of it.
I *think* that this is intentional on Amazon's part; I think they update things with the latest Windows security patches periodically.  We should probably copy our base AMI once we lock one in (should keep taking the latest until that point, IMO).
So I deployed some changes to the VMs and re-triggered some jobs. I see a bit more green, but there are still quite a few test failures. I'm guessing the RDP or ClearType settings didn't stick.

https://tbpl.mozilla.org/?tree=Date&rev=073011f5cae4
I adjusted the ClearType enabling script to run on login instead of when the image is created, and then retriggered some jobs. Still getting similar failures :(

I need a better way to verify that the right RDP and ClearType settings are being applied.
Bah, I missed the "Desktop Experience" stuff from comment #34. I'll try and get to that this week.
I tried using Z:\ to run the tests out of. The machine wasn't able to unpack the firefox zips without timing out. Will investigate.
I enabled "Desktop Experience" on the VMs, and it looks like we're getting even more test failures than before :(
Not currently working on this.
Assignee: catlee → nobody
Assignee: nobody → jhopkins
Depends on: 950206
Depends on: 964933
Depends on: 967030
Depends on: 977214
Depends on: 977223
Depends on: 910255
No longer depends on: 977214
No longer depends on: 977223
Assignee: jhopkins → george.miroshnykov
Assignee: gmiroshnykov → nobody
Depends on: 1024272, 1035562
No longer depends on: win64test, win32test
Summary: Run unittests on Win64 builds → Run unittests on Win64 builds in VMs
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: