Closed Bug 492589 Opened 15 years ago Closed 15 years ago

manually run unittest on two old non-SSE2 boxes

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: ted)

References

Details

(Keywords: fixed1.9.1)

Attachments

(4 files)

There were some non-SSE VMs created in bug 462190. We'd like to get unittests running on them regularly, but as a stopgap I'm going to try one-off unittest runs on them.
Flags: blocking1.9.1+
Status: NEW → ASSIGNED
I'd like to get a sanity-check that these VMs do in fact have SSE disabled, but I need help. /proc/cpuinfo is not inspiring confidence, certainly:

[cltbld@moz2-linuxnonsse-slave01 builds]$ grep sse /proc/cpuinfo
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss constant_tsc up
Mrz might know for sure re: comment#1
Yeah, I really don't believe these VMs are non-SSE. test.c:
#include <stdio.h>

int main(int argc, char**argv)
{
#define cpuid(func,ax,bx,cx,dx)\
    __asm__ __volatile__ ("cpuid":\
    "=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func));

int a, b, c, d;
cpuid(0x1, a, b, c, d);

if (d & (1 << 25)) { printf("sse enabled\n"); }
if (d & (1 << 26)) { printf("sse2 enabled\n"); }
if (c & (1 << 0)) { printf("sse3 enabled\n"); }

return 0;
}
[cltbld@moz2-linuxnonsse-slave01 builds]$ gcc -o testsse test.c
[cltbld@moz2-linuxnonsse-slave01 builds]$ ./testsse
sse enabled
sse2 enabled
sse3 enabled
You should be able to check the VM's config - Phong, is that right?
I don't think this will work at all, per VMWare (quoted here):
http://www.novosco.com/articles/2008/08/19/vmware-esx-and-enhanced-vmotion-compatibility/

I don't know if we're using Enhanced VMotion Compatibility or not, but if not:
        * SSE features can be used by user-level code (applications).
        * Mask does not work for user-level code (i.e. applications).
        * In user-level code, CPUID is executed directly on hardware and is not intercepted by VMware.
        * Thus, VM cannot reliably hide SSE from an application

Even if we are:
EVC utilizes hardware support to modify the semantics of the CPUID instruction only. It does not disable the feature itself. For example, if an attempt to disable SSE4.1 is made by applying the appropriate masks to a CPU that has these features, this feature bit indicates SSE4.1 is not available to the guest or the application, but the feature and the SSE4.1 instructions themselves (such as PTESE and PMULLD) are still available for use. This implies applications that do not use the CPUID instruction to determine the list of supported features, but use try‐catch undefined instructions (#UD) instead, can still detect the existence of this feature.

This won't let us test what we're trying to test.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → WONTFIX
I'm trolling for community help, there's probably someone out there with older hardware that we can get to do this:
http://forums.mozillazine.org/viewtopic.php?f=23&t=1247655
http://www.nongnu.org/qemu/qemu-tech.html#SEC3

Qemu does not have SSE support.  It also loads vmdk if i am not mistaken, and it runs on windows, osx and linux.
found an old P3 machine Egg,  going to do a unit test run.  Maybe I need to find a spare disk that i can format
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Assignee: ted.mielczarek → jford
Status: REOPENED → ASSIGNED
Summary: try a one-off unittest run on non-SSE VMs → try a one-off unittest run on some random old box that reed found
Summary: try a one-off unittest run on some random old box that reed found → try a one-off unittest run on some random old box that reed found (non-SSE)
I have two old HP servers (BTek, Spider) that I have been given the ok to use by Reed.  I am just about done installing ubuntu on one and I am waiting on an XP license for the other one.
(In reply to comment #8)
> found an old P3 machine Egg,  going to do a unit test run.  Maybe I need to
> find a spare disk that i can format

Actually, there were 3 machines: btek, spider and egg. 

egg turned out to be way older, so we cannibalized parts from egg to increase
RAM and replace useless video card in btek.

btek now has ubuntu v9.04 installed, with a cltbld account account on it.
However, it still needs network configuration, DNS configs, etc.

spider now has WinXP installed, with a license key, with a cltbld account on
it. It also needs network configuration, DNS configs, etc. These will also need
VNC (or RDP) installed for Ted to be able to remotely connect and use them for
running tests.


Per discussion with shaver and damons this morning about other priorities,
these machines are being handed back to IT to finish the o.s. setup. Once both
are ready, please reassign back, so Ted can try a manual unittest run on them.
(In reply to comment #3)
> Yeah, I really don't believe these VMs are non-SSE. test.c:
> #include <stdio.h>
> 
> int main(int argc, char**argv)
> {
> #define cpuid(func,ax,bx,cx,dx)\
>     __asm__ __volatile__ ("cpuid":\
>     "=a" (ax), "=b" (bx), "=c" (cx), "=d" (dx) : "a" (func));
> 
> int a, b, c, d;
> cpuid(0x1, a, b, c, d);
> 
> if (d & (1 << 25)) { printf("sse enabled\n"); }
> if (d & (1 << 26)) { printf("sse2 enabled\n"); }
> if (c & (1 << 0)) { printf("sse3 enabled\n"); }
> 
> return 0;
> }
> [cltbld@moz2-linuxnonsse-slave01 builds]$ gcc -o testsse test.c
> [cltbld@moz2-linuxnonsse-slave01 builds]$ ./testsse
> sse enabled
> sse2 enabled
> sse3 enabled

Also, jhford ran ted's diagnostic program on btek, and only got "sse enabled" as expected, for these machines are dual-P3 CPUs running at 500MHz. As expected, these machines do not have sse2, sse3 enabled.
Summary: try a one-off unittest run on some random old box that reed found (non-SSE) → manually run unittest on two old non-SSE2 boxes
Assignee: jford → server-ops
Component: Release Engineering → Server Operations
Flags: blocking1.9.1+
OS: Linux → All
QA Contact: release → mrz
Hardware: x86 → All
> 
> Per discussion with shaver and damons this morning about other priorities,
> these machines are being handed back to IT to finish the o.s. setup. Once both
> are ready, please reassign back, so Ted can try a manual unittest run on them.

i know we talked about this on the phone but what IT steps are left?  Boxes are up and running?
(In reply to comment #12)
> > 
> > Per discussion with shaver and damons this morning about other priorities,
> > these machines are being handed back to IT to finish the o.s. setup. Once both
> > are ready, please reassign back, so Ted can try a manual unittest run on them.
> 
> i know we talked about this on the phone but what IT steps are left?  Boxes are
> up and running?

Boxes now reassembled and at reed's desk. They'd need to be racked somewhere (downstairs in K?), and then also need the following from comment#10:

"...network configuration, DNS configs, etc. These will also need
VNC (or RDP) ..."
Reed gets this because they're sitting next to his desk :)
Assignee: server-ops → reed
spider is racked and cabled... 5/19 on the switch just needs its vlan changed from 200 to 500, and it'll be ready to go. I just turned RDP on for now. If you need VNC, you're welcome to install it. It'll be accessible at spider.office.mozilla.org within the MV Office VPN once the vlan has been changed and the networking restarted.

btek, on the other hand, is dead. When it was plugged in, its power supply instantly died and made smelly smoke, as the power supply was set for 115 instead of 230. We can either try to replace the power supply or just get another box instead. Thoughts?
I'd go with whatever you think is fastest.
I've got a mochitest run started on spider (WinXP). I downloaded the latest 1.9.1 unittest build that was available, which was this one:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-1.9.1-win32-unittest/1242749541/
btek is broken.  I have balsa here and I can move the good hardware from btek into balsa when there is some spare time.
I have rebuilt Balsa's hardware to have 2x500MHz P3 cpus that are identical to the ones in spider.  I have also installed the scsi card and drives from btek but it isn't booting properly and the hard drives are not being picked up my the scsi bios.  If the scsi card cannot be coerced into working there are some ATA drives left over from egg which can be used but require a reinstall of linux.  I have the rebuilt balsa and the reminents of Egg and Btek by my desk.  What do I do with them?  Egg is totally broken but btek could be useful for spares.
The mochitest run on spider finished without crashing. I'll run through the rest of the test suites today.
I ran through all of our test suites (mochitest, mochitest chrome, mochitest browser-chrome, mochitest a11y, reftest, crashtest, xpcshell tests) on spider. There were some test failures (that I didn't look into very deeply, but most look like the same kind of intermittent failures as on tinderbox), but no crashes.
i have rebuilt balsa and it does not work at all.  The options I can think of include running a dual boot on spider, which effectively makes automation impossible or finding new hardware.
Assignee: reed → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Moving to releng.
Flags: blocking1.9.1+
And marking blocking 1.9.1+.  We need to run this after all JS bugs are in, and before RC, and before each RC.
(In reply to comment #21)
> I ran through all of our test suites (mochitest, mochitest chrome, mochitest
> browser-chrome, mochitest a11y, reftest, crashtest, xpcshell tests) on spider.
> There were some test failures (that I didn't look into very deeply, but most
> look like the same kind of intermittent failures as on tinderbox), but no
> crashes.

We need to look at the test failures: only one failure mode (generating SSE2 code on a non-SSE2 machine, and calling it) will result in a SIGILL crash.  We also need to know that the x87/non-SSE2 code that we generate is correct!
Ok, I'll collate and attach them to the bug in a bit.
Assignee: nobody → ted.mielczarek
First, the simple:
crashtest, mochitest-a11y, xpcshell: 0 failures
mochitest-chrome: 1 failure
mochitest-browser-chrome: 6 failures
2 of these are because bug 475383 hasn't landed on branch. The others may just be fallout from that failure, I didn't investigate fully.
reftest had a bunch of failures, but then I noticed that the first one was colordepth.html, and realized that my RDP connection was 16 bit, so those failures are probably all a result of that.
Attached file mochitest failures
mochitest failures: 21
9 of these are known: 2 are from bug 475383 again, 7 are from the geolocation tests (bug 489817). I didn't investigate the rest.
I'll fire off another run today as well. (on the same build)
What happened with that run, please?
Do we have updated info here?
Sorry, lost track of this over the weekend. Summary from the second run:

mochitest-plain: somewhat different results, will attach log in a minute
mochitest-chrome: exact same result as previous run
mochitest-browser-chrome: one additional failure:
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/preferences/tests/browser_privacypane_1.js | Timed out

Anything not mentioned still had zero failures.
Ok, there are 23 failures in this log, of which 9 are known (as before, the plugin tests and geolocation tests).
So there are 14 bugs to file, I guess. :-(

If we're seeing consistent fails on mochitest-chrome, doesn't that mean that they're probably not just the usual sometimes-orange randoms?
The mochitest-chrome failure was a known random that didn't get a fix backported to branch, bug 468189. (Although interestingly on this machine it sure seems repeatable!)
I think the browser-chrome failures are all fallout from the plugin test failing. It opens a tab, and then doesn't clean it up if it doesn't finish successfully. Should file a bug on making that test cleanup after itself better.
In the mochitest failures, I looked at:
31268 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/ajax/offline/test_fallback.html | Fallback page displayed for top level document
I think this test is broken, it has a 3 second timeout internally:
http://mxr.mozilla.org/mozilla-central/source/dom/tests/mochitest/ajax/offline/test_fallback.html?force=1#71

This machine is *really* slow, so it wouldn't surprise me if we hit that.
(In reply to comment #39)
> I think the browser-chrome failures are all fallout from the plugin test
> failing. It opens a tab, and then doesn't clean it up if it doesn't finish
> successfully. Should file a bug on making that test cleanup after itself
> better.

I re-ran browser-chrome with the plugin test moved out of the way, and got just one failure:
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/places/tests/perf/browser_ui_history_sidebar.js | Timed out

Suspiciously, this is in a "tests/perf" directory, and it looks like the test does a lot of work. The browser chrome harness has a 30 second timeout, so it seems likely that this test just can't finish in time.
How many times have we looped through these test runs so far?
Just two runs through the full test suite, on the same build (mentioned in comment 17). Happy to do more runs, or on a newer build, whatever floats your boat.
Ted, be ready to run these on notice. I'm guessing we'll want to run this before we ship the RC.
Will do, I was planning on grabbing a build from this morning and giving it another run.
Might be time to run this again?
Yeah, can use the b99 builds when they're out.
I re-ran this on a build from thursday(?) and got extremely similar results, although I didn't finish analysis. I think this box is currently MIA due to the office move, so hopefully someone can plug it back in on monday.
(In reply to comment #48)
> I think this box is currently MIA due to the
> office move, so hopefully someone can plug it back in on monday.

Both nonsse machines are AWOL. They didnt show up in new server room, or any of RelEng desks in new office. I already went back to Building K server lab this morning, and they are not there. 

I'll go back and search a few other rooms in Building K later today.
One non-sse machine was in the server room but was off.  It is now connected using a dhcp address of 10.250.6.227 but i am working on getting a dns hostname for it in bug 496946

This machine is a P3-500MHz with 384MB of ram.  There is ssh working and I will email the username and password to ted
(In reply to comment #49)
> (In reply to comment #48)
> Both nonsse machines are AWOL. They didnt show up in new server room, or any of
> RelEng desks in new office. I already went back to Building K server lab this
> morning, and they are not there. 
> 
> I'll go back and search a few other rooms in Building K later today.

John Ford and myself went dumpster-diving in the old Building K and S. We found the nonsse machine, as well as a few other nonsse and ppc machines, and brought them all back to the new office. 

We should have the pre-existing nonsse machine back online today sometime, and will find out how many of the other machines even work at all. Very happy with the additional nonsse and ppc machines found; quite a productive afternoon's scavenging!!
I've got a Mochitest run started on the Linux machine.
Looks like we're done with all blockers for RC.  Need to run everything again?
(In reply to comment #51)
> (In reply to comment #49)
> > (In reply to comment #48)
[snip] 
> We should have the pre-existing nonsse machine back online today sometime...

Forgot to update this bug earlier. jhford got the nonsse win32 machine up and running again Tues. DNS is still a bit unsettled in new office, but these IPs work:

linux: 10.250.6.227
win32: 10.250.5.20
(In reply to comment #54)
> DNS is still a bit unsettled in new office, but these IPs
> work:
> 
> linux: 10.250.6.227
> win32: 10.250.5.20

Are there bugs on file to get these assigned static IPs?
there is one for goat, the linux one (bug 496946).  The windows one had a working one before (spider) but i guess it was removed when it was moved to the junk pile.  I can file a seperate bug or expand the linux one.  either works for me.
Ted: can you run a set of unit tests on RC3 using these boxes so we can close this out?
I'm OOTO today, and traveling this weekend, so I can't get to it until Monday. If you want it sooner than that you'll have to find someone else, sorry.
Adding Joel Maher as he will be running the tests this afternoon.
I am seeing a LOT more errors when I did a run on linux/windows this weekend. 

For example the linux mochitests have 331 failures (I ran twice to verify)!  Also the linux browser-chrome tests did not finish (verified twice) as they were hung on sessionrestore tests!


# of failures
test             linux     windows
xpcshell         0         0
reftest          3         123
crashtest        0         0
mochitest        331       13
chrome           9         0
browser-chrome   20        10
a11y             0         0
I don't believe I ever ran the unittests on that Linux box, as it didn't exist when I started this testing.

The windows reftest results may be completely wrong, as you have to be careful to connect using 24-bit color with remote desktop. The mochitest/browser-chrome results look to be in line with what I saw, and were all harmless failures (tests relying on the test plugin, which is a known failure on branch packaged tests currently, or tests that are intermittent failures/timeout on slow hardware).
let me try the reftests again on windows.  thanks for the data Ted.
This is the failure log after re-running browser-chrome tests after removing:
mochitest/browser/browser/base/content/test/browser_pluginnotification.js


cltbld@SPIDER /c/ff35_unittest/mochitest
$ grep UNEXPECTED-FAIL bchrome.log
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/plac
es/tests/browser/browser_410196_paste_into_tags.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/plac
es/tests/perf/browser_ui_history_sidebar.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/pref
erences/tests/browser_privacypane_1.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/pref
erences/tests/browser_privacypane_2.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/pref
erences/tests/browser_privacypane_3.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/pref
erences/tests/browser_privacypane_4.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/pref
erences/tests/browser_privacypane_5.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/pref
erences/tests/browser_privacypane_6.js | Timed out
TEST-UNEXPECTED-FAIL | chrome://mochikit/content/browser/browser/components/pref
erences/tests/browser_privacypane_7.js | Timed out

cltbld@SPIDER /c/ff35_unittest/mochitest
$
Can we get an assessment here of whether or not we are good to go?
I'm pretty sure we're good, based on that log.
Yeah, those are just timeouts from tests that take too long because this machine is so godawful slow. If we're going to get automated builds on this machine, we should file a bug to track the test issues we'll need to resolve to get green tests on this machine, but I don't see anything that's an actual problem with running the builds here.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
if the work here is finished, could you please mark status1.9.1 accordingly? or fixed1.9.1 keyword at least. i'm querying bugzilla for unfinished 1.9.1 bugs and this is still marked as unfinished. or any other way to mark that we are done here, that i can query on.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: