run unit test on non-sse and ppc machines

RESOLVED FIXED

Status

Release Engineering
General
P2
normal
RESOLVED FIXED
10 years ago
5 years ago

People

(Reporter: aki, Assigned: jhford)

Tracking

Dependency tree / graph
Bug Flags:
blocking1.9.2 +
blocking1.9.1 -
wanted1.9.1 +

Firefox Tracking Flags

(blocking2.0 beta2+, status1.9.1 ?)

Details

Attachments

(3 attachments)

(Reporter)

Description

10 years ago
This is for non-Intel chipsets that have issues with the added instruction sets.
sayrer/shaver: 

do you want us to *build* and test on non-SSE VM? 
...or...
do you want us to take an existing build from an SSE VM and *test* on non-SSE VM?
(In reply to comment #1)
> do you want us to take an existing build from an SSE VM and *test* on non-SSE
> VM?

We want to test a standard (SSE-capable) build on a non-SSE machine.  If we can build with SSE on a machine without it, that's fine, but we don't want a non-SSE build, just non-SSE runtime.
(Reporter)

Updated

10 years ago
Summary: Create non-SSE build for testing → setup win32-sse builds to automatically run unittests on both win32-non-sse VMs and win32-sse VMs
(Reporter)

Updated

10 years ago
Priority: -- → P2
Tweaking summary, as this is needed for linux *and* win32.

Per discussions with Damon, running this once a night would be plenty. No need to schedule per-checkin tests, so no need for larger pool of slaves.
OS: Windows Server 2003 → All
Hardware: PC → All
Summary: setup win32-sse builds to automatically run unittests on both win32-non-sse VMs and win32-sse VMs → setup sse builds to automatically run unittests on both non-sse VMs and sse VMs
(In reply to comment #2)
> (In reply to comment #1)
> > do you want us to take an existing build from an SSE VM and *test* on non-SSE
> > VM?
> 
> We want to test a standard (SSE-capable) build on a non-SSE machine.  If we can
> build with SSE on a machine without it, that's fine, but we don't want a
> non-SSE build, just non-SSE runtime.

This means we need to be able to run unittests on pre-existing build, which we cannot yet do. Adding dependency and moving to Future per discussions with Aki and Ted. (Should have updated this bug weeks ago, but holidays and end-of-quarter overran my brain, sorry).
Assignee: aki → nobody
Component: Release Engineering → Release Engineering: Future
Depends on: 421611
sorry, lost dependent bug somehow
Depends on: 421611

Comment 6

9 years ago
(In reply to comment #4)
> > 
> > We want to test a standard (SSE-capable) build on a non-SSE machine.  If we can
> > build with SSE on a machine without it, that's fine, but we don't want a
> > non-SSE build, just non-SSE runtime.
> 
> This means we need to be able to run unittests on pre-existing build, which we
> cannot yet do.

Either build style works.
Flags: blocking1.9.1+
John,  who in RelEng owns this?

Comment 8

9 years ago
as I recall - this is blocked on the patch from Ted, no?
Yeah, and the setup of the slaves (bug 465302), from the deps listed above.
OK.  So, do we want to hold of on fingering someone as the owner?  I just wanna make sure we're all in line to crank through the remaining blockers during RC.  I think finding an owner for these bugs (i.e., including bug 465302) would be ideal.

Comment 11

9 years ago
releng is the owner, with the bug blocked on the unit test bug.  are you looking for an individual owner rather than a group?
Yeah, I'm just looking for someone I can beat with a stick once this bug becomes the last thing blocking 1.9.1.  :)

Comment 13

9 years ago
ted first, then releng :-)
I don't think this will work on these VMs, see bug 492589 comment 5. We'll probably have to get physical machines that either have ancient CPUs or allow disabling of SSE features in BIOS.
1) VMs dont support this, even when they claim the do. Details in bug#492589#c5. We have therefore deleted the two *nonsse VMs created for this, as they are useless. 

2) ted is now trying to see if anyone in community has old-enough hardware, which is running a nonSSE cpu.

3) from irc: some debate about using QEMU as emulator, but dismissed because of cases where QEMU did not catch problems that crashed on an end user's nonSSE computer.

4) from irc: seems that the best choice for CPU is an AMD Athlon K7, details here: http://en.wikipedia.org/wiki/Athlon and http://en.wikipedia.org/wiki/SSE2#CPUs_supporting_SSE2. Not sure where we can buy those anymore, some quick websurfing and phone calls were fruitless.

5) I question if this bug should be "blocking1.9.1", but dont know how/who to ask. However, I could possibly see bug#492589 being reopened and marked as "blocking1.9.1", and even that is totally dependent on being able to find the right hardware.

Comment 16

9 years ago
> 
> 5) I question if this bug should be "blocking1.9.1", but dont know how/who to
> ask. 

I marked this bug blocking+ on January 7 2009. I think it's still the right thing, unless we don't support this platform anymore. Shaver or justin probably know where to ask.

Updated

9 years ago
Component: Release Engineering: Future → Release Engineering
(In reply to comment #15)
> 5) I question if this bug should be "blocking1.9.1", but dont know how/who to
> ask. However, I could possibly see bug#492589 being reopened and marked as
> "blocking1.9.1", and even that is totally dependent on being able to find the
> right hardware.

Instead of reopening bug#492589 (test manually on VMs), ted filed bug#492589 (test manually on hardware) and marked that "blocking1.9.1". We need to know this can work manually before we try automating anything, so setting as dependent bug.
Depends on: 492589
Based on the success of bug 492589, we can take this one off the blocker list.  Discussed this with Sayre and he agrees.  We'll need Ted to run the tests manually before each RC and final.  If someone disagrees, please re-nom.
Flags: blocking1.9.1+ → blocking1.9.1-
Re-nom: we don't know that 492589 was a success (we don't even have a list of the tests, let alone another run to see if the frequency of the randoms is the same!) based on the data there, and we need coverage for m-c and 1.9.1 after 3.5.0 is released.

If we can run it once, isn't it straightforward to set it up in cron or a slave script and have it run all the time, reporting to the TM tree?
Flags: blocking1.9.1- → blocking1.9.1?
Also, note that bug 492589 is now a blocker.  But, we probably do need multiple runs to see frequency of randoms (I like that phrase).
Can we get a blocking decision here, one way or another, please?

The original reporter is apparently satisfied (see comment 18) but that's based on an assumption that bug 492589 (manually running the test) was a success which might not be a big deal as that bug does block, meaning that we will at least have an answer about unittests on sse builds.

I can't comment on what Shaver's asking for in comment 19, but from my product driving side comes a request to ensure that we test builds before cutting for RC in a way that lets us know if we're going to break on SSE. If that's already covered by bug 492589, then I don't think this bug blocks our release of Firefox 3.5, though it should obviously be up there on the to-do list for releng.
I would mark blocking 3.5.1 if I could -- we need automation here ASAP, but if we're willing to burn some ted-cycles on the manual runs (probably need several if there are "randoms" in play, and may want a valgrind run of the suite as well) then I'm OK with that.
Flags: wanted1.9.1.x?
Flags: wanted1.9.1+
Flags: blocking1.9.2+
Flags: blocking1.9.1?
Flags: blocking1.9.1-
Found during triage, assigning to joduinn for investigation.
Assignee: nobody → joduinn
Depends on: 501526
Where are we here? Do we need this for 1.9.1 still or just 1.9.2? Has any work been done in the last 7 weeks?
status1.9.1: --- → ?
Flags: wanted1.9.1.x?
I confirmed with rsayrer that this is still needed in advance of 1.9.2 release.

Nothing to do here until blocking bugs are fixed, so putting this bug back in the pool.
Assignee: joduinn → nobody
John Ford has been working on this.
Assignee: nobody → jford
Depends on: 521437
Depends on: 522129
After doing some tests, I can't currently run the PPC test because I don't have any builds that are universal and have symbols.  I am going to look at getting windows and linux going tomorrow.
Depends on: 457753
Summary: setup sse builds to automatically run unittests on both non-sse VMs and sse VMs → run unit test on non-sse and ppc machines
What's wrong with these builds?
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx/

Aside from the fact that there are debug builds mixed in there, which should get fixed by some other bug whose number escapes me, those are the builds from "OS X 10.5.2 mozilla-central build", and they are universal builds.
Hmm, i seemed to have gotten a debug build every other time I have tried.  I have tried with http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx/1255468611/firefox-3.7a1pre.en-US.mac.dmg and it seems to be running fine.
Created attachment 406304 [details]
time message

Unsurprisingly, these machines are very slow.  I am getting an unresponsive script warning.  Is there a way to disable this prompt?
Depends on: 522379
Depends on: 522380
Depends on: 522382
Depends on: 522383
Aki showed me dom.max_script_run_time.  I am going to set this to 0 and see if this changes anything.
Blocks: 488847
Depends on: 523293
Depends on: 527996
Shaver: you marked this as blocking, are you saying we can't release without running the tests? Lots of dependencies here that are unresolved :(

Comment 33

9 years ago
a bunch of ppc and non-sse machines were just brought up in the mv server room and I just bought 2 more non-sse machines.  guess I am just pointing out there is progress, but if this is actually blocking, seems it needs to be a drop-everything for rel-eng, no?
I should probably do a quick status update on this.  We just landed a large part of the automation/master side work to trigger test runs and I have verified that it is triggering test runs.  I am currently blocked on getting the slaves up and running.  I have our linux and leopard slaves working.  I am having a little more difficulty with tiger and windows.  As I understand it, tiger is low priority and Windows is critical as it is most of our user base.

I had a set back today in that both of our windows xp slaves completely gave out.  I have installed windows on one of the new machines and will be configuring it first thing tomorrow.

Currently the test runs are triggered at the completion of nightly builds.  Because mozilla-1.9.2 builds [1] do not upload required artifacts (tests and symbols), I can only run mozilla-central [2] tests in a fully automated fashion.  The fix would be to start uploading the required symbols and test packages for mozilla-1.9.2 nightlies.  I am assuming this is possible because we are uploading these files for the 1.9.2 tinderbox-builds.

I would also like to know how long this testing is going to be required.  If it is only going to be until 3.6 is released, it might be acceptable to trigger the jobs manually.  If this is something we need going forth for all releases, I can look into the more permanent fixes.  If this is going to be required going forth, I would like to know which branches need to be covered.

Of all the bugs that are blocking this, only setting up windows slaves (bug 522379) is actually blocking running these tests against mozilla-central.  I will file the bug for getting the mozilla-1.9.2 artifacts uploaded shortly and will add it as blocking this bug.

[1] example: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-1.9.2/
[2] example: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla-central/
This is tested to work.  The windows and linux slave are running P3-1.2GHz Tualatin core cpus.  The leopard ones are running either a G4-1.42GHz Mini or a Dual-1.0GHz PowerMac G4.  Until this above issue with mozilla-1.9.2 producing packaged unittests is addressed, I cannot run Tiger tests as mozilla-central (the only branch I am running on currently) does not run on Tiger.  Adding a new arch/os to this testing isn't a lot of overhead, as long as it can run the standard unit test commands and we already produce compatible binaries.
No longer depends on: 491890
These machines have fallen down a very long time ago and are going to need some serious love to get them back up.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → INCOMPLETE

Comment 37

8 years ago
what does this mean?
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

Comment 38

8 years ago
ping
blocking2.0: --- → beta1+
(In reply to comment #36)
> These machines have fallen down a very long time ago and are going to need some
> serious love to get them back up.

(In reply to comment #37)
> what does this mean?

Sorry, let me clarify where things stand right now:

* Linux on p3 is running properly and has been since November 23, 2009.  The tests have been orange/red since we started them running, this is because of tests timing out or not running correctly, according to Ted.

* WindowsXP on p3 was running tests with orange/red tests, like linux, until November 27, 2009, but has not run since then. I don't know what the status of this machine is currently, but it is not responding to pings. Its unclear if this is a hardware problem or a WinXP license problem. 

* leopard slave on PowerPC G4 was working correctly with orange/red tests, until January 22, but has not reported anything since then. I've looked around on the machine, and found it stuck on a job since January 22. I've killed that, and rebooted the machine this morning. Up until the 22 of january, this slave was working properly. I will see what happens overnight. 

* tiger slave is not working and I am not sure that it ever has worked.  

All of these were running for nightly builds on mozilla-central. They take 6-12 hours to run one cycle on one nightly build, so we've never had enough horsepower to run this on other branches, or more frequently then just once per nightly.
(In reply to comment #39)
> * Linux on p3 is running properly and has been since November 23, 2009.  The
> tests have been orange/red since we started them running, this is because of
> tests timing out or not running correctly, according to Ted.

Linux tests were broken for a short time due to fall out from bug 549427.  I imported http://hg.mozilla.org/build/buildbotcustom/rev/a20f711dc417 and did a restart

> * leopard slave on PowerPC G4 was working correctly with orange/red tests,
> until January 22, but has not reported anything since then. I've looked around
> on the machine, and found it stuck on a job since January 22. I've killed that,
> and rebooted the machine this morning. Up until the 22 of january, this slave
> was working properly. I will see what happens overnight. 

This machine came back from reboot but is not currently able to connect to the geriatric master.  I am seeing messages like "<twisted.internet.tcp.Connector instance at 0x77f3c8> will retry in 36 seconds" in the slave log and nothing on the master side.  I was able to nc the master on the correct slave port and did get the pb prompt.  I am not sure what is going wrong.
(In reply to comment #40)
> This machine came back from reboot but is not currently able to connect to the
> geriatric master.  I am seeing messages like "<twisted.internet.tcp.Connector
> instance at 0x77f3c8> will retry in 36 seconds" in the slave log and nothing on
> the master side.  I was able to nc the master on the correct slave port and did
> get the pb prompt.  I am not sure what is going wrong.

Just realised that this was the slave not able to find the master as the old domain name must have disappeared.  

s/geriatric-master.mv.mozilla.com/geriatric-master.build.mozilla.org/

has fixed this and the slave is back in the pool.
Depends on: 563831
(In reply to comment #39)
> * WindowsXP on p3 was running tests with orange/red tests, like linux, until
> November 27, 2009, but has not run since then. I don't know what the status of
> this machine is currently, but it is not responding to pings. Its unclear if
> this is a hardware problem or a WinXP license problem. 

reinstallation of windows on this machine is being tracked in bug 563831.  Installation and configuration of automation tools will be tracked in this bug.
joduinn found two more Mac PPC systems.  Tracking the work to add these to geriatric master in bug 549559
Depends on: 549559
Created attachment 445439 [details] [diff] [review]
buildbot-configs patch

this patch brings long overdue improvements to the geriatric master.

-Understand variants on the geriatric master instead of build master
  -adding a new variant requires no change to production master
-Split tests into their own builders
  -one builder per test on each platform variant
Attachment #445439 - Flags: review?(aki)
Created attachment 445440 [details] [diff] [review]
buildbotcustom patch

required buildbotcustom changes
Attachment #445440 - Flags: review?(aki)
(Reporter)

Updated

8 years ago
Attachment #445439 - Flags: review?(aki) → review+
(Reporter)

Updated

8 years ago
Attachment #445440 - Flags: review?(aki) → review+
We now have OSX 10.5 Coverage on Leopard.

Working on Windows XP Slave set up is being tracked in bug 566955
Depends on: 566955
exceptions.ValueError: incomplete format

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274349723.1274351109.10254.gz#err0

revision=WithProperties("%(got_revision)"),
should be 
revision=WithProperties("%(got_revision)s"),
Did you run this patch overnight? I am not sure if you would have been able to
hit it on staging or not (depends on the sendchanges).
So are we done here?
blocking2.0: beta1+ → beta2+
(In reply to comment #52)
> So are we done here?
 
I believe so.  Between June 17 and today the buildbot master was down.  It was brought down to upgrade the ram in the mountain view ESX hosts and was never started up again after.  I have filed bug 574415 to add all the old machines into our nagios alerts to avoid this problem in future.

I have started tests against the latest nightlies and will report back with the results.
I ran the tests with the latest nightly builds.  The status is as follows:

-Linux
  -All tests failed because of an incompatibility with SELinux
-Leopard G5
  -xpcshell orange 786/3
  -crashtest green 1611/0/10
  -reftest green 4449/0/219
  -mochitest-plain green 183334/0/1474
-Leopard G4
  -xpcshell, crashtest, reftest same as G5
  -mochitest-plain still running with at least one test failure
-Win32
  -xpcshell timed out
  -crashtest green 1611/0/10
  -reftest orange 4434/20/214
  -mochitest-plain orange 206966/144/1469

I have disabled SELinux on the P3 computer and launched another round of tests.  Its looking like the tests are actually running with selinux off.  I will report back in a couple hours with the status of Leopard-G4's mochitest results and the Linux results.


For the curious, the results of this testing goes to http://tinderbox.mozilla.org/showbuilds.cgi?tree=GeriatricMachines
(In reply to comment #54)
> I will
> report back in a couple hours with the status of Leopard-G4's mochitest results
> and the Linux results.

The linux results are:
-xpcshell orange 786/4
-crashtest green 1612/0/9
-reftests 4418/6/244
-mochitest-plain timed out

The G4 and Linux mochitest-plain timed out after running '/tests/layout/style/test/test_value_cloning.html' by not generating any output for 5400 seconds. These tests all run slowly on these slow machines, so the oranges/timeouts are "normal", and fixing these would likely require reworking the test suites. For previous releases, we've avoided that by always needing human inspection (usually by ted iirc), and this seems to be still true here. 

It feels like the infrastructure setup work is done here, and if people want to rework tests to pass green on slower machines, that work should be tracked as separate bugs with the specific test suite owners. Does that seem reasonable?
Still running with similar results.  It looks like the infrastructure is set up.(In reply to comment #55)
> (In reply to comment #54)
> > I will
> > report back in a couple hours with the status of Leopard-G4's mochitest results
> > and the Linux results.
> 
> The linux results are:
> -xpcshell orange 786/4
> -crashtest green 1612/0/9
> -reftests 4418/6/244
> -mochitest-plain timed out
> 
> The G4 and Linux mochitest-plain timed out after running
> '/tests/layout/style/test/test_value_cloning.html' by not generating any output
> for 5400 seconds. These tests all run slowly on these slow machines, so the
> oranges/timeouts are "normal", and fixing these would likely require reworking
> the test suites. For previous releases, we've avoided that by always needing
> human inspection (usually by ted iirc), and this seems to be still true here. 
> 
> It feels like the infrastructure setup work is done here, and if people want to
> rework tests to pass green on slower machines, that work should be tracked as
> separate bugs with the specific test suite owners. Does that seem reasonable?

Just checked again today, all are still running with similar results.  It looks like the infrastructure is set up. :-)
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
since this was blocking 1.9.2, can we set status.1.9.2 to at least final-fixed? otherwise it still appears in queries as not being fixed.
Am I able to set those flags?
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.