run unittests on PGO-enabled builds

RESOLVED FIXED

Status

defect
P1
normal
RESOLVED FIXED
12 years ago
6 years ago

People

(Reporter: ted, Assigned: mikeal)

Tracking

other
x86
Windows Server 2003
Dependency tree / graph
Bug Flags:
blocking1.9 +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 8 obsolete attachments)

1.49 KB, patch
Details | Diff | Splinter Review
2.32 KB, patch
rcampbell
: review+
Details | Diff | Splinter Review
5.57 KB, patch
Details | Diff | Splinter Review
Given bug 420069 and other as yet undiscovered dependencies of bug 419893, we should be running tests on the PGO builds. I fear for my life if I were to suggest enabling PGO on the unit test box, given the cycle time hit, but we need to do something here. Presumably a reftest run would show some failures with the PGO build if we're failing to layout <ul> properly.
Unit tests need the whole build tree, pretty much, but it looks like we can run a slow version of the reftests (data: URI comparison) with just a copy of the layout/ source pull, such that you stick reftest-cmdline.js in $COMPONENTS, reftest.jar in chrome, and then run with --reftest as per

http://developer.mozilla.org/en/docs/Running_Mozilla_Tests

?
The least-code approach to this would just be to bring up another set of unit test machines building the code with PGO and --enable-tests, and running the tests like the other boxes do. I've run all the test suites successfully on my own build like that, it just takes a long time to build.
(FWIW, aside from those two reftests, we actually pass all the other tests in my PGO build, so that's pretty nice)
I think this has to block the next beta; we have invested too much in our testing to not be applying it to the bits we're going to ship to 150M people in a couple of months.
Flags: blocking1.9+
Priority: -- → P1
Priority: P1 → --

Comment 5

12 years ago
Agreed - P1
Priority: -- → P1
reassigning so this shows up in triage.
Component: Testing → Build & Release
Product: Core → mozilla.org
QA Contact: testing → build
Version: Trunk → other
Rob: How best to proceed here?

Because PGO builds take so long, should we setup 2nd set of unittest machines to run in parallel with existing non-PGO unittest machines? Or should we tweak the existing unittest machines, which would slow down turnaround time on all unittest results?
Summary: run tests on PGO-enabled builds → run unittests on PGO-enabled builds
I believe we should setup new machines to do this. Our existing unittest boxes are taking too long as it is.
(In reply to comment #8)
> I believe we should setup new machines to do this. Our existing unittest boxes
> are taking too long as it is.
Sounds fair to me. 

Rob, could you file bugs on what exactly would be needed? 1 linux VM, 1 win32 VM and 1 mac xserve? What ref images should be used when building these?
And, can we get an owner for this bug?
taking. How soon do we need these? By end of week?
Assignee: nobody → rcampbell
Depends on: 423642
They block beta 5, so ASAP.
Assignee: rcampbell → mrogers
coop is gonna take this.

This is something I _will_ learn, but learning on a beta5 blocker isn't the best time.
Assignee: mrogers → ccooper
Getting this configured now.
Status: NEW → ASSIGNED
Back to mrogers, robcee is going to shepherd him through the process.
Assignee: ccooper → mrogers
Status: ASSIGNED → NEW
Attachment #310535 - Flags: review?(rcampbell)
Comment on attachment 310535 [details] [diff] [review]
Adding pgo unittest to master.cfg

I'd take the authList.append out at line 17 and just add the entry to auth.py directly.

Also, as ted mentioned in irc, we'll need to bump the timeout value for the compile step to around 3 hours.

Need to add: mk_add_options PROFILE_GEN_SCRIPT='$(PYTHON) $(MOZ_OBJDIR)/_profile/pgo/profileserver.py'
 to the mozconfig file as well, which should be renamed to mozconfig-win2k3-pgo
Attachment #310535 - Flags: review?(rcampbell) → review+
Attachment #310535 - Attachment is obsolete: true
Attachment #310538 - Flags: review?(rcampbell)
Comment on attachment 310538 [details] [diff] [review]
Adding pgo unittest to master.cfg

Checking in master.cfg;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest-stage/master.cfg,v  <--  master.cfg
new revision: 1.5; previous revision: 1.4

Checking in mozconfig-win2k3;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest-stage/mozconfig-win2k3,v  <--  mozconfig-win2k3
initial revision: 1.1
Attachment #310538 - Attachment filename: win2k3pgo_02.patch → [checked in] win2k3pgo_02.patch
bm-win2k3-pgo01 can't connect to unittest staging master qm-unittest02.

Filed IT bug.
Depends on: 423951
The box is hooked up to the staging master 

http://qm-unittest02.mozilla.org:2005/

It's currently compiling, once it goes green I'll resolve this bug.

One thing to note, we're pulling from cvs via anonymous checkout from cvs-mirror.mozilla.org . We were having some really strange issues with the ssh key after the move to the new network and hit a dead end debugging it and decided to just pull anonymous rather than stay blocked.

Comment 22

11 years ago
Any updates here Mikeal?
Comment on attachment 310538 [details] [diff] [review]
Adding pgo unittest to master.cfg

This patch is stale, we've made a half dozen changes live on the qm-unittest02 staging master to get this running.
Attachment #310538 - Attachment is obsolete: true
Attachment #310538 - Flags: review?(rcampbell)
Update:

I'm working with robcee to track a reftest failure.
Patch to add the new win2k3 pgo slave to the production master.

This patch is a little early since this box hasn't gone green on staging. Once it has I'll remove this comment
Attachment #311455 - Flags: review?(rcampbell)
We're seeing a failure in the unittests for PGO builds;

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1206388465.1206403127.30304.gz&fulltext=1#err0

test == _tests/xpcshell-simple/test_dm/unit/test_bug_401430.js

This hasn't passed since Friday, which is the first run we had that made it as far as $ make -k check
Addition, the previous bug is passing on a non-PGO build on tinderbox.
We're also seeing a series of reftest failures for PGO builds.

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1206388465.1206403127.30304.gz&fulltext=1#err5

Most of them seem to be in the png suite. A few are in the jpeg suite.

The last few failures are not in the images suites, those are;
REFTEST UNEXPECTED FAIL: file:///d:/slave/trunk_2k3/mozilla/layout/reftests/box-properties/CSS21-t100303-simple.xhtml
	FAIL - Timed out - chrome://mochikit/content/browser/toolkit/mozapps/downloads/tests/browser/browser_bug_406857.js

These tests are not failing on the other non-PGO unittest boxes.
There is one failure in the Browser Chrome tests;

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTest/1206388465.1206403127.30304.gz&fulltext=1#err127

FAIL - Timed out - chrome://mochikit/content/browser/toolkit/mozapps/downloads/tests/browser/browser_bug_406857.js

This test is not failing on the other non-PGO unittest boxes.

We've confirmed that the compilers match between the current production non-PGO unittest box and the new PGO unittest box. So that isn't the problem.

Both have MSVS 6.1.6000.16384
Once this is applied to the source directory, you can run (from the object directory):
make -C toolkit/components/downloads/test
make SOLO_FILE="test_bug_401430.js" -C toolkit/components/downloads/test check-one

to run the single test. Output should help figure out why it's failing.
robcee enlightened us to the fact that the display has to be set to 24bit or else a bunch of reftests fail.

After we made this change the reftests went green.

Next I'll be applying the recent patch to try and fix the unittest failure.
(In reply to comment #31)
> Created an attachment (id=311494) [details]
> debugging patch for test_bug_401430.js
> 
> Once this is applied to the source directory, you can run (from the object
> directory):
> make -C toolkit/components/downloads/test
> make SOLO_FILE="test_bug_401430.js" -C toolkit/components/downloads/test
> check-one
> 
> to run the single test. Output should help figure out why it's failing.
> 

Gavin: looks like we don't need your debugging help with reftests after all. It was our setup problem after all. Sorry for the noise. 
(In reply to comment #33)
> Gavin: looks like we don't need your debugging help with reftests after all. It
> was our setup problem after all. Sorry for the noise. 

The debugging patch is for the unit test failure, not the reftests.
Can we add a reftest that just checks .pixelDepth, so that it's a little more obvious the next time someone bumps into this?  They're easy to write! :)
When i was applying the patch I mangled the file and hard to start over, so I
rm'd the file and cvs up'd that directory, when I did that I took this change;

http://bonsai.mozilla.org/cvslog.cgi?file=/mozilla/toolkit/components/downloads/test/unit/test_bug_420230.js&rev=&mark=

Now all the unittests are passing. I think the wrong test may have been
reporting failure, because test_bug_420430 wasn't even getting run according to
the log, but I think it would have been next in line to be run.
Ok,

The last test failure we have is in the browser chrome tests;

FAIL - Ignore warning button should be present (but hidden) for malware - chrome://mochikit/content/browser/browser/components/safebrowsing/content/test/browser_bug400731.js

I'm going to comment in the bug that pertains to that test and see if we and get some help debugging.
Upon further investigation we noticed the two chrome test runs on the same build are showing different failures.

I just queued up a few dozen runs of the tests on the current build, I'll check back through the logs in the morning and track further.
Comment on attachment 311455 [details] [diff] [review]
Adding pgo to production master

please expand the mozbuild_pgo environment. You might also need to update your local copy of mozbuild.py to include the recent path additions for the Vista SDK.
Attachment #311455 - Flags: review?(rcampbell) → review-
Alright, the last 24 runs all show this test failing in the browser chrome tests;

FAIL - Timed out - chrome://mochikit/content/browser/toolkit/mozapps/downloads/tests/browser/browser_bug_406857.js

Most runs only had this test failing, although there were a few runs that had nearly a dozen other failures but they seem to not be 100% reproducible.

Regardless browser_bug_406857.js is always failing.

It appears that the ref image that was used to set up qm-win2k3-pgo01 was configured to use a virtual disk rather than a physical one.

This meant that when we asked IT to copy the VM so that we could have one in production and one in staging it was turned off and when it came back up everything we did to it since it had been provisioned was gone.

:luser completed a manual run of the tests earlier today and couldn't reproduce the Chrome test failure we were seeing earlier, which was the last failure we were only seeing on the PGO build (at that time some reftest failures were happening on all the unittest machines).

It's my recommendation that we go with the manual run for the beta5 release and remove this bugs blocker status. We'll still set up the unittest automation machines to run these automatically, and will do so before shipping beta5. However this should no longer block the code handover from dev to build.
qm-win2k3-pgo01 is back up and all the tests are passing.

We have one more issue to work through in the clobber script before we'll be copying this image and sticking one on production, but we're almost there.
Ok, all green on staging.

We're now just waiting for the VM copy from IT. I'm nearly done with a few new patches for staging and production master.cfg.

:robcee, when should we schedule some down time to get this on the production master?
Adding IT bug to depends
Depends on: 424925
Attachment #311455 - Attachment is obsolete: true
Attachment #312163 - Flags: review?(rcampbell)
Comment on attachment 312163 [details] [diff] [review]
Adding pgo to production master

you'll need to change the name of the builddir to trunk_2k3_pgo or similar to avoid conflict with existing win2k3 machine.

Also, I know it's tedious but could you expand the environment variables in MozillaEnvironments['mozbuild_pgo']. I asked for that in another patch submission and still think it's the way to go for the reasons stated then.
Attachment #312163 - Flags: review?(rcampbell) → review-
Attachment #312163 - Attachment is obsolete: true
Attachment #312167 - Flags: review?(rcampbell)
Comment on attachment 312167 [details] [diff] [review]
Adding pgo to production master

awesome++. WOULD TAKE PATCH AGAIN.
Attachment #312167 - Flags: review?(rcampbell) → review+
Comment on attachment 312167 [details] [diff] [review]
Adding pgo to production master

Checking in auth.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/auth.py,v  <--  auth.py
new revision: 1.3; previous revision: 1.2
done
Checking in master.cfg;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/master.cfg,v  <--  master.cfg
new revision: 1.22; previous revision: 1.21
done
Checking in mozbuild.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/mozbuild.py,v  <--  mozbuild.py
new revision: 1.17; previous revision: 1.16
done
Checking in mozconfig-win2k3-pgo;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/mozconfig-win2k3-pgo,v  <--  mozconfig-win2k3-pgo
initial revision: 1.1
done
Attachment #312167 - Attachment filename: pgo_production02.patch → [checked-in] pgo_production02.patch
Attachment #313130 - Flags: review?(rcampbell)
After these final changes make it in to mozbuild we'll be moving qm-win2k3-pgo01 to the production master.

:robcee is planning downtime for Thursday.
Attachment #313130 - Attachment is obsolete: true
Attachment #313204 - Flags: review?(rcampbell)
Attachment #313130 - Flags: review?(rcampbell)
Attachment #313204 - Flags: review?(rcampbell) → review+
Comment on attachment 313204 [details] [diff] [review]
Additional SDK paths added to mozbuild 2

Checking in mozbuild.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/mozbuild.py,v  <--  mozbuild.py
new revision: 1.18; previous revision: 1.17
done
Attachment #313204 - Attachment filename: pgo_mozbuild_sdkfix.patch → [checked in] pgo_mozbuild_sdkfix.patch
OS: Windows XP → Windows Server 2003
Whiteboard: [adding to production, apr 3, 2008 @ 7pm PDT]
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [adding to production, apr 3, 2008 @ 7pm PDT]
looks like I closed this too soon as it was removed from the tree this weekend because of burning.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is blocked by the current burnage.
Depends on: 426997
Assignee

Updated

11 years ago
Duplicate of this bug: 426997
I closed out bug 426997 that was logged when the PGO box started burning as a dupe. 

426997 was logged after the PGO box was put on production and started burning. The burning was caused by an intermittent failure in a mochitest that was locking up runtests.py and consecutive runs couldn't rm the test directory. We moved to runtests.pl and increased the screen resolution size and this seems to have fixed both of those issues on staging.

But, after running the tests consecutively on both PGO boxes I'm seeing a new intermittent failure in make check.

_tests/xpcshell-simple/test_places/bookmarks/test_393498.js: FAIL

This is now the final issue I know of for the PGO boxes. It was logged as bug 427142, https://bugzilla.mozilla.org/show_bug.cgi?id=427142 , while the box was still on production.

I have a patch for production that takes the runtests.pl fixes we already have running on staging that I'll be attaching to this bug momentarily.

I'll also be turning back on the compile steps for both PGO boxes so that we're not running yesterdays build anymore. 
Depends on: 427142
Attachment #314638 - Flags: review?(rcampbell)
Open question about prioritization, should we be blocking this for bug 427142 .

It's an intermittent issue, it's a real test failure and not an environmental problem. In the bug comments it states that it _may_ be related to bug 381240 , which has been prioritized as very low and is likely to not be fixed in the near future.

I'm waiting on clarification about why/how the issue is related to bug 381240.

Can we put the PGO box in to production with a known intermittent failure?

If this is related to bug 381240 I think we need to re-prioritize that bug.

Comment 60

11 years ago
(In reply to comment #59)
> Open question about prioritization, should we be blocking this for bug 427142 .
> 
> It's an intermittent issue, it's a real test failure and not an environmental
> problem. In the bug comments it states that it _may_ be related to bug 381240 ,
> which has been prioritized as very low and is likely to not be fixed in the
> near future.
> 
> I'm waiting on clarification about why/how the issue is related to bug 381240.
> 
> Can we put the PGO box in to production with a known intermittent failure?
> 

No because it will get ignored in practice since folks will tire of trying to separate a known issue for a real failure.

Is it a specific test? If so can we disable just that test? 
We've run into these problems before on our testing VMs. The timing on VMs is just too random to provide decent, timer-based tests. We could disable the test, but then we'd lose it across the platform and that'd mean decreased test coverage. The other option is increasing the timeouts on these tests, but because the timeout value is a constant in mochitest, that could mean ballooning the test time to ∂t x 50k. On a VM that's already taking upwards of 3 hours to complete, that could be pretty painful. Also, those timing changes are global across all platforms as it's hard-coded in the test harness itself.

In the past we've usually moved VMs having these types of problems to physical hardware. Given the availability, we should be able to swap this out onto a win2k3 box pretty quickly as we have a very recent clone that coop's setup.
Doesn't our VM server underpinning provide a way to set strong resource guarantees for CPU and I/O bandwidth and so forth (both min and max)?  If we're still using ESX it should.

Or is the issue not CPU variability but unreliability in the timing functions themselves?  That seems like a solvable problem as well, though perhaps not in the same way.
There is work going on in bug 427142 to fix this test.
(In reply to comment #62)
> Doesn't our VM server underpinning provide a way to set strong resource
> guarantees for CPU and I/O bandwidth and so forth (both min and max)?  If we're
> still using ESX it should.

Yeah, we're still on ESX. I haven't looked closely at the options there but you're probably right. IT should be able to verify that.

> Or is the issue not CPU variability but unreliability in the timing functions
> themselves?  That seems like a solvable problem as well, though perhaps not in
> the same way.

I'm not sure. I think the timing functions are pretty good, tbh, but are failing due to resource starving.

If the effort in bug 427142 clears this up, then I'm fine with leaving this on a VM to get it out the door.
A failure hasn't been observed in at least 10 hours on both PGO boxes.

Doesn't mean this issue is gone, just an observation.
We are, optimistically, thinking this will be fixed by EOD. We are scheduling production downtime for tomorrow at 10am PST to put the PGO untittest box back on .
Ted tracked the issue to some missing build config env variables.

I just editing the build config on the staging master and restarted the slaves.

Logged bug 428431 per Ted's request to track migrating those extra env variables to the other build configs.
Attachment #314638 - Attachment is obsolete: true
Attachment #315125 - Flags: review?(rcampbell)
Attachment #314638 - Flags: review?(rcampbell)
The compiling is green again from the CFLAGS change.

I'm not seeing a make check failure;

../../../../_tests/xpcshell-simple/test_dm/unit/test_bug_401430.js: FAIL

http://qm-unittest02.mozilla.org:2005/WINNT%205.2%20qm-win2k3-pgo01%20dep%20unit%20test/builds/52/step-check/0
For the latest 3 runs qm-win2k3-pgo01 has been green.

The staging box had a different issue with it's clobber script which is now fixed. We should have a better idea this afternoon if any of the intermittent failures are still happening on these boxes for PGO enabled builds.
qm-win2k3-pgo01 is very green, robcee is scheduling downtime for tomorrow to add back to production.

The staging box shows a new intermittent issue but that shouldn't block the other box from going live. If we close this bug out tomorrow when the box goes on to production I'll log a new one for tracking that issue, until then I'd like to keep everything in this bug.
Attachment #315125 - Flags: review?(rcampbell) → review+
updated patch to reflect recent win2k3 additions.
Attachment #312167 - Attachment is obsolete: true
Attachment #315125 - Attachment is obsolete: true
Comment on attachment 316227 [details] [diff] [review]
[checked-in] pgo prod patch

Checking in master.cfg;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/master.cfg,v  <--  master.cfg
new revision: 1.25; previous revision: 1.24
done
Checking in mozconfig-win2k3-pgo;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/mozconfig-win2k3-pgo,v  <--  mozconfig-win2k3-pgo
new revision: 1.3; previous revision: 1.2
done
Attachment #316227 - Attachment description: pgo prod patch → [checked-in] pgo prod patch
qm-win2k3-pgo01 is on the production master.

it's not showing up on the Firefox tinderbox page yet tho.
Who do i talk to about getting this back on the Firefox tinderbox waterfall?
Reporting and green!

Closing, finally!
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.