Closed Bug 426997 Opened 16 years ago Closed 16 years ago

bm-win2k3-pgo01 is burning

Categories

(Release Engineering :: General, defect, P1)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 420073

People

(Reporter: ted, Assigned: mikeal)

References

Details

Attachments

(2 files)

Looks like win2k3-pgo got hung up on something. Filed here instead of tinderbox maintenance because this is a new box, so it's possible that it's still a setup issue:

Build Error Log

 Skipping 26 Lines...

PsKill v1.12 - Terminates processes on local or remote systems
Copyright (C) 1999-2005  Mark Russinovich
Sysinternals - www.sysinternals.com

cvs checkout: Updating tinderbox-configs
cvs checkout: Updating buildbot-configs
No clobber required
# tools/buildbot-configs/testing/unittest/mozconfig-win2k3-pgo

mk_add_options MOZ_CO_PROJECT=browser
ac_add_options --enable-places
ac_add_options --disable-installer
ac_add_options --enable-application=browser
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/objdir
ac_add_options --enable-tests
ac_add_options --enable-debugger-info-modules
#ac_add_options --enable-mochitest
# ac_add_options --enable-extensions=default,jssh
# ac_add_options --disable-javaxpcom
# ac_add_options --enable-debug
# ac_add_options --disable-optimize
ac_add_options --disable-composer
ac_add_options --disable-mailnews
mk_add_options MOZ_MAKE_FLAGS="-j3"
# ac_add_options --enable-optimize="-O2 -g"
ac_add_options --enable-logrefcnt

# mozilla/testing/tools needed for buildbot profile (re)creation
mk_add_options MOZ_CO_MODULE="mozilla/testing/tools"
mk_add_options PROFILE_GEN_SCRIPT='$(PYTHON) $mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 23: unexpected EOF while looking for matching `''
NEXT ERROR mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 24: syntax error: unexpected end of file
Adding client.mk options from /d/slave/trunk_2k3_pgo/mozilla/.mozconfig:
    MOZ_CO_PROJECT=browser
    MOZ_OBJDIR=$(TOPSRCDIR)/objdir
    MOZ_MAKE_FLAGS=-j3
    MOZ_CO_MODULE=mozilla/testing/tools
checkout start: Fri Apr 4 03:27:31 PDT 2008
cvs -d :ext:unittest@cvs.mozilla.org:/cvsroot -q -z 3  co    mozilla/client.mk mozilla/browser/config/mozconfig mozilla/browser/config/version.txt mozilla/build/unix/uniq.pl mozilla/calendar/sunbird/config/version.txt mozilla/mail/config/version.txt mozilla/suite/config/version.txt
mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 23: unexpected EOF while looking for matching `''
NEXT ERROR mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 24: syntax error: unexpected end of file
make[1]: Entering directory `/d/slave/trunk_2k3_pgo'
cvs -d :ext:unittest@cvs.mozilla.org:/cvsroot -q -z 3 co -P -r NSPR_4_7_1_BETA2 mozilla/nsprpub
cvs -d :ext:unittest@cvs.mozilla.org:/cvsroot -q -z 3 co -P -r NSS_3_12_BETA3 mozilla/dbm mozilla/security/nss mozilla/security/coreconf mozilla/security/dbm

cvs -d :ext:unittest@cvs.mozilla.org:/cvsroot -q -z 3 co -P -A -l mozilla/ mozilla/db mozilla/js mozilla/js/jsd mozilla/js/src
? mozilla/objdir
cvs -d :ext:unittest@cvs.mozilla.org:/cvsroot -q -z 3 co -P -A mozilla/README mozilla/accessible mozilla/browser mozilla/build mozilla/caps mozilla/chrome mozilla/config mozilla/content mozilla/db/mdb mozilla/db/mork mozilla/db/morkreader mozilla/db/sqlite3 mozilla/docshell mozilla/dom mozilla/editor mozilla/embedding mozilla/extensions mozilla/gfx mozilla/intl mozilla/ipc/ipcd mozilla/jpeg mozilla/js/jsd/idl mozilla/js/src/fdlibm mozilla/js/src/liveconnect mozilla/js/src/xpconnect mozilla/layout mozilla/memory/jemalloc mozilla/modules/lcms mozilla/modules/libbz2 mozilla/modules/libimg mozilla/modules/libjar mozilla/modules/libmar mozilla/modules/libpr0n mozilla/modules/libpref mozilla/modules/libreg mozilla/modules/libutil mozilla/modules/oji mozilla/modules/plugin mozilla/modules/staticmod mozilla/modules/zlib mozilla/netwerk mozilla/other-licenses/7zstub/firefox mozilla/other-licenses/atk-1.0 mozilla/other-licenses/branding/firefox mozilla/other-licenses/ia2 mozilla/parser mozilla/plugin/oji mozilla/probes mozilla/profile mozilla/rdf mozilla/security/manager mozilla/storage mozilla/sun-java mozilla/testing/crashtest mozilla/testing/mochitest mozilla/testing/tools mozilla/toolkit mozilla/tools/elf-dynstr-gc mozilla/tools/test-harness mozilla/uriloader mozilla/view mozilla/webshell mozilla/widget mozilla/xpcom mozilla/xpfe mozilla/xpinstall
checkout finish: Fri Apr 4 03:30:56 PDT 2008
make[1]: Leaving directory `/d/slave/trunk_2k3_pgo'
mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 23: unexpected EOF while looking for matching `''
NEXT ERROR mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 24: syntax error: unexpected end of file
make -f /d/slave/trunk_2k3_pgo/mozilla/client.mk build MOZ_PROFILE_GENERATE=1
mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 23: unexpected EOF while looking for matching `''
mozilla/build/autoconf/mozconfig2client-mk: /d/slave/trunk_2k3_pgo/mozilla/.mozconfig: line 24: syntax error: unexpected end of file
make[1]: Entering directory `/d/slave/trunk_2k3_pgo/mozilla'
Adding client.mk options from /d/slave/trunk_2k3_pgo/mozilla/.mozconfig:
    MOZ_CO_PROJECT=browser
    MOZ_OBJDIR=$(TOPSRCDIR)/objdir
    MOZ_MAKE_FLAGS=-j3
    MOZ_CO_MODULE=mozilla/testing/tools
make -j3 -C /d/slave/trunk_2k3_pgo/mozilla/objdir
make[2]: Entering directory `/d/slave/trunk_2k3_pgo/mozilla/objdir'
rm -f -rf ./dist/sdk
rm -f -rf ./dist/include
rm -f -rf ./dist/private
rm -f -rf ./dist/public
rm -f -rf _tests
make[2]: Leaving directory `/d/slave/trunk_2k3_pgo/mozilla/objdir'
make[1]: Leaving directory `/d/slave/trunk_2k3_pgo/mozilla'
rm: cannot remove directory `_tests/testing/mochitest': Permission denied
rm: cannot remove directory `_tests/testing': Directory not empty
rm: cannot remove directory `_tests': Directory not empty
make[2]: *** [default] Error 1
make[1]: *** [build] Error 2
make: *** [profiledbuild] Error 2

No More Errors
clobbering...
Assignee: nobody → rcampbell
OS: Windows XP → Windows Server 2003
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Went green for one cycle then red again:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1207318951.1207322570.27303.gz

Note:
make[7]: Leaving directory `/d/slave/trunk_2k3_pgo/mozilla/objdir/xpcom/tools/registry'
      0 [main] make 472 open_stackdumpfile: Dumping stack trace to make.exe.stackdump

Looks like it still has some issues.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This was working fine yesterday afternoon before we moved it from staging to production. 

Reassigning to Mikeal, as Robcee is traveling.
Priority: -- → P1
Assignee: rcampbell → mrogers
Status: REOPENED → NEW
Probably separate from the burnination, but the mozconfig2client-mk error is due to:

 mk_add_options PROFILE_GEN_SCRIPT='$(PYTHON) $
in http://mxr.mozilla.org/seamonkey/source/tools/buildbot-configs/testing/unittest/mozconfig-win2k3-pgo

 mk_add_options PROFILE_GEN_SCRIPT='$(PYTHON) $(MOZ_OBJDIR)/_profile/pgo/profileserver.py' 
in http://mxr.mozilla.org/seamonkey/source/tools/tinderbox-configs/firefox/win32/mozconfig
I brought up the staging box to see if we get the same issues.

I'm also attaching a patch to fix the mozbuild-win2k3-pgo issue nick noticed.
Attachment #313631 - Flags: review? → review?(nrthomas)
I'd be inclined to call that patch a typo fix, myself, but then again I've been around the project for awhile.  :-)  Up to you what to do here for now.
I don't have commit right to cvs yet. So _somebody_ has to review it and check it in.

I can make the update to the production master. And this change won't require a buildbot reconfig because that file is pulled down anew for each build.
Comment on attachment 313631 [details] [diff] [review]
[checked in] fixing mozbuild-win2k3-pgo

>? pgo_mozbuild.patch
>Index: unittest/mozconfig-win2k3-pgo
>===================================================================
>RCS file: /cvsroot/mozilla/tools/buildbot-configs/testing/unittest/mozconfig-win2k3-pgo,v
>retrieving revision 1.1
>diff -u -8 -p -r1.1 mozconfig-win2k3-pgo
>--- unittest/mozconfig-win2k3-pgo	28 Mar 2008 18:47:48 -0000	1.1
>+++ unittest/mozconfig-win2k3-pgo	4 Apr 2008 16:45:30 -0000
>@@ -15,9 +15,9 @@ ac_add_options --enable-debugger-info-mo
> ac_add_options --disable-composer
> ac_add_options --disable-mailnews
> mk_add_options MOZ_MAKE_FLAGS="-j3"
> # ac_add_options --enable-optimize="-O2 -g"
> ac_add_options --enable-logrefcnt
> 
> # mozilla/testing/tools needed for buildbot profile (re)creation
> mk_add_options MOZ_CO_MODULE="mozilla/testing/tools"
>-mk_add_options PROFILE_GEN_SCRIPT='$(PYTHON) $
>\ No newline at end of file

Probably don't want to check this in ^ :)

>+mk_add_options PROFILE_GEN_SCRIPT='$(PYTHON) $(MOZ_OBJDIR)/_profile/pgo/profileserver.py'
Comment on attachment 313631 [details] [diff] [review]
[checked in] fixing mozbuild-win2k3-pgo

Checking in unittest/mozconfig-win2k3-pgo;
/cvsroot/mozilla/tools/buildbot-configs/testing/unittest/mozconfig-win2k3-pgo,v  <--  mozconfig-win2k3-pgo
new revision: 1.2; previous revision: 1.1
done
Attachment #313631 - Flags: review?(nrthomas) → review+
Attachment #313631 - Attachment description: fixing mozbuild-win2k3-pgo → [checked in] fixing mozbuild-win2k3-pgo
Since this change didn't require a buildbot reconfig it didn't require any downtime. I update the production master with this change after it was checked in.

The existing cycle compiled green, but looking at the total time compared to the staging PGO build time it seems as those not everything is getting compiled with the old mozconfig, the next build will use the new mozconfig and fix this issue.
While that existing cycle (09.07am - 10.54am) finished green, we still have problems:

2008/04/04 09.07 green
2008/04/04 11.00 orange
2008/04/04 13.23 green
2008/04/04 15.36 orange
2008/04/04 17.43 orange
2008/04/04 20.06 orange
2008/04/04 22.27 red
...and remains burning red continuously even now. 

Random poking at the logs shows the following error:

rm -f -rf ./dist/public
rm -f -rf _tests
rm: cannot unlink `_tests/testing/mochitest/httpd.js': Permission denied
rm: cannot unlink `_tests/testing/mochitest/server.js': Permission denied
rm: cannot remove directory `_tests/testing/mochitest': Permission denied
rm: cannot remove directory `_tests/testing': Directory not empty
make[2]: Leaving directory `/d/slave/trunk_2k3_pgo/mozilla/objdir'
make[1]: Leaving directory `/d/slave/trunk_2k3_pgo/mozilla'
rm: cannot remove directory `_tests': Directory not empty
make[2]: *** [default] Error 1
make[1]: *** [build] Error 2
make: *** [profiledbuild] Error 2


It was running green in staging before we switched it to production - what changed?
I've removed bm-win2k3-pgo01 from the main Tinderbox page. With the freeze coming (and people presumably trying to work on the weekend), this new box shouldn't just sit burning on the main Tinderbox page. Please reenable when it's working, though!
This issue isn't specific to this box or to pgo, it's an issue we have on all the win2k3 unittest machines. The three that coop just set up all have the same problem.

Adding coop to CC
I have the semantics for a fix down, I'm working on writing up some buildbot code to deal with this.

Essentially, we need to kill any other python processes that aren't the main buildbot process as part of our first few cleanup steps.

I do think that this intermittent mochitest failure on win2k3 is real and not a problem with the environment on these boxes, but our buildbot code should be robust enough to handle dangling processes from previously failed tests. We can track the mochitest failure much easier after the boxes don't completely fall down when encountering it.
Blocks: 420073
Ok, I have two ShellCommand subclasses that fix the rogue python process issue.

I would categorize the risk of this fix as "high" for sure. There is a good amount of failover logic in the code but if something is out of whack it could kill the builbot slave. I'm running it on my test master overnight, if it all looks good I'll merge the code in to the staging master and let it live there for a while.

If we could run both PGO boxes on staging and get this code in the other win2k3 unittest boxes and see how they run over a 24 hour period I'd say the patch is good to go, but the risk is high enough that I'd don't want to push it in to production too hastily.
(In reply to comment #14)
> This issue isn't specific to this box or to pgo, it's an issue we have on all
> the win2k3 unittest machines. The three that coop just set up all have the same
> problem.

Any details on what the issue is? For example, is bug#427605 the same problem? Are you seeing memory access violations on this PGO machine?
It's possible that this is the issue that was causing them to fall over, there isn't really enough in that bug for me to tell.

Both the PGO machines were red this morning, and only one of them was showing this memory access violation error.

Regardless, as I said in an earlier comment, the patch I'm currently working on is to keep the box from going red on consecutive runs after issues like this one. It addresses a slightly larger problem of killing rogue python processes from previous runs.

Once the boxes aren't going red after these test failures I'll dig deeper in to the intermittent test failures. If coop thinks this is the reason the mochitests are failing and then locking up a Python process then I'm inclined to agree with him and his fix for that issue will clear up the last of the problems on these PGO boxes. If not then we'll have an easier time tracking the issue once the boxes can run continuously without going red.
Waldo suspects that other bustage will be fixed by his patch in bug 418009.
I'm going to recommend switching to runtests.pl instead of the pythonic version until we can figure this out in staging.
Are any of the other unittest boxes using runtests.pl?
In order to increase transparency on this I'm going to attach the new ShellCommands I wrote so that people can comment on them before they are in the context of a patch to production.
On the main tinderbox, they all are.

On MozillaTest, most are using py.  qm-stage-centos5-01 is mean and green; qm-stage-osx-01 and qm-xserve02 dep were green for awhile but turned orange sometime, I don't know when, and it's reporting failures on a set of mochitests that don't really indicate anything -- I'd be surprised if a kick didn't fix.  qm-stage-win2k3-01 was green until bug 418009 hit and bricked it until someone can give it a kick; it's using py for the non-mochitest browser test run.  qm-win2k3-03 was orange on a specific browser test for no obvious reason, one that passed on the other box, and is now in need of a kick for the same reason.  qm-win2k3-02 is red on something completely non-mochitest-related, some buildbot failure it looks like -- no idea what it is.
Yeah, windows is the problem area for these things. I'd rather try these steps on staging before putting them on production. Not that they look bad, I'd just prefer not using production as a test environment.

Mikeal: please convert the step on the pgo unittest box to runtests.pl.
Both staging and production slaves are reporting to the unittest staging master until they are green.

Both boxes have had their resolution set to 1280x1024 as that seems to cause intermittent failures in some of these tests.

Both boxes are now using runtests.pl in place of runtests.py.
I commented out the clobber and build steps on both boxes so that we can see more consecutive test cycles to determine if there are any more intermittent issues.
The issue that was causing this to burn is now fixed by using runtests.pl .

There are now a new set of issue keeping us from putting the PGO box back on production. I'm marking this bug as a dupe and referring everyone back to the original bug 420073, https://bugzilla.mozilla.org/show_bug.cgi?id=420073, to track further issue with the PGO unittest box.

The other bug is older, has more history, and seems to be on people's radar more.
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → DUPLICATE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: