Closed Bug 378440 Opened 18 years ago Closed 17 years ago

cerberus build machine isn't well (very slow)

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: harald.langhammer, Assigned: nthomas)

References

()

Details

Attachments

(1 file)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.4pre) Gecko/20070415 Firefox/2.0.0.4 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.8.1.4pre) Gecko/20070415 Firefox/2.0.0.4pre For about a week now the "cerberus" (cerberus-vm?) build machine vanished from tinderbox http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla1.8-l10n-de As a result, localized win32 fx2.0 branch nightlies aren't built since April 15 any more, linux and mac builds are fine as seen at the URL given above Reproducible: Always Steps to Reproduce: 1. Open Tinderbox http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla1.8-l10n-de or 2. Open target FTP folder ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-mozilla1.8-l10n/ Actual Results: Cerberus not present on tinderbox / No win32 builds since April 15 at target FTP folder Expected Results: Cerberus building on tinderbox / New nightlies every day
I didn't find anything obvious on the console - box was setting with a cygwin shell opened in: cltbld@cerberus-vm /builds/repacks/firefox-1.5.0.11-rc1-google Killed off a couple makes and re-started tinderbox.
Status: UNCONFIRMED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
Assignee: server-ops → mrz
OS: Windows XP → All
Hardware: PC → All
Same happened again
Status: RESOLVED → UNCONFIRMED
Resolution: FIXED → ---
There was a "corrupted stack" error in the cygwin terminal (which is a new one to me). I opened a new terminal and restarted tinderbox.
Status: UNCONFIRMED → RESOLVED
Closed: 18 years ago17 years ago
Resolution: --- → FIXED
Same happened again
Status: RESOLVED → UNCONFIRMED
Resolution: FIXED → ---
This box is currently having problems with the file system on its hard disk. I meant to log/reopen a bug on that yesterday, apologies.
Assignee: mrz → nrthomas
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P1
Sorry for the bugspam; these are now P2 in the New View of the World (tm).
Priority: P1 → P2
Fixed three file system errors using chkdsk. Tinderbox restarted, its doing Mozilla1.8 branch now. Will leave this open for a day or so, to remind me to keep an eye on it.
Component: Server Operations: Tinderbox Maintenance → Build & Release
Got as far as the Korean locale, then update-packaging/common.sh: fork: Resource temporarily unavailable Tinderbox restarted.
I updated VMware tools from build 32039 to 38803 (which matches boxes like fx-win32-tbox), which bumped the driver "VMware SCSI Controller" from v1.2.0.2 (1999-11-14) to v1.2.0.4 (2005-08-17). It seems snappier, but lets see how it goes.
This time the Firefox/Mozilla1.8/l10n run finished, taking 8h 15m. The previous successful nightly run was on June 29th, and took 8h 35m. The machine had been rebooted for comment #9. Then the Thunderbird/Mozilla1.8/l10n run started, getting as far as the complete update for cs before bombing out processing xpistub.dll /cygdrive/c/builds/tinderbox/Tb-Mozilla1.8-l10n/WINNT_5.2_Clobber/mozilla/tools/update-packaging/common.sh: fork: Resource temporarily unavailable /cygdrive/c/builds/tinderbox/Tb-Mozilla1.8-l10n/WINNT_5.2_Clobber/mozilla/tools/update-packaging/common.sh: line 81: [: =: unary operator expected ignoring remove instruction for directory: components/mork.dll /cygdrive/c/builds/tinderbox/Tb-Mozilla1.8-l10n/WINNT_5.2_Clobber/mozilla/tools/update-packaging/common.sh: fork: Resource temporarily unavailable The call is to make_full_update.sh, which uses functions in common.sh http://mxr.mozilla.org/mozilla1.8/source/tools/update-packaging/make_full_update.sh http://mxr.mozilla.org/mozilla1.8/source/tools/update-packaging/common.sh xpistub.dll is the last of the files to go into the mar, so we get to 70 # Append remove instructions for any dead files. 71 append_remove_instructions "$targetdir" >> $manifest There are a bunch of shell calls and pipes between lines 71 and 88 of common.sh, which could cause the error.
Fragmentation report: ... NTFS, 4KB clusters, 68 GB total space, 31 % free space ... Volume fragmentation Total fragmentation 39 % File fragmentation 79 % Free space fragmentation 0 % File fragmentaion Total files 1,072,595 Average file size 72 KB Total fragmented files 135,420 Total excess fragments 339,687 Average fragments per file 1.31 ... Pagefile not fragmented ... Folder fragmentaion Total folders 184,451 Fragmented folders 9,090 Excess folder fragments 31,066 Running defragger .... If this doesn't help, then we can look at the SCSI driver (some win32 VMs use a different one), or will need a fix from bug 386074.
Still defragging, at 17% on some arbitrarily non-linear scale.
Defrag done, couldn't manage to fix ~ 2500 files but still a big improvement. Tinderbox restarted.
Got as far as Korean before the fork error occurred. I'll turn off the Mozilla1.8 locales to try get some trunk coverage. Next step is to try the SCSI driver change, using a clone of cerberus-vm.
Blocks: 386074
Not to jinx it, but cerberus-vm has been solid for the last three days. The shorter trunk runs might be factor. I'm hoping to clone this VM today, so we can try the other SCSI driver.
mrz set up cerberus-vm-clone, which is what is says on the tin. It's now doing the Firefox & Thunderbird locales for Mozilla1.8 branch, using the SCSI driver "LSI Logic PCI-X Ultra320", v5.2.3790.1830 from Microsoft. cerberus-vm continues to do the trunk locales. The procedure was: • Power off the VM you want to change controllers on • Connect to the Service Console and edit the vmx file for the VM • Add the following lines to the vmx file o scsi1.present = "true" o scsi1.virtualDev = "lsilogic" • Power on the VM and it will discover the new SCSI card • Power off the VM and edit the SCSI Controller settings, change the type to LSI Logic • Power VM back on, answer Yes for the adapter change message • Once it boots successfully shut the VM down again (it will have two LSI controllers at this point) • Edit the vmx file and remove the lines you added above • Power on the VM again and you will be all set I'll check how it's doing tomorrow, and if possible compare build times.
The clone only got as far as da before having the "fork: resource not available problem" in update-packaging/common.sh. Shell verbosity turned on, running again.
(In reply to comment #17) > The clone only got as far as da before having the "fork: resource not available > problem" in update-packaging/common.sh. Shell verbosity turned on, running > again. I also jiggered the last-built file so that the Fx l10n run was an hourly. The resulting run finished ok, although it took 4hrs 20 min. I can't see any record of update preparation in either the full log or those for individual locales. I backed these logs up in /cygdrive/c/builds/tinderbox/Fx-Mozilla1.8-l10n/WINNT_5_2_Clobber/... old-logs/20070712-hourly-build Also, http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8-l10n/1184232217.7885.gz&fulltext=1 For the Thunderbird nightly run, I manually added "set -x" to the top of mozilla/tools/update-packaging/{common.sh,make_full_update.sh} when the mozilla/ checkout finished. The full run completed, which I wasn't expecting, taking 5h 24m. The logs were backed up to /cygdrive/c/builds/tinderbox/Tb-Mozilla1.8-l10n/WINNT_5_2_Clobber/... old-logs/20070712-nightly-build Also, http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla1.8-l10n/1184247705.21996.gz&fulltext=1 There was update output this time, including some fork errors (warnings ?!?). Eg + (( i=103+1 )) + (( 104<219 )) /cygdrive/c/builds/tinderbox/Tb-Mozilla1.8-l10n/WINNT_5.2_Clobber/mozilla/tools/update-packaging/common.sh: fork: Resource temporarily unavailable + f= ++ echo components/msgdb.xpt + '[' -n '' ']' + (( i=104+1 )) + (( 105<219 )) ++ echo components/msgimap.xpt ++ tr '|' ' ' ++ sed 's/^ *\(.*\) *$/\1/' ++ tr -d '\r' + f=components/msgimap.xpt + '[' -n components/msgimap.xpt ']' ++ echo components/msgimap.xpt ++ grep -c '\/$' + '[' 0 = 0 ']' + echo 'remove "components/msgimap.xpt"' This is one broken and one working trip around the for loop at http://mxr.mozilla.org/mozilla1.8/source/tools/update-packaging/common.sh#75 I think the failure to fork must be for the $() at line 77, which points the finger at Cygwin rather than disk problems. In total there are 7 fork errors in the run: 3 at line 77, 2 at line 81, one at line 38 (make_add_instruction), and one more I didn't identify. ##<blink>################################################################### # # # The net result is that the manifest for locale mars is not trustworthy, # # unless we've verified there are no fork errors in the log. # # # ############################################################################ Random guess - it's an SMP or HyperThreading issue on ESX3, but I have nothing to back that up. Then we're back to building Firefox/Mozilla1.8. I've removed last-built to force a nightly, and will add the "set -x" again. It will be interesting to see if this run also completes. Based on cerberus-vm not managing it the ~ 5 times I tried, that seems unlikely but maybe the -x slows things down enough to make it not die completely. The symptom of dying is bash consuming all of a CPU and the build making no progress. Finally, to summarize a little * cerberus-vm is set to do Trunk builds only, and has done so without dying since July 6th. It's unknown why this is, although there are less locales that build on the trunk at the moment. * cerberus-vm-clone is set to do Mozilla1.8 builds only, and is using the LSI SCSI driver. It's still very slow and has this forking problem.
Assignee: nrthomas → preed
> * cerberus-vm is set to do Trunk builds only, and has done so without dying > since July 6th. It's unknown why this is, although there are less locales that > build on the trunk at the moment. There are quite a few differences between trunk and mozilla1.8 in tools/update-packaging, though nothing in the way of shell calls.
(In reply to comment #18) > Finally, to summarize a little > * cerberus-vm is set to do Trunk builds only, and has done so without dying > since July 6th. It's unknown why this is, although there are less locales that > build on the trunk at the moment. > > * cerberus-vm-clone is set to do Mozilla1.8 builds only, and is using the LSI > SCSI driver. It's still very slow and has this forking problem. Given this, I'm going to to use cerberus-vm for the 2.0.0.5 release.
So, some notes on this: -- We got cerberus-vm feeling... ok again by reducing its CPUs from 2 back to 1. -- cerberus-vm is the original VM; we were gonna dump the clone. It has the original SCSI driver settings, too...
Status: NEW → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → DUPLICATE
cerberus-vm has been getting slower and slower, the most recent nightly runs for Firefox & Thunderbird took 10 hours and 6 hrs 10 mins respectively (on the 25th/26th Sep), since then it's been too slow to complete the cvs checkout within the 1 hour timeout. Reopening to try to sort his out (390340 only helps trunk).
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
I'm relocating the VM files from the netapp-d-002 partiton to netapp-c-001 to see if that helps.
Assignee: preed → nrthomas
Status: REOPENED → NEW
Nightly runs took 8hr 12 min and 5hr 35min for Fx and Tb respectively, so that's better but still too slow. mrz, do you have any advice from when you looked at patrocles a while back ?
Summary: cerberus build machine (fx 2.0 branch localized nightlies) dropped off tinderbox → cerberus build machine isn't well (very slow)
Some things to try: -- I note some weird VM settings; for instance, the vmdk doesn't have independent mode off (from what I can tell), and it doesn't have "persistent" selected. (These may come up when the VM is powered off, but for now, they look like they're unchecked). Also, I'd try removing/disconnecting the USB device and the serial devices. We don't really need them. Are we sure we have the current version of the tools installed? -- Of very interesting note, this looks like it has a BusLogic controller. Have we tried the LSI Logic controller with no other changes (still 1 VCPU, etc.)? I know we tried the LSI Logic controller before, but as I remember, we also changed a bunch of other settings at the same time, correct? I think we should probably clone cerberus-vm via VI, and then only modify it via the client, not by tweaking vmx files, and see if we can get performance to be a bit more acceptable.
We can also try moving it to fiberchannel disk to eliminate the "netapp-is-slow" issue (which is because the sata disks are just overloaded). Note, that will just give you a best-case disk access time as the fc shelf is just on loan. Just another point of reference if it's helpful.
(In reply to comment #27) > We can also try moving it to fiberchannel disk to eliminate the > "netapp-is-slow" issue (which is because the sata disks are just overloaded). > Note, that will just give you a best-case disk access time as the fc shelf is > just on loan. Just another point of reference if it's helpful. We could give that a try, but I don't think that's the problem; I noticed that during a nightly build run, the first few locales run in a reasonable amount of time (10ish minutes), and that time gets progressively worse with each locale. In my mind, that's points to something wrong with the machine configuration or software, not a hardware problem.
* When the VM is shutdown, the virtual disk has Independent off, and Persistent/Nonpersistent is undefined (greyed out with no sign of radio selection). Paul, did you mean that Independent should be enabled ? * On the console at boot, there is a message "At least one service or driver failed during system startup. Use Event Viewer to examine the log for details". * The System log has "The cpqasm2 service failed to start due to the following error: The service cannot be started, either because it is disabled or because it has no enabled devices associated with it." * This then prevents the "HP ProLiant System Management Interface Driver" from loading, and in turn the "HP ProLiant Remote Manageement Service" & "HP ProLiant System Shutdown Service" * In the Service Manager, there are no disabled HP services. It's pretty tempting to uninstall this stuff. * There is also an error about the "VMware Converter Service" failing to start. Ditto on the uninstall comment. * Installed VMWare Tools is build-38803, I updated it July 4th (comment #9). Did we update ESX since then ? For comparison, I have v7.3.2 build-51348 in a XP VM running in Fusion. Can test updating the tools after the cloning. * this was the CPU usage trend while deleting the old source tree for a nightly: http://people.mozilla.com/~nthomas/misc/cerberus-vm.png Times are PDT+8 - not sure what that tells us, the command was rm. Starting a clone now, so we can try the suggestions.
> "At least one service or driver failed during system startup. Use Event > Viewer > to examine the log for details". > * The System log has > "The cpqasm2 service failed to start due to the following error: That's a relic of when cerberus was on a physical HP box. You could remove them I guess but their failure to start shouldn't affect anything. > * There is also an error about the "VMware Converter Service" failing to start. > Ditto on the uninstall comment. > Probably only necessary during the original p2v/VMware Converter step. > * Installed VMWare Tools is build-38803, I updated it July 4th (comment #9). > Did we update ESX since then ? For comparison, I have v7.3.2 build-51348 in a > XP VM running in Fusion. Can test updating the tools after the cloning. No, the ESX hosts have all been the same since we migrated from 2.x to 3.0.1 and I upgraded all the tools packages during that round.
cerberus-vm-clone is up, with * Disk - Independent: off, Persistent. I don't think this was a real change, although the VI said the device was modified simply by toggling Independent on and off. * The two serial ports and USB controller were removed I'll start it up and let it cycle for a bit, cerberus-vm is off.
(In reply to comment #31) > cerberus-vm-clone is up, with > > * Disk - Independent: off, Persistent. I don't think this was a real change, > although the VI said the device was modified simply by toggling Independent on > and off. > * The two serial ports and USB controller were removed > > I'll start it up and let it cycle for a bit, cerberus-vm is off. Fx nightly run took 8hr 21min, so no improvment there. Restarting box with Independent: on, Persistent.
(In reply to comment #32) > Restarting box with Independent: on, Persistent. Got two Firefox nightly runs, 8hr 21min & 8hr 49min; plus a Thunderbird nightly run of 6hr. So this change was no help either. Going to try switching from the BusLogic SCSI driver to the LSI one, and also clone cerberus-vm-clone onto the fiberchannel netapp.
(In reply to comment #33) > Going to try switching from the BusLogic SCSI driver to the LSI one, and also > clone cerberus-vm-clone onto the fiberchannel netapp. This brought the build time down to about 5 hours for a Firefox nightly, woo! I accidentally killed tinderbox while it was copying files right at the end, so it didn't go green. This included a 30 minute CVS checkout, which was about 50 minutes previously. I've now removed all the HP crud left over from when this had a real disk array, plus the VMware converter, VNC, and freed up a few GB of old builds. It's building another nightly now.
Just out of curiosity, what kind of build time is acceptable or are we aiming for?
(In reply to comment #35) > Just out of curiosity, what kind of build time is acceptable or are we aiming > for? I'm working on generating a trend graph for this box, right back to when it was on real hardware, which would help answer this question. As we've moved towards the release automation this box is more a backup, and 4 hours or less would probably be OK. Those localisers working to get 2.0.0.x locales ready would probably appreciate something faster than that. How many are in that situation Axel ?
(In reply to comment #35) > Just out of curiosity, what kind of build time is acceptable or are we aiming > for? staging-pacifica-vm and production-pacifica-vm consistently take about 2.5 hours to do a branch l10n clobber build (same exact setup as cerberus-vm). They are both VMs (clones of pacifica-vm).
(In reply to comment #37) > (In reply to comment #35) > > Just out of curiosity, what kind of build time is acceptable or are we aiming > > for? > > staging-pacifica-vm and production-pacifica-vm consistently take about 2.5 > hours to do a branch l10n clobber build (same exact setup as cerberus-vm). > > They are both VMs (clones of pacifica-vm). > Sorry I should qualify that "same exact setup" statement - "same exact Tinderbox config and version". Other things on the machine are quite likely different (versions of compiler and other tools).
cf: What are the build times for each locale after this switch? I'm actually curious about trend, mostly; I noticed when build times were long that each locale got progressively slower to build (with af taking 10 minutes and zh-TW taking an hour+). This doesn't seem right to me, and I'm wondering if we're still seeing that with the new setup.
(In reply to comment #34) > I've now removed all the HP crud left over from when this had a real disk > array, plus the VMware converter, VNC, and freed up a few GB of old builds. > It's building another nightly now. Firefox hourly/nightly: 2:38 / 4:56 (57 locales) T'bird hourly/nightly: 2:12 / 3:27 (42 locales) So no further improvement there. (In reply to comment #39) > cf: What are the build times for each locale after this switch? > > I'm actually curious about trend, mostly; I noticed when build times were long > that each locale got progressively slower to build (with af taking 10 minutes > and zh-TW taking an hour+). > > This doesn't seem right to me, and I'm wondering if we're still seeing that > with the new setup. We are still seeing this, turns out it's the way tinderbox is written for l10n - all the locales have the same start time: http://mxr.mozilla.org/seamonkey/source/tools/tinderbox/post-mozilla-rel.pl#842 So the boxes on the waterfall get longer the further down the list of locales you get (this kinda makes sense if you want to compare to a CVS timestamp, but the time used is actually later than that). It's not so obvious on other boxes because they cycle much faster. Here's a break down of the Firefox nightly run: Run starts - 1230 [deletion of old source tree and other cleanup] checkout start - 1249 ( 19 min) checkout end - 1314 ( 25 min) [configure, build tools] l10n starts - 1321 ( 7 min) last locale ends - 1704 (223 min, avg of 3min 54sec each) [local copies and push to stage] Run ends - 1726 ( 22 min, 750MB of data to handle)
Here's the promised trend of build time for cerberus, with comparison to 2 other windows boxes (patrcoles & pacifica), going back to pre-virtualisation days. The times were extracted from tinderbox's JSON output with a perl script, and there is a bunch more data than shown if anyone wants it. Of them all pacifica has not changed very much at all, patrocles was unwell (how was that fixed again ?), and cerberus has been pretty spectacularly broken in recent times. While the SCSI driver change has helped a lot it's clear that more can be done, so I'm going to run scandisk, turn off the Indexing Service, clean out some more old files and defrag the disk. References for the labels: "cerberus virtualisation?" - from change of name on tinderbox "Down for a week" - comment #0 and 1 "Move to netapp" - https://intranet.mozilla.org/Build:Vmware:VIMigrationNotes#Round_3 "Use 1 CPU instead of 2" - comment #22 "LUN fix??" - can't locate any firm date from this from email, helpwanted "Use LSI SCSI driver on a clone" - comment #34
QA Contact: justin → build
There was small improvement (~10min) after three runs with the builtin defragger (this compacted free space as well as defragging the files). And the build times are pretty constant over the last fortnight - it's about 4hr45min for Fx nightly, and 3hr for Thunderbird. mrz, any objections to me moving cerberus-vm onto the fcal partition for a few test runs ?
Is thi going to close the tree? If it will I want to try this online copy tool instead.
(In reply to comment #43) > Is thi going to close the tree? If it will I want to try this online copy tool > instead. No, it's a l10n box on the 1.8 branch - a couple of hours downtime is no problem. If you want to try VMotion then that's cool too. I meant cerberus-vm-clone in commment #42.
This was done yesterday, btw.
The resulting build times were Firefox hourly/nightly: 1:20 / 3:05 (57 locales) T'bird hourly/nightly: 1:00 / 1:52 (42 locales) on an otherwise unloaded fibre channel netapp, and unloaded VM host (same hardware). Compared to the loaded netapp and vmhost (comment #40), it's about a 50% less time for hourlies, and 46% less for the Tb nightly, and 37% less for the Firefox nightly. So that would be a nice win to have permanently, if we can. I've switched back to the loaded VM now that the test is complete. Anyone else have any ideas ? Otherwise I think we're out of steam here and will have to live with it.
(In reply to comment #46) > The resulting build times were > > Firefox hourly/nightly: 1:20 / 3:05 (57 locales) > T'bird hourly/nightly: 1:00 / 1:52 (42 locales) > > on an otherwise unloaded fibre channel netapp, and unloaded VM host (same > hardware). Compared to the loaded netapp and vmhost (comment #40), it's about a > 50% less time for hourlies, and 46% less for the Tb nightly, and 37% less for > the Firefox nightly. So that would be a nice win to have permanently, if we > can. You should go back to the FCAL image and vmotion it back to a loaded ESX host to see what it looks like on an un-busy ESX box. In my test, the sweet spot was at 5 running VMs - more than 5 and build times increased. Ping me if you need my help in doing this.
(In reply to comment #47) > You should go back to the FCAL image and vmotion it back to a loaded ESX host > to see what it looks like on an un-busy ESX box. In my test, the sweet spot > was at 5 running VMs - more than 5 and build times increased. Worth a go, although this particular VM host (bm-vmware08) is relatively unloaded. The resident VM's are associated with our build automation and only compile intermittently. It's just started cycling in this setup (unloaded FCAL, "loaded" host).
(In reply to comment #48) > It's just started cycling in this setup (unloaded FCAL, "loaded" host). Build times are a little slower: Firefox hourly/nightly: 1:30 / 3:30 T'bird hourly/nightly: 1:10 / 2:10 If we are getting new hardware which will have similar performance to the borrowed netapp then I don't think it's worth putting any more work into this. This box does nightly builds on a maintenance branch, and we now do releases on another machine. For locales that are still coming on for 2.0.0.x, it cycles frequently enough IMHO. I suspect most localisers are going to be concentrating on Firefox 3 now. If there are no objections, I'll resolve this FIXED.
Priority: P2 → P3
Fine with me.
Status: NEW → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
This VM was moved back to the regular Netapp today, host unchanged.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: