Closed Bug 384035 Opened 13 years ago Closed 12 years ago

Upgrade qa and build machines to reflect latest linux runtime requirements

Categories

(Release Engineering :: General, defect, P2)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rcampbell, Assigned: preed)

References

()

Details

Attachments

(1 file)

Vlad attempted to land an upgrade to in-repository version of Cairo last night and witnessed some reftest errors due to platform incompatibility. We need to upgrade the machines to reflect the new platform requirements proposed in the url:

http://wiki.mozilla.org/Linux/Runtime_Requirements
Preed and Ben have raised good points about this hitting a number of extra machines. Perftest machines, for example. While it'll be fairly easy to upgrade qm-rhel02,3, the others might be a little trickier. There might be staging issues if we don't upgrade these at the same time.
Some notes extracted from email and previous discussions:

We tried to replace argo-vm (an undocumented install currently production firefox trunk nightlies) with fx-linux-tbox (The current linux refplatform), but the builds produced by fx-linux-tbox were 5-10% slower than argo-vm when tested under identical conditions and identical configs.

I don't know how to diagnose this, so I was going to propose that we ignore it and switch to fx-linux-tbox anyway, because it's a known config and with --enable-libxul it's within a few % of argo-vm.

Now, as for switching to CentOS5 as a build refplatform: builds produced by CentOS5 will not run on CentOS4. This means that we would have to upgrade the performance test boxes to a newer runtime (I believe that the new perf farm is already running some modern version of Ubuntu, right?). If we decide to switch to CentOS5, we should switch the perf box first, and get some historical data. Then we can switch the build machine.

This doesn't really affect the unit-test boxes specifically, because they test their own builds; so you could theoretically upgrade them independently. But that means that our unit tests are testing an entirely different environment than our production builds, which isn't especially helpful.
(In reply to comment #2)
> Now, as for switching to CentOS5 as a build refplatform: builds produced by
> CentOS5 will not run on CentOS4. This means that we would have to upgrade the
> performance test boxes to a newer runtime (I believe that the new perf farm is
> already running some modern version of Ubuntu, right?). If we decide to switch
> to CentOS5, we should switch the perf box first, and get some historical data.
> Then we can switch the build machine.

Yup. The linux perf boxes (qm-plinux01-05) are running Ubuntu Feisty, iirc.

> This doesn't really affect the unit-test boxes specifically, because they test
> their own builds; so you could theoretically upgrade them independently. But
> that means that our unit tests are testing an entirely different environment
> than our production builds, which isn't especially helpful.

Right. Once nice option since we've got a new machine on dedicated hardware is that we can install CentOS5 on it and bring up a test box and run it alongside qm-rhel02 for a couple of days to compare results. This might give us some useful data before we upgrade the build machines.
I would love to stop using the old perf machine setup and use the new qm-plinux boxes instead, but I'm concerned about when they're going to be ready. Would it be faster to commission a new tinderbox-based perftest machine?
Flags: blocking1.9+
Target Milestone: --- → mozilla1.9alpha6
from offline discussions: 

1) we've already moved from argo-vm (an undocumented install currently production
firefox trunk nightlies) to fx-linux-tbox (The current linux refplatform, same as linux refplatform rel4).

2) we need to create a new refplatform rel5. Assigning to preed, based on yesterday's build meeting.

3) we need to rollout this new refplatform rel5 out to build and QA machines to close this bug.
Assignee: nobody → build
Component: Build Config → Build & Release
Flags: blocking1.9+
Product: Core → mozilla.org
QA Contact: build-config → preed
Target Milestone: mozilla1.9alpha6 → ---
Version: Trunk → other
Assignee: build → preed
I noticed that fx-linux-tbox, from the current generation of the ref vm, didn't have TBOX_CLIENT_CVS_DIR set, so it wasn't updating its tinderbox code. This line needs to be added to  ~cltbld/.bash_profile 
  export TBOX_CLIENT_CVS_DIR="/builds/tinderbox/mozilla/tools"

Bug 384626 might change that a little.
I'm actually working on this nowish, so resetting priority.
Status: NEW → ASSIGNED
Priority: -- → P1
Attachment #269265 - Flags: review?(ccooper)
Attachment #269265 - Flags: review?(ccooper) → review+
Alright, ref-vm is created. It does *NOT* (as of now) include the buildbot dependencies (I will add those on Monday).

It's reporting into the MozillaExperimental tinderbox page (http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaExperimental), and builds are being uploaded in the experimental directory under linux-newref.

We'll test this VM out for a few days, and then look at when it makes sense to switch it so it's the nightly Linux build machine for trunk (probably next week)?
It would probably be good for the reference build VM (if this is what we're going to be using for tinderboxes) to have some debuginfo packages, which help for generating stack traces and performance data when needed.  The ones I have installed on my machine are:

cairo-debuginfo
expat-debuginfo
fontconfig-debuginfo
freetype-debuginfo
glib2-debuginfo
glibc-debuginfo
gnome-vfs2-debuginfo
gtk2-debuginfo
hal-debuginfo
libX11-debuginfo
libXft-debuginfo
libgnome-debuginfo
pango-debuginfo
scim-bridge-debuginfo
gcc-debuginfo is also useful sometimes (it has symbols for libstdc++)
Based on some trace-malloc stacks I took recently, the following also show up in some stacks:

dbus-debuginfo libselinux-debuginfo gtk2-engines-debuginfo ORBit2-debuginfo libXcursor-debuginfo libXext-debuginfo libXfixes-debuginfo libXi-debuginfo libXinerama-debuginfo libXrender-debuginfo atk-debuginfo libbonobo-debuginfo dbus-glib-debuginfo GConf2-debuginfo popt-debuginfo gcc-debuginfo
Sorry for the bugspam; these are now P2 in the New View of the World (tm).
Priority: P1 → P2
This is holding up trunk development, which was the original reason for filing this bug. If there's some way that we can build a later version of Cairo on the existing refplatform, then that's OK, but I understood this was blocking some important patches.

Vlad, Dbaron?
(In reply to comment #14)
> This is holding up trunk development, which was the original reason for filing
> this bug. If there's some way that we can build a later version of Cairo on the
> existing refplatform, then that's OK, but I understood this was blocking some
> important patches.

robcee:

I'm planning on deploying this for nightlies on Thursday, 5 July. There was a question about whether we need to coordinate the deployment together, and I think the answer technically is no, but a) I could be wrong, and b) it doesn't actually solve the problem until reftest is running on this version.

So, will you have time on Thursday to deploy the new image?
Hey preed,

I will be around all day Thursday and can help set this up under your expert tutelage. I'm not sure if we'll want to replace the existing machine (qm-rhel02, scary thought as it's still running the master) or install either a new VM with the reference image or install it on the mac mini we had set aside for this. We'll have to discuss.

have a good 4th!
(In reply to comment #16)

> new VM with the reference image or install it on the mac mini we had set aside
> for this. We'll have to discuss.

The quickest/easiest way to do this is run it in a VM. The unit test/ref test doesn't have performance requirements, does it?

I was going to make the switch tonight (just now, actually), but I ran into a couple of problems getting all the extra packages people wanted. I also don't want to make the switch until you're around (and for others reading the bug, to be clear, robcee did bug me about it today, but I was distracted, looking at a couple of other fires).

robcee: let's do this on Friday during the day; you gonna be around?
Update: bug 387128 requests cloning the new VM for the reftest.

I'll be switching the nightly tinderbox over this afternoon; we'll see what happens.
Depends on: 387128
No longer depends on: 387128
Depends on: 387181
Update:

We attempted to switch over to the new refplatform for nightly builds on friday; that went fine. However, when the performance testing machine (not unit test or ref test) tried to run the build, it failed to find a (pango) shared library.

So, I reverted us back to the older ref platform, since keeping us on the new refplatform would have meant no performance data.

Reed and rhelmer pulled some heroics on Friday to get them up and I just pointed the new test machine to the new ref-test builds; they're both reporting to MozillaExperimental:

http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaExperimental
Depends on: 387676
New machine's up and running, and reporting to:

http://tinderbox.mozilla.org/showbuilds.cgi?tree=Firefox

just waiting to checkin configuration files.
Are there bugs filed on the failing tests on qm-centos5-01?  I didn't look at the unit/chrome tests, but I did check out the reftests yesterday, and 3/4 of them were due to font kerning.  The other one looked like it could have been rounding, it looked like a 1 pixel difference in the size of a rect.
I blogged about them and roc saw it, does that count? ;)

I don't believe individual bugs have been filed yet. We should do that.
So are we ready to make this switch again during Thursday's (12 July) nightly outage?

Or do we need to wait for something else?

I'm going to re-assign this bug to robcee, since I'm ready to go, to let him comment.

There are bugs and test results reporting into MozillaExperimental now (although, they're currently down due to a VMware migration; should be back up shortly, though):

http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaExperimental
Assignee: preed → rcampbell
Status: ASSIGNED → NEW
I believe we're ready to roll with this. I'll file individual bugs on the failures on qm-centos5-01 tomorrow if they haven't been filed already by then and send out pleas and bribes to try to get people looking at the errors.
Assignee: rcampbell → preed
Depends on: 388054
fxnewref-linux-tbox is reporting into the Firefox page, cycle times look good, test results look good, I've posted to m.d.planning, m.d.a.firefox, and m.d.platform about its existence.

Bug 387167 tracks renaming the machine to fx-linux-tbox; bug 388054 tracks an issue with spikes in the Tp/Ts graphs, which rhelmer is dealing with/has addressed.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.