Separate builds from the same source don't perform the same on Talos

RESOLVED FIXED

Status

P5
major
RESOLVED FIXED
8 years ago
6 years ago

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [unittest][talos])

Attachments

(8 attachments, 2 obsolete attachments)

I noticed this in bug 414641 comment 30.

See Compare Talos results at:

http://goo.gl/PxwYJ

The link compares ~7 Talos runs on try to ~6 m-c runs.  (Not every test was run the same number of times, unfortunately, but hover over a test score to see the number of runs.)

Notice that in the vast majority of cases, the variance on the try run is far below the variance on the m-c run.  My understanding is that the try and m-c setups were supposed to be identical, so I find this strange.

Perhaps we're seeing this because I kicked off all my try runs at night, while the m-c runs were all during the day.  But if time of day affects our benchmarks, we need to understand and try to fix this, because it makes accurate testing very difficult.
Priority: -- → P5
Whiteboard: [unittest][try][talos]
OS: Mac OS X → Linux
To clarify why this is important: The difference in variance indicates that something is different between the m-c and try testing environments.  It therefore seems plausible that this difference affects the tests' mean results in addition to their variance.  So until we can understand this effect, it's hard to believe that one can correctly compare Talos numbers between try and m-c.

Comment 2

8 years ago
Bumping severity.  This makes it impossible for us to know the effect of any patch before landing it on mozilla-central, which makes the try server useless for performance measurements.  :(
Severity: normal → major
The set of machines that run talos, and their environment, don't vary between any of the branches - it's all one pool. They get fed builds and run tests, and don't care what the source was. Now that's not to say that they all return exactly the same result for the same build, but there's nothing branch-specific there.

AFAICT, the mozilla-central runs in your comparison are from a range of checkins
 http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2f41abc5a89d&tochange=9f8e09d29a0d
and are compared a single try revision. It would be great to hear what you've done to rule out any perf changes in that range. 

It would also be hear about the determinism of gcc's PGO. Most likely there were several different objdirs producing those m-c builds, starting from the clobber glandium set at 8:50am that day, so there is certainly scope for all those builds not being completely equal.

I think a better comparison would be redo talos runs on a single m-c build, and if you think there might be a variation during the day try at different times during the day.
> It would be great to hear what you've
> done to rule out any perf changes in that range.

I looked at the changeset log and observed that none of those csets were likely to affect perf on so many different benchmarks.

Note also that for many of the tests on which I observed high variance, my changes' mean performance was close to m-c's mean (e.g. a11y linux-64).  To explain the variance as being caused by changes on m-c, you'd have to have a performance regression / improvement followed by a fix / backout.  I doubt this is hiding in the changelog.

> It would also be [good to] hear about the determinism of gcc's PGO.

I'll let Mike speak to this; I don't know.  When I run more tests, I'll do all platforms.

> I think a better comparison would be redo talos runs on a single m-c build,

Judging from what you've said, it seems very unlikely that this would reproduce these results.  If the pool of machines is the same, then the pool of machines is the same.  But that doesn't mean there isn't an issue here, because we still see a difference between the one-for-every-push m-c Talos results to results from try.

> and if you think there might be a variation during the day try at different
> times during the day.

To be clear, the suspicion that time of day matters is a really suspicion that when there's more load at the system, somehow Talos machines are affected.

Can you think of how that might be?  Do they run a daemon which is chattier when there's more load?

It could be a deep problem -- for instance, individual machines or the whole DC is hotter when there's more load, so some machines run their CPUs slower to compensate.  Or it could be vibrations affecting hard disks [1].  Have you observed any problems like this?  Do you have monitoring which would allow us to rule these things out?

In any case, I'll run more tests.

[1] http://blogs.sun.com/brendan/entry/unusual_disk_latency
(In reply to comment #4)
> > It would also be [good to] hear about the determinism of gcc's PGO.
> 
> I'll let Mike speak to this; I don't know.  When I run more tests, I'll do all
> platforms.

I don't have a definitive answer on that, but PGO is going to be disabled, so we'd be able to see if that changes afterwards.
-O3 and PGO are back on, hopefully for good, AIUI.

I've arbitrarily chosen 0221eb8660f4 to be the base of perf comparisons.  So I just have to click one button on the self-serve build page, I've chosen to focus on Dromaeo results.

To test the hypothesis that time of day and/or load on machines affects variance, I pushed bb03789cbaec (a noop change) to try.  I'll kick off more builds in the next few days.

Stay tuned.
I've kicked two sets of Talos runs for rev bb03789cbaec, one set when there was no backlog on Try, and another when there was a large backlog.  I've attached the raw graphserver data, which is also available at [1] (I wanted to save it, since I'm not sure how long graphserver keeps Talos results from Try).

I haven't looked at this data at all yet, but I will soon.

[1] http://graphs.mozilla.org/api/test/runs/revisions?revision=bb03789cbaec
I just analyzed this data.  The two runs are indistinguishable as far as I can tell.

I'm going to look through the data from comment 0 again.  Maybe one of the changesets was slower than the others, or maybe all the tests ran slower at a certain time.
You can kind of see the variance on graphserver.

http://goo.gl/7DMfv

(I wish there was a way to turn off the lines connecting the dots in this graph.  For our purposes here, try to ignore the lines.)

Notice that the general spread here is about +/- 10 points, while my tryserver build saw +/- 3 points on the same test.  (This is Dromaeo DOM on Linux-32.)

The lines make it appear that there's a general upward trend, so you might think that trend explains the spread.  But I don't think it does.  First, it's not clear to me whether the lines trend upward because of points shown in the plot or points lying outside the edges of the plot.

But more convincingly, let's examine two specific data points.  Inside the 28 Apr 2011 3:00p - 9:00p box, pick out the highest and lowest points.  These correspond to 

 * changeset a78968ac5bbb - test result 213.97
 * changeset 4a6f98909da5 - test result 202.75

Notice that the higher result is the earlier test, so this doesn't fit with the general upward trend we observed earlier.  It also seems unlikely that this was an actual 5% performance regression, since it immediately bounced back, and *on the same machine as 4a6f98909da5* -- this is changeset 88d3c5bde0ba.

Examining the log (hg log --rev a78968ac5bbb:88d3c5bde0ba), I don't see how we could have a 5% regression followed by immediate fix back to previous levels.
If I'm right and the high variance we're observing is due to testers rather than builders, then the variance should disappear if we kick a new set of runs for the builds in question.  On the other hand, if I kick a bunch of runs for the existing builds of a78968ac5bbb and 4a6f98909da5 and still observe the 5% difference, then that would give me confidence that the perf difference has something to do with the bits.

I'll try not to take down the self-serve server this time...
Interesting!  The extra Talos runs I kicked for the existing bits on a78968ac5bbb and 4a6f98909da5 all came back with results very close to the original results.

I can think of two possible explanations for this.  Either there really *is* a 5% perf difference between the source at these two revisions, or there's some serious nondeterminism in our build.

Both of these seem unlikely to me.  (It's possible that there is a real perf difference between these two csets, but that doesn't explain the ~5% variance I see in general on this test.)

I'm going to kick a few new builds of these changesets and see how they perform.  If the new builds perform the same as the m-c builds, then I'll try to reproduce this result with another set of changesets.  If the new builds perform differently from the m-c builds, then we'll want to investigate why.
The test results for both csets hovered around 205.

So to review, here's the situation:

 * The bits on m-c for 4a6f98909da5 consistently give Dromaeo DOM results of around 214 on Linux-32.
 * The same code pushed to try results in a build which benchmarks at 205, about 5% slower.

If this is correct, it's a pretty surprising result!

Why might this be happening?  Here are some guesses:

 * Nondeterminism in Linux PGO.  We could rule this out by seeing the same effect on Mac.
 * Differences in the software stacks on the builders.  (Presumably these should all be the same!)

Since the difference appears to be in the bits, we should be able to disassemble and diff the builds.  I imagine this won't be useful in PGO land, since we'd expect some nondeterminism (just not enough to cause a 5% perf swing), but on the Mac, I'd expect close to bit-for-bit sameness.  In any case it would be good to verify that when the test results are different the bits are also different.

I'm going to try to recreate these results, since I'm pretty skeptical I've interpreted this correctly.  I've requested birch.
(Assignee)

Updated

8 years ago
Depends on: 657609
(Assignee)

Updated

8 years ago
Depends on: 658571
Releng set me up with ssh access to a Talos box, and I've been running tests there.  Amazingly, the two builds from comment 14 perform just as they did on try: The m-c build a78968ac5bbb yields ~214, while the try build yields ~204 on Dromaeo DOM.  I ran 8 tests of each, and while there's some variance, the two sets of results don't even come close: [214.05 - 217.01] versus [202.41 - 208.58].

I think we can tentatively conclude that, somehow, the bits really are different.

I'm going to confirm that the bits from 4a6f98909da5 act as we expect.  This build comes from m-c, so if it acts as we now expect, we'll be able to rule out Try as having anything to do with this.
The build from 4a6f98909da5 acted just as expected.  In seven runs, it varied from 203.71 to 207.52.

This strongly suggests that what we're seeing doesn't have to do with a difference between try and mc.

There are still many questions to answer.  Among them:

 * On what platforms do we see this effect?  (All tests up until now were on Linux-32.)
 * How common is it?
 * What causes it?

My guess is that this might be due to PGO, as nthomas suggested in comment 3.  (Observing a similar effect on mac would suggest this isn't the case.)

I'd further guess that perhaps these differences arise due to the hardware differences of builders -- apparently we have at least two classes of machines doing builds.  This should be easy enough to test.
(Assignee)

Updated

8 years ago
Summary: Talos on tryserver has lower variance than Talos on m-c → Separate builds from the same source don't perform the same on Talos
Whiteboard: [unittest][try][talos] → [unittest][talos]
Our linux builders are:
 - iX machines (hostname matches /.*-ix-.*/): physical hardware in scl1
 - VMs: on a VMWare cluster in sjc1
As I mentioned in IRC, we have a decent mix of build and try in each hardware type.
Are you able to reproduce this with two different try builds (as opposed to try vs m-c)? If so, can you try comparing try builds with "mk_add_options PGO=" in mozconfig-extra to disable PGO ?
Here's a chart from graphserver which suggests to me that this might be related to PGO.

http://graphs.mozilla.org/#tests=[[73,1,415],[73,1,417]]&sel=1295849726,1306160530

PGO is presumably the large perf increase around 4/26.  After that, the variance appears to increase.  (I realize that we turned PGO on/off a few times, but none of the post-4/26 builds dropped down to pre-4/26 levels, so I presume they all have PGO enabled.)

(In reply to comment #19)
> Are you able to reproduce this with two different try builds (as opposed to
> try vs m-c)? If so, can you try comparing try builds with "mk_add_options
> PGO=" in mozconfig-extra to disable PGO ?

Right now, I don't have a reliable way to reproduce this effect.  It doesn't seem to be uniformly random, since I've kicked a number of try builds which have all had the same performance.
Similar graph for Windows 7:

http://graphs.mozilla.org/#tests=[[73,1,355],[73,1,357]]&sel=1300873828,1306122160

I wouldn't read too much into this, because I'm cherry-picking lines which have large variance.

However, there are a few 8-point swings and one 10-point swing in this graph -- this is more than the natural variance I saw on Linux.  More importantly, the swings seem to correspond to matching swings in Win XP:

http://graphs.mozilla.org/#tests=[[73,1,355],[73,1,379]]&sel=1300873828,1306122160

http://graphs.mozilla.org/#tests=[[73,1,357],[73,1,376]]&sel=1300873828,1306122160

Unfortunately neither of those graphs lines up perfectly, because graphserver only lets me plot on a per-machine basis.  I'll go make my own graphs now.
Attached is a plot of Windows XP and 7 Dromaeo DOM results over a 2 month period (March 20 - May 20).

Notice the two sections of bimodality.  Neither of these is a regression in all builds -- in the first section, most builds are fast, but some are slow, and in the second section, most builds are slow, but some are still fast!

In a graph I'll post in a moment, we can see that the Win7 and XP numbers track each other closely.  Since Talos machines for WinXP and 7 use the same bits, this suggests that the noise is coming from the builds themselves.
Assignee: nobody → justin.lebar+bug
Status: NEW → ASSIGNED
This plots the correlation between WinXP and Win7 Dromaeo DOM.  There's a strong correlation between WinXP and Win7 scores within both modes (that is, both clusters sit along a line with positive slope -- in this case, slope close to 1).

This suggests that the bimodality from the previous attachment is in the bits, rather than due to random variation in the testers.
Comment on attachment 534498 [details]
Win XP / 7 Dromaeo DOM correlation first 300 revs

I should have mentioned: This is correlation for just the first 300 revisions.  But the plot looks as you'd expect when you expand to all the revisions from the previous plot -- the faster mode becomes more dense, and still hews closely to the existing line of best fit.
Attachment #534498 - Attachment description: Win XP / 7 Dromaeo DOM 200a746 to d3d0241 correlation → Win XP / 7 Dromaeo DOM correlation first 300 revs
Posted image Dromaeo DOM Mac
This plot isn't as interesting as the Windows one.  Mac-32 seems to have pretty low variance, while Mac-64's variance is larger.
Posted image Linux Dromaeo DOM
I think this plot pretty conclusively shows that PGO is to blame for the extra variance on Linux.

It looks like there was initially a problem with PGO builds occasionally turning out much worse than they should, but that got fixed around r1000.  But even after that fix, the variance is still much higher than it was before.
Although we've ruled out the testers as the source of this variance, it's possible that some builders yield faster builds than others.

Apparently the only way to go from buildid to builder name is to parse the build logs.  So here I go...
The bimodality in Windows may be a non-effect -- I'm not sure I pulled only m-c revisions, so we could be seeing a branch's test results.
I collected this data by scraping the tinderbox FTP server.
It doesn't appear that builder type makes a large difference in the build's speed.  I'm not sure if the stdev differences are statistically significant, but the variance spread on Linux-* and Win XP seems prety large.

All of this data is for m-c builds from the past month.  I excluded nightlies, since they're building a cset that was already built on m-c.  I also excluded non-PGO Linux builds, by throwing away csets which scored too low to be PGO.  (To be really rigorous, I should have only considered builds which had test runs on all platforms and excluded on all platforms the csets for which Linux builds had PGO turned off.  But I don't think it really matters here.)

Machine labels:

  linux-ix-slave: New bare hardware
   mv-moz2-linux: Old bare hardware
      moz2-linux: VM

          w32-ix: New bare hardware
         mw32-ix: Old bare hardware
           win32: VM

Linux-32:
>       moz2-linux-slave: n 105, min 201.7, 1q 208.0, median 210.9, mean 211.4, 2q 215.2, max 220.5, stdev 4.4
>         linux-ix-slave: n  79, min 200.3, 1q 208.3, median 213.6, mean 212.8, 2q 216.9, max 221.3, stdev 4.7
> mv-moz2-linux-ix-slave: n  88, min 200.1, 1q 207.1, median 212.6, mean 212.0, 2q 217.2, max 219.7, stdev 5.1

Linux-64:
>       moz2-linux-slave: n 105, min 242.3, 1q 255.2, median 258.4, mean 257.6, 3q 261.3, max 268.2, stdev 4.9
>         linux-ix-slave: n  79, min 243.2, 1q 256.1, median 258.8, mean 258.3, 3q 262.2, max 269.2, stdev 5.5
> mv-moz2-linux-ix-slave: n  88, min 243.9, 1q 255.4, median 258.4, mean 258.2, 3q 262.0, max 269.3, stdev 4.9

Win 7:
>           w32-ix-slave: n 110, min 245.3, 1q 251.6, median 253.5, mean 253.5, 3q 255.8, max 259.7, stdev 2.9
>          mw32-ix-slave: n 189, min 242.3, 1q 252.3, median 254.1, mean 254.1, 3q 256.3, max 261.4, stdev 2.9
>            win32-slave: n  31, min 246.0, 1q 250.8, median 253.1, mean 252.9, 3q 255.1, max 259.3, stdev 3.1

Win XP:
>           w32-ix-slave: n 110, min 246.2, 1q 254.7, median 256.4, mean 256.0, 3q 257.8, max 262.4, stdev 2.9
>          mw32-ix-slave: n 189, min 248.0, 1q 255.5, median 257.1, mean 256.9, 3q 258.8, max 262.8, stdev 2.6
>            win32-slave: n  31, min 252.4, 1q 254.5, median 256.1, mean 256.3, 3q 258.3, max 261.1, stdev 2.4

Mac-32:
>    moz2-darwin10-slave: n 342, min 227.6, 1q 232.8, median 234.2, mean 234.0, 3q 235.2, max 239.1, stdev 1.8

Mac-64:
>    moz2-darwin10-slave: n 342, min 244.8, 1q 256.0, median 258.5, mean 258.1, 3q 260.4, max 266.5, stdev 3.8
One data point: the iX machines (/.*-ix-.*/) have really bad disk performance because they're rockin' consumer-grade SATA drives.  See bug 655304 for the work in progress to fix that, noting that none of the servers are back from iX yet.

I'd be surprised if that had a significant effect on the PGO - AIUI, PGO is using profiling, not wall-clock performance measurements, to decide where to optimize for speed and where to optimize for space.  That said, Firefox is full of timers, and timing differences due to disk slowness could easily cause changes in codepath execution counts.

BTW, to my knowledge, there's no great difference between "Old" and "New" bare hardware, other than which PO they were purchased on.
I can think of two next steps to address PGO:

a) Take a pair of builds with a large perf difference, disassemble, and try to figure out why one is slower.  The hope would be that most of the code would be the same, but we'd be able to see where the compiler chose to compile for space rather than speed along some critical path.

b) Test whether making the profile script do more (and thereby exercise more codepaths) reduces variance.

There's also a curious effect on Mac: Why does Mac-64 have twice the stdev of Mac-32?  My understanding is that we run 32-bit builds on 10.5 and 64-bit builds on 10.6.  Maybe we could try to approach this by running a 32-bit build on 10.6.  I don't know what would be involved with that.

(It's probably also worth checking that this variance is in the testers and not in the builders, as we did for Linux.)
Is there any way to add debugging output for PGO?  Perhaps we can keep the profile data around with the build?
(In reply to comment #34)
> One data point: the iX machines (/.*-ix-.*/) have really bad disk
> performance because they're rockin' consumer-grade SATA drives.  See bug
> 655304 for the work in progress to fix that, noting that none of the servers
> are back from iX yet.
> 
> I'd be surprised if that had a significant effect on the PGO - AIUI, PGO is
> using profiling, not wall-clock performance measurements, to decide where to
> optimize for speed and where to optimize for space.  That said, Firefox is
> full of timers, and timing differences due to disk slowness could easily
> cause changes in codepath execution counts.

OTOH, the profile runs rights after the build, so it shouldn't hit the disk too hard because everything should already be in RAM. Except if there isn't enough ot it.

(In reply to comment #36)
> Is there any way to add debugging output for PGO?  Perhaps we can keep the
> profile data around with the build?

You can keep the .gcda files from the objdir if you want.
These are the results of aggregating Linux-32 test results by builder.  I've omitted builders with fewer than 5 builds.

Sorry to make you sqint at all these numbers.  There does appear to be something going on here -- I've starred some builders with suspiciously low medians.  (Some of these also have low means.)  It's hard for me to say whether this is significant, though.

>          linux-ix-slave28: n  21, min 205.3, median 214.3, mean 213.6, max 220.5, stdev 4.6
>  mv-moz2-linux-ix-slave11: n  16, min 206.3, median 217.0, mean 214.6, max 219.1, stdev 4.4
>          linux-ix-slave38: n  14, min 204.1, median 213.4, mean 213.1, max 219.2, stdev 4.2
>        moz2-linux-slave42: n  12, min 207.1, median 211.2, mean 212.5, max 220.5, stdev 4.4
>  mv-moz2-linux-ix-slave02: n  11, min 205.2, median 215.5, mean 213.9, max 219.3, stdev 4.5
>*       moz2-linux-slave07: n  11, min 204.1, median 209.2, mean 210.9, max 219.8, stdev 4.7
>  mv-moz2-linux-ix-slave08: n  11, min 204.9, median 216.2, mean 213.7, max 219.7, stdev 5.0
>        moz2-linux-slave30: n  10, min 208.5, median 214.9, mean 214.6, max 218.5, stdev 3.3
>  mv-moz2-linux-ix-slave19: n   9, min 205.4, median 214.2, mean 212.9, max 217.6, stdev 4.1
>* mv-moz2-linux-ix-slave14: n   8, min 204.6, median 207.9, mean 209.8, max 218.3, stdev 5.1
>* mv-moz2-linux-ix-slave18: n   8, min 203.4, median 205.8, mean 207.4, max 217.2, stdev 4.5
>*         linux-ix-slave12: n   8, min 207.5, median 208.9, mean 211.4, max 218.7, stdev 4.3
>*       moz2-linux-slave08: n   8, min 201.7, median 208.7, mean 209.2, max 215.3, stdev 4.3
>          linux-ix-slave41: n   7, min 203.9, median 213.5, mean 211.9, max 219.8, stdev 5.6
>        moz2-linux-slave38: n   7, min 204.2, median 211.0, mean 212.4, max 219.6, stdev 5.2
>          linux-ix-slave30: n   7, min 205.0, median 211.5, mean 213.5, max 221.3, stdev 5.2
>        moz2-linux-slave25: n   7, min 208.4, median 216.5, mean 214.7, max 218.0, stdev 3.8
>  mv-moz2-linux-ix-slave17: n   7, min 203.5, median 213.6, mean 213.2, max 219.1, stdev 5.3
>          linux-ix-slave20: n   6, min 206.3, median 216.2, mean 214.4, max 218.7, stdev 4.2
>        moz2-linux-slave34: n   5, min 206.5, median 214.7, mean 212.5, max 218.6, stdev 4.8
>  mv-moz2-linux-ix-slave10: n   5, min 206.7, median 212.2, mean 211.5, max 214.4, stdev 2.6
>          linux-ix-slave14: n   5, min 200.3, median 214.4, mean 211.3, max 217.2, stdev 6.4
>        moz2-linux-slave05: n   5, min 205.9, median 210.2, mean 210.2, max 217.0, stdev 3.8
>        moz2-linux-slave29: n   5, min 208.2, median 213.7, mean 212.3, max 216.8, stdev 3.4
If the results from comment 38 showed differences in min/max along with differences in median/mean, I'd be much more likely to believe that we're seeing a difference in the builder.  But as it is, the data at best suggests that something about the machine makes it *less likely* to produce a fast build, but leaves it perfectly capable of producing a fast build.  That seems weird.
I'd like to see some analysis of significance here - those means aren't that far out of the ordinary.
(In reply to comment #39)
> If the results from comment 38 showed differences in min/max along with
> differences in median/mean, I'd be much more likely to believe that we're
> seeing a difference in the builder.  But as it is, the data at best suggests
> that something about the machine makes it *less likely* to produce a fast
> build, but leaves it perfectly capable of producing a fast build.  That
> seems weird.

Drives in the existing ix machines are not well isolated - part of the reason they are being replaced. Could be these few machines are located in the racks such that this problem is more common.

Any number of things can cause the vm builds to be faster/slower, depending on what else is going on with the VH host. Since it's not more widespread, we should check to make sure that these two VMs have the same settings and reservations as the others. Dustin: how hard to check that?
We're not looking at speed of the builds though - we're looking at speed of the tests on builds created on these machines.

It's not hard to look at VM settings - it's just clicking around and taking screenshots (y'know, the windows equivalent of "diff").  Which vm's did you have in mind?
(In reply to comment #41)
> Any number of things can cause the vm builds to be faster/slower, depending on
> what else is going on with the VH host. Since it's not more widespread, we
> should check to make sure that these two VMs have the same settings and
> reservations as the others. Dustin: how hard to check that?
Whoa, wait.  We have VMs that do PGO builds (which require the builder to run a test and generate a profiler)?  That's going to add a lot of non-determinism to the profile, isn't it?
(In reply to comment #43)
> Whoa, wait.  We have VMs that do PGO builds (which require the builder to
> run a test and generate a profiler)?  That's going to add a lot of
> non-determinism to the profile, isn't it?

Not unless the physical machine is oversubscribed, right?

Anyway, the results from comment 38 don't suggest that variance of VM build perf is higher than physical machine builds' variance.
(In reply to comment #40)
> I'd like to see some analysis of significance here - those means aren't that
> far out of the ordinary.

This is tricky, for the following reason: Suppose there was no difference at all between any of these machines.  We'd still expect to see differences in the statistics, due to noise.

So suppose we choose a few machines which look different than the others.  We now want to run a statistical test to tell us whether those machines are actually different than the others.  Well...we already picked out those machines as looking different, so the test is more likely to say that the machines are different.

Compensating for this effect without over-compensating is hard, especially since I'm not prepared to characterize the distribution from which the build speeds are drawn (it doesn't appear normal).

Anyway, I'll run Welch's t-test -- maybe it can give us some confidence that the difference is just noise.
(In reply to comment #42)
> We're not looking at speed of the builds though - we're looking at speed of
> the tests on builds created on these machines.

Sure, but the differences are being traced back to a handful of machines that are, in theory, identical to the rest of the pool.
 
> It's not hard to look at VM settings - it's just clicking around and taking
> screenshots (y'know, the windows equivalent of "diff").  Which vm's did you
> have in mind?

The two VMs jlebar has noted as anomalous, namely moz2-linux-slave07 and moz2-linux-slave08.

I'm happy to fire up the VI Client and check the settings if you'd prefer.
Let's back up a bit here:

PGO is an *inherently* nondeterministic process.  On dedicated hardware or on a VM, the profiling numbers will be different on every run.  For simpler software, this would not be the case, but because Firefox has lots of timers and whatnot, it connects CPU time to wall time, and things are nondeterministic.

If that kind of nondeterminism can significantly affect the runtime of the resulting builds, then I think we need to fix that, rather than try to find build hardware with absolutely predictable performance characteristics.  Maybe that means changing what gets run to build the profile?  Maybe that means accumulating data over multiple runs into a single profile?

We still haven't seen a *significant* regression of performance against machine ID.  My science background makes me blind to claims of "dunno if it's significant, but the means are different".  At this point I still consider the null hypothesis (that test performance is unrelated to build machine) very viable.

It sounds like the best path forward is
 * statistical analysis of the build-machine-causes-differences-in-means
and once that's done
 * get the profiles for a "slow" and "fast" build
 * make *new* builds using those profiles on a single build machine
 * compare performance of the new builds
assuming that there are still differences,
 * hand analysis of the profiles.  Likely we'll find that in a "slow" build, some large chunk of code - js, maybe! - never executed, or executed once and some other chunk executed twice.
>  * get the profiles for a "slow" and "fast" build

Even if the per-machine variance turns out to be real, we'll probably still need to do this, since we won't have explained all of the variance increase.

Can we start on this now?  Presumably we're going to need to kick many PGO builds, let them run to completion, save the profiles, run Talos on the builds, and pick a particularly fast and slow profile.

I can do this if I can get access to a Linux builder.
Here's the statistical test I said I'd run:

slow_builds are the Linux-32 results from the builders with stars in comment 38.  fast_builds are the results from the remaining builders.  Non-PGO results (less than 190) are not counted.

Welch's two-sample t-test gives p=0.005 that slow_builds and fast_builds are drawn from distributions with the same mean.  But like I said in comment 45, I think this is kind of bogus.  Suggestions are welcome as to how to improve the test, but I'm not sure it's going to get us anywhere.

> t.test(slow_builds, fast_builds)

	Welch Two Sample t-test

data:  slow_builds and fast_builds 
t = -2.9201, df = 47.322, p-value = 0.005344
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -6.623432 -1.220548 
sample estimates:
mean of x mean of y 
 208.2267  212.1487
(In reply to comment 49)
> Suggestions are welcome as to how to improve the test, but I'm not sure it's 
> going to get us anywhere.

What I mean is: We probably can improve the test, but it's either going to tell us that the difference is or isn't statistically significant.

If the difference isn't significant, well, maybe we're not running the right test.  Our data isn't normal, etc.  If the difference is significant, well, we biased our sample (comment 38).  And anyway, it doesn't explain all the variance we've seen -- even the worst machine is capable of making fast builds, and even the best machine is capable of making slow builds.

I think the result of the test is: If someone wants to poke around with those machines, that might not be a bad idea.  I'm not sure what else we might learn by throwing more statistics at it.
(Assignee)

Updated

8 years ago
Depends on: 659403
(In reply to comment #38)
> These are the results of aggregating Linux-32 test results by builder.  I've
> omitted builders with fewer than 5 builds.

I have doubts about the validity of this analysis, because aggregating over many code revisions assumes that any code-based performance fluctuations are distributed evenly across all slaves. In fact some slaves might have done 'more than their fair share' of builds of a slower/faster code. We try to allocate jobs to slaves that mostly recently did that compile, but there are multiple buildbot masters and they might not poll in the right sequence to achieve that, and slaves can be reallocated between masters.
That's a fair point, although there doesn't appear to have been much movement in this test after PGO was enabled.  The graphs I generated earlier are pretty messed up (they include non-mc builds), but the shiny new graphserver paints the picture well:

http://graphs-new.mozilla.org/graph.html#tests=[[73,1,14]]&sel=1303799230231.2942,1306272641996&displayrange=365&datatype=running
Glandium filed bug 659942 because he discovered that we leave the profile files around between builds and gcc will aggregate profile data on each run.

I think this is the best lead we have on this bug, although the statistics (in bug 659942) suggest that it may not be The Fix.
For my reference, the machines I have on loan from releng are

  talos-r3-fed-002
  moz2-linux-slave17

and the commands I've been using in lieu of xvfb-run are:

  $ Xvfb :42 -screen 0 800x600x16 &
  $ DISPLAY=:42 make -f client.mk
I just finished collecting the Talos results from my first two builds.

The builds were in separate objdirs, so the profile data was not shared.

Build 1: n=5,  mean 197.5, range [196.23, 199.91]
Build 2: n=13, mean 202.3, range [200.17, 203.35]

(After I finished running build 2, I went back and ran build 1 just to make sure that nothing changed, and nothing had changed.)

Something is different about these two builds, to the tune of 4.8 points (2.3%).  Build 1 seems to be particularly slow, as I don't see any results below 200 at [1].

The point of this exercise was to get access to the profile files, which we now have.  I'll post them to the bug in a moment.

[1] http://graphs-new.mozilla.org/graph.html#tests=[[73,1,13],[73,1,14]]&sel=1304594032929.0112,1306436585738&displayrange=365&datatype=running
Posted file PGO .gcda files - Build 1 (obsolete) —
Posted file PGO .gcda files - Build 2 (obsolete) —
Glandium and I just figured out that these .gcda files are useless to humans without the corresponding .gcno files.  To get those, I need to recompile...
(Assignee)

Updated

8 years ago
Attachment #535441 - Attachment is obsolete: true
(Assignee)

Updated

8 years ago
Attachment #535442 - Attachment is obsolete: true
I now have three builds with both .gcda and .gcno files.

One microbenchmark in Dromaeo DOM, element.property, has particularly different performance on my three builds.  The benchmark does:

  var a = document.getElementsByTagName("a")[0];
  time this operation in a loop:
    a.setAttribute("id", "foo");

When I run this test alone, I get the following results:

  build-4 223.73 227.14 222.38 224.90 226.44 208.45
  build-5 262.59 266.52 265.42 267.77 258.94 267.88
  build-6 270.02 269.93 273.00 276.76 274.27 276.10

Clearly there's a big difference between build-4 and build-6 on this test.

I can think of two ways to proceed from here.  First, we could spin a build in which we load the PGO manifest many times rather than once.  It's possible that whatever caused build-4 to become unlucky would go away if we have more profiling data.

Second, we could try and investigate exactly why build-6 is so much faster than build-4 by examining the output of the profiling runs (i.e., the input to the second pass of a PGO build).  We could figure out where to look by loading our microbenchmark in an instrumented build and examining the profiling output.

I think I'm going to try both of these approaches.  The first one has the potential for solving this problem, while the second one has the potential for teaching us something about designing a better profiling manifest.
Ack, I attributed those numbers to the wrong test.  They're actually for the microbenchmark element.property, which does:

  var ret;
  var elem = document.getElementById('some-element');
  time this operation in a loop:
    ret = elem.id;
I profiled this test and found that the hot functions are, perhaps unsurprisingly:

  9.3% InlineGetProps (in js/src/StubCalls.cpp)
  6.8% js_NewExternalString
  6.4% js_NativeGet
  6.1% nsIDOMHtmlElement_GetID

I disassembled InlineGetProps and found considerable differences between build-4 and build-6.  (The assembly is too complex for me to infer what the differences indicate about the compiler's assumptions.)  But what's curious is that the recorded branch probabilities are nearly identical for the two builds.

The function is called 31443 times in build-4 and 32328 times in build-6, but these extra 1000 calls don't appear to affect the overall branch probabilities.

Why does GCC generate different code for this function given profiles which appear nearly identical?  I can think of a few possibilities:

* GCC is sensitive to tiny changes (much less than 1%) in branch probabilities.  I hope and doubt this is the case, because it means that the generated code is effectively random.

* GCC nondeterministically generates different code given the same profile.  Again, I really hope this isn't the case, and I don't expect that it is.

* gcov isn't showing me all the relevant data; there's something other than branch probabilities influencing the generated code.  I suppose this is possible, although I can't think of what kind of difference this might be such that it would also leave branch probabilities alone.

The third option strikes me as the most likely, so I'm going to investigate that more.
> I disassembled InlineGetProps and found considerable differences between 
> build-4 and build-6.

I should say that most of the functions in this file were compiled exactly the same; InlineGetProps is unusual in that it looks different in the two builds.
I took the object files containing the four hot functions from comment 61 (in order: StubCalls, jsapi, jsobj, and dom_quickstubs) and substituted these into my slow build's objdir.

I wanted to know: Will the build run faster if I use all four of the fast object files?  Do some of the object files contribute more to the increase in speed than others?

I ran six tests.  I benchmarked the slow build, with no modified object files, the slow build with just one of the four fast object files inserted, and the slow build with all four fast object files inserted.

I found that yes, with all four fast object files substituted in, the slow build approaches the speed of the fast build.  Also yes, some object files contributed more to the faster build than others.  The results were:

                       units: run/s (bigger is better)
vanilla slow build:    203 229 229 224 232 230 (mean 229)
fast StubCalls.o only: 221 225 223 228 231 230 (mean 226)
fast jsapi.o only:     239 248 244 250 239 251 (mean 245)
fast jsobj.o only:     235 238 239 239 234 241 (mean 238)
fast dom_quickstubs.o: 234 248 238 246 242 251 (mean 243)
four fast obj files:   266 265 271 269 268 269 (mean 268)

(I left out the 203 for the vanilla build when calculating the mean, since it's an outlier.)

So the conclusion is that the fast versions of jsapi.o, jsobj.o, and dom_quickstubs.o all contribute to the faster build, while StubCalls doesn't.  Perhaps the differences in that object file (comment 61) are only cosmetic.  That would be a promising result -- maybe when we look at the other object files, we'll see meaningful differences in its profile data.

So now that I know which object files make a difference, I'm going to investigate one of them more closely.
I may also try to reproduce these results on my local Linux box, now that I have one.  Turnaround should be a lot faster.
Well, this is unfortunate.

JS_PUBLIC_API(JSString *)
JS_NewExternalString(JSContext *cx, const jschar *chars, size_t length, intN type)
{
    CHECK_REQUEST(cx);
    return JSExternalString::new_(cx, chars, length, type, NULL);
}

JSExternalString::new is inlined in here (it's defined in jsstrinlines.h).

But as far as I can tell, I can't expand this inlined function (say to asm) when I look at this function using gcov.  So the gcov output is useless.

I'd guess this is what was happening in comment 61, too.  There a few inlined functions in InlineGetProps; perhaps the difference in branches taken is inside one of those.
...actually, it looks like gcov *does* trace into these inlined functions.  That's good news.
It might be interesting to try to do the MOZ_PROFILE_GENERATE phase at -O0.
From comment 61, one possibility is:

> * GCC nondeterministically generates different code given the same profile.
> Again, I really hope this isn't the case, and I don't expect that it is.

We can rule this out by recompiling a given object file a few times using the same profiling data and observing that the object file is the same each time.  I just did this, and indeed the object file didn't change.
Even though JS_NewExternalString is the only hot function in jsapi.o, it appears that build 4 and build 6's copies of JS_NewExternalString are identical.

So either:

 * There's another hot function in jsapi.o, perhaps one which doesn't show up in my (Mac based) profiler, or

 * Substituting in build-6's jsapi.o doesn't actually improve the performance of build-4.

I wonder where I went wrong...
(Assignee)

Updated

8 years ago
Depends on: 661890
When I tried to make PGO builds locally with Ubuntu's GCC 4.5.2 on x64, I hit ICEs.  Glandium suggested I try Debian, and I was able to get successful builds in a VM.

But this makes me wonder whether what we're seeing is due to bugs in GCC which are particular to the version on the build machines.  (My machine's GCC 4.5.2 crashes, but maybe the build machines' GCC 4.5.2 uses uninitialized memory.)

To test this, I spun four PGO builds within my Debian VM and ran Dromaeo.  It looks like all four builds have approximately the same performance, but the variance within a build is very large, so I don't want to read anything into this yet.  I've requested a Linux-64 talos machine for further testing.
Interestingly, it appears that the variance on Linux-32 was reduced significantly around May 31.  The fastest builds are still around 218, but the lowest results are around 206, instead of 200.

From the graph, it appears that the change happened somewhere between d2586300a6a7 and ebf7f6d8e177.

http://graphs-new.mozilla.org/graph.html#tests=[[73,1,14]]&sel=1303236253416.5217,1307563766460&displayrange=90&datatype=running
Also, TM tends to score lower than MC on Dromaeo DOM since we turned on PGO.  That's awfully strange.

http://graphs-new.mozilla.org/graph.html#tests=[[73,1,14],[73,4,14]]&sel=1303882152578.901,1307568053569&displayrange=90&datatype=running
Possibly related to disabling PGO for JS recently ?
> Possibly related to disabling PGO for JS recently ?

Could be!  Do you have a bug number on that?
Happened in bug 605033, rev 7d2a3d61a377 in m-c.
Ah, that patch only applies to Windows.

> +ifeq ($(OS_ARCH),WINNT)
> +NO_PROFILE_GUIDED_OPTIMIZE = 1
> +endif
(In reply to comment #72)
> Also, TM tends to score lower than MC on Dromaeo DOM since we turned on PGO.
> That's awfully strange.

Here's a better graph of this:

http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,4,14]]&sel=none&displayrange=90&datatype=running
Try behaves similarly to TM [1].

Cedar is an interesting case [2].  It appears to behave similarly to MC, not TM/Try.  But if you decrease your browser's width enough, you'll see that, until the last grid block (starting around June 1, when we observe the increase in minimum score across all trees), there's bimodality in the cedar scores.  They're either high or low, but few are in the middle.

It kind of looks like the MC scores may also be weakly bimodal, but it's harder to see because there are so many of them.  If we imagine that most of the variance is due to one branch being flipped one way or another, it's not so surprising to see bimodality.  I can analyze it more rigorously if we think that would reveal anything, but I'm not sure what there is to learn.

I have lots of new questions, but few new answers.  To recap the two most important new questions:

* Why does TM/Try perform worse than MC/Cedar?  (In particular, all four branches have the same min score, but MC/Cedar have a higher mean score.)

As far as I can tell, the only difference between TM and MC is the priority they're given by the build infrastructure [now 3, but until recently was 4].  They have the same mozconfigs [5, 6] modulo orderings.  But Cedar (in [4]) has the same priority as TM and a higher priority than Try.  So if it's priority that affects things, then why doesn't Cedar look like TM/Try?

Perhaps the pushes to Cedar usually take place at a different time than most pushes to TM (bz and volkmar both get up before MV).

* Why did the minimum score increase around June 1?  If we can figure this out, it might give us insight into the rest of the variance.


[1] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,23,14]]&sel=1300307570760.6572,1307582050410&displayrange=90&datatype=running

[2] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,26,14]]&sel=1303944335005.3289,1307581574830&displayrange=90&datatype=running

[3] http://hg.mozilla.org/build/buildbot-configs/file/7c00ac95c87b/mozilla/master_common.py#l20

[4] http://hg.mozilla.org/build/buildbot-configs/file/af01d1e2040a/mozilla/master_common.py#l20

[5] http://hg.mozilla.org/build/buildbot-configs/file/0ba5a6c4330a/mozilla2/linux/tracemonkey/nightly/mozconfig

[6] http://hg.mozilla.org/build/buildbot-configs/file/0ba5a6c4330a/mozilla2/linux/mozilla-central/nightly/mozconfig
Well, here's something.

I pushed a Linux-32 build to Try which runs the profiling step 10 times.  (On Linux, the data is cumulative.)  It scored 220 on Dromaeo DOM, which is 8 points faster than the next closest run from Try in the past 90 days (minus one outlier at 224), and as fast as the fastest builds off MC.

This lends credence to the idea that MC builds are benefiting from cumulative profiling data left around between depends builds (bug 659942).

I don't have a good explanation for why, in bug 659942 comment 1, I didn't find a statistical difference between a builder's first PGO build and its later builds.  Presumably later builds would have some chance of being depends builds, so there'd be an increased chance of seeing an increase in speed in later builds.  Perhaps I messed up the analysis, or perhaps the effect is too small to see through the noise.
Second build which ran the profiling step 10 times came back from try with a 218.23 on Dromaeo DOM.  Again, that's higher than all but one of the results from Try in the past 90 days.

I'll file a bug on running the profiling step more times.  I'm not sure how far that will bring down the variance here, but it appears that it will help.
(Assignee)

Updated

8 years ago
Depends on: 663251
MC is now clobbering profiling data on each build [1].  I expect that this will reduce Linux scores to the low end of the old range.

I also checked in changes to the PGO script so we can run the instrumented build multiple times [2].  We can land a buildbot-config change from bug 663251 and turn that on whenever.  I'd like to wait a few days, so we can observe whether things change as expected and so we can get an estimate of the variance in the new results.

[1] http://hg.mozilla.org/mozilla-central/rev/2d1746337846
[2] http://hg.mozilla.org/mozilla-central/rev/83b4a771ea45
> MC is now clobbering profiling data on each build [1].  I expect that this will 
> reduce Linux scores to the low end of the old range.

The change appears to have had no impact on Talos scores.  Assuming the patch to clobber the data works (it does when I test on my machine), this indicates that the variance isn't due to accumulated profiling data on MC, as I'd previously thought.

We could start running the instrumented builds multiple times anyway.  It might resolve the issue.  But I'd rather understand why MC builds are faster than TM builds first.  I'm afraid if we don't, we'll end up painting over this issue, whatever its cause, and it'll come back to bite us later.

You can see what I'm talking about with TM vs MC at [1].

[1] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,4,14],[73,1,14]]&sel=1306629271386.4368,1308244525714&displayrange=90&datatype=running
I took another look at the scores today.  I think TM pulled this change in the June 20 merge (a285146675dc).  After that merge, the TM scores increased a bit.  (It's clear in the graph if you resize the window so it's pretty narrow.)

We can't attribute this change to clobbering until tm merges back to m-c and we see whether m-c scores increase (indicating that the source of the speedup was from tm) or stay the same (indicating that the source of the speedup was from mc).

I'll keep an eye out over the next few days.

http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,4,14]]&sel=1307333365814.5754,1309179466534&displayrange=90&datatype=running
Confirmed that bug 659942 landed in TM in the June 20 merge.
Just had a look at mozilla-inbound: I expected its scores to resemble TM, but in fact M-I's scores look like M-C's.  I guess that's good.

Not enough data yet to say whether Monday's TM -> MC merge resulted in an uptick on MC corresponding to the uptick on TM, but it looks like MI didn't move, so I'd guess MC won't move either.

It's possible that the uptick on TM was caused by something unrelated to this bug, but it's also possible that it was caused by clobbering the profiling data.
Catlee landed bug 663251 on Thu, June 30 [1]; we now run the instrumented binary 10 times during Linux PGO builds.  I'd hoped this landing would resolve this bug once and for all, but unfortunately that doesn't appear to have been the case.

There appears to have been a slight uptick on Try, but there isn't much data, and it's noisy anyway, since people don't always update their trees.

Critically, scores on M-C and M-I have the same variance as before (about a 15-point spread).

I think the question now is: Why did this change not make a big difference, while when I pushed to try, I got results significantly higher than any others on try before or after [2]?

[1] http://hg.mozilla.org/build/buildbot-configs/rev/58247fce99b4

[2] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,23,14]]&sel=none&displayrange=90&datatype=running

The results from my pushes are the cluster of 4 results with score ~ 220 on June 9 and 10.  No push came close to this score until 83a4240b3886 on Jun 29.  In the meantime, scores on all trees have been slowly trending upward.
It looks like this may be partially or wholly fixed, although I have no idea why.  Starting around July 9, scores from MC, TM, and MI all tightened up into a 9-point range.  (The previous range was ~20 points.)  9 points is close to what we had before we turned on PGO.

All three trees' scores look similar, which is fantastic.

Catlee or Dustin, were there changes to the infrastructure over the weekend?

http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,4,14],[73,1,14]]
(Assignee)

Updated

8 years ago
Depends on: 672563
Depends on: 676252
I think glandium fixed this in bug 676252.  It's kind of hard to see in the graphs, because that bug landed at almost the same time as type inference, so all the benchmarks show a change.

Dromaeo jslib used to be obviously bimodal, but now isn't.  And the variance of dromaeo DOM is now about what it was before we turned on PGO.  The Dromaeo DOM results on m-i now look like the results from m-c and try.

\o/
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.