653961 - Separate builds from the same source don't perform the same on Talos

Assignee

Description

•

13 years ago

I noticed this in bug 414641 comment 30.

See Compare Talos results at:

http://goo.gl/PxwYJ

The link compares ~7 Talos runs on try to ~6 m-c runs.  (Not every test was run the same number of times, unfortunately, but hover over a test score to see the number of runs.)

Notice that in the vast majority of cases, the variance on the try run is far below the variance on the m-c run.  My understanding is that the try and m-c setups were supposed to be identical, so I find this strange.

Perhaps we're seeing this because I kicked off all my try runs at night, while the m-c runs were all during the day.  But if time of day affects our benchmarks, we need to understand and try to fix this, because it makes accurate testing very difficult.

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

Priority: -- → P5

Whiteboard: [unittest][try][talos]

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

OS: Mac OS X → Linux

Justin Lebar (not reading bugmail)

Assignee

Comment 1

•

13 years ago

To clarify why this is important: The difference in variance indicates that something is different between the m-c and try testing environments.  It therefore seems plausible that this difference affects the tests' mean results in addition to their variance.  So until we can understand this effect, it's hard to believe that one can correctly compare Talos numbers between try and m-c.

(no longer active)

Comment 2

•

13 years ago

Bumping severity.  This makes it impossible for us to know the effect of any patch before landing it on mozilla-central, which makes the try server useless for performance measurements.  :(

Severity: normal → major

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

13 years ago

The set of machines that run talos, and their environment, don't vary between any of the branches - it's all one pool. They get fed builds and run tests, and don't care what the source was. Now that's not to say that they all return exactly the same result for the same build, but there's nothing branch-specific there.

AFAICT, the mozilla-central runs in your comparison are from a range of checkins
 http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=2f41abc5a89d&tochange=9f8e09d29a0d
and are compared a single try revision. It would be great to hear what you've done to rule out any perf changes in that range. 

It would also be hear about the determinism of gcc's PGO. Most likely there were several different objdirs producing those m-c builds, starting from the clobber glandium set at 8:50am that day, so there is certainly scope for all those builds not being completely equal.

I think a better comparison would be redo talos runs on a single m-c build, and if you think there might be a variation during the day try at different times during the day.

Justin Lebar (not reading bugmail)

Assignee

Comment 4

•

13 years ago

> It would be great to hear what you've
> done to rule out any perf changes in that range.

I looked at the changeset log and observed that none of those csets were likely to affect perf on so many different benchmarks.

Note also that for many of the tests on which I observed high variance, my changes' mean performance was close to m-c's mean (e.g. a11y linux-64).  To explain the variance as being caused by changes on m-c, you'd have to have a performance regression / improvement followed by a fix / backout.  I doubt this is hiding in the changelog.

> It would also be [good to] hear about the determinism of gcc's PGO.

I'll let Mike speak to this; I don't know.  When I run more tests, I'll do all platforms.

> I think a better comparison would be redo talos runs on a single m-c build,

Judging from what you've said, it seems very unlikely that this would reproduce these results.  If the pool of machines is the same, then the pool of machines is the same.  But that doesn't mean there isn't an issue here, because we still see a difference between the one-for-every-push m-c Talos results to results from try.

> and if you think there might be a variation during the day try at different
> times during the day.

To be clear, the suspicion that time of day matters is a really suspicion that when there's more load at the system, somehow Talos machines are affected.

Can you think of how that might be?  Do they run a daemon which is chattier when there's more load?

It could be a deep problem -- for instance, individual machines or the whole DC is hotter when there's more load, so some machines run their CPUs slower to compensate.  Or it could be vibrations affecting hard disks [1].  Have you observed any problems like this?  Do you have monitoring which would allow us to rule these things out?

In any case, I'll run more tests.

[1] http://blogs.sun.com/brendan/entry/unusual_disk_latency

Mike Hommey [:glandium]

Comment 5

•

13 years ago

(In reply to comment #4)
> > It would also be [good to] hear about the determinism of gcc's PGO.
> 
> I'll let Mike speak to this; I don't know.  When I run more tests, I'll do all
> platforms.

I don't have a definitive answer on that, but PGO is going to be disabled, so we'd be able to see if that changes afterwards.

Justin Lebar (not reading bugmail)

Assignee

Comment 6

•

13 years ago

-O3 and PGO are back on, hopefully for good, AIUI.

I've arbitrarily chosen 0221eb8660f4 to be the base of perf comparisons.  So I just have to click one button on the self-serve build page, I've chosen to focus on Dromaeo results.

To test the hypothesis that time of day and/or load on machines affects variance, I pushed bb03789cbaec (a noop change) to try.  I'll kick off more builds in the next few days.

Stay tuned.

Justin Lebar (not reading bugmail)

Assignee

Comment 7

•

13 years ago

Attached file Data from two sets of tryserver runs — Details

I've kicked two sets of Talos runs for rev bb03789cbaec, one set when there was no backlog on Try, and another when there was a large backlog.  I've attached the raw graphserver data, which is also available at [1] (I wanted to save it, since I'm not sure how long graphserver keeps Talos results from Try).

I haven't looked at this data at all yet, but I will soon.

[1] http://graphs.mozilla.org/api/test/runs/revisions?revision=bb03789cbaec

Justin Lebar (not reading bugmail)

Assignee

Comment 8

•

13 years ago

I just analyzed this data.  The two runs are indistinguishable as far as I can tell.

I'm going to look through the data from comment 0 again.  Maybe one of the changesets was slower than the others, or maybe all the tests ran slower at a certain time.

Justin Lebar (not reading bugmail)

Assignee

Comment 9

•

13 years ago

You can kind of see the variance on graphserver.

http://goo.gl/7DMfv

(I wish there was a way to turn off the lines connecting the dots in this graph.  For our purposes here, try to ignore the lines.)

Notice that the general spread here is about +/- 10 points, while my tryserver build saw +/- 3 points on the same test.  (This is Dromaeo DOM on Linux-32.)

The lines make it appear that there's a general upward trend, so you might think that trend explains the spread.  But I don't think it does.  First, it's not clear to me whether the lines trend upward because of points shown in the plot or points lying outside the edges of the plot.

But more convincingly, let's examine two specific data points.  Inside the 28 Apr 2011 3:00p - 9:00p box, pick out the highest and lowest points.  These correspond to 

 * changeset a78968ac5bbb - test result 213.97
 * changeset 4a6f98909da5 - test result 202.75

Notice that the higher result is the earlier test, so this doesn't fit with the general upward trend we observed earlier.  It also seems unlikely that this was an actual 5% performance regression, since it immediately bounced back, and *on the same machine as 4a6f98909da5* -- this is changeset 88d3c5bde0ba.

Examining the log (hg log --rev a78968ac5bbb:88d3c5bde0ba), I don't see how we could have a 5% regression followed by immediate fix back to previous levels.

Justin Lebar (not reading bugmail)

Assignee

Comment 10

•

13 years ago

If I'm right and the high variance we're observing is due to testers rather than builders, then the variance should disappear if we kick a new set of runs for the builds in question.  On the other hand, if I kick a bunch of runs for the existing builds of a78968ac5bbb and 4a6f98909da5 and still observe the 5% difference, then that would give me confidence that the perf difference has something to do with the bits.

I'll try not to take down the self-serve server this time...

Justin Lebar (not reading bugmail)

Assignee

Comment 11

•

13 years ago

Interesting!  The extra Talos runs I kicked for the existing bits on a78968ac5bbb and 4a6f98909da5 all came back with results very close to the original results.

I can think of two possible explanations for this.  Either there really *is* a 5% perf difference between the source at these two revisions, or there's some serious nondeterminism in our build.

Both of these seem unlikely to me.  (It's possible that there is a real perf difference between these two csets, but that doesn't explain the ~5% variance I see in general on this test.)

I'm going to kick a few new builds of these changesets and see how they perform.  If the new builds perform the same as the m-c builds, then I'll try to reproduce this result with another set of changesets.  If the new builds perform differently from the m-c builds, then we'll want to investigate why.

Justin Lebar (not reading bugmail)

Assignee

Comment 12

•

13 years ago

Try build based on a78968ac5bbb:

  http://tbpl.mozilla.org/?tree=Try&rev=e9236125da3e

Try build based on 4a6f98909da5

  http://tbpl.mozilla.org/?tree=Try&rev=518f7d882dc3

Justin Lebar (not reading bugmail)

Assignee

Comment 13

•

13 years ago

The test results for both csets hovered around 205.

So to review, here's the situation:

 * The bits on m-c for 4a6f98909da5 consistently give Dromaeo DOM results of around 214 on Linux-32.
 * The same code pushed to try results in a build which benchmarks at 205, about 5% slower.

If this is correct, it's a pretty surprising result!

Why might this be happening?  Here are some guesses:

 * Nondeterminism in Linux PGO.  We could rule this out by seeing the same effect on Mac.
 * Differences in the software stacks on the builders.  (Presumably these should all be the same!)

Since the difference appears to be in the bits, we should be able to disassemble and diff the builds.  I imagine this won't be useful in PGO land, since we'd expect some nondeterminism (just not enough to cause a 5% perf swing), but on the Mac, I'd expect close to bit-for-bit sameness.  In any case it would be good to verify that when the test results are different the bits are also different.

I'm going to try to recreate these results, since I'm pretty skeptical I've interpreted this correctly.  I've requested birch.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 657609

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 658571

Justin Lebar (not reading bugmail)

Assignee

Comment 14

•

13 years ago

Because it's hard to find, here's a link to the m-c build a78968ac5bbb:

http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1304033936/

(Thanks, Dustin.)

The corresponding try build is at:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jlebar@mozilla.com-e9236125da3e/try-linux/

Justin Lebar (not reading bugmail)

Assignee

Comment 15

•

13 years ago

And here's the m-c build 4a6f98909da5

http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1304036336/

Justin Lebar (not reading bugmail)

Assignee

Comment 16

•

13 years ago

Releng set me up with ssh access to a Talos box, and I've been running tests there.  Amazingly, the two builds from comment 14 perform just as they did on try: The m-c build a78968ac5bbb yields ~214, while the try build yields ~204 on Dromaeo DOM.  I ran 8 tests of each, and while there's some variance, the two sets of results don't even come close: [214.05 - 217.01] versus [202.41 - 208.58].

I think we can tentatively conclude that, somehow, the bits really are different.

I'm going to confirm that the bits from 4a6f98909da5 act as we expect.  This build comes from m-c, so if it acts as we now expect, we'll be able to rule out Try as having anything to do with this.

Justin Lebar (not reading bugmail)

Assignee

Comment 17

•

13 years ago

The build from 4a6f98909da5 acted just as expected.  In seven runs, it varied from 203.71 to 207.52.

This strongly suggests that what we're seeing doesn't have to do with a difference between try and mc.

There are still many questions to answer.  Among them:

 * On what platforms do we see this effect?  (All tests up until now were on Linux-32.)
 * How common is it?
 * What causes it?

My guess is that this might be due to PGO, as nthomas suggested in comment 3.  (Observing a similar effect on mac would suggest this isn't the case.)

I'd further guess that perhaps these differences arise due to the hardware differences of builders -- apparently we have at least two classes of machines doing builds.  This should be easy enough to test.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Summary: Talos on tryserver has lower variance than Talos on m-c → Separate builds from the same source don't perform the same on Talos

Whiteboard: [unittest][try][talos] → [unittest][talos]

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

13 years ago

Our linux builders are:
 - iX machines (hostname matches /.*-ix-.*/): physical hardware in scl1
 - VMs: on a VMWare cluster in sjc1
As I mentioned in IRC, we have a decent mix of build and try in each hardware type.

Mike Hommey [:glandium]

Comment 19

•

13 years ago

Are you able to reproduce this with two different try builds (as opposed to try vs m-c)? If so, can you try comparing try builds with "mk_add_options PGO=" in mozconfig-extra to disable PGO ?

Justin Lebar (not reading bugmail)

Assignee

Comment 20

•

13 years ago

Attached file Dromaeo DOM Linux-32 scores for a78968ac5bbb, e9236125da3e, and 4a6f98909da5 — Details

Justin Lebar (not reading bugmail)

Assignee

Comment 21

•

13 years ago

Here's a chart from graphserver which suggests to me that this might be related to PGO.

http://graphs.mozilla.org/#tests=[[73,1,415],[73,1,417]]&sel=1295849726,1306160530

PGO is presumably the large perf increase around 4/26.  After that, the variance appears to increase.  (I realize that we turned PGO on/off a few times, but none of the post-4/26 builds dropped down to pre-4/26 levels, so I presume they all have PGO enabled.)

(In reply to comment #19)
> Are you able to reproduce this with two different try builds (as opposed to
> try vs m-c)? If so, can you try comparing try builds with "mk_add_options
> PGO=" in mozconfig-extra to disable PGO ?

Right now, I don't have a reliable way to reproduce this effect.  It doesn't seem to be uniformly random, since I've kicked a number of try builds which have all had the same performance.

Justin Lebar (not reading bugmail)

Assignee

Comment 22

•

13 years ago

Similar graph for Linux-64:

http://graphs.mozilla.org/#tests=[[73,1,434],[73,1,436],[73,1,437]]&sel=1300873828,1306122160

Justin Lebar (not reading bugmail)

Assignee

Comment 23

•

13 years ago

Similar graph for Windows 7:

http://graphs.mozilla.org/#tests=[[73,1,355],[73,1,357]]&sel=1300873828,1306122160

I wouldn't read too much into this, because I'm cherry-picking lines which have large variance.

However, there are a few 8-point swings and one 10-point swing in this graph -- this is more than the natural variance I saw on Linux.  More importantly, the swings seem to correspond to matching swings in Win XP:

http://graphs.mozilla.org/#tests=[[73,1,355],[73,1,379]]&sel=1300873828,1306122160

http://graphs.mozilla.org/#tests=[[73,1,357],[73,1,376]]&sel=1300873828,1306122160

Unfortunately neither of those graphs lines up perfectly, because graphserver only lets me plot on a per-machine basis.  I'll go make my own graphs now.

Justin Lebar (not reading bugmail)

Assignee

Comment 24

•

13 years ago

Attached image Win XP / 7 Dromaeo DOM 200a746 to d3d0241 — Details

Attached is a plot of Windows XP and 7 Dromaeo DOM results over a 2 month period (March 20 - May 20).

Notice the two sections of bimodality.  Neither of these is a regression in all builds -- in the first section, most builds are fast, but some are slow, and in the second section, most builds are slow, but some are still fast!

In a graph I'll post in a moment, we can see that the Win7 and XP numbers track each other closely.  Since Talos machines for WinXP and 7 use the same bits, this suggests that the noise is coming from the builds themselves.

Assignee: nobody → justin.lebar+bug

Status: NEW → ASSIGNED

Justin Lebar (not reading bugmail)

Assignee

Comment 25

•

13 years ago

Attached image Win XP / 7 Dromaeo DOM correlation first 300 revs — Details

This plots the correlation between WinXP and Win7 Dromaeo DOM.  There's a strong correlation between WinXP and Win7 scores within both modes (that is, both clusters sit along a line with positive slope -- in this case, slope close to 1).

This suggests that the bimodality from the previous attachment is in the bits, rather than due to random variation in the testers.

Justin Lebar (not reading bugmail)

Assignee

Comment 26

•

13 years ago

Comment on attachment 534498 [details]
Win XP / 7 Dromaeo DOM correlation first 300 revs

I should have mentioned: This is correlation for just the first 300 revisions.  But the plot looks as you'd expect when you expand to all the revisions from the previous plot -- the faster mode becomes more dense, and still hews closely to the existing line of best fit.

Attachment #534498 - Attachment description: Win XP / 7 Dromaeo DOM 200a746 to d3d0241 correlation → Win XP / 7 Dromaeo DOM correlation first 300 revs

Justin Lebar (not reading bugmail)

Assignee

Comment 27

•

13 years ago

Attached image Dromaeo DOM Mac — Details

This plot isn't as interesting as the Windows one.  Mac-32 seems to have pretty low variance, while Mac-64's variance is larger.

Justin Lebar (not reading bugmail)

Assignee

Comment 28

•

13 years ago

Attached image Linux Dromaeo DOM — Details

I think this plot pretty conclusively shows that PGO is to blame for the extra variance on Linux.

It looks like there was initially a problem with PGO builds occasionally turning out much worse than they should, but that got fixed around r1000.  But even after that fix, the variance is still much higher than it was before.

Justin Lebar (not reading bugmail)

Assignee

Comment 29

•

13 years ago

Although we've ruled out the testers as the source of this variance, it's possible that some builders yield faster builds than others.

Apparently the only way to go from buildid to builder name is to parse the build logs.  So here I go...

Justin Lebar (not reading bugmail)

Assignee

Comment 30

•

13 years ago

The bimodality in Windows may be a non-effect -- I'm not sure I pulled only m-c revisions, so we could be seeing a branch's test results.

Justin Lebar (not reading bugmail)

Assignee

Comment 31

•

13 years ago

Attached file Raw data used to generate graphs — Details

Justin Lebar (not reading bugmail)

Assignee

Comment 32

•

13 years ago

Attached file Data - Which machines built which revs? — Details

I collected this data by scraping the tinderbox FTP server.

Justin Lebar (not reading bugmail)

Assignee

Comment 33

•

13 years ago

It doesn't appear that builder type makes a large difference in the build's speed.  I'm not sure if the stdev differences are statistically significant, but the variance spread on Linux-* and Win XP seems prety large.

All of this data is for m-c builds from the past month.  I excluded nightlies, since they're building a cset that was already built on m-c.  I also excluded non-PGO Linux builds, by throwing away csets which scored too low to be PGO.  (To be really rigorous, I should have only considered builds which had test runs on all platforms and excluded on all platforms the csets for which Linux builds had PGO turned off.  But I don't think it really matters here.)

Machine labels:

  linux-ix-slave: New bare hardware
   mv-moz2-linux: Old bare hardware
      moz2-linux: VM

          w32-ix: New bare hardware
         mw32-ix: Old bare hardware
           win32: VM

Linux-32:
>       moz2-linux-slave: n 105, min 201.7, 1q 208.0, median 210.9, mean 211.4, 2q 215.2, max 220.5, stdev 4.4
>         linux-ix-slave: n  79, min 200.3, 1q 208.3, median 213.6, mean 212.8, 2q 216.9, max 221.3, stdev 4.7
> mv-moz2-linux-ix-slave: n  88, min 200.1, 1q 207.1, median 212.6, mean 212.0, 2q 217.2, max 219.7, stdev 5.1

Linux-64:
>       moz2-linux-slave: n 105, min 242.3, 1q 255.2, median 258.4, mean 257.6, 3q 261.3, max 268.2, stdev 4.9
>         linux-ix-slave: n  79, min 243.2, 1q 256.1, median 258.8, mean 258.3, 3q 262.2, max 269.2, stdev 5.5
> mv-moz2-linux-ix-slave: n  88, min 243.9, 1q 255.4, median 258.4, mean 258.2, 3q 262.0, max 269.3, stdev 4.9

Win 7:
>           w32-ix-slave: n 110, min 245.3, 1q 251.6, median 253.5, mean 253.5, 3q 255.8, max 259.7, stdev 2.9
>          mw32-ix-slave: n 189, min 242.3, 1q 252.3, median 254.1, mean 254.1, 3q 256.3, max 261.4, stdev 2.9
>            win32-slave: n  31, min 246.0, 1q 250.8, median 253.1, mean 252.9, 3q 255.1, max 259.3, stdev 3.1

Win XP:
>           w32-ix-slave: n 110, min 246.2, 1q 254.7, median 256.4, mean 256.0, 3q 257.8, max 262.4, stdev 2.9
>          mw32-ix-slave: n 189, min 248.0, 1q 255.5, median 257.1, mean 256.9, 3q 258.8, max 262.8, stdev 2.6
>            win32-slave: n  31, min 252.4, 1q 254.5, median 256.1, mean 256.3, 3q 258.3, max 261.1, stdev 2.4

Mac-32:
>    moz2-darwin10-slave: n 342, min 227.6, 1q 232.8, median 234.2, mean 234.0, 3q 235.2, max 239.1, stdev 1.8

Mac-64:
>    moz2-darwin10-slave: n 342, min 244.8, 1q 256.0, median 258.5, mean 258.1, 3q 260.4, max 266.5, stdev 3.8

Dustin J. Mitchell [:dustin] (he/him)

Comment 34

•

13 years ago

One data point: the iX machines (/.*-ix-.*/) have really bad disk performance because they're rockin' consumer-grade SATA drives.  See bug 655304 for the work in progress to fix that, noting that none of the servers are back from iX yet.

I'd be surprised if that had a significant effect on the PGO - AIUI, PGO is using profiling, not wall-clock performance measurements, to decide where to optimize for speed and where to optimize for space.  That said, Firefox is full of timers, and timing differences due to disk slowness could easily cause changes in codepath execution counts.

BTW, to my knowledge, there's no great difference between "Old" and "New" bare hardware, other than which PO they were purchased on.

Justin Lebar (not reading bugmail)

Assignee

Comment 35

•

13 years ago

I can think of two next steps to address PGO:

a) Take a pair of builds with a large perf difference, disassemble, and try to figure out why one is slower.  The hope would be that most of the code would be the same, but we'd be able to see where the compiler chose to compile for space rather than speed along some critical path.

b) Test whether making the profile script do more (and thereby exercise more codepaths) reduces variance.

There's also a curious effect on Mac: Why does Mac-64 have twice the stdev of Mac-32?  My understanding is that we run 32-bit builds on 10.5 and 64-bit builds on 10.6.  Maybe we could try to approach this by running a 32-bit build on 10.6.  I don't know what would be involved with that.

(It's probably also worth checking that this variance is in the testers and not in the builders, as we did for Linux.)

Dustin J. Mitchell [:dustin] (he/him)

Comment 36

•

13 years ago

Is there any way to add debugging output for PGO?  Perhaps we can keep the profile data around with the build?

Mike Hommey [:glandium]

Comment 37

•

13 years ago

(In reply to comment #34)
> One data point: the iX machines (/.*-ix-.*/) have really bad disk
> performance because they're rockin' consumer-grade SATA drives.  See bug
> 655304 for the work in progress to fix that, noting that none of the servers
> are back from iX yet.
> 
> I'd be surprised if that had a significant effect on the PGO - AIUI, PGO is
> using profiling, not wall-clock performance measurements, to decide where to
> optimize for speed and where to optimize for space.  That said, Firefox is
> full of timers, and timing differences due to disk slowness could easily
> cause changes in codepath execution counts.

OTOH, the profile runs rights after the build, so it shouldn't hit the disk too hard because everything should already be in RAM. Except if there isn't enough ot it.

(In reply to comment #36)
> Is there any way to add debugging output for PGO?  Perhaps we can keep the
> profile data around with the build?

You can keep the .gcda files from the objdir if you want.

Justin Lebar (not reading bugmail)

Assignee

Comment 38

•

13 years ago

These are the results of aggregating Linux-32 test results by builder.  I've omitted builders with fewer than 5 builds.

Sorry to make you sqint at all these numbers.  There does appear to be something going on here -- I've starred some builders with suspiciously low medians.  (Some of these also have low means.)  It's hard for me to say whether this is significant, though.

>          linux-ix-slave28: n  21, min 205.3, median 214.3, mean 213.6, max 220.5, stdev 4.6
>  mv-moz2-linux-ix-slave11: n  16, min 206.3, median 217.0, mean 214.6, max 219.1, stdev 4.4
>          linux-ix-slave38: n  14, min 204.1, median 213.4, mean 213.1, max 219.2, stdev 4.2
>        moz2-linux-slave42: n  12, min 207.1, median 211.2, mean 212.5, max 220.5, stdev 4.4
>  mv-moz2-linux-ix-slave02: n  11, min 205.2, median 215.5, mean 213.9, max 219.3, stdev 4.5
>*       moz2-linux-slave07: n  11, min 204.1, median 209.2, mean 210.9, max 219.8, stdev 4.7
>  mv-moz2-linux-ix-slave08: n  11, min 204.9, median 216.2, mean 213.7, max 219.7, stdev 5.0
>        moz2-linux-slave30: n  10, min 208.5, median 214.9, mean 214.6, max 218.5, stdev 3.3
>  mv-moz2-linux-ix-slave19: n   9, min 205.4, median 214.2, mean 212.9, max 217.6, stdev 4.1
>* mv-moz2-linux-ix-slave14: n   8, min 204.6, median 207.9, mean 209.8, max 218.3, stdev 5.1
>* mv-moz2-linux-ix-slave18: n   8, min 203.4, median 205.8, mean 207.4, max 217.2, stdev 4.5
>*         linux-ix-slave12: n   8, min 207.5, median 208.9, mean 211.4, max 218.7, stdev 4.3
>*       moz2-linux-slave08: n   8, min 201.7, median 208.7, mean 209.2, max 215.3, stdev 4.3
>          linux-ix-slave41: n   7, min 203.9, median 213.5, mean 211.9, max 219.8, stdev 5.6
>        moz2-linux-slave38: n   7, min 204.2, median 211.0, mean 212.4, max 219.6, stdev 5.2
>          linux-ix-slave30: n   7, min 205.0, median 211.5, mean 213.5, max 221.3, stdev 5.2
>        moz2-linux-slave25: n   7, min 208.4, median 216.5, mean 214.7, max 218.0, stdev 3.8
>  mv-moz2-linux-ix-slave17: n   7, min 203.5, median 213.6, mean 213.2, max 219.1, stdev 5.3
>          linux-ix-slave20: n   6, min 206.3, median 216.2, mean 214.4, max 218.7, stdev 4.2
>        moz2-linux-slave34: n   5, min 206.5, median 214.7, mean 212.5, max 218.6, stdev 4.8
>  mv-moz2-linux-ix-slave10: n   5, min 206.7, median 212.2, mean 211.5, max 214.4, stdev 2.6
>          linux-ix-slave14: n   5, min 200.3, median 214.4, mean 211.3, max 217.2, stdev 6.4
>        moz2-linux-slave05: n   5, min 205.9, median 210.2, mean 210.2, max 217.0, stdev 3.8
>        moz2-linux-slave29: n   5, min 208.2, median 213.7, mean 212.3, max 216.8, stdev 3.4

Justin Lebar (not reading bugmail)

Assignee

Comment 39

•

13 years ago

If the results from comment 38 showed differences in min/max along with differences in median/mean, I'd be much more likely to believe that we're seeing a difference in the builder.  But as it is, the data at best suggests that something about the machine makes it *less likely* to produce a fast build, but leaves it perfectly capable of producing a fast build.  That seems weird.

Dustin J. Mitchell [:dustin] (he/him)

Comment 40

•

13 years ago

I'd like to see some analysis of significance here - those means aren't that far out of the ordinary.

Chris Cooper [:coop] (he/him)

Comment 41

•

13 years ago

(In reply to comment #39)
> If the results from comment 38 showed differences in min/max along with
> differences in median/mean, I'd be much more likely to believe that we're
> seeing a difference in the builder.  But as it is, the data at best suggests
> that something about the machine makes it *less likely* to produce a fast
> build, but leaves it perfectly capable of producing a fast build.  That
> seems weird.

Drives in the existing ix machines are not well isolated - part of the reason they are being replaced. Could be these few machines are located in the racks such that this problem is more common.

Any number of things can cause the vm builds to be faster/slower, depending on what else is going on with the VH host. Since it's not more widespread, we should check to make sure that these two VMs have the same settings and reservations as the others. Dustin: how hard to check that?

Dustin J. Mitchell [:dustin] (he/him)

Comment 42

•

13 years ago

We're not looking at speed of the builds though - we're looking at speed of the tests on builds created on these machines.

It's not hard to look at VM settings - it's just clicking around and taking screenshots (y'know, the windows equivalent of "diff").  Which vm's did you have in mind?

Shawn Wilsher :sdwilsh

Comment 43

•

13 years ago

(In reply to comment #41)
> Any number of things can cause the vm builds to be faster/slower, depending on
> what else is going on with the VH host. Since it's not more widespread, we
> should check to make sure that these two VMs have the same settings and
> reservations as the others. Dustin: how hard to check that?
Whoa, wait.  We have VMs that do PGO builds (which require the builder to run a test and generate a profiler)?  That's going to add a lot of non-determinism to the profile, isn't it?

Justin Lebar (not reading bugmail)

Assignee

Comment 44

•

13 years ago

(In reply to comment #43)
> Whoa, wait.  We have VMs that do PGO builds (which require the builder to
> run a test and generate a profiler)?  That's going to add a lot of
> non-determinism to the profile, isn't it?

Not unless the physical machine is oversubscribed, right?

Anyway, the results from comment 38 don't suggest that variance of VM build perf is higher than physical machine builds' variance.

Justin Lebar (not reading bugmail)

Assignee

Comment 45

•

13 years ago

(In reply to comment #40)
> I'd like to see some analysis of significance here - those means aren't that
> far out of the ordinary.

This is tricky, for the following reason: Suppose there was no difference at all between any of these machines.  We'd still expect to see differences in the statistics, due to noise.

So suppose we choose a few machines which look different than the others.  We now want to run a statistical test to tell us whether those machines are actually different than the others.  Well...we already picked out those machines as looking different, so the test is more likely to say that the machines are different.

Compensating for this effect without over-compensating is hard, especially since I'm not prepared to characterize the distribution from which the build speeds are drawn (it doesn't appear normal).

Anyway, I'll run Welch's t-test -- maybe it can give us some confidence that the difference is just noise.

Chris Cooper [:coop] (he/him)

Comment 46

•

13 years ago

(In reply to comment #42)
> We're not looking at speed of the builds though - we're looking at speed of
> the tests on builds created on these machines.

Sure, but the differences are being traced back to a handful of machines that are, in theory, identical to the rest of the pool.
 
> It's not hard to look at VM settings - it's just clicking around and taking
> screenshots (y'know, the windows equivalent of "diff").  Which vm's did you
> have in mind?

The two VMs jlebar has noted as anomalous, namely moz2-linux-slave07 and moz2-linux-slave08.

I'm happy to fire up the VI Client and check the settings if you'd prefer.

Dustin J. Mitchell [:dustin] (he/him)

Comment 47

•

13 years ago

Let's back up a bit here:

PGO is an *inherently* nondeterministic process. On dedicated hardware or on a VM, the profiling numbers will be different on every run. For simpler software, this would not be the case, but because Firefox has lots of timers and whatnot, it connects CPU time to wall time, and things are nondeterministic.

If that kind of nondeterminism can significantly affect the runtime of the resulting builds, then I think we need to fix that, rather than try to find build hardware with absolutely predictable performance characteristics. Maybe that means changing what gets run to build the profile? Maybe that means accumulating data over multiple runs into a single profile?

We still haven't seen a *significant* regression of performance against machine ID. My science background makes me blind to claims of "dunno if it's significant, but the means are different". At this point I still consider the null hypothesis (that test performance is unrelated to build machine) very viable.

It sounds like the best path forward is
* statistical analysis of the build-machine-causes-differences-in-means
and once that's done
* get the profiles for a "slow" and "fast" build
* make *new* builds using those profiles on a single build machine
* compare performance of the new builds
assuming that there are still differences,
* hand analysis of the profiles. Likely we'll find that in a "slow" build, some large chunk of code - js, maybe! - never executed, or executed once and some other chunk executed twice.

Justin Lebar (not reading bugmail)

Assignee

Comment 48

•

13 years ago

>  * get the profiles for a "slow" and "fast" build

Even if the per-machine variance turns out to be real, we'll probably still need to do this, since we won't have explained all of the variance increase.

Can we start on this now?  Presumably we're going to need to kick many PGO builds, let them run to completion, save the profiles, run Talos on the builds, and pick a particularly fast and slow profile.

I can do this if I can get access to a Linux builder.

Justin Lebar (not reading bugmail)

Assignee

Comment 49

•

13 years ago

Here's the statistical test I said I'd run:

slow_builds are the Linux-32 results from the builders with stars in comment 38.  fast_builds are the results from the remaining builders.  Non-PGO results (less than 190) are not counted.

Welch's two-sample t-test gives p=0.005 that slow_builds and fast_builds are drawn from distributions with the same mean.  But like I said in comment 45, I think this is kind of bogus.  Suggestions are welcome as to how to improve the test, but I'm not sure it's going to get us anywhere.

> t.test(slow_builds, fast_builds)

	Welch Two Sample t-test

data:  slow_builds and fast_builds 
t = -2.9201, df = 47.322, p-value = 0.005344
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -6.623432 -1.220548 
sample estimates:
mean of x mean of y 
 208.2267  212.1487

Justin Lebar (not reading bugmail)

Assignee

Comment 50

•

13 years ago

(In reply to comment 49)
> Suggestions are welcome as to how to improve the test, but I'm not sure it's 
> going to get us anywhere.

What I mean is: We probably can improve the test, but it's either going to tell us that the difference is or isn't statistically significant.

If the difference isn't significant, well, maybe we're not running the right test.  Our data isn't normal, etc.  If the difference is significant, well, we biased our sample (comment 38).  And anyway, it doesn't explain all the variance we've seen -- even the worst machine is capable of making fast builds, and even the best machine is capable of making slow builds.

I think the result of the test is: If someone wants to poke around with those machines, that might not be a bad idea.  I'm not sure what else we might learn by throwing more statistics at it.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 659403

Nick Thomas [:nthomas] (UTC+12)

Comment 51

•

13 years ago

(In reply to comment #38)
> These are the results of aggregating Linux-32 test results by builder.  I've
> omitted builders with fewer than 5 builds.

I have doubts about the validity of this analysis, because aggregating over many code revisions assumes that any code-based performance fluctuations are distributed evenly across all slaves. In fact some slaves might have done 'more than their fair share' of builds of a slower/faster code. We try to allocate jobs to slaves that mostly recently did that compile, but there are multiple buildbot masters and they might not poll in the right sequence to achieve that, and slaves can be reallocated between masters.

Justin Lebar (not reading bugmail)

Assignee

Comment 52

•

13 years ago

That's a fair point, although there doesn't appear to have been much movement in this test after PGO was enabled.  The graphs I generated earlier are pretty messed up (they include non-mc builds), but the shiny new graphserver paints the picture well:

http://graphs-new.mozilla.org/graph.html#tests=[[73,1,14]]&sel=1303799230231.2942,1306272641996&displayrange=365&datatype=running

Justin Lebar (not reading bugmail)

Assignee

Comment 53

•

13 years ago

Glandium filed bug 659942 because he discovered that we leave the profile files around between builds and gcc will aggregate profile data on each run.

I think this is the best lead we have on this bug, although the statistics (in bug 659942) suggest that it may not be The Fix.

Justin Lebar (not reading bugmail)

Assignee

Comment 54

•

13 years ago

For my reference, the machines I have on loan from releng are

  talos-r3-fed-002
  moz2-linux-slave17

and the commands I've been using in lieu of xvfb-run are:

  $ Xvfb :42 -screen 0 800x600x16 &
  $ DISPLAY=:42 make -f client.mk

Justin Lebar (not reading bugmail)

Assignee

Comment 55

•

13 years ago

I just finished collecting the Talos results from my first two builds.

The builds were in separate objdirs, so the profile data was not shared.

Build 1: n=5,  mean 197.5, range [196.23, 199.91]
Build 2: n=13, mean 202.3, range [200.17, 203.35]

(After I finished running build 2, I went back and ran build 1 just to make sure that nothing changed, and nothing had changed.)

Something is different about these two builds, to the tune of 4.8 points (2.3%).  Build 1 seems to be particularly slow, as I don't see any results below 200 at [1].

The point of this exercise was to get access to the profile files, which we now have.  I'll post them to the bug in a moment.

[1] http://graphs-new.mozilla.org/graph.html#tests=[[73,1,13],[73,1,14]]&sel=1304594032929.0112,1306436585738&displayrange=365&datatype=running

Justin Lebar (not reading bugmail)

Assignee

Comment 56

•

13 years ago

Attached file PGO .gcda files - Build 1 (obsolete) — Details

Justin Lebar (not reading bugmail)

Assignee

Comment 57

•

13 years ago

Attached file PGO .gcda files - Build 2 (obsolete) — Details

Justin Lebar (not reading bugmail)

Assignee

Comment 58

•

13 years ago

Glandium and I just figured out that these .gcda files are useless to humans without the corresponding .gcno files.  To get those, I need to recompile...

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Attachment #535441 - Attachment is obsolete: true

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Attachment #535442 - Attachment is obsolete: true

Justin Lebar (not reading bugmail)

Assignee

Comment 59

•

13 years ago

I now have three builds with both .gcda and .gcno files.

One microbenchmark in Dromaeo DOM, element.property, has particularly different performance on my three builds.  The benchmark does:

  var a = document.getElementsByTagName("a")[0];
  time this operation in a loop:
    a.setAttribute("id", "foo");

When I run this test alone, I get the following results:

  build-4 223.73 227.14 222.38 224.90 226.44 208.45
  build-5 262.59 266.52 265.42 267.77 258.94 267.88
  build-6 270.02 269.93 273.00 276.76 274.27 276.10

Clearly there's a big difference between build-4 and build-6 on this test.

I can think of two ways to proceed from here.  First, we could spin a build in which we load the PGO manifest many times rather than once.  It's possible that whatever caused build-4 to become unlucky would go away if we have more profiling data.

Second, we could try and investigate exactly why build-6 is so much faster than build-4 by examining the output of the profiling runs (i.e., the input to the second pass of a PGO build).  We could figure out where to look by loading our microbenchmark in an instrumented build and examining the profiling output.

I think I'm going to try both of these approaches.  The first one has the potential for solving this problem, while the second one has the potential for teaching us something about designing a better profiling manifest.

Justin Lebar (not reading bugmail)

Assignee

Comment 60

•

13 years ago

Ack, I attributed those numbers to the wrong test.  They're actually for the microbenchmark element.property, which does:

  var ret;
  var elem = document.getElementById('some-element');
  time this operation in a loop:
    ret = elem.id;

Justin Lebar (not reading bugmail)

Assignee

Comment 61

•

13 years ago

I profiled this test and found that the hot functions are, perhaps unsurprisingly:

  9.3% InlineGetProps (in js/src/StubCalls.cpp)
  6.8% js_NewExternalString
  6.4% js_NativeGet
  6.1% nsIDOMHtmlElement_GetID

I disassembled InlineGetProps and found considerable differences between build-4 and build-6.  (The assembly is too complex for me to infer what the differences indicate about the compiler's assumptions.)  But what's curious is that the recorded branch probabilities are nearly identical for the two builds.

The function is called 31443 times in build-4 and 32328 times in build-6, but these extra 1000 calls don't appear to affect the overall branch probabilities.

Why does GCC generate different code for this function given profiles which appear nearly identical?  I can think of a few possibilities:

* GCC is sensitive to tiny changes (much less than 1%) in branch probabilities.  I hope and doubt this is the case, because it means that the generated code is effectively random.

* GCC nondeterministically generates different code given the same profile.  Again, I really hope this isn't the case, and I don't expect that it is.

* gcov isn't showing me all the relevant data; there's something other than branch probabilities influencing the generated code.  I suppose this is possible, although I can't think of what kind of difference this might be such that it would also leave branch probabilities alone.

The third option strikes me as the most likely, so I'm going to investigate that more.

Justin Lebar (not reading bugmail)

Assignee

Comment 62

•

13 years ago

> I disassembled InlineGetProps and found considerable differences between 
> build-4 and build-6.

I should say that most of the functions in this file were compiled exactly the same; InlineGetProps is unusual in that it looks different in the two builds.

Justin Lebar (not reading bugmail)

Assignee

Comment 63

•

13 years ago

I took the object files containing the four hot functions from comment 61 (in order: StubCalls, jsapi, jsobj, and dom_quickstubs) and substituted these into my slow build's objdir.

I wanted to know: Will the build run faster if I use all four of the fast object files? Do some of the object files contribute more to the increase in speed than others?

I ran six tests. I benchmarked the slow build, with no modified object files, the slow build with just one of the four fast object files inserted, and the slow build with all four fast object files inserted.

I found that yes, with all four fast object files substituted in, the slow build approaches the speed of the fast build. Also yes, some object files contributed more to the faster build than others. The results were:

units: run/s (bigger is better)
vanilla slow build: 203 229 229 224 232 230 (mean 229)
fast StubCalls.o only: 221 225 223 228 231 230 (mean 226)
fast jsapi.o only: 239 248 244 250 239 251 (mean 245)
fast jsobj.o only: 235 238 239 239 234 241 (mean 238)
fast dom_quickstubs.o: 234 248 238 246 242 251 (mean 243)
four fast obj files: 266 265 271 269 268 269 (mean 268)

(I left out the 203 for the vanilla build when calculating the mean, since it's an outlier.)

So the conclusion is that the fast versions of jsapi.o, jsobj.o, and dom_quickstubs.o all contribute to the faster build, while StubCalls doesn't. Perhaps the differences in that object file (comment 61) are only cosmetic. That would be a promising result -- maybe when we look at the other object files, we'll see meaningful differences in its profile data.

So now that I know which object files make a difference, I'm going to investigate one of them more closely.

Justin Lebar (not reading bugmail)

Assignee

Comment 64

•

13 years ago

I may also try to reproduce these results on my local Linux box, now that I have one.  Turnaround should be a lot faster.

Justin Lebar (not reading bugmail)

Assignee

Comment 65

•

13 years ago

Well, this is unfortunate.

JS_PUBLIC_API(JSString *)
JS_NewExternalString(JSContext *cx, const jschar *chars, size_t length, intN type)
{
    CHECK_REQUEST(cx);
    return JSExternalString::new_(cx, chars, length, type, NULL);
}

JSExternalString::new is inlined in here (it's defined in jsstrinlines.h).

But as far as I can tell, I can't expand this inlined function (say to asm) when I look at this function using gcov.  So the gcov output is useless.

I'd guess this is what was happening in comment 61, too.  There a few inlined functions in InlineGetProps; perhaps the difference in branches taken is inside one of those.

Justin Lebar (not reading bugmail)

Assignee

Comment 66

•

13 years ago

...actually, it looks like gcov *does* trace into these inlined functions.  That's good news.

Mike Hommey [:glandium]

Comment 67

•

13 years ago

It might be interesting to try to do the MOZ_PROFILE_GENERATE phase at -O0.

Justin Lebar (not reading bugmail)

Assignee

Comment 68

•

13 years ago

From comment 61, one possibility is:

> * GCC nondeterministically generates different code given the same profile.
> Again, I really hope this isn't the case, and I don't expect that it is.

We can rule this out by recompiling a given object file a few times using the same profiling data and observing that the object file is the same each time.  I just did this, and indeed the object file didn't change.

Justin Lebar (not reading bugmail)

Assignee

Comment 69

•

13 years ago

Even though JS_NewExternalString is the only hot function in jsapi.o, it appears that build 4 and build 6's copies of JS_NewExternalString are identical.

So either:

 * There's another hot function in jsapi.o, perhaps one which doesn't show up in my (Mac based) profiler, or

 * Substituting in build-6's jsapi.o doesn't actually improve the performance of build-4.

I wonder where I went wrong...

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 661890

Justin Lebar (not reading bugmail)

Assignee

Comment 70

•

13 years ago

When I tried to make PGO builds locally with Ubuntu's GCC 4.5.2 on x64, I hit ICEs.  Glandium suggested I try Debian, and I was able to get successful builds in a VM.

But this makes me wonder whether what we're seeing is due to bugs in GCC which are particular to the version on the build machines.  (My machine's GCC 4.5.2 crashes, but maybe the build machines' GCC 4.5.2 uses uninitialized memory.)

To test this, I spun four PGO builds within my Debian VM and ran Dromaeo.  It looks like all four builds have approximately the same performance, but the variance within a build is very large, so I don't want to read anything into this yet.  I've requested a Linux-64 talos machine for further testing.

Justin Lebar (not reading bugmail)

Assignee

Comment 71

•

13 years ago

Interestingly, it appears that the variance on Linux-32 was reduced significantly around May 31.  The fastest builds are still around 218, but the lowest results are around 206, instead of 200.

From the graph, it appears that the change happened somewhere between d2586300a6a7 and ebf7f6d8e177.

http://graphs-new.mozilla.org/graph.html#tests=[[73,1,14]]&sel=1303236253416.5217,1307563766460&displayrange=90&datatype=running

Justin Lebar (not reading bugmail)

Assignee

Comment 72

•

13 years ago

Also, TM tends to score lower than MC on Dromaeo DOM since we turned on PGO.  That's awfully strange.

http://graphs-new.mozilla.org/graph.html#tests=[[73,1,14],[73,4,14]]&sel=1303882152578.901,1307568053569&displayrange=90&datatype=running

Nick Thomas [:nthomas] (UTC+12)

Comment 73

•

13 years ago

Possibly related to disabling PGO for JS recently ?

Justin Lebar (not reading bugmail)

Assignee

Comment 74

•

13 years ago

> Possibly related to disabling PGO for JS recently ?

Could be!  Do you have a bug number on that?

Nick Thomas [:nthomas] (UTC+12)

Comment 75

•

13 years ago

Happened in bug 605033, rev 7d2a3d61a377 in m-c.

Justin Lebar (not reading bugmail)

Assignee

Comment 76

•

13 years ago

Ah, that patch only applies to Windows.

> +ifeq ($(OS_ARCH),WINNT)
> +NO_PROFILE_GUIDED_OPTIMIZE = 1
> +endif

Justin Lebar (not reading bugmail)

Assignee

Comment 77

•

13 years ago

(In reply to comment #72)
> Also, TM tends to score lower than MC on Dromaeo DOM since we turned on PGO.
> That's awfully strange.

Here's a better graph of this:

http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,4,14]]&sel=none&displayrange=90&datatype=running

Justin Lebar (not reading bugmail)

Assignee

Comment 78

•

13 years ago

Try behaves similarly to TM [1].

Cedar is an interesting case [2].  It appears to behave similarly to MC, not TM/Try.  But if you decrease your browser's width enough, you'll see that, until the last grid block (starting around June 1, when we observe the increase in minimum score across all trees), there's bimodality in the cedar scores.  They're either high or low, but few are in the middle.

It kind of looks like the MC scores may also be weakly bimodal, but it's harder to see because there are so many of them.  If we imagine that most of the variance is due to one branch being flipped one way or another, it's not so surprising to see bimodality.  I can analyze it more rigorously if we think that would reveal anything, but I'm not sure what there is to learn.

I have lots of new questions, but few new answers.  To recap the two most important new questions:

* Why does TM/Try perform worse than MC/Cedar?  (In particular, all four branches have the same min score, but MC/Cedar have a higher mean score.)

As far as I can tell, the only difference between TM and MC is the priority they're given by the build infrastructure [now 3, but until recently was 4].  They have the same mozconfigs [5, 6] modulo orderings.  But Cedar (in [4]) has the same priority as TM and a higher priority than Try.  So if it's priority that affects things, then why doesn't Cedar look like TM/Try?

Perhaps the pushes to Cedar usually take place at a different time than most pushes to TM (bz and volkmar both get up before MV).

* Why did the minimum score increase around June 1?  If we can figure this out, it might give us insight into the rest of the variance.


[1] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,23,14]]&sel=1300307570760.6572,1307582050410&displayrange=90&datatype=running

[2] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,26,14]]&sel=1303944335005.3289,1307581574830&displayrange=90&datatype=running

[3] http://hg.mozilla.org/build/buildbot-configs/file/7c00ac95c87b/mozilla/master_common.py#l20

[4] http://hg.mozilla.org/build/buildbot-configs/file/af01d1e2040a/mozilla/master_common.py#l20

[5] http://hg.mozilla.org/build/buildbot-configs/file/0ba5a6c4330a/mozilla2/linux/tracemonkey/nightly/mozconfig

[6] http://hg.mozilla.org/build/buildbot-configs/file/0ba5a6c4330a/mozilla2/linux/mozilla-central/nightly/mozconfig

Justin Lebar (not reading bugmail)

Assignee

Comment 79

•

13 years ago

Well, here's something.

I pushed a Linux-32 build to Try which runs the profiling step 10 times.  (On Linux, the data is cumulative.)  It scored 220 on Dromaeo DOM, which is 8 points faster than the next closest run from Try in the past 90 days (minus one outlier at 224), and as fast as the fastest builds off MC.

This lends credence to the idea that MC builds are benefiting from cumulative profiling data left around between depends builds (bug 659942).

I don't have a good explanation for why, in bug 659942 comment 1, I didn't find a statistical difference between a builder's first PGO build and its later builds.  Presumably later builds would have some chance of being depends builds, so there'd be an increased chance of seeing an increase in speed in later builds.  Perhaps I messed up the analysis, or perhaps the effect is too small to see through the noise.

Justin Lebar (not reading bugmail)

Assignee

Comment 80

•

13 years ago

Second build which ran the profiling step 10 times came back from try with a 218.23 on Dromaeo DOM.  Again, that's higher than all but one of the results from Try in the past 90 days.

I'll file a bug on running the profiling step more times.  I'm not sure how far that will bring down the variance here, but it appears that it will help.

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 663251

Justin Lebar (not reading bugmail)

Assignee

Comment 81

•

13 years ago

MC is now clobbering profiling data on each build [1].  I expect that this will reduce Linux scores to the low end of the old range.

I also checked in changes to the PGO script so we can run the instrumented build multiple times [2].  We can land a buildbot-config change from bug 663251 and turn that on whenever.  I'd like to wait a few days, so we can observe whether things change as expected and so we can get an estimate of the variance in the new results.

[1] http://hg.mozilla.org/mozilla-central/rev/2d1746337846
[2] http://hg.mozilla.org/mozilla-central/rev/83b4a771ea45

Justin Lebar (not reading bugmail)

Assignee

Comment 82

•

13 years ago

> MC is now clobbering profiling data on each build [1].  I expect that this will 
> reduce Linux scores to the low end of the old range.

The change appears to have had no impact on Talos scores.  Assuming the patch to clobber the data works (it does when I test on my machine), this indicates that the variance isn't due to accumulated profiling data on MC, as I'd previously thought.

We could start running the instrumented builds multiple times anyway.  It might resolve the issue.  But I'd rather understand why MC builds are faster than TM builds first.  I'm afraid if we don't, we'll end up painting over this issue, whatever its cause, and it'll come back to bite us later.

You can see what I'm talking about with TM vs MC at [1].

[1] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,4,14],[73,1,14]]&sel=1306629271386.4368,1308244525714&displayrange=90&datatype=running

Justin Lebar (not reading bugmail)

Assignee

Comment 83

•

13 years ago

I took another look at the scores today.  I think TM pulled this change in the June 20 merge (a285146675dc).  After that merge, the TM scores increased a bit.  (It's clear in the graph if you resize the window so it's pretty narrow.)

We can't attribute this change to clobbering until tm merges back to m-c and we see whether m-c scores increase (indicating that the source of the speedup was from tm) or stay the same (indicating that the source of the speedup was from mc).

I'll keep an eye out over the next few days.

http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,1,14],[73,4,14]]&sel=1307333365814.5754,1309179466534&displayrange=90&datatype=running

Justin Lebar (not reading bugmail)

Assignee

Comment 84

•

13 years ago

Confirmed that bug 659942 landed in TM in the June 20 merge.

Justin Lebar (not reading bugmail)

Assignee

Comment 85

•

13 years ago

Just had a look at mozilla-inbound: I expected its scores to resemble TM, but in fact M-I's scores look like M-C's.  I guess that's good.

Not enough data yet to say whether Monday's TM -> MC merge resulted in an uptick on MC corresponding to the uptick on TM, but it looks like MI didn't move, so I'd guess MC won't move either.

It's possible that the uptick on TM was caused by something unrelated to this bug, but it's also possible that it was caused by clobbering the profiling data.

Justin Lebar (not reading bugmail)

Assignee

Comment 86

•

13 years ago

Catlee landed bug 663251 on Thu, June 30 [1]; we now run the instrumented binary 10 times during Linux PGO builds.  I'd hoped this landing would resolve this bug once and for all, but unfortunately that doesn't appear to have been the case.

There appears to have been a slight uptick on Try, but there isn't much data, and it's noisy anyway, since people don't always update their trees.

Critically, scores on M-C and M-I have the same variance as before (about a 15-point spread).

I think the question now is: Why did this change not make a big difference, while when I pushed to try, I got results significantly higher than any others on try before or after [2]?

[1] http://hg.mozilla.org/build/buildbot-configs/rev/58247fce99b4

[2] http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,23,14]]&sel=none&displayrange=90&datatype=running

The results from my pushes are the cluster of 4 results with score ~ 220 on June 9 and 10.  No push came close to this score until 83a4240b3886 on Jun 29.  In the meantime, scores on all trees have been slowly trending upward.

Justin Lebar (not reading bugmail)

Assignee

Comment 87

•

13 years ago

It looks like this may be partially or wholly fixed, although I have no idea why.  Starting around July 9, scores from MC, TM, and MI all tightened up into a 9-point range.  (The previous range was ~20 points.)  9 points is close to what we had before we turned on PGO.

All three trees' scores look similar, which is fantastic.

Catlee or Dustin, were there changes to the infrastructure over the weekend?

http://people.mozilla.org/~jlebar/graphs/graph.html#tests=[[73,4,14],[73,1,14]]

Justin Lebar (not reading bugmail)

Assignee

Updated

•

13 years ago

Depends on: 672563

Mike Hommey [:glandium]

Updated

•

13 years ago

Depends on: 676252

Justin Lebar (not reading bugmail)

Assignee

Comment 88

•

13 years ago

I think glandium fixed this in bug 676252.  It's kind of hard to see in the graphs, because that bug landed at almost the same time as type inference, so all the benchmarks show a change.

Dromaeo jslib used to be obviously bimodal, but now isn't.  And the variance of dromaeo DOM is now about what it was before we turned on PGO.  The Dromaeo DOM results on m-i now look like the results from m-c and try.

\o/

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Data from two sets of tryserver runs 13 years ago Justin Lebar (not reading bugmail) 57.35 KB, text/plain		Details
Dromaeo DOM Linux-32 scores for a78968ac5bbb, e9236125da3e, and 4a6f98909da5 13 years ago Justin Lebar (not reading bugmail) 369 bytes, text/plain		Details
Win XP / 7 Dromaeo DOM 200a746 to d3d0241 13 years ago Justin Lebar (not reading bugmail) 374.47 KB, image/svg+xml		Details
Win XP / 7 Dromaeo DOM correlation first 300 revs 13 years ago Justin Lebar (not reading bugmail) 192.93 KB, image/svg+xml		Details
Dromaeo DOM Mac 13 years ago Justin Lebar (not reading bugmail) 373.23 KB, image/svg+xml		Details
Linux Dromaeo DOM 13 years ago Justin Lebar (not reading bugmail) 366.48 KB, image/svg+xml		Details
Raw data used to generate graphs 13 years ago Justin Lebar (not reading bugmail) 31.99 KB, application/x-gzip		Details
Data - Which machines built which revs? 13 years ago Justin Lebar (not reading bugmail) 12.65 KB, application/x-bzip2		Details
PGO .gcda files - Build 1 13 years ago Justin Lebar (not reading bugmail) 1.50 MB, application/x-bzip2		Details
PGO .gcda files - Build 2 13 years ago Justin Lebar (not reading bugmail) 1.47 MB, application/x-bzip2		Details