Closed Bug 663251 Opened 14 years ago Closed 13 years ago

Run the profiling step many times on Linux

Categories

(Firefox Build System :: General, defect)

All
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

Details

Attachments

(2 files, 1 obsolete file)

Right now, our PGO script launches FF, visits some sites, and then closes the browser.

It's become apparent in bug 653961 that repeating this process many times increases Dromaeo DOM scores.

The short version is that we see the following ranges for Linux-32 Dromaeo DOM scores on MC, TM, and try:

MC:  206-222
TM:  204-216
Try: 204-214

I don't know how to account for the fact that the min of TM and Try is 2 points lower than on MC.  But my theory is that the difference in maximum scores is due to the fact that MC does depends builds and so can reuse existing profiling data when the changes are small.  (We keep profiling data around between builds [bug 659941] and it's cumulative, at least on Linux.)  In contrast, Try always clobbers.

TM is tricky; it does depends builds like MC, but my hypothesis is that because it has lower volume of pushes than MC, it's unlikely that two consecutive TM builds will end up on the same builder.  Thus, my guess is that most of its builds effectively clobber and throw away the profiling data.  (GCC only uses profiling data from a .gcda file which corresponds to a .o file with the same number of basic blocks as the .o file which created the .gcda file.)

Surely our releases are clobbers, so it seems likely that our releases' performance looks more like TM/Try's than MC's.

The obvious solution is to run the profiling step multiple times.  I did this on Try and got to builds whose scores (220 and 218) exceed all other builds on Try by a considerable margin.  Although I can't know what will happen when we do this on MC, I suspect that running the profiling step many times will improve scores there.

We'll want to land bug 659942 at the same time or before we land this change, because there's no use in keeping around old profiling data.

I haven't done any tests to see what the effect of doing this is on Windows.
Here's the patch I ran on try:

http://hg.mozilla.org/try/rev/242d2ee36919
I think the simplest way to do this would be to add a --number-of-runs parameter to profileserver.py, and add that arg to the PGO script in the mozconfigs
http://mxr.mozilla.org/mozilla-central/source/build/pgo/profileserver.py

(I don't want to force everyone to run 10 times if they don't want.)
I agree with the mechanical changes suggested in comment 2.

As long as this doesn't regress any of our other perf metrics, lets do it!
Sounds good to me!
(In reply to comment #2)
> I think the simplest way to do this would be to add a --number-of-runs
> parameter to profileserver.py, and add that arg to the PGO script in the
> mozconfigs

The python script currently exits with an error when you pass it an arg it doesn't understand.  So if we were to do this, you wouldn't be able to build old versions of FF with the new mozconfig.

I recall we worked around this once before with an hg hook, although I forget what that issue was.

In any case, for this one, how about we add the argument but make it optional and default it to 10.  Then if somebody is doing their own build, they can set it to 1 or whatever, but we don't have to modify our mozconfigs.
I don't want it to default to 10 because we also use this script for valgrind tests, and we certainly don't want to run those 10 times. Alternately, you can just make it take an integer value on the cmdline and use that, like "profileserver.py 10". That should be harmless to old versions.
Attached patch Patch v1Splinter Review
Attachment #538530 - Attachment description: Buildbot-config changes → Buildbot-config changes (Linux 32/64 only)
It's unclear whether we want to take these Windows changes, but I thought I'd spin the patch while I'm here.
I pushed to try, so we'll see what happens with the Windows builds in a bit.

http://tbpl.mozilla.org/?tree=Try&rev=3cf4b8988482
http://tbpl.mozilla.org/?tree=Try&rev=5abe5e61e75f
It doesn't seem to help on Windows, but it sure does on Linux.
Assignee: nobody → justin.lebar+bug
Comment on attachment 538525 [details] [diff] [review]
Patch v1

Review of attachment 538525 [details] [diff] [review]:
-----------------------------------------------------------------

Pre-emptive review, this looks fine.

::: build/pgo/profileserver.py
@@ +71,5 @@
>  
> +  if not os.getenv('OBJDIR'):
> +      parser.error('Please specify the OBJDIR environment variable.')
> +
> +  if len(args) == 0:

Python nit, you can just say "if not args:" here.

@@ +75,5 @@
> +  if len(args) == 0:
> +      num_runs = 1
> +  else:
> +      try:
> +          num_runs = int(args[0])

Do you want to validate that num_runs > 0? (int("-1") doesn't raise)
Attachment #538525 - Flags: review+
Who can review the buildbot changes?
Summary: Run the profiling step many times on Linux (perhaps also on Windows) → Run the profiling step many times on Linux
Attachment #538530 - Flags: review?(catlee)
Attachment #538535 - Attachment is obsolete: true
Attachment #538530 - Flags: review?(catlee) → review+
The changes to the python code were pushed to build-system [1] and m-c [2]

Catlee, can we push the buildbot changes in a few days?  I want to wait first to see that the Talos scores on m-c do what I expect with just the changes from bug 659942.

[1] http://hg.mozilla.org/projects/build-system/rev/83b4a771ea45
[2] http://hg.mozilla.org/mozilla-central/rev/27f51c5d5f9b
Attachment #538525 - Flags: checkin+
(In reply to comment #14)
> The changes to the python code were pushed to build-system [1] and m-c [2]
> 
> Catlee, can we push the buildbot changes in a few days?  I want to wait
> first to see that the Talos scores on m-c do what I expect with just the
> changes from bug 659942.
> 
> [1] http://hg.mozilla.org/projects/build-system/rev/83b4a771ea45
> [2] http://hg.mozilla.org/mozilla-central/rev/27f51c5d5f9b

Yeah, just let me know when.
I'm not convinced it's worth spending a lot more time to figure out why the builds perform differently on different branches if this bug will fix it.

Catlee, can you please push the build-config changes when it's convenient for you?
Attachment #538530 - Flags: checkin+
Is there a risk of overtraining here? If this is a good idea, why doesn't the compiler just add more weight to its decisions by default?
I think what's happening -- and I definitely don't know for sure -- is that some branch is not being taken in some runs, so therefore GCC incorrectly thinks some path is cold and mis-optimizes.  So my theory for why running the profiling step many times helps is that it's unlikely that this branch will not be taken 10 times in a row.

I just looked at the graphs, and I don't see any change lately.  Catlee, you marked the patch as checkin+ -- does that also mean it's been deployed?
It was deployed in http://hg.mozilla.org/build/buildbot-configs/rev/58247fce99b4.

Maybe it needs a clobber to take effect?
> Maybe it needs a clobber to take effect?

It shouldn't, but even if it did, haven't all the machines clobbered since Thursday?

Hm...
It does not appear to be working properly.

From [1]:

 Starting profiling run 2 of 10
 args: ['/builds/slave/m-cen-lnx/build/obj-firefox/dist/firefox/firefox-bin', '-no-remote', '-profile', '/builds/slave/m-cen-lnx/build/obj-firefox/_profile/pgo/pgoprofile/', 'http://localhost:8888/index.html']
 INFO | automation.py | Application pid: 17868
 INFO | automation.py | Application ran for: 0:01:40.541590
 INFO | automation.py | Reading PID log: /tmp/tmp69ivvUpidlog
 Starting profiling run 3 of 10

Whereas for profiling run 1 of 10, it lists a series of pages loaded.

[1] http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1309878091.1309882684.1560.gz&fulltext=1
When I run it locally, it similarly doesn't output the list of URLs it visited, although it does appear to visit them.
Oh, I think it's reading from cache.
Depends on: 669408
This appears to be working (modulo bug 669408), even if it's not having the expected effect on build speeds.  Closing the bug.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 672563
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: