Closed Bug 710840 Opened 13 years ago Closed 11 years ago

Track peak virtual memory usage of link.exe process during libxul PGO link on graph server

Categories

(Release Engineering :: General, defect, P1)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: armenzg)

References

Details

(Keywords: sheriffing-P1, Whiteboard: [graphserver][pgo] 2012-01-21 --> linker max vsize:3757MB)

Attachments

(8 files, 3 obsolete files)

bug 710712 is going to add the ability to measure the peak virtual memory usage of the linker during the final PGO link phase on Windows. We should track this number on the graph server so we can monitor the situation.
catlee asked if we could make the build go orange if we went over a threshold for this value. I'm totally in favor of this, but we should see where we're currently at before deciding on what the threshold needs to be. I just pushed bug 710712 to inbound, so we should get some numbers soon.
OS: Windows 7 → Windows Server 2003
Priority: -- → P3
Whiteboard: [graphserver][pgo]
For reference (WINNT 5.2 x86 tinderbox pgo):
2011-12-16: 2887.55 MB / 3027816448 bytes (from bug 710712 comment 16)
2011-01-04: 2886.54 MB / 3026759680 bytes (inbound rev 5025534b9d88)
And that should of course read 2012-01-04 (first of many instances of doing that this month I'm sure :-))
In lieu of having the graph server set up for this yet, I'll continue posting numbers periodically (at least whilst the memory of the ever so fun sheriffing weekend of bug 709193 is fresh in the mind):

2011-01-10: 2886.26 MB / 3026460672 bytes (inbound rev 01d69766026d)
2012-01-31: 2902.54 MB / 3043528704 bytes (inbound rev 5a8ff4828791)

(16mb increase in the last 3 weeks)
2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)

(Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch perhaps?)
(In reply to Ed Morley [:edmorley] from comment #6)
> 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> 
> (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch perhaps?)

Right.
3021553664 bytes (inbound rev 0831ce6ba72f on the retrigger after the first one died with "compiler is out of heap space in pass 2")
I'll start doing some bisecting to see if we can work out where this fairly hefty increase has come from...
Oh and we've just had another on inbound:
ac1504ff8740
https://tbpl.mozilla.org/php/getParsedLog.php?id=11354573&tree=Mozilla-Inbound

This is looking bad :-(
Severity: normal → critical
Priority: P3 → P1
Blocks: 750661
Hopefully bug 750717 should take away some of the pain of not having this at least for now.
Severity: critical → major
So here is my armchair quarterback summary of what I think needs to happen here:
1. Write buildbot step to send this data from the buildslave to the graphserver system.
2. Modify the graphserver to accept this value from the builders (jmaher and/or rhelmer - can you file the database modification bug needed for this?)
3. Ensure the networking flows are in place between the builders and the graphserver systems (I can file this but I need the vlan numbers for all the releng builders - I'm sort of assuming they are separate from the test slaves - if they are on the same vlan as the slaves then maybe we don't need this).
4. Update the script for the dev.treemanagement auto-emailer to send this new data. Catlee, can you take this? You're the only person I know of that knows where the code for that mailer is and how to re-deploy a new version of it.

Please if I have any details wrong, do add a comment and correct me.
(In reply to Clint Talbert ( :ctalbert ) from comment #13)
> So here is my armchair quarterback summary of what I think needs to happen
> here:
> 1. Write buildbot step to send this data from the buildslave to the
> graphserver system.

we have similar code in place to submit leak info / codesighs already

> 2. Modify the graphserver to accept this value from the builders (jmaher
> and/or rhelmer - can you file the database modification bug needed for this?)

> 3. Ensure the networking flows are in place between the builders and the
> graphserver systems (I can file this but I need the vlan numbers for all the
> releng builders - I'm sort of assuming they are separate from the test
> slaves - if they are on the same vlan as the slaves then maybe we don't need
> this).

no need - we already submit info to graph server from the build machines

> 4. Update the script for the dev.treemanagement auto-emailer to send this
> new data. Catlee, can you take this? You're the only person I know of that
> knows where the code for that mailer is and how to re-deploy a new version
> of it.

it will automatically get picked up
it looks like we just need to solve 2 things:
1. Write buildbot step to send this data from the buildslave to the graphserver system.
2. Modify the graphserver to accept this value from the builders (jmaher and/or rhelmer - can you file the database modification bug needed for this?)

I can work on the graph server database mods.  What is the name we want to use for this test?  'libxul_link'?
(In reply to Joel Maher (:jmaher) from comment #15)
> I can work on the graph server database mods.  What is the name we want to
> use for this test?  'libxul_link'?

Thanks joel, that works for me.  Do you also have to add all the machine names for the builders or are the machine names in the graphserver db populated at run-time?
For whatever reason, build metrics like this use a generic platform name as the machine name, e.g. http://hg.mozilla.org/graphs/file/2018284ed6e7/sql/data.sql#l1171

so you can use those names (with "_leak_test" == debug), and just add the new test to the database.
Attachment #622059 - Flags: review?(rhelmer) → review+
Depends on: 753767
(In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> 
> (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)

2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)
So, we're at 3.4 gigs.  Any chance we can get those graphs before The Next Big Surprise?  :-)
Could this be fixed by tbpl turning orange after a certain threshold?
Could a tool be written to  grab the information and create a graph?
Well TBPL already turns red after a certain threshold.  The point is to see what is causing the increases, not when we pass some arbitrary threshold.
The easiest way I think is to submit this data to the graph server and graph it there.  The point here is knowing when the problem is approaching before it does.
Oh!
It seems we only do TinderboxPrint!

Anyone know what is required to post the data in the graphs server?

I can see that this was added:
insert into tests values (NULL,"libxul_link","LibXUL Memory during link",0,1,NULL);
and IT run it on the DB.
Armen, look for usage of GraphServerPost in buildbotcustom/process/factory.py.
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: catlee
(In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> (In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> > 
> > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)
> 
> 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)

2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe)

500MB more in 3 months!

We need to fix this sooner rather than later (and ideally import all the old nightly figures from the logs, so we can more easily see what has bumped it up so much).

At current rate of increase we have only a couple of months before we hit this again.
Keywords: sheriffing-P1
Sorry, s/2012-01-07/2013-01-07/
(In reply to comment #27)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> > (In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> > > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> > > 
> > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)
> > 
> > 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)
> 
> 2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe)
> 
> 500MB more in 3 months!
> 
> We need to fix this sooner rather than later (and ideally import all the old
> nightly figures from the logs, so we can more easily see what has bumped it up
> so much).
> 
> At current rate of increase we have only a couple of months before we hit this
> again.

So what parts of the WebRTC code ended up going into libxul?
Flags: needinfo?(rjesup)
media/webrtc went in (/signaling had to be in, and it references many things in /trunk).  We could in the future (especially as the code gets more locked down) probably move trunk to gkmedia, and deal with adding a lot of symbols to symbols.def.in (or finding a better way to deal with that!)  The decision IIRC was to allow those two to come in, but leave the rest in gkmedia.

Signaling landed in m-c at FF 18th uplift, around Oct 6th.  Of the top of my head, I can't remember if media/webrtc/trunk was in gkmedia before that (I think it was), so both moved to libxul around then.
Flags: needinfo?(rjesup)
OK, I filed bug 827985 to move that code out of libxul.  Thanks for the clarification.

Ed, this needs to be treated with utmost priority.  Who should work on the graphing thing?
FYI, per above on 9/28 were were 3.2G, 11/2 we were at 3.4G (after signaling landed), now we're at 3.7G.  So the 500MB since 11/2 is NOT webrtc.  And I think (thinking back) that webrtc/trunk was in xul before 10/6; if so we took a (guess) 50-100MB hit for signaling, and likely webrtc has contributed little since then.

Can you run a PGO --disable-webrtc build and report the number?  I can't build on windows currently - thanks Microsoft!!  --disable-webrtc will be an significant over-estimate of what you'll get back (as signaling won't get compiled).
(In reply to comment #32)
> FYI, per above on 9/28 were were 3.2G, 11/2 we were at 3.4G (after signaling
> landed), now we're at 3.7G.  So the 500MB since 11/2 is NOT webrtc.  And I
> think (thinking back) that webrtc/trunk was in xul before 10/6; if so we took a
> (guess) 50-100MB hit for signaling, and likely webrtc has contributed little
> since then.

It doesn't matter how much we're going to win from this, we need to move all of the code that we can outside of libxul, and the WebRTC stuff is just part of it.

> Can you run a PGO --disable-webrtc build and report the number?  I can't build
> on windows currently - thanks Microsoft!!  --disable-webrtc will be an
> significant over-estimate of what you'll get back (as signaling won't get
> compiled).

I only have VS2012, so the numbers that I get will not be representative (I actually don't know how to run the linker vmem usage measurement script locally.)  That being said, you can push to try to get the numbers, but like I said it doesn't matter much, we need *all* of the wins that we can get.
Let me try to help with this.
Assignee: nobody → armenzg
I know that the following file contains the value that we need to post:
obj-firefox\toolkit\library\linker-vsize

We could add code in here:
http://hg.mozilla.org/mozilla-central/file/0faa1d47ea80/build/link.py#l19
and post to the graph server.

Or we can add a new post compilation step on buildbot to post it.

I will look in the releng side since we already have some GraphServer logic.
Right, the file contains the info, and it's also output to stdout in the build step in a line starting with "TinderboxPrint: linker max vsize:". You should use whichever is easier.
Attachment #702314 - Flags: review?(coop)
Attachment #702314 - Flags: review?(coop) → review+
Attachment #702314 - Flags: checked-in+
Attached patch post vsize (obsolete) — Splinter Review
Attachment #703578 - Flags: review?(bhearsum)
I have my first data point on staging:
http://graphs.allizom.org/graph.html#tests=[[205,6,8]]

At what point is this going to blow up? What is the upper limit?

With regards to producing a graph I would suggest to book a project branch and request releng to add PGO jobs to it. At that point I would suggest creating a list of changesets that we want data points for and trigger PGO builds.

Once bhearsum reviews this and we land it we will start having the data points.

(In reply to Ed Morley (Away 18th-20th Jan) [:edmorley UTC+0] from comment #27)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> > (In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> > > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)
> > 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)
> 2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe)
> 500MB more in 3 months!
2012-01-18: 3756.02 /3938476032 bytes (m-c rev b52c02f77cf5)
Whiteboard: [graphserver][pgo] → [graphserver][pgo] 2012-01-18 --> linker max vsize:3756.02MB
The blowup point is somewhere near 4GB (the total amount of virtual memory available to a 32-bit process running on Windows x64), but we don't know exactly where. Essentially once the linker tries to allocate more virtual memory and it runs out it will blow up, but that last allocation could be fairly large, it's hard to tell.
(In reply to comment #39)
> With regards to producing a graph I would suggest to book a project branch and
> request releng to add PGO jobs to it. At that point I would suggest creating a
> list of changesets that we want data points for and trigger PGO builds.

Who would own updating the branch though?  Why can't we just get the graphs for inbound and central?
(In reply to :Ehsan Akhgari from comment #41)
> (In reply to comment #39)
> > With regards to producing a graph I would suggest to book a project branch and
> > request releng to add PGO jobs to it. At that point I would suggest creating a
> > list of changesets that we want data points for and trigger PGO builds.
> 
> Who would own updating the branch though?  Why can't we just get the graphs
> for inbound and central?

I thought you guys mentioned that you wanted to get some history to see what things increased the memory usage. To build up history we would need to setup a project branch and trigger old changesets.

As soon as we land the patch we will get coverage on all branches that have pgo enabled from there on.
(In reply to comment #42)
> (In reply to :Ehsan Akhgari from comment #41)
> > (In reply to comment #39)
> > > With regards to producing a graph I would suggest to book a project branch and
> > > request releng to add PGO jobs to it. At that point I would suggest creating a
> > > list of changesets that we want data points for and trigger PGO builds.
> > 
> > Who would own updating the branch though?  Why can't we just get the graphs
> > for inbound and central?
> 
> I thought you guys mentioned that you wanted to get some history to see what
> things increased the memory usage. To build up history we would need to setup a
> project branch and trigger old changesets.

I don't see why.  We do have the data in the old logs, right?  We should just be able to write a script to parse them out or something.  Am I missing something?
(In reply to :Ehsan Akhgari from comment #43)
> (In reply to comment #42)
> > (In reply to :Ehsan Akhgari from comment #41)
> > > (In reply to comment #39)
> > > > With regards to producing a graph I would suggest to book a project branch and
> > > > request releng to add PGO jobs to it. At that point I would suggest creating a
> > > > list of changesets that we want data points for and trigger PGO builds.
> > > 
> > > Who would own updating the branch though?  Why can't we just get the graphs
> > > for inbound and central?
> > 
> > I thought you guys mentioned that you wanted to get some history to see what
> > things increased the memory usage. To build up history we would need to setup a
> > project branch and trigger old changesets.
> 
> I don't see why.  We do have the data in the old logs, right?  We should
> just be able to write a script to parse them out or something.  Am I missing
> something?

Good point! That would save lots of time.
Comment on attachment 703578 [details] [diff] [review]
post vsize

Review of attachment 703578 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +1344,5 @@
>                      data=WithProperties('TinderboxPrint: num_ctors: %(num_ctors:-unknown)s'),
>                      ))
>  
> +    def addPostBuildSteps(self):
> +        if self.profiledBuild and self.platform in ('win32',) and self.baseName:

Please add an explicit flag to MercurialBuildFactory/config.py for this rather than guessing based on 3 different things.

@@ +1356,5 @@
> +                    return {'testresults': []}
> +
> +            self.addStep(SetProperty(
> +                name='get_linker_vsize',
> +                command=['cat', '%s\\toolkit\\library\\linker-vsize' % self.mozillaObjdir],

Why the '\\'? All of other steps use / without issue.
Attachment #703578 - Flags: review?(bhearsum) → review-
closed trees: bug 832992 :(
(In reply to Armen Zambrano G. [:armenzg] from comment #46)
> closed trees: bug 832992 :(

well, in the end that may be just a disk space issue, though we are slowly getting near the limit (now at 3939495936).
Whiteboard: [graphserver][pgo] 2012-01-18 --> linker max vsize:3756.02MB → [graphserver][pgo] 2012-01-18 --> linker max vsize:3757MB
Whiteboard: [graphserver][pgo] 2012-01-18 --> linker max vsize:3757MB → [graphserver][pgo] 2012-01-21 --> linker max vsize:3757MB
I'm gonna call this a blocker this time.
Severity: major → blocker
May this be added to the releng Q1 goals please, we keep hitting the problem, and while there isn't a clear solution to it, this is the only way we have to track its evolution.
This bug is in progress, adding it to the goals list won't make it happen any faster. Armen had a working implementation, it just needs small tweaks before it can be landed.
Blocks: 832992
Attached patch post vsize (obsolete) — Splinter Review
dump_masters shows that this gets added for every PGO and WINNT nightly builds
Attachment #703578 - Attachment is obsolete: true
Attachment #704904 - Flags: review?(bhearsum)
Attachment #704905 - Flags: review?(bhearsum)
For historic values see:
https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/

Nathan, don't suppose you could attach the raw values, so we can backfill the gap on graphs.m.o?
Sure, Ed, no problem.  Here's the file I used; the format is:

<build-date> <hg-revision> <linker-vsize>

The data doesn't perfectly capture the hg revision for every log file, but the number of points that it missed was small enough that I wasn't going to worry about it.
...and for reference, here's the script I used to generate the previous file.  The script expects the names of the log files to start with:

YYYY-MM-DD-HH-MM-SS

for the timestamp portion, but that's probably not hard to change.  Simply invoke:

extract-info <list-of-log-files>
Comment on attachment 704904 [details] [diff] [review]
post vsize

Through IRC.
Attachment #704904 - Flags: review?(bhearsum) → review-
Attachment #704905 - Flags: review?(bhearsum) → review-
Comment on attachment 704904 [details] [diff] [review]
post vsize

Sorry, Armen and I talked on IRC about this awhile ago but I forgot to update the bug:
13:11 < bhearsum> armenzg: i meant that we should have a flag for 'post_linker_size' or something, not 'do_post_build_steps'
13:11 < bhearsum> i want this line gone:
13:11 < bhearsum>  if self.profiledBuild and self.platform in ('win32',) and self.baseName:
13:11 < bhearsum> because it guesses about what should happen
13:11 < bhearsum> that can be replaced with if self.postLinkerSize
13:12 < armenzg> k
Attachment #705069 - Flags: review?(bhearsum)
Attachment #704905 - Attachment is obsolete: true
Attachment #705072 - Flags: review?(bhearsum)
Attachment #705069 - Attachment description: do post build steps config changes → [buildbotcustom] do post vsize
Attachment #704904 - Attachment is obsolete: true
Attachment #705069 - Flags: review?(bhearsum) → review+
Attachment #705072 - Flags: review?(bhearsum) → review+
Attachment #705069 - Flags: checked-in+
Attachment #705072 - Flags: checked-in+
in production
(In reply to comment #60)
> in production

Where can the graphs be found?
Depends on: 833653
Comment on attachment 705072 [details] [diff] [review]
do post build steps config changes

Reverted this for bustage in bug 833653.

default:    http://hg.mozilla.org/build/buildbot-configs/rev/df9a319c5edd
production: http://hg.mozilla.org/build/buildbot-configs/rev/0dcbc3ce69f9
Attachment #705072 - Flags: checked-in+ → checked-in-
I don't know at which point we lost this line on the patch.
This gets the sourcestamp info missing.
Attachment #705425 - Flags: review?(bhearsum)
Attachment #705425 - Flags: review?(bhearsum) → review+
Blocks: 833881
I will land and reconfig in the morning.
(In reply to comment #64)
> I will land and reconfig in the morning.

Thanks Armen!

Can you please also let me know how much historical data we can get out of this on each of central and inbound?  It would be absolutely amazing if we can get per-checkin data for the interesting ranges highlighted in <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
(In reply to :Ehsan Akhgari from comment #65)
> Can you please also let me know how much historical data we can get out of
> this on each of central and inbound?  It would be absolutely amazing if we
> can get per-checkin data for the interesting ranges highlighted in
> <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.

Per push logs are only kept for 30days, so try runs will be required.
(In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> (In reply to :Ehsan Akhgari from comment #65)
> > Can you please also let me know how much historical data we can get out of
> > this on each of central and inbound?  It would be absolutely amazing if we
> > can get per-checkin data for the interesting ranges highlighted in
> > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> 
> Per push logs are only kept for 30days, so try runs will be required.

Would a project branch be more interesting for this project?

(In reply to :Ehsan Akhgari from comment #65)
> (In reply to comment #64)
> > I will land and reconfig in the morning.
> 
> Thanks Armen!
> 
> Can you please also let me know how much historical data we can get out of
> this on each of central and inbound?  It would be absolutely amazing if we
> can get per-checkin data for the interesting ranges highlighted in
> <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.

What edmorley says is correct. I think selfserve would be needed for this:
(In reply to comment #66)
> (In reply to :Ehsan Akhgari from comment #65)
> > Can you please also let me know how much historical data we can get out of
> > this on each of central and inbound?  It would be absolutely amazing if we
> > can get per-checkin data for the interesting ranges highlighted in
> > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> 
> Per push logs are only kept for 30days, so try runs will be required.

OK.  so I guess we can explore that path when we need to.  Thanks!
(In reply to comment #67)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> > (In reply to :Ehsan Akhgari from comment #65)
> > > Can you please also let me know how much historical data we can get out of
> > > this on each of central and inbound?  It would be absolutely amazing if we
> > > can get per-checkin data for the interesting ranges highlighted in
> > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> > 
> > Per push logs are only kept for 30days, so try runs will be required.
> 
> Would a project branch be more interesting for this project?

Not sure how that would help?
(In reply to :Ehsan Akhgari from comment #69)
> (In reply to comment #67)
> > (In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> > > (In reply to :Ehsan Akhgari from comment #65)
> > > > Can you please also let me know how much historical data we can get out of
> > > > this on each of central and inbound?  It would be absolutely amazing if we
> > > > can get per-checkin data for the interesting ranges highlighted in
> > > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> > > 
> > > Per push logs are only kept for 30days, so try runs will be required.
> > 
> > Would a project branch be more interesting for this project?
> 
> Not sure how that would help?

Turn around is faster.
Other pgo jobs (not related to this data gathering) could be posting numbers in the try graph and would polluting the graph. Even though I can't see any PGO that was triggered by a dev today.

On the other hand, customizing a branch to only do Windows PGO without test jobs could add some overhead to setup.

I think either way is fine.
Assuming that it works for the try server. I will try one build after I reconfig in the morning.
(In reply to comment #70)
> (In reply to :Ehsan Akhgari from comment #69)
> > (In reply to comment #67)
> > > (In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> > > > (In reply to :Ehsan Akhgari from comment #65)
> > > > > Can you please also let me know how much historical data we can get out of
> > > > > this on each of central and inbound?  It would be absolutely amazing if we
> > > > > can get per-checkin data for the interesting ranges highlighted in
> > > > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> > > > 
> > > > Per push logs are only kept for 30days, so try runs will be required.
> > > 
> > > Would a project branch be more interesting for this project?
> > 
> > Not sure how that would help?
> 
> Turn around is faster.
> Other pgo jobs (not related to this data gathering) could be posting numbers in
> the try graph and would polluting the graph. Even though I can't see any PGO
> that was triggered by a dev today.

Hmm, good point.  But that also takes away the ability of just pushing new heads and get builds on them, right?  I mean, we would need to push new heads in the right order, right?

> On the other hand, customizing a branch to only do Windows PGO without test
> jobs could add some overhead to setup.

Ouch.

> I think either way is fine.
> Assuming that it works for the try server. I will try one build after I
> reconfig in the morning.

Thanks, that's a good idea regardless.
Attachment #705425 - Flags: checked-in+
Attachment #705072 - Flags: checked-in- → checked-in+
The good news is that this is live.
The original purpose of the bug is fulfilled (as I understand it).

The bad news is that there is no way to trigger PGO builds on try.
Developers change the mozconfig but that does not trigger the PGO/try builders.
Pushing to try as PGO would print the linker size but it won't post to the try server (as before).

Booking a project branch with PGO would give the ability to push changesets in a chronological order but care would be needed to not prevent coallescing from happening (perhaps this can be configured on our side to be prevented - I don't know if it is easy).

Is it good enough to have data points on the graphs DB from here on?

IIUC there are means to gather historical data by pushing to try and scrapping the linker size.
(In reply to comment #72)
> The good news is that this is live.
> The original purpose of the bug is fulfilled (as I understand it).
> 
> The bad news is that there is no way to trigger PGO builds on try.
> Developers change the mozconfig but that does not trigger the PGO/try builders.
> Pushing to try as PGO would print the linker size but it won't post to the try
> server (as before).
> 
> Booking a project branch with PGO would give the ability to push changesets in
> a chronological order but care would be needed to not prevent coallescing from
> happening (perhaps this can be configured on our side to be prevented - I don't
> know if it is easy).
> 
> Is it good enough to have data points on the graphs DB from here on?

It's good but definitely not enough.

So the first step is to parse through the PGO logs for the past 30 days, and also nightly logs for as long as we have them stored, and report them to the graph server associated with the correct date and changeset.  Then, I guess we'll need to fill in the gaps for the individual changesets in the spikes that we've seen in Nathan's analysis.  That would help us experiement with the possibility of finding culprit changesets which have added the most to the linker memory usage and see how we can deal with that.

That all being said, gathering detailed historical data only matters if we decide to keep PGO enabled and try to keep the linker memory usage bounded, which is a call that we have not made yet.  We need more of the dependencies of bug 833881 to be resolved before we can make a meaningful decision on that.  If we do decide to keep PGO enabled, I'll file another bug in the RelEng component to gather more historical data.

Last but not least, thanks everyone for your help here, really appreciated!  :-)
I thought my reconfig this morning would have done the trick but it seems that when a change is backed out from both branches ("production" and "default") then, the typical land to default and merge to production misses the change [1]. I've seen this happen a couple of times in the past.

I landed it again (on production) and reconfigured the build masters again:
http://hg.mozilla.org/build/buildbot-configs/rev/92846acd0ba5

I re-triggered a second pgo in here that should be successful:
https://tbpl.mozilla.org/?jobname=WINNT%205.2%20mozilla-central%20pgo-build&rev=680e46fecff0

[1] http://hg.mozilla.org/build/buildbot-configs/graph
(In reply to Armen Zambrano G. [:armenzg] from comment #72)
> Booking a project branch with PGO would give the ability to push changesets
> in a chronological order but care would be needed to not prevent coallescing
> from happening (perhaps this can be configured on our side to be prevented -
> I don't know if it is easy).

For a project branch you could probably just use self-serve force pgo builds on a revision, so no need to push. That assumes disabling merging is easily done to speed the process up. Whether the history is easily transferable to the m-c branch in the graph server is another question.
(In reply to comment #75)
> (In reply to Armen Zambrano G. [:armenzg] from comment #72)
> > Booking a project branch with PGO would give the ability to push changesets
> > in a chronological order but care would be needed to not prevent coallescing
> > from happening (perhaps this can be configured on our side to be prevented -
> > I don't know if it is easy).
> 
> For a project branch you could probably just use self-serve force pgo builds on
> a revision, so no need to push. That assumes disabling merging is easily done
> to speed the process up. Whether the history is easily transferable to the m-c
> branch in the graph server is another question.

Hmm, I'm not quite sure what that exactly means...
Depends on: 834596
(In reply to :Ehsan Akhgari from comment #76)
> (In reply to comment #75)
> > (In reply to Armen Zambrano G. [:armenzg] from comment #72)
> > > Booking a project branch with PGO would give the ability to push changesets
> > > in a chronological order but care would be needed to not prevent coallescing
> > > from happening (perhaps this can be configured on our side to be prevented -
> > > I don't know if it is easy).
> > 
> > For a project branch you could probably just use self-serve force pgo builds on
> > a revision, so no need to push. That assumes disabling merging is easily done
> > to speed the process up. Whether the history is easily transferable to the m-c
> > branch in the graph server is another question.
> 
> Hmm, I'm not quite sure what that exactly means...

I think what nthomas is suggesting is to trigger PGO builds on a project branch which would add data points for that branch on the graph server. We could then ask a DBA to transfer the data points to the mozilla-central records.
One note is that tbpl might not show up any jobs since the changesets are from the past.
We are now getting data points:
http://graphs.mozilla.org/graph.html#tests=[[205,63,8]]&sel=none&displayrange=7&datatype=running

mozilla-central will show data point in the next few hours.
I missed inserting the machine name on the graphs DB. This has now been fixed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(In reply to comment #77)
> (In reply to :Ehsan Akhgari from comment #76)
> > (In reply to comment #75)
> > > (In reply to Armen Zambrano G. [:armenzg] from comment #72)
> > > > Booking a project branch with PGO would give the ability to push changesets
> > > > in a chronological order but care would be needed to not prevent coallescing
> > > > from happening (perhaps this can be configured on our side to be prevented -
> > > > I don't know if it is easy).
> > > 
> > > For a project branch you could probably just use self-serve force pgo builds on
> > > a revision, so no need to push. That assumes disabling merging is easily done
> > > to speed the process up. Whether the history is easily transferable to the m-c
> > > branch in the graph server is another question.
> > 
> > Hmm, I'm not quite sure what that exactly means...
> 
> I think what nthomas is suggesting is to trigger PGO builds on a project branch
> which would add data points for that branch on the graph server. We could then
> ask a DBA to transfer the data points to the mozilla-central records.
> One note is that tbpl might not show up any jobs since the changesets are from
> the past.

OK, like I said we might need this in the near future.  In the case that we do, I'll file another bug and let you guys do what's needed.  Thanks!  :-)
Depends on: 863061
Product: mozilla.org → Release Engineering
Blocks: 1084483
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: