Last Comment Bug 710840 - Track peak virtual memory usage of link.exe process during libxul PGO link on graph server
: Track peak virtual memory usage of link.exe process during libxul PGO link on...
Status: RESOLVED FIXED
[graphserver][pgo] 2012-01-21 --> lin...
: sheriffing-P1
Product: Release Engineering
Classification: Other
Component: General Automation (show other bugs)
: other
: x86 Windows Server 2003
: P1 blocker (vote)
: ---
Assigned To: Armen Zambrano [:armenzg] (EDT/UTC-4)
: Chris AtLee [:catlee]
Mentors:
Depends on: 710712 753767 833653 834596 863061
Blocks: 750661 832992 833881 1084483
  Show dependency treegraph
 
Reported: 2011-12-14 12:23 PST by Ted Mielczarek [:ted.mielczarek]
Modified: 2014-10-17 10:15 PDT (History)
27 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
April inbound linker max virtual size values (1.68 KB, text/plain)
2012-05-01 06:43 PDT, Ed Morley [:emorley]
no flags Details
graphserver definition for libxul_link (1.0) (632 bytes, patch)
2012-05-08 11:32 PDT, Joel Maher ( :jmaher)
rhelmer: review+
Details | Diff | Splinter Review
Add win64 slaves to the graph server (10.86 KB, patch)
2013-01-15 07:36 PST, Armen Zambrano [:armenzg] (EDT/UTC-4)
coop: review+
armenzg: checked‑in+
Details | Diff | Splinter Review
post vsize (2.21 KB, patch)
2013-01-17 13:46 PST, Armen Zambrano [:armenzg] (EDT/UTC-4)
bhearsum: review-
Details | Diff | Splinter Review
post vsize (3.75 KB, patch)
2013-01-22 08:37 PST, Armen Zambrano [:armenzg] (EDT/UTC-4)
armenzg: review-
Details | Diff | Splinter Review
do post build steps config changes (450 bytes, patch)
2013-01-22 08:38 PST, Armen Zambrano [:armenzg] (EDT/UTC-4)
armenzg: review-
Details | Diff | Splinter Review
historic linker vsize values from january 2012 onward (29.34 KB, text/plain)
2013-01-22 11:23 PST, Nathan Froyd [:froydnj]
no flags Details
script to extract hg revisions + linker vsize from log files (436 bytes, text/plain)
2013-01-22 11:26 PST, Nathan Froyd [:froydnj]
no flags Details
[buildbotcustom] do post vsize (5.64 KB, patch)
2013-01-22 12:43 PST, Armen Zambrano [:armenzg] (EDT/UTC-4)
bhearsum: review+
armenzg: checked‑in+
Details | Diff | Splinter Review
do post build steps config changes (450 bytes, patch)
2013-01-22 12:44 PST, Armen Zambrano [:armenzg] (EDT/UTC-4)
bhearsum: review+
armenzg: checked‑in+
Details | Diff | Splinter Review
add BuildInfoSteps (980 bytes, patch)
2013-01-23 10:21 PST, Armen Zambrano [:armenzg] (EDT/UTC-4)
bhearsum: review+
armenzg: checked‑in+
Details | Diff | Splinter Review

Description Ted Mielczarek [:ted.mielczarek] 2011-12-14 12:23:06 PST
bug 710712 is going to add the ability to measure the peak virtual memory usage of the linker during the final PGO link phase on Windows. We should track this number on the graph server so we can monitor the situation.
Comment 1 Ted Mielczarek [:ted.mielczarek] 2011-12-15 10:33:03 PST
catlee asked if we could make the build go orange if we went over a threshold for this value. I'm totally in favor of this, but we should see where we're currently at before deciding on what the threshold needs to be. I just pushed bug 710712 to inbound, so we should get some numbers soon.
Comment 2 Ed Morley [:emorley] 2012-01-04 15:23:35 PST
For reference (WINNT 5.2 x86 tinderbox pgo):
2011-12-16: 2887.55 MB / 3027816448 bytes (from bug 710712 comment 16)
2011-01-04: 2886.54 MB / 3026759680 bytes (inbound rev 5025534b9d88)
Comment 3 Ed Morley [:emorley] 2012-01-04 15:24:33 PST
And that should of course read 2012-01-04 (first of many instances of doing that this month I'm sure :-))
Comment 4 Ed Morley [:emorley] 2012-01-10 02:46:41 PST
In lieu of having the graph server set up for this yet, I'll continue posting numbers periodically (at least whilst the memory of the ever so fun sheriffing weekend of bug 709193 is fresh in the mind):

2011-01-10: 2886.26 MB / 3026460672 bytes (inbound rev 01d69766026d)
Comment 5 Ed Morley [:emorley] 2012-01-31 07:23:00 PST
2012-01-31: 2902.54 MB / 3043528704 bytes (inbound rev 5a8ff4828791)

(16mb increase in the last 3 weeks)
Comment 6 Ed Morley [:emorley] 2012-02-27 14:01:34 PST
2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)

(Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch perhaps?)
Comment 7 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2012-02-27 14:02:26 PST
(In reply to Ed Morley [:edmorley] from comment #6)
> 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> 
> (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch perhaps?)

Right.
Comment 8 Phil Ringnalda (:philor) 2012-05-01 00:15:15 PDT
3021553664 bytes (inbound rev 0831ce6ba72f on the retrigger after the first one died with "compiler is out of heap space in pass 2")
Comment 9 Ed Morley [:emorley] 2012-05-01 03:41:08 PDT
I'll start doing some bisecting to see if we can work out where this fairly hefty increase has come from...
Comment 10 Ed Morley [:emorley] 2012-05-01 04:04:06 PDT
Oh and we've just had another on inbound:
ac1504ff8740
https://tbpl.mozilla.org/php/getParsedLog.php?id=11354573&tree=Mozilla-Inbound

This is looking bad :-(
Comment 11 Ed Morley [:emorley] 2012-05-01 06:43:59 PDT
Created attachment 619911 [details]
April inbound linker max virtual size values
Comment 12 Ed Morley [:emorley] 2012-05-01 08:08:09 PDT
Hopefully bug 750717 should take away some of the pain of not having this at least for now.
Comment 13 cmtalbert 2012-05-03 16:03:00 PDT
So here is my armchair quarterback summary of what I think needs to happen here:
1. Write buildbot step to send this data from the buildslave to the graphserver system.
2. Modify the graphserver to accept this value from the builders (jmaher and/or rhelmer - can you file the database modification bug needed for this?)
3. Ensure the networking flows are in place between the builders and the graphserver systems (I can file this but I need the vlan numbers for all the releng builders - I'm sort of assuming they are separate from the test slaves - if they are on the same vlan as the slaves then maybe we don't need this).
4. Update the script for the dev.treemanagement auto-emailer to send this new data. Catlee, can you take this? You're the only person I know of that knows where the code for that mailer is and how to re-deploy a new version of it.

Please if I have any details wrong, do add a comment and correct me.
Comment 14 Chris AtLee [:catlee] 2012-05-03 17:33:18 PDT
(In reply to Clint Talbert ( :ctalbert ) from comment #13)
> So here is my armchair quarterback summary of what I think needs to happen
> here:
> 1. Write buildbot step to send this data from the buildslave to the
> graphserver system.

we have similar code in place to submit leak info / codesighs already

> 2. Modify the graphserver to accept this value from the builders (jmaher
> and/or rhelmer - can you file the database modification bug needed for this?)

> 3. Ensure the networking flows are in place between the builders and the
> graphserver systems (I can file this but I need the vlan numbers for all the
> releng builders - I'm sort of assuming they are separate from the test
> slaves - if they are on the same vlan as the slaves then maybe we don't need
> this).

no need - we already submit info to graph server from the build machines

> 4. Update the script for the dev.treemanagement auto-emailer to send this
> new data. Catlee, can you take this? You're the only person I know of that
> knows where the code for that mailer is and how to re-deploy a new version
> of it.

it will automatically get picked up
Comment 15 Joel Maher ( :jmaher) 2012-05-04 12:33:12 PDT
it looks like we just need to solve 2 things:
1. Write buildbot step to send this data from the buildslave to the graphserver system.
2. Modify the graphserver to accept this value from the builders (jmaher and/or rhelmer - can you file the database modification bug needed for this?)

I can work on the graph server database mods.  What is the name we want to use for this test?  'libxul_link'?
Comment 16 cmtalbert 2012-05-04 15:01:38 PDT
(In reply to Joel Maher (:jmaher) from comment #15)
> I can work on the graph server database mods.  What is the name we want to
> use for this test?  'libxul_link'?

Thanks joel, that works for me.  Do you also have to add all the machine names for the builders or are the machine names in the graphserver db populated at run-time?
Comment 17 Chris AtLee [:catlee] 2012-05-05 06:08:23 PDT
For whatever reason, build metrics like this use a generic platform name as the machine name, e.g. http://hg.mozilla.org/graphs/file/2018284ed6e7/sql/data.sql#l1171

so you can use those names (with "_leak_test" == debug), and just add the new test to the database.
Comment 18 Joel Maher ( :jmaher) 2012-05-08 11:32:21 PDT
Created attachment 622059 [details] [diff] [review]
graphserver definition for libxul_link (1.0)
Comment 19 Joel Maher ( :jmaher) 2012-05-10 07:03:56 PDT
landed graph server definition: http://hg.mozilla.org/graphs/rev/6bc547cd2202
Comment 20 Ed Morley [:emorley] 2012-09-28 17:59:04 PDT
(In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> 
> (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)

2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)
Comment 21 :Ehsan Akhgari 2012-11-02 08:52:35 PDT
So, we're at 3.4 gigs.  Any chance we can get those graphs before The Next Big Surprise?  :-)
Comment 22 Armen Zambrano [:armenzg] (EDT/UTC-4) 2012-11-02 08:57:58 PDT
Could this be fixed by tbpl turning orange after a certain threshold?
Could a tool be written to  grab the information and create a graph?
Comment 23 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2012-11-02 09:00:47 PDT
Well TBPL already turns red after a certain threshold.  The point is to see what is causing the increases, not when we pass some arbitrary threshold.
Comment 24 :Ehsan Akhgari 2012-11-02 10:06:24 PDT
The easiest way I think is to submit this data to the graph server and graph it there.  The point here is knowing when the problem is approaching before it does.
Comment 25 Armen Zambrano [:armenzg] (EDT/UTC-4) 2012-11-02 11:26:05 PDT
Oh!
It seems we only do TinderboxPrint!

Anyone know what is required to post the data in the graphs server?

I can see that this was added:
insert into tests values (NULL,"libxul_link","LibXUL Memory during link",0,1,NULL);
and IT run it on the DB.
Comment 26 Nick Thomas [:nthomas] 2012-11-04 13:50:50 PST
Armen, look for usage of GraphServerPost in buildbotcustom/process/factory.py.
Comment 27 Ed Morley [:emorley] 2013-01-08 04:53:22 PST
(In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> (In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> > 
> > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)
> 
> 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)

2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe)

500MB more in 3 months!

We need to fix this sooner rather than later (and ideally import all the old nightly figures from the logs, so we can more easily see what has bumped it up so much).

At current rate of increase we have only a couple of months before we hit this again.
Comment 28 Ed Morley [:emorley] 2013-01-08 04:53:51 PST
Sorry, s/2012-01-07/2013-01-07/
Comment 29 :Ehsan Akhgari 2013-01-08 12:15:17 PST
(In reply to comment #27)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> > (In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> > > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> > > 
> > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)
> > 
> > 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)
> 
> 2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe)
> 
> 500MB more in 3 months!
> 
> We need to fix this sooner rather than later (and ideally import all the old
> nightly figures from the logs, so we can more easily see what has bumped it up
> so much).
> 
> At current rate of increase we have only a couple of months before we hit this
> again.

So what parts of the WebRTC code ended up going into libxul?
Comment 30 Randell Jesup [:jesup] 2013-01-08 12:51:12 PST
media/webrtc went in (/signaling had to be in, and it references many things in /trunk).  We could in the future (especially as the code gets more locked down) probably move trunk to gkmedia, and deal with adding a lot of symbols to symbols.def.in (or finding a better way to deal with that!)  The decision IIRC was to allow those two to come in, but leave the rest in gkmedia.

Signaling landed in m-c at FF 18th uplift, around Oct 6th.  Of the top of my head, I can't remember if media/webrtc/trunk was in gkmedia before that (I think it was), so both moved to libxul around then.
Comment 31 :Ehsan Akhgari 2013-01-08 12:59:57 PST
OK, I filed bug 827985 to move that code out of libxul.  Thanks for the clarification.

Ed, this needs to be treated with utmost priority.  Who should work on the graphing thing?
Comment 32 Randell Jesup [:jesup] 2013-01-08 13:10:29 PST
FYI, per above on 9/28 were were 3.2G, 11/2 we were at 3.4G (after signaling landed), now we're at 3.7G.  So the 500MB since 11/2 is NOT webrtc.  And I think (thinking back) that webrtc/trunk was in xul before 10/6; if so we took a (guess) 50-100MB hit for signaling, and likely webrtc has contributed little since then.

Can you run a PGO --disable-webrtc build and report the number?  I can't build on windows currently - thanks Microsoft!!  --disable-webrtc will be an significant over-estimate of what you'll get back (as signaling won't get compiled).
Comment 33 :Ehsan Akhgari 2013-01-08 13:36:37 PST
(In reply to comment #32)
> FYI, per above on 9/28 were were 3.2G, 11/2 we were at 3.4G (after signaling
> landed), now we're at 3.7G.  So the 500MB since 11/2 is NOT webrtc.  And I
> think (thinking back) that webrtc/trunk was in xul before 10/6; if so we took a
> (guess) 50-100MB hit for signaling, and likely webrtc has contributed little
> since then.

It doesn't matter how much we're going to win from this, we need to move all of the code that we can outside of libxul, and the WebRTC stuff is just part of it.

> Can you run a PGO --disable-webrtc build and report the number?  I can't build
> on windows currently - thanks Microsoft!!  --disable-webrtc will be an
> significant over-estimate of what you'll get back (as signaling won't get
> compiled).

I only have VS2012, so the numbers that I get will not be representative (I actually don't know how to run the linker vmem usage measurement script locally.)  That being said, you can push to try to get the numbers, but like I said it doesn't matter much, we need *all* of the wins that we can get.
Comment 34 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-08 14:59:21 PST
Let me try to help with this.
Comment 35 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-09 13:34:56 PST
I know that the following file contains the value that we need to post:
obj-firefox\toolkit\library\linker-vsize

We could add code in here:
http://hg.mozilla.org/mozilla-central/file/0faa1d47ea80/build/link.py#l19
and post to the graph server.

Or we can add a new post compilation step on buildbot to post it.

I will look in the releng side since we already have some GraphServer logic.
Comment 36 Ted Mielczarek [:ted.mielczarek] 2013-01-09 17:20:45 PST
Right, the file contains the info, and it's also output to stdout in the build step in a line starting with "TinderboxPrint: linker max vsize:". You should use whichever is easier.
Comment 37 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-15 07:36:06 PST
Created attachment 702314 [details] [diff] [review]
Add win64 slaves to the graph server
Comment 38 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-17 13:46:48 PST
Created attachment 703578 [details] [diff] [review]
post vsize
Comment 39 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-18 11:09:13 PST
I have my first data point on staging:
http://graphs.allizom.org/graph.html#tests=[[205,6,8]]

At what point is this going to blow up? What is the upper limit?

With regards to producing a graph I would suggest to book a project branch and request releng to add PGO jobs to it. At that point I would suggest creating a list of changesets that we want data points for and trigger PGO builds.

Once bhearsum reviews this and we land it we will start having the data points.

(In reply to Ed Morley (Away 18th-20th Jan) [:edmorley UTC+0] from comment #27)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #20)
> > (In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> > > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2)
> > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch)
> > 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)
> 2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe)
> 500MB more in 3 months!
2012-01-18: 3756.02 /3938476032 bytes (m-c rev b52c02f77cf5)
Comment 40 Ted Mielczarek [:ted.mielczarek] 2013-01-18 11:13:49 PST
The blowup point is somewhere near 4GB (the total amount of virtual memory available to a 32-bit process running on Windows x64), but we don't know exactly where. Essentially once the linker tries to allocate more virtual memory and it runs out it will blow up, but that last allocation could be fairly large, it's hard to tell.
Comment 41 :Ehsan Akhgari 2013-01-18 12:33:11 PST
(In reply to comment #39)
> With regards to producing a graph I would suggest to book a project branch and
> request releng to add PGO jobs to it. At that point I would suggest creating a
> list of changesets that we want data points for and trigger PGO builds.

Who would own updating the branch though?  Why can't we just get the graphs for inbound and central?
Comment 42 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-18 13:03:22 PST
(In reply to :Ehsan Akhgari from comment #41)
> (In reply to comment #39)
> > With regards to producing a graph I would suggest to book a project branch and
> > request releng to add PGO jobs to it. At that point I would suggest creating a
> > list of changesets that we want data points for and trigger PGO builds.
> 
> Who would own updating the branch though?  Why can't we just get the graphs
> for inbound and central?

I thought you guys mentioned that you wanted to get some history to see what things increased the memory usage. To build up history we would need to setup a project branch and trigger old changesets.

As soon as we land the patch we will get coverage on all branches that have pgo enabled from there on.
Comment 43 :Ehsan Akhgari 2013-01-18 13:39:43 PST
(In reply to comment #42)
> (In reply to :Ehsan Akhgari from comment #41)
> > (In reply to comment #39)
> > > With regards to producing a graph I would suggest to book a project branch and
> > > request releng to add PGO jobs to it. At that point I would suggest creating a
> > > list of changesets that we want data points for and trigger PGO builds.
> > 
> > Who would own updating the branch though?  Why can't we just get the graphs
> > for inbound and central?
> 
> I thought you guys mentioned that you wanted to get some history to see what
> things increased the memory usage. To build up history we would need to setup a
> project branch and trigger old changesets.

I don't see why.  We do have the data in the old logs, right?  We should just be able to write a script to parse them out or something.  Am I missing something?
Comment 44 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-18 13:49:16 PST
(In reply to :Ehsan Akhgari from comment #43)
> (In reply to comment #42)
> > (In reply to :Ehsan Akhgari from comment #41)
> > > (In reply to comment #39)
> > > > With regards to producing a graph I would suggest to book a project branch and
> > > > request releng to add PGO jobs to it. At that point I would suggest creating a
> > > > list of changesets that we want data points for and trigger PGO builds.
> > > 
> > > Who would own updating the branch though?  Why can't we just get the graphs
> > > for inbound and central?
> > 
> > I thought you guys mentioned that you wanted to get some history to see what
> > things increased the memory usage. To build up history we would need to setup a
> > project branch and trigger old changesets.
> 
> I don't see why.  We do have the data in the old logs, right?  We should
> just be able to write a script to parse them out or something.  Am I missing
> something?

Good point! That would save lots of time.
Comment 45 Ben Hearsum (:bhearsum) 2013-01-21 06:34:07 PST
Comment on attachment 703578 [details] [diff] [review]
post vsize

Review of attachment 703578 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +1344,5 @@
>                      data=WithProperties('TinderboxPrint: num_ctors: %(num_ctors:-unknown)s'),
>                      ))
>  
> +    def addPostBuildSteps(self):
> +        if self.profiledBuild and self.platform in ('win32',) and self.baseName:

Please add an explicit flag to MercurialBuildFactory/config.py for this rather than guessing based on 3 different things.

@@ +1356,5 @@
> +                    return {'testresults': []}
> +
> +            self.addStep(SetProperty(
> +                name='get_linker_vsize',
> +                command=['cat', '%s\\toolkit\\library\\linker-vsize' % self.mozillaObjdir],

Why the '\\'? All of other steps use / without issue.
Comment 46 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-21 08:32:32 PST
closed trees: bug 832992 :(
Comment 47 Marco Bonardo [::mak] 2013-01-21 08:52:26 PST
(In reply to Armen Zambrano G. [:armenzg] from comment #46)
> closed trees: bug 832992 :(

well, in the end that may be just a disk space issue, though we are slowly getting near the limit (now at 3939495936).
Comment 48 :Ehsan Akhgari 2013-01-22 05:56:26 PST
I'm gonna call this a blocker this time.
Comment 49 Marco Bonardo [::mak] 2013-01-22 06:04:06 PST
May this be added to the releng Q1 goals please, we keep hitting the problem, and while there isn't a clear solution to it, this is the only way we have to track its evolution.
Comment 50 Ben Hearsum (:bhearsum) 2013-01-22 06:06:37 PST
This bug is in progress, adding it to the goals list won't make it happen any faster. Armen had a working implementation, it just needs small tweaks before it can be landed.
Comment 51 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-22 08:37:47 PST
Created attachment 704904 [details] [diff] [review]
post vsize

dump_masters shows that this gets added for every PGO and WINNT nightly builds
Comment 52 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-22 08:38:06 PST
Created attachment 704905 [details] [diff] [review]
do post build steps config changes
Comment 53 Ed Morley [:emorley] 2013-01-22 11:00:10 PST
For historic values see:
https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/

Nathan, don't suppose you could attach the raw values, so we can backfill the gap on graphs.m.o?
Comment 54 Nathan Froyd [:froydnj] 2013-01-22 11:23:16 PST
Created attachment 705019 [details]
historic linker vsize values from january 2012 onward

Sure, Ed, no problem.  Here's the file I used; the format is:

<build-date> <hg-revision> <linker-vsize>

The data doesn't perfectly capture the hg revision for every log file, but the number of points that it missed was small enough that I wasn't going to worry about it.
Comment 55 Nathan Froyd [:froydnj] 2013-01-22 11:26:27 PST
Created attachment 705021 [details]
script to extract hg revisions + linker vsize from log files

...and for reference, here's the script I used to generate the previous file.  The script expects the names of the log files to start with:

YYYY-MM-DD-HH-MM-SS

for the timestamp portion, but that's probably not hard to change.  Simply invoke:

extract-info <list-of-log-files>
Comment 56 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-22 11:52:06 PST
Comment on attachment 704904 [details] [diff] [review]
post vsize

Through IRC.
Comment 57 Ben Hearsum (:bhearsum) 2013-01-22 11:53:08 PST
Comment on attachment 704904 [details] [diff] [review]
post vsize

Sorry, Armen and I talked on IRC about this awhile ago but I forgot to update the bug:
13:11 < bhearsum> armenzg: i meant that we should have a flag for 'post_linker_size' or something, not 'do_post_build_steps'
13:11 < bhearsum> i want this line gone:
13:11 < bhearsum>  if self.profiledBuild and self.platform in ('win32',) and self.baseName:
13:11 < bhearsum> because it guesses about what should happen
13:11 < bhearsum> that can be replaced with if self.postLinkerSize
13:12 < armenzg> k
Comment 58 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-22 12:43:55 PST
Created attachment 705069 [details] [diff] [review]
[buildbotcustom] do post vsize
Comment 59 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-22 12:44:33 PST
Created attachment 705072 [details] [diff] [review]
do post build steps config changes
Comment 60 Kim Moir [:kmoir] 2013-01-22 14:41:03 PST
in production
Comment 61 :Ehsan Akhgari 2013-01-22 14:44:44 PST
(In reply to comment #60)
> in production

Where can the graphs be found?
Comment 62 Nick Thomas [:nthomas] 2013-01-22 18:38:24 PST
Comment on attachment 705072 [details] [diff] [review]
do post build steps config changes

Reverted this for bustage in bug 833653.

default:    http://hg.mozilla.org/build/buildbot-configs/rev/df9a319c5edd
production: http://hg.mozilla.org/build/buildbot-configs/rev/0dcbc3ce69f9
Comment 63 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-23 10:21:46 PST
Created attachment 705425 [details] [diff] [review]
add BuildInfoSteps

I don't know at which point we lost this line on the patch.
This gets the sourcestamp info missing.
Comment 64 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-23 13:08:07 PST
I will land and reconfig in the morning.
Comment 65 :Ehsan Akhgari 2013-01-23 13:50:12 PST
(In reply to comment #64)
> I will land and reconfig in the morning.

Thanks Armen!

Can you please also let me know how much historical data we can get out of this on each of central and inbound?  It would be absolutely amazing if we can get per-checkin data for the interesting ranges highlighted in <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
Comment 66 Ed Morley [:emorley] 2013-01-23 13:53:55 PST
(In reply to :Ehsan Akhgari from comment #65)
> Can you please also let me know how much historical data we can get out of
> this on each of central and inbound?  It would be absolutely amazing if we
> can get per-checkin data for the interesting ranges highlighted in
> <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.

Per push logs are only kept for 30days, so try runs will be required.
Comment 67 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-23 14:04:57 PST
(In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> (In reply to :Ehsan Akhgari from comment #65)
> > Can you please also let me know how much historical data we can get out of
> > this on each of central and inbound?  It would be absolutely amazing if we
> > can get per-checkin data for the interesting ranges highlighted in
> > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> 
> Per push logs are only kept for 30days, so try runs will be required.

Would a project branch be more interesting for this project?

(In reply to :Ehsan Akhgari from comment #65)
> (In reply to comment #64)
> > I will land and reconfig in the morning.
> 
> Thanks Armen!
> 
> Can you please also let me know how much historical data we can get out of
> this on each of central and inbound?  It would be absolutely amazing if we
> can get per-checkin data for the interesting ranges highlighted in
> <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.

What edmorley says is correct. I think selfserve would be needed for this:
Comment 68 :Ehsan Akhgari 2013-01-23 14:05:24 PST
(In reply to comment #66)
> (In reply to :Ehsan Akhgari from comment #65)
> > Can you please also let me know how much historical data we can get out of
> > this on each of central and inbound?  It would be absolutely amazing if we
> > can get per-checkin data for the interesting ranges highlighted in
> > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> 
> Per push logs are only kept for 30days, so try runs will be required.

OK.  so I guess we can explore that path when we need to.  Thanks!
Comment 69 :Ehsan Akhgari 2013-01-23 14:07:59 PST
(In reply to comment #67)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> > (In reply to :Ehsan Akhgari from comment #65)
> > > Can you please also let me know how much historical data we can get out of
> > > this on each of central and inbound?  It would be absolutely amazing if we
> > > can get per-checkin data for the interesting ranges highlighted in
> > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> > 
> > Per push logs are only kept for 30days, so try runs will be required.
> 
> Would a project branch be more interesting for this project?

Not sure how that would help?
Comment 70 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-23 14:19:57 PST
(In reply to :Ehsan Akhgari from comment #69)
> (In reply to comment #67)
> > (In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> > > (In reply to :Ehsan Akhgari from comment #65)
> > > > Can you please also let me know how much historical data we can get out of
> > > > this on each of central and inbound?  It would be absolutely amazing if we
> > > > can get per-checkin data for the interesting ranges highlighted in
> > > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> > > 
> > > Per push logs are only kept for 30days, so try runs will be required.
> > 
> > Would a project branch be more interesting for this project?
> 
> Not sure how that would help?

Turn around is faster.
Other pgo jobs (not related to this data gathering) could be posting numbers in the try graph and would polluting the graph. Even though I can't see any PGO that was triggered by a dev today.

On the other hand, customizing a branch to only do Windows PGO without test jobs could add some overhead to setup.

I think either way is fine.
Assuming that it works for the try server. I will try one build after I reconfig in the morning.
Comment 71 :Ehsan Akhgari 2013-01-23 14:24:32 PST
(In reply to comment #70)
> (In reply to :Ehsan Akhgari from comment #69)
> > (In reply to comment #67)
> > > (In reply to Ed Morley [:edmorley UTC+0] from comment #66)
> > > > (In reply to :Ehsan Akhgari from comment #65)
> > > > > Can you please also let me know how much historical data we can get out of
> > > > > this on each of central and inbound?  It would be absolutely amazing if we
> > > > > can get per-checkin data for the interesting ranges highlighted in
> > > > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
> > > > 
> > > > Per push logs are only kept for 30days, so try runs will be required.
> > > 
> > > Would a project branch be more interesting for this project?
> > 
> > Not sure how that would help?
> 
> Turn around is faster.
> Other pgo jobs (not related to this data gathering) could be posting numbers in
> the try graph and would polluting the graph. Even though I can't see any PGO
> that was triggered by a dev today.

Hmm, good point.  But that also takes away the ability of just pushing new heads and get builds on them, right?  I mean, we would need to push new heads in the right order, right?

> On the other hand, customizing a branch to only do Windows PGO without test
> jobs could add some overhead to setup.

Ouch.

> I think either way is fine.
> Assuming that it works for the try server. I will try one build after I
> reconfig in the morning.

Thanks, that's a good idea regardless.
Comment 72 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-24 07:46:45 PST
The good news is that this is live.
The original purpose of the bug is fulfilled (as I understand it).

The bad news is that there is no way to trigger PGO builds on try.
Developers change the mozconfig but that does not trigger the PGO/try builders.
Pushing to try as PGO would print the linker size but it won't post to the try server (as before).

Booking a project branch with PGO would give the ability to push changesets in a chronological order but care would be needed to not prevent coallescing from happening (perhaps this can be configured on our side to be prevented - I don't know if it is easy).

Is it good enough to have data points on the graphs DB from here on?

IIUC there are means to gather historical data by pushing to try and scrapping the linker size.
Comment 73 :Ehsan Akhgari 2013-01-24 09:30:10 PST
(In reply to comment #72)
> The good news is that this is live.
> The original purpose of the bug is fulfilled (as I understand it).
> 
> The bad news is that there is no way to trigger PGO builds on try.
> Developers change the mozconfig but that does not trigger the PGO/try builders.
> Pushing to try as PGO would print the linker size but it won't post to the try
> server (as before).
> 
> Booking a project branch with PGO would give the ability to push changesets in
> a chronological order but care would be needed to not prevent coallescing from
> happening (perhaps this can be configured on our side to be prevented - I don't
> know if it is easy).
> 
> Is it good enough to have data points on the graphs DB from here on?

It's good but definitely not enough.

So the first step is to parse through the PGO logs for the past 30 days, and also nightly logs for as long as we have them stored, and report them to the graph server associated with the correct date and changeset.  Then, I guess we'll need to fill in the gaps for the individual changesets in the spikes that we've seen in Nathan's analysis.  That would help us experiement with the possibility of finding culprit changesets which have added the most to the linker memory usage and see how we can deal with that.

That all being said, gathering detailed historical data only matters if we decide to keep PGO enabled and try to keep the linker memory usage bounded, which is a call that we have not made yet.  We need more of the dependencies of bug 833881 to be resolved before we can make a meaningful decision on that.  If we do decide to keep PGO enabled, I'll file another bug in the RelEng component to gather more historical data.

Last but not least, thanks everyone for your help here, really appreciated!  :-)
Comment 74 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-24 13:25:32 PST
I thought my reconfig this morning would have done the trick but it seems that when a change is backed out from both branches ("production" and "default") then, the typical land to default and merge to production misses the change [1]. I've seen this happen a couple of times in the past.

I landed it again (on production) and reconfigured the build masters again:
http://hg.mozilla.org/build/buildbot-configs/rev/92846acd0ba5

I re-triggered a second pgo in here that should be successful:
https://tbpl.mozilla.org/?jobname=WINNT%205.2%20mozilla-central%20pgo-build&rev=680e46fecff0

[1] http://hg.mozilla.org/build/buildbot-configs/graph
Comment 75 Nick Thomas [:nthomas] 2013-01-24 17:22:51 PST
(In reply to Armen Zambrano G. [:armenzg] from comment #72)
> Booking a project branch with PGO would give the ability to push changesets
> in a chronological order but care would be needed to not prevent coallescing
> from happening (perhaps this can be configured on our side to be prevented -
> I don't know if it is easy).

For a project branch you could probably just use self-serve force pgo builds on a revision, so no need to push. That assumes disabling merging is easily done to speed the process up. Whether the history is easily transferable to the m-c branch in the graph server is another question.
Comment 76 :Ehsan Akhgari 2013-01-24 17:26:45 PST
(In reply to comment #75)
> (In reply to Armen Zambrano G. [:armenzg] from comment #72)
> > Booking a project branch with PGO would give the ability to push changesets
> > in a chronological order but care would be needed to not prevent coallescing
> > from happening (perhaps this can be configured on our side to be prevented -
> > I don't know if it is easy).
> 
> For a project branch you could probably just use self-serve force pgo builds on
> a revision, so no need to push. That assumes disabling merging is easily done
> to speed the process up. Whether the history is easily transferable to the m-c
> branch in the graph server is another question.

Hmm, I'm not quite sure what that exactly means...
Comment 77 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-25 06:05:07 PST
(In reply to :Ehsan Akhgari from comment #76)
> (In reply to comment #75)
> > (In reply to Armen Zambrano G. [:armenzg] from comment #72)
> > > Booking a project branch with PGO would give the ability to push changesets
> > > in a chronological order but care would be needed to not prevent coallescing
> > > from happening (perhaps this can be configured on our side to be prevented -
> > > I don't know if it is easy).
> > 
> > For a project branch you could probably just use self-serve force pgo builds on
> > a revision, so no need to push. That assumes disabling merging is easily done
> > to speed the process up. Whether the history is easily transferable to the m-c
> > branch in the graph server is another question.
> 
> Hmm, I'm not quite sure what that exactly means...

I think what nthomas is suggesting is to trigger PGO builds on a project branch which would add data points for that branch on the graph server. We could then ask a DBA to transfer the data points to the mozilla-central records.
One note is that tbpl might not show up any jobs since the changesets are from the past.
Comment 78 Armen Zambrano [:armenzg] (EDT/UTC-4) 2013-01-25 07:12:29 PST
We are now getting data points:
http://graphs.mozilla.org/graph.html#tests=[[205,63,8]]&sel=none&displayrange=7&datatype=running

mozilla-central will show data point in the next few hours.
I missed inserting the machine name on the graphs DB. This has now been fixed.
Comment 79 :Ehsan Akhgari 2013-01-25 08:46:59 PST
(In reply to comment #77)
> (In reply to :Ehsan Akhgari from comment #76)
> > (In reply to comment #75)
> > > (In reply to Armen Zambrano G. [:armenzg] from comment #72)
> > > > Booking a project branch with PGO would give the ability to push changesets
> > > > in a chronological order but care would be needed to not prevent coallescing
> > > > from happening (perhaps this can be configured on our side to be prevented -
> > > > I don't know if it is easy).
> > > 
> > > For a project branch you could probably just use self-serve force pgo builds on
> > > a revision, so no need to push. That assumes disabling merging is easily done
> > > to speed the process up. Whether the history is easily transferable to the m-c
> > > branch in the graph server is another question.
> > 
> > Hmm, I'm not quite sure what that exactly means...
> 
> I think what nthomas is suggesting is to trigger PGO builds on a project branch
> which would add data points for that branch on the graph server. We could then
> ask a DBA to transfer the data points to the mozilla-central records.
> One note is that tbpl might not show up any jobs since the changesets are from
> the past.

OK, like I said we might need this in the near future.  In the case that we do, I'll file another bug and let you guys do what's needed.  Thanks!  :-)

Note You need to log in before you can comment on or make changes to this bug.