Closed
Bug 710840
Opened 13 years ago
Closed 11 years ago
Track peak virtual memory usage of link.exe process during libxul PGO link on graph server
Categories
(Release Engineering :: General, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ted, Assigned: armenzg)
References
Details
(Keywords: sheriffing-P1, Whiteboard: [graphserver][pgo] 2012-01-21 --> linker max vsize:3757MB)
Attachments
(8 files, 3 obsolete files)
1.68 KB,
text/plain
|
Details | |
632 bytes,
patch
|
rhelmer
:
review+
|
Details | Diff | Splinter Review |
10.86 KB,
patch
|
coop
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
29.34 KB,
text/plain
|
Details | |
436 bytes,
text/plain
|
Details | |
5.64 KB,
patch
|
bhearsum
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
450 bytes,
patch
|
bhearsum
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
980 bytes,
patch
|
bhearsum
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
bug 710712 is going to add the ability to measure the peak virtual memory usage of the linker during the final PGO link phase on Windows. We should track this number on the graph server so we can monitor the situation.
Reporter | ||
Comment 1•13 years ago
|
||
catlee asked if we could make the build go orange if we went over a threshold for this value. I'm totally in favor of this, but we should see where we're currently at before deciding on what the threshold needs to be. I just pushed bug 710712 to inbound, so we should get some numbers soon.
Updated•13 years ago
|
OS: Windows 7 → Windows Server 2003
Priority: -- → P3
Whiteboard: [graphserver][pgo]
Comment 2•13 years ago
|
||
For reference (WINNT 5.2 x86 tinderbox pgo): 2011-12-16: 2887.55 MB / 3027816448 bytes (from bug 710712 comment 16) 2011-01-04: 2886.54 MB / 3026759680 bytes (inbound rev 5025534b9d88)
Comment 3•13 years ago
|
||
And that should of course read 2012-01-04 (first of many instances of doing that this month I'm sure :-))
Comment 4•13 years ago
|
||
In lieu of having the graph server set up for this yet, I'll continue posting numbers periodically (at least whilst the memory of the ever so fun sheriffing weekend of bug 709193 is fresh in the mind): 2011-01-10: 2886.26 MB / 3026460672 bytes (inbound rev 01d69766026d)
Comment 5•12 years ago
|
||
2012-01-31: 2902.54 MB / 3043528704 bytes (inbound rev 5a8ff4828791) (16mb increase in the last 3 weeks)
Comment 6•12 years ago
|
||
2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2) (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch perhaps?)
(In reply to Ed Morley [:edmorley] from comment #6) > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2) > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch perhaps?) Right.
Comment 8•12 years ago
|
||
3021553664 bytes (inbound rev 0831ce6ba72f on the retrigger after the first one died with "compiler is out of heap space in pass 2")
Comment 9•12 years ago
|
||
I'll start doing some bisecting to see if we can work out where this fairly hefty increase has come from...
Comment 10•12 years ago
|
||
Oh and we've just had another on inbound: ac1504ff8740 https://tbpl.mozilla.org/php/getParsedLog.php?id=11354573&tree=Mozilla-Inbound This is looking bad :-(
Severity: normal → critical
Priority: P3 → P1
Comment 11•12 years ago
|
||
Comment 12•12 years ago
|
||
Hopefully bug 750717 should take away some of the pain of not having this at least for now.
Severity: critical → major
Comment 13•12 years ago
|
||
So here is my armchair quarterback summary of what I think needs to happen here: 1. Write buildbot step to send this data from the buildslave to the graphserver system. 2. Modify the graphserver to accept this value from the builders (jmaher and/or rhelmer - can you file the database modification bug needed for this?) 3. Ensure the networking flows are in place between the builders and the graphserver systems (I can file this but I need the vlan numbers for all the releng builders - I'm sort of assuming they are separate from the test slaves - if they are on the same vlan as the slaves then maybe we don't need this). 4. Update the script for the dev.treemanagement auto-emailer to send this new data. Catlee, can you take this? You're the only person I know of that knows where the code for that mailer is and how to re-deploy a new version of it. Please if I have any details wrong, do add a comment and correct me.
Comment 14•12 years ago
|
||
(In reply to Clint Talbert ( :ctalbert ) from comment #13) > So here is my armchair quarterback summary of what I think needs to happen > here: > 1. Write buildbot step to send this data from the buildslave to the > graphserver system. we have similar code in place to submit leak info / codesighs already > 2. Modify the graphserver to accept this value from the builders (jmaher > and/or rhelmer - can you file the database modification bug needed for this?) > 3. Ensure the networking flows are in place between the builders and the > graphserver systems (I can file this but I need the vlan numbers for all the > releng builders - I'm sort of assuming they are separate from the test > slaves - if they are on the same vlan as the slaves then maybe we don't need > this). no need - we already submit info to graph server from the build machines > 4. Update the script for the dev.treemanagement auto-emailer to send this > new data. Catlee, can you take this? You're the only person I know of that > knows where the code for that mailer is and how to re-deploy a new version > of it. it will automatically get picked up
Comment 15•12 years ago
|
||
it looks like we just need to solve 2 things: 1. Write buildbot step to send this data from the buildslave to the graphserver system. 2. Modify the graphserver to accept this value from the builders (jmaher and/or rhelmer - can you file the database modification bug needed for this?) I can work on the graph server database mods. What is the name we want to use for this test? 'libxul_link'?
Comment 16•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #15) > I can work on the graph server database mods. What is the name we want to > use for this test? 'libxul_link'? Thanks joel, that works for me. Do you also have to add all the machine names for the builders or are the machine names in the graphserver db populated at run-time?
Comment 17•12 years ago
|
||
For whatever reason, build metrics like this use a generic platform name as the machine name, e.g. http://hg.mozilla.org/graphs/file/2018284ed6e7/sql/data.sql#l1171 so you can use those names (with "_leak_test" == debug), and just add the new test to the database.
Comment 18•12 years ago
|
||
Attachment #622059 -
Flags: review?(rhelmer)
Updated•12 years ago
|
Attachment #622059 -
Flags: review?(rhelmer) → review+
Comment 19•12 years ago
|
||
landed graph server definition: http://hg.mozilla.org/graphs/rev/6bc547cd2202
Comment 20•12 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #6) > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2) > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch) 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465)
Comment 21•12 years ago
|
||
So, we're at 3.4 gigs. Any chance we can get those graphs before The Next Big Surprise? :-)
Assignee | ||
Comment 22•12 years ago
|
||
Could this be fixed by tbpl turning orange after a certain threshold? Could a tool be written to grab the information and create a graph?
Well TBPL already turns red after a certain threshold. The point is to see what is causing the increases, not when we pass some arbitrary threshold.
Comment 24•12 years ago
|
||
The easiest way I think is to submit this data to the graph server and graph it there. The point here is knowing when the problem is approaching before it does.
Assignee | ||
Comment 25•12 years ago
|
||
Oh! It seems we only do TinderboxPrint! Anyone know what is required to post the data in the graphs server? I can see that this was added: insert into tests values (NULL,"libxul_link","LibXUL Memory during link",0,1,NULL); and IT run it on the DB.
Comment 26•12 years ago
|
||
Armen, look for usage of GraphServerPost in buildbotcustom/process/factory.py.
Assignee | ||
Updated•12 years ago
|
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: catlee
Comment 27•12 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+0] from comment #20) > (In reply to Ed Morley [:edmorley UTC+1] from comment #6) > > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2) > > > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch) > > 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465) 2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe) 500MB more in 3 months! We need to fix this sooner rather than later (and ideally import all the old nightly figures from the logs, so we can more easily see what has bumped it up so much). At current rate of increase we have only a couple of months before we hit this again.
Keywords: sheriffing-P1
Comment 28•12 years ago
|
||
Sorry, s/2012-01-07/2013-01-07/
Comment 29•12 years ago
|
||
(In reply to comment #27) > (In reply to Ed Morley [:edmorley UTC+0] from comment #20) > > (In reply to Ed Morley [:edmorley UTC+1] from comment #6) > > > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2) > > > > > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch) > > > > 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465) > > 2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe) > > 500MB more in 3 months! > > We need to fix this sooner rather than later (and ideally import all the old > nightly figures from the logs, so we can more easily see what has bumped it up > so much). > > At current rate of increase we have only a couple of months before we hit this > again. So what parts of the WebRTC code ended up going into libxul?
Updated•12 years ago
|
Flags: needinfo?(rjesup)
Comment 30•12 years ago
|
||
media/webrtc went in (/signaling had to be in, and it references many things in /trunk). We could in the future (especially as the code gets more locked down) probably move trunk to gkmedia, and deal with adding a lot of symbols to symbols.def.in (or finding a better way to deal with that!) The decision IIRC was to allow those two to come in, but leave the rest in gkmedia. Signaling landed in m-c at FF 18th uplift, around Oct 6th. Of the top of my head, I can't remember if media/webrtc/trunk was in gkmedia before that (I think it was), so both moved to libxul around then.
Flags: needinfo?(rjesup)
Comment 31•12 years ago
|
||
OK, I filed bug 827985 to move that code out of libxul. Thanks for the clarification. Ed, this needs to be treated with utmost priority. Who should work on the graphing thing?
Comment 32•12 years ago
|
||
FYI, per above on 9/28 were were 3.2G, 11/2 we were at 3.4G (after signaling landed), now we're at 3.7G. So the 500MB since 11/2 is NOT webrtc. And I think (thinking back) that webrtc/trunk was in xul before 10/6; if so we took a (guess) 50-100MB hit for signaling, and likely webrtc has contributed little since then. Can you run a PGO --disable-webrtc build and report the number? I can't build on windows currently - thanks Microsoft!! --disable-webrtc will be an significant over-estimate of what you'll get back (as signaling won't get compiled).
Comment 33•12 years ago
|
||
(In reply to comment #32) > FYI, per above on 9/28 were were 3.2G, 11/2 we were at 3.4G (after signaling > landed), now we're at 3.7G. So the 500MB since 11/2 is NOT webrtc. And I > think (thinking back) that webrtc/trunk was in xul before 10/6; if so we took a > (guess) 50-100MB hit for signaling, and likely webrtc has contributed little > since then. It doesn't matter how much we're going to win from this, we need to move all of the code that we can outside of libxul, and the WebRTC stuff is just part of it. > Can you run a PGO --disable-webrtc build and report the number? I can't build > on windows currently - thanks Microsoft!! --disable-webrtc will be an > significant over-estimate of what you'll get back (as signaling won't get > compiled). I only have VS2012, so the numbers that I get will not be representative (I actually don't know how to run the linker vmem usage measurement script locally.) That being said, you can push to try to get the numbers, but like I said it doesn't matter much, we need *all* of the wins that we can get.
Assignee | ||
Comment 35•12 years ago
|
||
I know that the following file contains the value that we need to post: obj-firefox\toolkit\library\linker-vsize We could add code in here: http://hg.mozilla.org/mozilla-central/file/0faa1d47ea80/build/link.py#l19 and post to the graph server. Or we can add a new post compilation step on buildbot to post it. I will look in the releng side since we already have some GraphServer logic.
Reporter | ||
Comment 36•12 years ago
|
||
Right, the file contains the info, and it's also output to stdout in the build step in a line starting with "TinderboxPrint: linker max vsize:". You should use whichever is easier.
Assignee | ||
Comment 37•12 years ago
|
||
Attachment #702314 -
Flags: review?(coop)
Updated•12 years ago
|
Attachment #702314 -
Flags: review?(coop) → review+
Assignee | ||
Updated•12 years ago
|
Attachment #702314 -
Flags: checked-in+
Assignee | ||
Comment 38•12 years ago
|
||
Attachment #703578 -
Flags: review?(bhearsum)
Assignee | ||
Comment 39•12 years ago
|
||
I have my first data point on staging: http://graphs.allizom.org/graph.html#tests=[[205,6,8]] At what point is this going to blow up? What is the upper limit? With regards to producing a graph I would suggest to book a project branch and request releng to add PGO jobs to it. At that point I would suggest creating a list of changesets that we want data points for and trigger PGO builds. Once bhearsum reviews this and we land it we will start having the data points. (In reply to Ed Morley (Away 18th-20th Jan) [:edmorley UTC+0] from comment #27) > (In reply to Ed Morley [:edmorley UTC+0] from comment #20) > > (In reply to Ed Morley [:edmorley UTC+1] from comment #6) > > > 2012-02-27: 2785.78 MB / 2921103360 bytes (inbound rev 0714ec049da2) > > > (Down 116 MB from 4 weeks ago, due to the MSVC 2010 switch) > > 2012-09-28: 3,217.13 MB / 3373408256 bytes (inbound rev 938e09d5a465) > 2012-01-07: 3,701.80 MB / 3881619456 bytes (m-c rev 795632f0e4fe) > 500MB more in 3 months! 2012-01-18: 3756.02 /3938476032 bytes (m-c rev b52c02f77cf5)
Whiteboard: [graphserver][pgo] → [graphserver][pgo] 2012-01-18 --> linker max vsize:3756.02MB
Reporter | ||
Comment 40•12 years ago
|
||
The blowup point is somewhere near 4GB (the total amount of virtual memory available to a 32-bit process running on Windows x64), but we don't know exactly where. Essentially once the linker tries to allocate more virtual memory and it runs out it will blow up, but that last allocation could be fairly large, it's hard to tell.
Comment 41•12 years ago
|
||
(In reply to comment #39) > With regards to producing a graph I would suggest to book a project branch and > request releng to add PGO jobs to it. At that point I would suggest creating a > list of changesets that we want data points for and trigger PGO builds. Who would own updating the branch though? Why can't we just get the graphs for inbound and central?
Assignee | ||
Comment 42•12 years ago
|
||
(In reply to :Ehsan Akhgari from comment #41) > (In reply to comment #39) > > With regards to producing a graph I would suggest to book a project branch and > > request releng to add PGO jobs to it. At that point I would suggest creating a > > list of changesets that we want data points for and trigger PGO builds. > > Who would own updating the branch though? Why can't we just get the graphs > for inbound and central? I thought you guys mentioned that you wanted to get some history to see what things increased the memory usage. To build up history we would need to setup a project branch and trigger old changesets. As soon as we land the patch we will get coverage on all branches that have pgo enabled from there on.
Comment 43•12 years ago
|
||
(In reply to comment #42) > (In reply to :Ehsan Akhgari from comment #41) > > (In reply to comment #39) > > > With regards to producing a graph I would suggest to book a project branch and > > > request releng to add PGO jobs to it. At that point I would suggest creating a > > > list of changesets that we want data points for and trigger PGO builds. > > > > Who would own updating the branch though? Why can't we just get the graphs > > for inbound and central? > > I thought you guys mentioned that you wanted to get some history to see what > things increased the memory usage. To build up history we would need to setup a > project branch and trigger old changesets. I don't see why. We do have the data in the old logs, right? We should just be able to write a script to parse them out or something. Am I missing something?
Assignee | ||
Comment 44•12 years ago
|
||
(In reply to :Ehsan Akhgari from comment #43) > (In reply to comment #42) > > (In reply to :Ehsan Akhgari from comment #41) > > > (In reply to comment #39) > > > > With regards to producing a graph I would suggest to book a project branch and > > > > request releng to add PGO jobs to it. At that point I would suggest creating a > > > > list of changesets that we want data points for and trigger PGO builds. > > > > > > Who would own updating the branch though? Why can't we just get the graphs > > > for inbound and central? > > > > I thought you guys mentioned that you wanted to get some history to see what > > things increased the memory usage. To build up history we would need to setup a > > project branch and trigger old changesets. > > I don't see why. We do have the data in the old logs, right? We should > just be able to write a script to parse them out or something. Am I missing > something? Good point! That would save lots of time.
Comment 45•11 years ago
|
||
Comment on attachment 703578 [details] [diff] [review] post vsize Review of attachment 703578 [details] [diff] [review]: ----------------------------------------------------------------- ::: process/factory.py @@ +1344,5 @@ > data=WithProperties('TinderboxPrint: num_ctors: %(num_ctors:-unknown)s'), > )) > > + def addPostBuildSteps(self): > + if self.profiledBuild and self.platform in ('win32',) and self.baseName: Please add an explicit flag to MercurialBuildFactory/config.py for this rather than guessing based on 3 different things. @@ +1356,5 @@ > + return {'testresults': []} > + > + self.addStep(SetProperty( > + name='get_linker_vsize', > + command=['cat', '%s\\toolkit\\library\\linker-vsize' % self.mozillaObjdir], Why the '\\'? All of other steps use / without issue.
Attachment #703578 -
Flags: review?(bhearsum) → review-
Assignee | ||
Comment 46•11 years ago
|
||
closed trees: bug 832992 :(
Comment 47•11 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] from comment #46) > closed trees: bug 832992 :( well, in the end that may be just a disk space issue, though we are slowly getting near the limit (now at 3939495936).
Whiteboard: [graphserver][pgo] 2012-01-18 --> linker max vsize:3756.02MB → [graphserver][pgo] 2012-01-18 --> linker max vsize:3757MB
Updated•11 years ago
|
Whiteboard: [graphserver][pgo] 2012-01-18 --> linker max vsize:3757MB → [graphserver][pgo] 2012-01-21 --> linker max vsize:3757MB
Comment 49•11 years ago
|
||
May this be added to the releng Q1 goals please, we keep hitting the problem, and while there isn't a clear solution to it, this is the only way we have to track its evolution.
Comment 50•11 years ago
|
||
This bug is in progress, adding it to the goals list won't make it happen any faster. Armen had a working implementation, it just needs small tweaks before it can be landed.
Assignee | ||
Comment 51•11 years ago
|
||
dump_masters shows that this gets added for every PGO and WINNT nightly builds
Attachment #703578 -
Attachment is obsolete: true
Attachment #704904 -
Flags: review?(bhearsum)
Assignee | ||
Comment 52•11 years ago
|
||
Attachment #704905 -
Flags: review?(bhearsum)
Comment 53•11 years ago
|
||
For historic values see: https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/ Nathan, don't suppose you could attach the raw values, so we can backfill the gap on graphs.m.o?
Comment 54•11 years ago
|
||
Sure, Ed, no problem. Here's the file I used; the format is: <build-date> <hg-revision> <linker-vsize> The data doesn't perfectly capture the hg revision for every log file, but the number of points that it missed was small enough that I wasn't going to worry about it.
Comment 55•11 years ago
|
||
...and for reference, here's the script I used to generate the previous file. The script expects the names of the log files to start with: YYYY-MM-DD-HH-MM-SS for the timestamp portion, but that's probably not hard to change. Simply invoke: extract-info <list-of-log-files>
Assignee | ||
Comment 56•11 years ago
|
||
Comment on attachment 704904 [details] [diff] [review] post vsize Through IRC.
Attachment #704904 -
Flags: review?(bhearsum) → review-
Assignee | ||
Updated•11 years ago
|
Attachment #704905 -
Flags: review?(bhearsum) → review-
Comment 57•11 years ago
|
||
Comment on attachment 704904 [details] [diff] [review] post vsize Sorry, Armen and I talked on IRC about this awhile ago but I forgot to update the bug: 13:11 < bhearsum> armenzg: i meant that we should have a flag for 'post_linker_size' or something, not 'do_post_build_steps' 13:11 < bhearsum> i want this line gone: 13:11 < bhearsum> if self.profiledBuild and self.platform in ('win32',) and self.baseName: 13:11 < bhearsum> because it guesses about what should happen 13:11 < bhearsum> that can be replaced with if self.postLinkerSize 13:12 < armenzg> k
Assignee | ||
Comment 58•11 years ago
|
||
Attachment #705069 -
Flags: review?(bhearsum)
Assignee | ||
Comment 59•11 years ago
|
||
Attachment #704905 -
Attachment is obsolete: true
Attachment #705072 -
Flags: review?(bhearsum)
Assignee | ||
Updated•11 years ago
|
Attachment #705069 -
Attachment description: do post build steps config changes → [buildbotcustom] do post vsize
Assignee | ||
Updated•11 years ago
|
Attachment #704904 -
Attachment is obsolete: true
Updated•11 years ago
|
Attachment #705069 -
Flags: review?(bhearsum) → review+
Updated•11 years ago
|
Attachment #705072 -
Flags: review?(bhearsum) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #705069 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Attachment #705072 -
Flags: checked-in+
Comment 60•11 years ago
|
||
in production
Comment 61•11 years ago
|
||
(In reply to comment #60) > in production Where can the graphs be found?
Comment 62•11 years ago
|
||
Comment on attachment 705072 [details] [diff] [review] do post build steps config changes Reverted this for bustage in bug 833653. default: http://hg.mozilla.org/build/buildbot-configs/rev/df9a319c5edd production: http://hg.mozilla.org/build/buildbot-configs/rev/0dcbc3ce69f9
Attachment #705072 -
Flags: checked-in+ → checked-in-
Assignee | ||
Comment 63•11 years ago
|
||
I don't know at which point we lost this line on the patch. This gets the sourcestamp info missing.
Attachment #705425 -
Flags: review?(bhearsum)
Updated•11 years ago
|
Attachment #705425 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 64•11 years ago
|
||
I will land and reconfig in the morning.
Comment 65•11 years ago
|
||
(In reply to comment #64) > I will land and reconfig in the morning. Thanks Armen! Can you please also let me know how much historical data we can get out of this on each of central and inbound? It would be absolutely amazing if we can get per-checkin data for the interesting ranges highlighted in <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>.
Comment 66•11 years ago
|
||
(In reply to :Ehsan Akhgari from comment #65) > Can you please also let me know how much historical data we can get out of > this on each of central and inbound? It would be absolutely amazing if we > can get per-checkin data for the interesting ranges highlighted in > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>. Per push logs are only kept for 30days, so try runs will be required.
Assignee | ||
Comment 67•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+0] from comment #66) > (In reply to :Ehsan Akhgari from comment #65) > > Can you please also let me know how much historical data we can get out of > > this on each of central and inbound? It would be absolutely amazing if we > > can get per-checkin data for the interesting ranges highlighted in > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>. > > Per push logs are only kept for 30days, so try runs will be required. Would a project branch be more interesting for this project? (In reply to :Ehsan Akhgari from comment #65) > (In reply to comment #64) > > I will land and reconfig in the morning. > > Thanks Armen! > > Can you please also let me know how much historical data we can get out of > this on each of central and inbound? It would be absolutely amazing if we > can get per-checkin data for the interesting ranges highlighted in > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>. What edmorley says is correct. I think selfserve would be needed for this:
Comment 68•11 years ago
|
||
(In reply to comment #66) > (In reply to :Ehsan Akhgari from comment #65) > > Can you please also let me know how much historical data we can get out of > > this on each of central and inbound? It would be absolutely amazing if we > > can get per-checkin data for the interesting ranges highlighted in > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>. > > Per push logs are only kept for 30days, so try runs will be required. OK. so I guess we can explore that path when we need to. Thanks!
Comment 69•11 years ago
|
||
(In reply to comment #67) > (In reply to Ed Morley [:edmorley UTC+0] from comment #66) > > (In reply to :Ehsan Akhgari from comment #65) > > > Can you please also let me know how much historical data we can get out of > > > this on each of central and inbound? It would be absolutely amazing if we > > > can get per-checkin data for the interesting ranges highlighted in > > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>. > > > > Per push logs are only kept for 30days, so try runs will be required. > > Would a project branch be more interesting for this project? Not sure how that would help?
Assignee | ||
Comment 70•11 years ago
|
||
(In reply to :Ehsan Akhgari from comment #69) > (In reply to comment #67) > > (In reply to Ed Morley [:edmorley UTC+0] from comment #66) > > > (In reply to :Ehsan Akhgari from comment #65) > > > > Can you please also let me know how much historical data we can get out of > > > > this on each of central and inbound? It would be absolutely amazing if we > > > > can get per-checkin data for the interesting ranges highlighted in > > > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>. > > > > > > Per push logs are only kept for 30days, so try runs will be required. > > > > Would a project branch be more interesting for this project? > > Not sure how that would help? Turn around is faster. Other pgo jobs (not related to this data gathering) could be posting numbers in the try graph and would polluting the graph. Even though I can't see any PGO that was triggered by a dev today. On the other hand, customizing a branch to only do Windows PGO without test jobs could add some overhead to setup. I think either way is fine. Assuming that it works for the try server. I will try one build after I reconfig in the morning.
Comment 71•11 years ago
|
||
(In reply to comment #70) > (In reply to :Ehsan Akhgari from comment #69) > > (In reply to comment #67) > > > (In reply to Ed Morley [:edmorley UTC+0] from comment #66) > > > > (In reply to :Ehsan Akhgari from comment #65) > > > > > Can you please also let me know how much historical data we can get out of > > > > > this on each of central and inbound? It would be absolutely amazing if we > > > > > can get per-checkin data for the interesting ranges highlighted in > > > > > <https://blog.mozilla.org/nfroyd/2013/01/22/analyzing-linker-max-vsize/>. > > > > > > > > Per push logs are only kept for 30days, so try runs will be required. > > > > > > Would a project branch be more interesting for this project? > > > > Not sure how that would help? > > Turn around is faster. > Other pgo jobs (not related to this data gathering) could be posting numbers in > the try graph and would polluting the graph. Even though I can't see any PGO > that was triggered by a dev today. Hmm, good point. But that also takes away the ability of just pushing new heads and get builds on them, right? I mean, we would need to push new heads in the right order, right? > On the other hand, customizing a branch to only do Windows PGO without test > jobs could add some overhead to setup. Ouch. > I think either way is fine. > Assuming that it works for the try server. I will try one build after I > reconfig in the morning. Thanks, that's a good idea regardless.
Assignee | ||
Updated•11 years ago
|
Attachment #705425 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Attachment #705072 -
Flags: checked-in- → checked-in+
Assignee | ||
Comment 72•11 years ago
|
||
The good news is that this is live. The original purpose of the bug is fulfilled (as I understand it). The bad news is that there is no way to trigger PGO builds on try. Developers change the mozconfig but that does not trigger the PGO/try builders. Pushing to try as PGO would print the linker size but it won't post to the try server (as before). Booking a project branch with PGO would give the ability to push changesets in a chronological order but care would be needed to not prevent coallescing from happening (perhaps this can be configured on our side to be prevented - I don't know if it is easy). Is it good enough to have data points on the graphs DB from here on? IIUC there are means to gather historical data by pushing to try and scrapping the linker size.
Comment 73•11 years ago
|
||
(In reply to comment #72) > The good news is that this is live. > The original purpose of the bug is fulfilled (as I understand it). > > The bad news is that there is no way to trigger PGO builds on try. > Developers change the mozconfig but that does not trigger the PGO/try builders. > Pushing to try as PGO would print the linker size but it won't post to the try > server (as before). > > Booking a project branch with PGO would give the ability to push changesets in > a chronological order but care would be needed to not prevent coallescing from > happening (perhaps this can be configured on our side to be prevented - I don't > know if it is easy). > > Is it good enough to have data points on the graphs DB from here on? It's good but definitely not enough. So the first step is to parse through the PGO logs for the past 30 days, and also nightly logs for as long as we have them stored, and report them to the graph server associated with the correct date and changeset. Then, I guess we'll need to fill in the gaps for the individual changesets in the spikes that we've seen in Nathan's analysis. That would help us experiement with the possibility of finding culprit changesets which have added the most to the linker memory usage and see how we can deal with that. That all being said, gathering detailed historical data only matters if we decide to keep PGO enabled and try to keep the linker memory usage bounded, which is a call that we have not made yet. We need more of the dependencies of bug 833881 to be resolved before we can make a meaningful decision on that. If we do decide to keep PGO enabled, I'll file another bug in the RelEng component to gather more historical data. Last but not least, thanks everyone for your help here, really appreciated! :-)
Assignee | ||
Comment 74•11 years ago
|
||
I thought my reconfig this morning would have done the trick but it seems that when a change is backed out from both branches ("production" and "default") then, the typical land to default and merge to production misses the change [1]. I've seen this happen a couple of times in the past. I landed it again (on production) and reconfigured the build masters again: http://hg.mozilla.org/build/buildbot-configs/rev/92846acd0ba5 I re-triggered a second pgo in here that should be successful: https://tbpl.mozilla.org/?jobname=WINNT%205.2%20mozilla-central%20pgo-build&rev=680e46fecff0 [1] http://hg.mozilla.org/build/buildbot-configs/graph
Comment 75•11 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] from comment #72) > Booking a project branch with PGO would give the ability to push changesets > in a chronological order but care would be needed to not prevent coallescing > from happening (perhaps this can be configured on our side to be prevented - > I don't know if it is easy). For a project branch you could probably just use self-serve force pgo builds on a revision, so no need to push. That assumes disabling merging is easily done to speed the process up. Whether the history is easily transferable to the m-c branch in the graph server is another question.
Comment 76•11 years ago
|
||
(In reply to comment #75) > (In reply to Armen Zambrano G. [:armenzg] from comment #72) > > Booking a project branch with PGO would give the ability to push changesets > > in a chronological order but care would be needed to not prevent coallescing > > from happening (perhaps this can be configured on our side to be prevented - > > I don't know if it is easy). > > For a project branch you could probably just use self-serve force pgo builds on > a revision, so no need to push. That assumes disabling merging is easily done > to speed the process up. Whether the history is easily transferable to the m-c > branch in the graph server is another question. Hmm, I'm not quite sure what that exactly means...
Assignee | ||
Comment 77•11 years ago
|
||
(In reply to :Ehsan Akhgari from comment #76) > (In reply to comment #75) > > (In reply to Armen Zambrano G. [:armenzg] from comment #72) > > > Booking a project branch with PGO would give the ability to push changesets > > > in a chronological order but care would be needed to not prevent coallescing > > > from happening (perhaps this can be configured on our side to be prevented - > > > I don't know if it is easy). > > > > For a project branch you could probably just use self-serve force pgo builds on > > a revision, so no need to push. That assumes disabling merging is easily done > > to speed the process up. Whether the history is easily transferable to the m-c > > branch in the graph server is another question. > > Hmm, I'm not quite sure what that exactly means... I think what nthomas is suggesting is to trigger PGO builds on a project branch which would add data points for that branch on the graph server. We could then ask a DBA to transfer the data points to the mozilla-central records. One note is that tbpl might not show up any jobs since the changesets are from the past.
Assignee | ||
Comment 78•11 years ago
|
||
We are now getting data points: http://graphs.mozilla.org/graph.html#tests=[[205,63,8]]&sel=none&displayrange=7&datatype=running mozilla-central will show data point in the next few hours. I missed inserting the machine name on the graphs DB. This has now been fixed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 79•11 years ago
|
||
(In reply to comment #77) > (In reply to :Ehsan Akhgari from comment #76) > > (In reply to comment #75) > > > (In reply to Armen Zambrano G. [:armenzg] from comment #72) > > > > Booking a project branch with PGO would give the ability to push changesets > > > > in a chronological order but care would be needed to not prevent coallescing > > > > from happening (perhaps this can be configured on our side to be prevented - > > > > I don't know if it is easy). > > > > > > For a project branch you could probably just use self-serve force pgo builds on > > > a revision, so no need to push. That assumes disabling merging is easily done > > > to speed the process up. Whether the history is easily transferable to the m-c > > > branch in the graph server is another question. > > > > Hmm, I'm not quite sure what that exactly means... > > I think what nthomas is suggesting is to trigger PGO builds on a project branch > which would add data points for that branch on the graph server. We could then > ask a DBA to transfer the data points to the mozilla-central records. > One note is that tbpl might not show up any jobs since the changesets are from > the past. OK, like I said we might need this in the near future. In the case that we do, I'll file another bug and let you guys do what's needed. Thanks! :-)
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•