Closed Bug 433710 Opened 17 years ago Closed 16 years ago

moz2/3.next leak test + codesighs need to report results to graphs.m.o

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

(Whiteboard: Leak numbers still not linkified on tinderbox waterfall)

Attachments

(8 files, 5 obsolete files)

1020 bytes, patch
laura
: review+
Details | Diff | Splinter Review
419 bytes, patch
laura
: review+
Details | Diff | Splinter Review
5.29 KB, patch
bhearsum
: review+
Details | Diff | Splinter Review
1.59 KB, patch
anodelman
: review+
Details | Diff | Splinter Review
3.25 KB, patch
Details | Diff | Splinter Review
1.51 KB, patch
bhearsum
: review+
anodelman
: review-
Details | Diff | Splinter Review
5.57 KB, patch
bhearsum
: review+
Details | Diff | Splinter Review
12.21 KB, patch
bhearsum
: review+
catlee
: checked-in+
Details | Diff | Splinter Review
For Firefox 3, leaktest machines report their results back to build-graphs. There's been discussion about mothballing that machine so I'm leaning towards pushing data to graphs.m.o. but to be honest I'm not picky either way.
+1 for using graphs.m.o, as thats where new development work is happening.
Priority: -- → P3
updating summary.
Summary: mozilla-central leak test machines need to report results to a graph server → moz2/3.next leak test + codesighs need to report results to graphs.m.o
Component: Release Engineering → Release Engineering: Future
QA Contact: build → release
This is extremely bad... we have no way of gauging whether there is a change in these numbers without having graphs. Moving to RE and raising priority.
Severity: normal → major
Component: Release Engineering: Future → Release Engineering
Priority: P3 → P2
P2 is reserved for bugs that are currently being worked on.
Priority: P2 → P3
How do I request that the priority be raised on this then?
Ask?
Ok, I'd like to request that the priority be raised on this bug, as it's currently impossible to watch these numbers over a long period of time to see if a patch caused a problem with those numbers.
Let's just get this pointed at whatever graph server is currently working.
Assignee: nobody → nthomas
This would be graphs.m.o. Triaged to future until other higher priority stuff is completed.
(In reply to comment #10) > This would be graphs.m.o. Triaged to future until other higher priority stuff > is completed. No, it would be build-graphs.mozilla.org. Afaik, graphs.mozilla.org doesn't support the format that tinderbox uses yet (please correct me if I'm wrong).
So, um, I'd like to say that we care about leak regressions, even if the stuff we're currently testing for is only a small subset of what we ought to be. In fact, I'm sort of curious how the Lk number on Windows ended up being as high as it is now. And without this, I don't have a usable way to find out.
Flags: blocking1.9.1?
Agree with dbaron; we need to be able to keep an eye on this information in order to have the confidence to ship 3.1
Flags: blocking1.9.1? → blocking1.9.1+
What needs to be done for this data to be sent to graphs.m.o, instead of build-graphs.m.o?
Both graphs.m.o and build-graphs.m.o use a simple http post method to add data. The difference being in the format/keywords used. If we can push data to build-graphs.m.o we should be able to push to graphs.m.o with some alterations to the link structure of the post command.
reed: can you attach a recent build-graphs.m.o post command for leaktest+codesighs? alice: can you attach a sample graph.m.o post command, so we can figure out how easy/hard this would be to do?
TinderboxPrint:<a title="refcnt Leaks" href="http://build-graphs.mozilla.org/graph/query.cgi?testname=refcnt_leaks&units=bytes&tbox=fxdbug-linux-tbox.build.mozilla.org&autoscale=1&days=7&avg=1&showpoint=2008:08:05:12:43:38,0">RLk:0B</a> send_results_to_server(): tmpurl = http://build-graphs.mozilla.org/graph/collect.cgi?value=0&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=refcnt_leaks&tbox=fxdbug-linux-tbox.build.mozilla.org Results submitted to server: http://build-graphs.mozilla.org/graph/collect.cgi?value=0&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=refcnt_leaks&tbox=fxdbug-linux-tbox.build.mozilla.org send_results_to_server() succeeded. TinderboxPrint:<a title="Leaks: total bytes 'malloc'ed and not 'free'd" href="http://build-graphs.mozilla.org/graph/query.cgi?testname=trace_malloc_leaks&units=bytes&tbox=fxdbug-linux-tbox.build.mozilla.org&autoscale=1&days=7&avg=1&showpoint=2008:08:05:12:44:23,971984">Lk:949KB</a> TinderboxPrint:<a title="Maximum Heap: max (bytes 'malloc'ed - bytes 'free'd) over run" href="http://build-graphs.mozilla.org/graph/query.cgi?testname=trace_malloc_maxheap&units=bytes&tbox=fxdbug-linux-tbox.build.mozilla.org&autoscale=1&days=7&avg=1&showpoint=2008:08:05:12:44:23,20952158">MH:20.0MB</a> TinderboxPrint:<a title="Allocations: number of calls to 'malloc' and friends" href="http://build-graphs.mozilla.org/graph/query.cgi?testname=trace_malloc_allocs&units=count&tbox=fxdbug-linux-tbox.build.mozilla.org&autoscale=1&days=7&avg=1&showpoint=2008:08:05:12:44:23,571227">A:557K</a> send_results_to_server(): tmpurl = http://build-graphs.mozilla.org/graph/collect.cgi?value=971984&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=trace_malloc_leaks&tbox=fxdbug-linux-tbox.build.mozilla.org Results submitted to server: http://build-graphs.mozilla.org/graph/collect.cgi?value=971984&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=trace_malloc_leaks&tbox=fxdbug-linux-tbox.build.mozilla.org send_results_to_server() succeeded. send_results_to_server(): tmpurl = http://build-graphs.mozilla.org/graph/collect.cgi?value=20952158&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=trace_malloc_maxheap&tbox=fxdbug-linux-tbox.build.mozilla.org Results submitted to server: http://build-graphs.mozilla.org/graph/collect.cgi?value=20952158&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=trace_malloc_maxheap&tbox=fxdbug-linux-tbox.build.mozilla.org send_results_to_server() succeeded. send_results_to_server(): tmpurl = http://build-graphs.mozilla.org/graph/collect.cgi?value=571227&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=trace_malloc_allocs&tbox=fxdbug-linux-tbox.build.mozilla.org Results submitted to server: http://build-graphs.mozilla.org/graph/collect.cgi?value=571227&data=MOZ_CO_DATE=2008:08:05:12:38:00 --&testname=trace_malloc_allocs&tbox=fxdbug-linux-tbox.build.mozilla.org send_results_to_server() succeeded.
Over to Alice.
Assignee: nthomas → anodelman
We'll need to push data in with the collect.cgi script, which ended up somewhat mangled when I fixed up the link format. This is just a minor bustage fix - patches for buildbot to come.
Attachment #332767 - Flags: review?(morgamic)
Attachment #332767 - Flags: review?(morgamic) → review?(laura)
Comment on attachment 332767 [details] [diff] [review] [Checked in]repair broken collect.cgi graph server script Looks fine to me.
Attachment #332767 - Flags: review?(laura) → review+
Comment on attachment 332767 [details] [diff] [review] [Checked in]repair broken collect.cgi graph server script changeset: 98:81a60b46f0b7
Attachment #332767 - Attachment description: repair broken collect.cgi graph server script → [Checked in]repair broken collect.cgi graph server script
Found another problem with the collect.cgi script - getting some wonky results when we don't provide a reasonable default to the date variable.
Attachment #333313 - Flags: review?(laura)
This is the new class to be able to send info to the graph server. Assumptions: - build property is set called 'testresults' which is a list of lists, each individual list being of the form (testshortname, testlongname, value) (ie, ('Z', 'codesighs', '1.03') - start time of the build is used to timestamp the results on the graph server - format of the link to the graph server is always http://server/selector/collect.cgi (ie, http://graphs-stage.mozilla.org/server/collect.cgi) This could really use more (or any) error handling. On the plus side, it is up and working on staging so I have pretty high confidence that it will work in production.
Attachment #333319 - Flags: review?(bhearsum)
Comment on attachment 333319 [details] [diff] [review] [checked in] add GraphServerPost custom buildstep >Index: test.py >=================================================================== >RCS file: /cvsroot/mozilla/tools/buildbotcustom/steps/test.py,v >retrieving revision 1.4 >diff -u -8 -p -r1.4 test.py >--- test.py 30 May 2008 14:53:47 -0000 1.4 >+++ test.py 11 Aug 2008 23:45:58 -0000 >@@ -17,25 +17,30 @@ > # Mozilla Corporation. > # Portions created by the Initial Developer are Copyright (C) 2007 > # the Initial Developer. All Rights Reserved. > # > # Contributor(s): > # Ben Hearsum <bhearsum@mozilla.com> > # Rob Campbell <rcampbell@mozilla.com> > # Chris Cooper <coop@mozilla.com> >+# Alice Nodelman <anodelman@mozilla.com> > # ***** END LICENSE BLOCK ***** > > from buildbot.steps.shell import ShellCommand >-from buildbot.status.builder import FAILURE >+from buildbot.status.builder import FAILURE, SUCCESS >+from buildbot.process.buildstep import BuildStep > >+import urllib #for conntacting the graph server It'd be better to use Twisted's HTTP library here. urllib is blocking and can hang the BuildMaster while running. This would hurt a lot if DNS or the graph server went down. There's an example of how to use it further down. >+ >+ self.build.setProperty('testresults', [("RLk", "refcnt_leaks", formatBytes(leaks,3))]) >+ This can just be self.setProperty() >+ for res in self.testresults: >+ testname, testlongname, testval = res >+ params = urllib.urlencode({'branchid': self.buildid, 'value': testval.strip(string.letters), 'testname': testlongname, 'tbox' : self.resultsname, 'type' : "continuous", 'time' : self.timestamp, 'branch' : self.branch}) >+ request = urllib.urlopen(self.graphurl, params) Here's how you could use getPage(): from twisted.web.client import getPage d = getPage(self.graphurl, postdata=params) d.addCallback(self._success) d.addErrback(self._failure) def _success(self, content): # if we hit this, the page was properly. we can grep the log, tinderboxprint, etc, from here lines = content.split('\n') ... ... self.finished(SUCCESS) def _failure(self, junk): # if we hit this, we failed self.finished(FAILURE) Does that make sense? I've gotta r- because of urlopen - with all of the builders there is on the master I suspect this will make it unresponsive a lot.
Attachment #333319 - Flags: review?(bhearsum) → review-
Comment on attachment 333323 [details] [diff] [review] update buildbot mozilla2 and mozilla2-staging to use GraphServerPost I'm sorry, this is bitrotted because of bug 429670 - I'll fix it up for you.
Attachment #333323 - Flags: review?(bhearsum)
Attachment #333323 - Attachment is obsolete: true
Attachment #333751 - Flags: review?
I changed the branch to '1.9.1' because that's what mozilla-central currently is. We should really drop references to "moz2" :(
Attachment #333752 - Flags: review?(anodelman)
Attachment #333751 - Flags: review? → review?(anodelman)
I'm willing to do the rewrite, but I have a couple of questions/notes first: - is it worth worrying about a blocking possibility that we haven't seen? I don't want to add an unreasonable amount of complexity (ie, threads) for what may not ever be a problem - we are sending very small bits of information to the graph server (just time/value pairs) and these shouldn't ever block minus the situation where the graph server is down and unreachable - and at that point we have bigger troubles - the callback mechanism suggested would add information to the log possibly after the other steps have completed - would we end up blocking on this anyway waiting for sending the data to complete before closing the log and sending it off to the waterfall? Or does the waterfall get updated later? And if so, does that get scraped properly? I think that a simpler solution would be to put a small timeout on urllib open (which is how talos runs) to guarantee that you'll quit after a minute. Voila! No threading and you don't end up blocked indefinitely.
(In reply to comment #29) > I'm willing to do the rewrite, but I have a couple of questions/notes first: > > - is it worth worrying about a blocking possibility that we haven't seen? I > don't want to add an unreasonable amount of complexity (ie, threads) for what > may not ever be a problem > How long does a graph server post take? For every set of builds on mozilla-central we do at least 6 posts. We do 2 for each set of builds on actionmonkey and tracemonkey, too. > - we are sending very small bits of information to the graph server (just > time/value pairs) and these shouldn't ever block minus the situation where the > graph server is down and unreachable - and at that point we have bigger > troubles > I totally agree that we have bigger problems at that point, but I don't think we should lose the ability to see our Waterfall because the graph server goes down (however unlikely). > - the callback mechanism suggested would add information to the log possibly > after the other steps have completed - would we end up blocking on this anyway > waiting for sending the data to complete before closing the log and sending it > off to the waterfall? Or does the waterfall get updated later? And if so, > does that get scraped properly? > Actually, the BuildStep doesn't finish until self.finished() is called. So, GraphServerPost.start() will finish execution, but the Build will not move on to the next step until one of the callbacks has called self.finished() > I think that a simpler solution would be to put a small timeout on urllib open > (which is how talos runs) to guarantee that you'll quit after a minute. Voila! > No threading and you don't end up blocked indefinitely. > The difference here is that the Talos code runs on the slaves in a ShellCommand. This is already guaranteed not to block the master in any way. GraphServerPost runs on master. I know it sucks to a rewrite but I really think we need to do this the right way, esp. as our Buildbots get larger.
Comment on attachment 333319 [details] [diff] [review] [checked in] add GraphServerPost custom buildstep after discussion on irc we decided to go with this option for now. we should definitely move to a non-blocking solution in the future though.
Attachment #333319 - Flags: review- → review+
Attachment #333751 - Flags: review?(anodelman) → review+
Attachment #333752 - Flags: review?(anodelman) → review+
Attachment #333313 - Flags: review?(laura) → review+
Comment on attachment 333313 [details] [diff] [review] [Checked in]more repair for collect.cgi graph server script 99:7f1dd343a692
Attachment #333313 - Attachment description: more repair for collect.cgi graph server script → [Checked in]more repair for collect.cgi graph server script
Comment on attachment 333319 [details] [diff] [review] [checked in] add GraphServerPost custom buildstep Checking in misc.py; /cvsroot/mozilla/tools/buildbotcustom/steps/misc.py,v <-- misc.py new revision: 1.7; previous revision: 1.6 done Checking in test.py; /cvsroot/mozilla/tools/buildbotcustom/steps/test.py,v <-- test.py new revision: 1.5; previous revision: 1.4 done
Attachment #333319 - Attachment description: add GraphServerPost custom buildstep → [checked in] add GraphServerPost custom buildstep
Checking in factory.py; /cvsroot/mozilla/tools/buildbotcustom/process/factory.py,v <-- factory.py new revision: 1.10; previous revision: 1.9 done
Attachment #333751 - Attachment is obsolete: true
Comment on attachment 333752 [details] [diff] [review] [checked in] pass the new parameters to the factory changeset: 235:7bf70420a81f
Attachment #333752 - Attachment description: pass the new parameters to the factory → [checked in] pass the new parameters to the factory
I've updated and reconfig'ed the buildmaster. I'll keep an eye on it and resolve this bug once I confirm it's working ok.
Had to land a bustage fix - forgot to pass baseName to MercurialBuildFactory which caused GraphServerPost to throw an exception about resultsname being NoneType. Build is going now.
Looks like things are working okay. I just noticed that we're using 'codesighs' and 'codesighs_embed' and the old graphs are using 'codesize' and 'codesize_embed' -- Alice, can we make that change for consistency?
The codesize vs. codesighs is resolved with rewriting rules on the graph server side.
(In reply to comment #39) > The codesize vs. codesighs is resolved with rewriting rules on the graph server > side. > Can we get those in place?
The codesighs/codesize graph server fix will probably land this week. As far as I can tell, the data is being correctly sent to the graph server now - the links are just not being published to the waterfall.
ok as for the links on the Tinderbox page....I think TinderboxMailNotifier is not working correctly with non-ShellCommands. I'm going to a file a bug about that - there's no reason to hold this one open because of it. once the graph server fix lands we can close this bug out.
Any update on the graph server fix?
graph server fix checked in and live.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
The leak numbers are still not linkified on the tinderbox waterfall...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: Leak numbers still not linkified on tinderbox waterfall
What does it take to get the text linkified? This bug has been open since May, and it still hasn't been fixed correctly. People in #developers today were complaining that they can't find the graphs on graphs.m.o, too. If the leak results were linkified, this wouldn't be a problem.
I think bug 451210 is what Ben is referring to in comment #42. Since I don't know when our priorities will match Reed's, there are now links at the top of the Firefox waterfall, look for "Graph links for Codesighs/Tracemalloc ...". Which throws up an interesting problem. Only the allocations (A) really give sensible graphs, all the others are submitting using KB or MB units instead of bytes.
Depends on: 451210
Attached patch Fix units ?Splinter Review
I don't think there are other consumers of the testresults BuildProperty, but lets ask the original author what he thinks.
Attachment #338433 - Flags: review?(bhearsum)
Comment on attachment 338433 [details] [diff] [review] Fix units ? The Buildbot code looks fine. Alice is a better person to review the format/units - I don't know a whole lot about that stuff
Attachment #338433 - Flags: review?(bhearsum)
Attachment #338433 - Flags: review?(anodelman)
Attachment #338433 - Flags: review+
Comment on attachment 338433 [details] [diff] [review] Fix units ? Down in GraphServerPost is also uses that value to report to the waterfall page, and I'd assume that we want the pretty version (MB instead of Bytes, etc) to end up on the waterfall. We'd want to repeat the formatting in that function for the waterfall report for things to remain readable. Though, since we aren't successfully printing to the waterfall that may be moot. Also, what do we want to do with the current data stored in the graph server? Do we mind if the lines take a sudden jump when they change from being in MB to B?
Attachment #338433 - Flags: review?(anodelman) → review-
I think that we'll just take the hit with the unit change - can try and come up with a sql script after the fact to repair the old data if we think that it will be worthwhile. My going theory is that the log isn't being correctly scraped because it isn't using a standard name - I've changed it to 'stdio'. I don't know if this will actually fix things because I don't have a proper stage set up to try it on, but it won't harm anything.
Attachment #338944 - Flags: review?(bhearsum)
(In reply to comment #51) > My going theory is that the log isn't being correctly scraped because it isn't > using a standard name - I've changed it to 'stdio'. I don't know if this will > actually fix things because I don't have a proper stage set up to try it on, > but it won't harm anything. I would be surprised if this was the case (unless getLogs() has a bug like that). Also, the leak test steps use non-standard logs name and *do* get scraped properly. It's worth a try, though.
Comment on attachment 338944 [details] [diff] [review] fix units/possible fix for not being scraped to waterfall Everything here looks fine. Can we test this patch in staging before landing it?
Attachment #338944 - Flags: review?(bhearsum) → review+
This gets scraped correctly on stage and uses the correct units for the data. The jump in the graph is a little annoying - but as we build up more data it will be less noticeable. Or we can go the extra mile and figure out the query to fix the questionable values.
Attachment #338944 - Attachment is obsolete: true
Attachment #339369 - Flags: review?(bhearsum)
Attachment #339369 - Flags: review?(bhearsum) → review+
Comment on attachment 339369 [details] [diff] [review] [Checked in]fix units/possible fix for not being scraped to waterfall (again) Checking in test.py; /cvsroot/mozilla/tools/buildbotcustom/steps/test.py,v <-- test.py new revision: 1.10; previous revision: 1.9 done
Attachment #339369 - Attachment description: fix units/possible fix for not being scraped to waterfall (again) → [Checked in]fix units/possible fix for not being scraped to waterfall (again)
Patch applied and pushed to production. Still not getting scraping. Should we split that out into a different bug?
Data successfully being pushed, but still not being scraped on tinderbox waterfall.
Assignee: anodelman → nobody
Component: Release Engineering → Release Engineering: Future
Has the main issue in the bug been fixed? If so, can we close this one out and open another bug? Need to know if this is still blocking.
We're reporting results to graph server now. AIUI, Tinderbox is still failing to link to them properly. I don't think this is blocking anything anymore.
Except that a significant part of the value of having the numbers on the graph server comes from people knowing that they're on the graph server (and thus looking at them) -- which doesn't happen without the links. (Seems like it should be pretty trivial to fix, in the sense of taking longer to find the code that does the TinderboxPrint in question than to write the patch.)
Graph links for all of these things are at the top of the Tinderbox page, in the status message. It is inconvenient to not have each build linkifying them, but again, I don't see how it's a blocker. We did do some investigation into this and there's something trickier going on than a missing <a> tag.
dbaron, do you think we should block on this? Seems like people can get the info, even if it's not as efficient.
Flags: blocking1.9.1+ → blocking1.9.1-
So the fix here is most likely adding a call to self.step_status.setText when the graph server post is successful.
Assignee: nobody → catlee
Component: Release Engineering: Future → Release Engineering
Calling setText when we succeed means that the tinderboxmailnotifier won't skip over this step when generating the build log. We also need to use single quotes for the href attribute here, since the URL graph server gives us contains double quotes.
Attachment #409845 - Flags: review?(bhearsum)
Comment on attachment 409845 [details] [diff] [review] Call setText on success, and use single quotes to surround href wfm
Attachment #409845 - Flags: review?(bhearsum) → review+
Same as before, plus adding a tbPrint parameter to CompareBloatLogs, CompareLeakLogs, Codesighs. This gets set to False if the results are going to be posted to graph server, and having them TinderboxPrinted from there.
Attachment #409845 - Attachment is obsolete: true
Attachment #410232 - Flags: review?(bhearsum)
Comment on attachment 410232 [details] [diff] [review] Call setText on success, and use single quotes to surround href, prevent duplicate TinderboxPrints >diff --git a/process/factory.py b/process/factory.py >--- a/process/factory.py >+++ b/process/factory.py >@@ -522,22 +522,27 @@ class MercurialBuildFactory(MozillaBuild > name='upload_bloat_log', > env=self.env, > command=['scp', '-o', 'User=%s' % self.stageUsername, > '-o', 'IdentityFile=~/.ssh/%s' % self.stageSshKey, > '../bloat.log', > '%s:%s/%s' % (self.stageServer, self.stageBasePath, > self.logUploadDir)] > ) >+ if self.graphServer is not None: >+ tbPrint = False >+ else: >+ tbPrint = True Looks like this block can go at the top of this function, or maybe even in __init__()? In any case, no need to have it twice here. r=bhearsum with that change
Attachment #410232 - Attachment is obsolete: true
Attachment #410278 - Flags: review?(bhearsum)
Attachment #410232 - Flags: review?(bhearsum)
Attachment #410278 - Flags: review?(bhearsum) → review+
Comment on attachment 410278 [details] [diff] [review] Call setText on success, and use single quotes to surround href, prevent duplicate TinderboxPrints changeset: 477:98fc27231b18
Attachment #410278 - Flags: checked-in+
Looks like it's working now.
Status: REOPENED → RESOLVED
Closed: 17 years ago16 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: