Closed Bug 476208 Opened 11 years ago Closed 11 years ago

have build machines send data to graph server in new graph server format.

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: anodelman)

References

Details

Attachments

(3 files, 8 obsolete files)

4.07 KB, patch
rdoherty
: review+
anodelman
: checked-in+
Details | Diff | Splinter Review
3.39 KB, patch
bhearsum
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
7.48 KB, patch
bhearsum
: review+
bhearsum
: checked-in+
Details | Diff | Splinter Review
The new graph server requires a new format for sending data, the builders should comply with this format.

We should roll this out in staging to get a decent amount of testing done.  When the new graph server goes into production we can wait a few weeks to ensure that everything is working correctly and then switch to sending data to it.

I'm assuming that we will not require data migration or double sending of data (ie, have the builders sending results to both old/new graph servers for a limited set of time) for this as we don't have much history and the baseline can be generated over a single run.
Assignee: nobody → anodelman
Priority: -- → P2
bhearsum - can you take a quick look over this, as you touched this code just recently? 

The only block that I'm hitting is that I'm going to require the sourcestamp for the build that was testing to send along to the graph server.  Is that information available easily somewhere?  Also, would it be possible to get the 'real' buildid instead of the current solution that involves just taking a timestamp from localtime (so, not related to the build but ends up related to the time the test was run).
You can parse both SourceStamp and BuildID out of application.ini
Can I get a directory where the ini would be located from where this graph sending code is?  ie, is it a build property that I could access?
Missing here is the post_file.py file - it's the same as that used by talos.  We could check in a second copy for each of use or we could pull from a single source to avoid duplication.
Attachment #360201 - Attachment is obsolete: true
Attachment #360843 - Flags: review?(bhearsum)
Attachment #360851 - Flags: review?(rdoherty)
This will require allowing the cpu_speed column to have NULLs - is that acceptable?
Attachment #360851 - Attachment is obsolete: true
Attachment #360854 - Flags: review?(rdoherty)
Attachment #360851 - Flags: review?(rdoherty)
Comment on attachment 360844 [details] [diff] [review]
changes to mozilla stage to pass correct variables to buildbotcustom

>diff -r 9e0527a80b00 mozilla2-staging/master-main.cfg
>--- a/mozilla2-staging/master-main.cfg	Tue Feb 03 14:32:48 2009 -0500
>+++ b/mozilla2-staging/master-main.cfg	Thu Feb 05 18:03:35 2009 -0800
>@@ -200,7 +200,7 @@
>             stageBasePath=STAGE_BASE_PATH,
>             graphServer=GRAPH_SERVER,
>             graphSelector=GRAPH_SELECTOR,
>-            graphBranch=branch['major_version'],
>+            graphBranch=branch['waterfall_name'],

No need to add a new parameter for this, it's already in config.py as branch['tinderbox_tree']
Attachment #360844 - Flags: review?(bhearsum) → review-
Comment on attachment 360843 [details] [diff] [review]
update buildbotcustom to handle new graph server send protocol

This code isn't non-blocking. I don't want to put us in another situation where graph server or network problems hang Buildbot.

You should be able to use getPage just as before. I did some playing and it looks like this will work:

fh = open(filename)
data = fh.read()
fh.close()
content_type, body = encode_multipart_formdata(
    [('key', 'value')],
    [('filename', filename, data)]
)
headers = {'Content-Type': content_type, 'Content-Length': str(len(data))}
d = getPage(self.graphurl, timeout=self.timeout, method='POST',
            headers=headers, postdata=body)

The deferreds should fire exactly the same as before with this code. You can probably get rid of sendAndRemoveFile, and you'll have to move the deletion of the tempfile to postFinished, you can pass it in like this:
d1.addCallback(self.postFinished, filename)
...
def postFinished(self, results, filename)


Does the above make sense?
Attachment #360843 - Flags: review?(bhearsum) → review-
Attachment #360854 - Flags: review?(rdoherty) → review-
Comment on attachment 360854 [details] [diff] [review]
add build tests/machines to staging graph server db (take 2)

Changing the column to null is fine, just change it in sql/schema.sql and make it part of this patch and I'll r+ it.
bhearsum - 
The waterfall_name reflects the name of the waterfall that the itself would be associated with (ie, branch).  For the all the staging boxes the tinderbox_tree is 'MozillaTest', but I require them to be 'Firefox', 'Firefox3.1' and 'TraceMonkey' depending on the type of build.  Maybe calling it waterfall_name isn't quite appropriate.  Would be more comfortable with branchLongName?  

This is due to an effort in the graph server to get away from numbers (1.9, 1.9.0, 1.9.1, etc) in favor of waterfall branch names - it's seen as an easier to find the information.
Change to schema as requested.
Attachment #360854 - Attachment is obsolete: true
Attachment #360983 - Flags: review?(rdoherty)
Comment on attachment 360983 [details] [diff] [review]
[Checked in]add build tests/machines to staging graph server db (now with schema update)

Looks good!
Attachment #360983 - Flags: review?(rdoherty) → review+
(In reply to comment #12)
> bhearsum - 
> The waterfall_name reflects the name of the waterfall that the itself would be
> associated with (ie, branch).  For the all the staging boxes the tinderbox_tree
> is 'MozillaTest', but I require them to be 'Firefox', 'Firefox3.1' and
> 'TraceMonkey' depending on the type of build.  Maybe calling it waterfall_name
> isn't quite appropriate.  Would be more comfortable with branchLongName?  
> 
> This is due to an effort in the graph server to get away from numbers (1.9,
> 1.9.0, 1.9.1, etc) in favor of waterfall branch names - it's seen as an easier
> to find the information.

Can we use repo_path instead, or is that too ugly? repo_path is more accurate, anyways, as evidenced by all staging branches sharing a tinderbox_tree.
Uses non-blocking getPage.
Attachment #360843 - Attachment is obsolete: true
Attachment #361366 - Flags: review?(bhearsum)
Attachment #360844 - Attachment is obsolete: true
Attachment #361368 - Flags: review?(bhearsum)
Comment on attachment 360983 [details] [diff] [review]
[Checked in]add build tests/machines to staging graph server db (now with schema update)

197:eb1cb26499d4
Attachment #360983 - Attachment description: add build tests/machines to staging graph server db (now with schema update) → [Checked in]add build tests/machines to staging graph server db (now with schema update)
Attachment #360983 - Flags: checked‑in+ checked‑in+
bhearsum - looking over the results on MozillaTest, the links that I'm generating aren't being added to the logs correctly.  Used to be the case that we could get things to properly print on stage but not on production.  Now it looks like we can't get graph links post anywhere.
Comment on attachment 361368 [details] [diff] [review]
changes to mozilla stage to pass correct variables to buildbotcustom (take 2)

>diff -r 9e0527a80b00 mozilla2-staging/config.py
>--- a/mozilla2-staging/config.py	Tue Feb 03 14:32:48 2009 -0500
>+++ b/mozilla2-staging/config.py	Mon Feb 09 14:40:13 2009 -0800
>@@ -16,8 +16,8 @@
> AUS2_USER = 'cltbld'
> AUS2_HOST = 'staging-stage.build.mozilla.org'
> DOWNLOAD_BASE_URL = 'ftp://ftp.mozilla.org/pub/mozilla.org/firefox'
>-GRAPH_SERVER = 'graphs-stage.mozilla.org'
>-GRAPH_SELECTOR = 'server'
>+GRAPH_SERVER = 'graphs-stage2.mozilla.org'

I guess this is the new style graph server?
Attachment #361368 - Flags: review?(bhearsum) → review+
Comment on attachment 361366 [details] [diff] [review]
[Backed out]update buildbotcustom to handle new graph server send protocol (take 2)

Much nicer - thanks!
Attachment #361366 - Flags: review?(bhearsum) → review+
(In reply to comment #19)
> bhearsum - looking over the results on MozillaTest, the links that I'm
> generating aren't being added to the logs correctly.  Used to be the case that
> we could get things to properly print on stage but not on production.  Now it
> looks like we can't get graph links post anywhere.

That sucks. If we need to fix it for this ping me later and maybe can track it down...
Blocks: 487329
Are these patches going to fix the current GraphServerPost failures in staging. Eg:
Cannot find input stream
  File "/var/www/html/graphs2/server/pyfomatic/collect.py", line 241, in handleRequest
    raise ImproperFormatException("Cannot find input stream")
That looks like attempting data in the old graph server format to the new graph server - so yes, it should fix that.
Double checked, and the build machines and tests have been correctly added to the production graph server db.  Rolling this out should only be a question of checking in and pushing the code.
Comment on attachment 361366 [details] [diff] [review]
[Backed out]update buildbotcustom to handle new graph server send protocol (take 2)


>+    def constructFile(self, machine, testname, branch, sourcestamp, buildid, date, val):
>+        info_format = "%s,%s,%s,%s,%s,%s\n"
>+        filename = tempfile.mktemp()
>+        tmpf = open(filename, "w")
>+        tmpf.write("START\n")
>+        tmpf.write("AVERAGE\n")
>+        tmpf.write(info_format % (machine, testname, branch, sourcestamp, buildid, date))
>+        tmpf.write("%.2f\n" % (float(val)))
>+        tmpf.write("END")
>+        tmpf.flush()
>+        tmpf.close()
>+        return filename
>+

Can we change this so we're not creating and deleting these temporary files all the time?  Is there a need for this information to be stored in the file?  Below, you're immediately reading the entire file into memory and then deleting the file, so it's probably better to just return a string with all the above data.

>     def start(self):
>         self.changes = self.build.allChanges()
>         self.timestamp = int(self.step_status.build.getTimes()[0])
>-        self.buildid = strftime("%Y%m%d%H%M", localtime(self.timestamp))
>+        self.buildid = self.getProperty('buildid')
>+        self.sourcestamp = self.getProperty('sourcestamp')
>         self.testresults = self.getProperty('testresults')
>         self.stdio = self.addLog('stdio')
>         self.error = False
>@@ -379,9 +399,15 @@
>         deferreds = []
>         for res in self.testresults:
>             testname, testlongname, testval, prettyval = res
>-            params = urlencode({'branchid': self.buildid, 'value': str(testval).strip(string.letters), 'testname': testlongname, 'tbox' : self.resultsname, 'type' : "continuous", 'time' : self.timestamp, 'branch' : self.branch})
>-            d = getPage(self.graphurl, timeout=self.timeout, method='POST',
>-                        postdata=params)
>+            testval = str(testval).strip(string.letters)
>+            filename = self.constructFile(self.resultsname, testlongname, self.branch, self.sourcestamp, self.buildid, self.timestamp, testval)
>+            fh = open(filename)
>+            file_data = fh.read()
>+            fh.close()
>+            os.remove(filename)
>+            content_type, body = post_file.encode_multipart_formdata([("key", "value")], [("filename", filename, file_data)])
>+            headers = {'Content-Type': content_type, 'Content-Length' : str(len(file_data))}
>+            d = getPage(self.graphurl, timeout=self.timeout, method='POST', headers=headers, postdata=body)
>             d.addCallback(self.doTinderboxPrint, testlongname, testname,
>                           prettyval)
>             d.addErrback(lambda x: self.postFailed(testlongname))
Updated patch to reflect currently reality.  This looks like during migration we are going to break the ability to thunderbird and seamonkey to send data to graphs-stage...

How can we get in contact with them to make the transition smooth?
Attachment #361368 - Attachment is obsolete: true
Attachment #383603 - Flags: review?(bhearsum)
Attachment #383603 - Attachment is patch: true
Attachment #383603 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #27)
> Created an attachment (id=383603) [details]
> changes to pass correct variables to buildbotcustom (take 3)
> 
> Updated patch to reflect currently reality.  This looks like during migration
> we are going to break the ability to thunderbird and seamonkey to send data to
> graphs-stage...

Well afaik SeaMonkey doesn't send any stats to graphs-stage.

Thunderbird stats seemed to have stopped around May time. As I was expecting it to be much earlier than that I'm not too fussed about breaking it.

What I would like to do is to work with you so that we can migrate to the new code (graphs-new?) and get the right names for everything (we currently use comm-central and other things on graph-stage; the graphs-new version looks much better).

> How can we get in contact with them to make the transition smooth?

Make sure we're cc'ed and/or ping us on #maildev
(In reply to comment #28)
> Well afaik SeaMonkey doesn't send any stats to graphs-stage.

Right, and we're waiting for the new graphs server to be fully working before enabling that, see bug 492406. I the buildbotcustom factories send the data in the new format, we're safe for that once we can move forward on it.
Attachment #383603 - Flags: review?(bhearsum) → review+
Comment on attachment 383603 [details] [diff] [review]
[Backed out]changes to pass correct variables to buildbotcustom (take 3)

I don't know much about the new graph server, but this is what Talos uses, so it sounds good to me!
Comment on attachment 361366 [details] [diff] [review]
[Backed out]update buildbotcustom to handle new graph server send protocol (take 2)

changeset:   358:393344231abf
Attachment #361366 - Attachment description: update buildbotcustom to handle new graph server send protocol (take 2) → [Checked in]update buildbotcustom to handle new graph server send protocol (take 2)
Attachment #361366 - Flags: checked‑in+ checked‑in+
Comment on attachment 383603 [details] [diff] [review]
[Backed out]changes to pass correct variables to buildbotcustom (take 3)

changeset:   1270:55793cf7f0c9
Attachment #383603 - Attachment description: changes to pass correct variables to buildbotcustom (take 3) → [Checked in]changes to pass correct variables to buildbotcustom (take 3)
Attachment #383603 - Flags: checked‑in+ checked‑in+
Attachment #361366 - Attachment description: [Checked in]update buildbotcustom to handle new graph server send protocol (take 2) → [Backed out]update buildbotcustom to handle new graph server send protocol (take 2)
Attachment #361366 - Flags: checked‑in+ checked‑in+ → checked‑in- checked‑in-
Attachment #383603 - Attachment description: [Checked in]changes to pass correct variables to buildbotcustom (take 3) → [Backed out]changes to pass correct variables to buildbotcustom (take 3)
Attachment #383603 - Flags: checked‑in+ checked‑in+ → checked‑in- checked‑in-
Graph server posting failed after applying patches, backed out and currently sending to graphs-old.
Blocks: 493623
Attachment #383603 - Attachment is obsolete: true
Attachment #389840 - Flags: review?(bhearsum)
Attachment #361366 - Attachment is obsolete: true
Attachment #389841 - Flags: review?(bhearsum)
Patches run green on stage, I've also added better error handling so that we can diagnose graph server failures quickly.
Comment on attachment 389840 [details] [diff] [review]
buildbot-configs updates to support new graph server


>diff -r 0a44baf04cbb mozilla2-staging/master-main.cfg
>--- a/mozilla2-staging/master-main.cfg	Tue Jul 21 21:16:36 2009 +0200
>+++ b/mozilla2-staging/master-main.cfg	Tue Jul 21 18:25:25 2009 -0700
>@@ -298,7 +298,7 @@
>             stageBasePath=STAGE_BASE_PATH,
>             graphServer=GRAPH_SERVER,
>             graphSelector=GRAPH_SELECTOR,
>-            graphBranch=branch['major_version'],
>+            graphBranch=branch['tinderbox_tree'],

Using tinderbox_tree here doesn't work very well in staging, but we can fix that later.
Attachment #389840 - Flags: review?(bhearsum) → review+
Attachment #389840 - Flags: checked‑in+ checked‑in+
Comment on attachment 389840 [details] [diff] [review]
buildbot-configs updates to support new graph server

2e154a2f8c5e
Attachment #389841 - Flags: checked‑in+ checked‑in+
Comment on attachment 389841 [details] [diff] [review]
buildbotcustom updates to support new graph server

2dad9a4ec70e
Comment on attachment 389841 [details] [diff] [review]
buildbotcustom updates to support new graph server

>@@ -549,6 +567,24 @@
>          warnOnFailure=True,
>          haltOnFailure=True
>         )
>+        self.addStep(SetProperty,
>+          command=['python', 'build/config/printconfigsetting.py',
>+          'build/%s/dist/bin/application.ini' % self.objdir,
>+          'App', 'BuildID'],
>+          property='buildid',
>+          workdir='.',
>+          description=['getting', 'buildid'],
>+          descriptionDone=['got', 'buildid']
>+        )
>+        self.addStep(SetProperty,
>+          command=['python', 'build/config/printconfigsetting.py',
>+          'build/%s/dist/bin/application.ini' % self.objdir,
>+          'App', 'SourceStamp'],
>+          property='sourcestamp',
>+          workdir='.',
>+          description=['getting', 'sourcestamp'],
>+          descriptionDone=['got', 'sourcestamp']
>+        )

I don't think you need to do this a second time here.

This patch looks fine otherwise. I'll check it in with this nit fixed during the downtime today.
Attachment #389841 - Flags: review?(bhearsum) → review+
Alice, can you try adding this code in staging, it should get us better logs as to why graph server posts fail.

To reproduce set GRAPH_SELECTOR to 'server/collect.cgi' (instead of '/server/collect.cgi')


diff --git a/steps/test.py b/steps/test.py
--- a/steps/test.py
+++ b/steps/test.py
@@ -402,22 +402,24 @@
               found = True
           if not found:
               self.stdio.addStderr("results not added, response: \n" + contents)
               raise Exception("graph server did not add results successfully")
         except Exception, e:
           self.stdio.addStderr(str(e))
           raise
         
-    def postFailed(self, testlongname):
+    def postFailed(self, testlongname, res=None):
         # This function is called when getPage() fails and simply sets
         # self.error = True so postFinished knows that something failed.
         self.error = True
         self.stdio.addStderr('\nEncountered error when trying to post %s\n' % \
           testlongname)
+        if res:
+            self.stdio.addStderr(str(res))
 
     def constructString(self, machine, testname, branch, sourcestamp, buildid, date, val):
         info_format = "%s,%s,%s,%s,%s,%s\n"
         str = ""
         str += "START\n"
         str += "AVERAGE\n"
         str += info_format % (machine, testname, branch, sourcestamp, buildid, date)
         str += "%.2f\n" % (float(val))
@@ -439,17 +441,17 @@
             testname, testlongname, testval, prettyval = res
             testval = str(testval).strip(string.letters)
             data = self.constructString(self.resultsname, testlongname, self.branch, self.sourcestamp, self.buildid, self.timestamp, testval)
             content_type, body = post_file.encode_multipart_formdata([("key", "value")], [("filename", "data", data)])
             headers = {'Content-Type': content_type, 'Content-Length' : str(len(data))}
             d = getPage(self.graphurl, timeout=self.timeout, method='POST', headers=headers, postdata=body)
             d.addCallback(self.doTinderboxPrint, testlongname, testname,
                           prettyval)
-            d.addErrback(lambda x: self.postFailed(testlongname))
+            d.addErrback(lambda res: self.postFailed(testlongname, res))
             deferreds.append(d)
 
         # Now, once *everything* has finished we need to tell Buildbot
         # that this step is complete.
         dl = DeferredList(deferreds)
         dl.addCallback(self.postFinished)
 
     def postFinished(self, results):
I'm seeing the leak/bloat results on the new graph server, so data is successfully being sent.  Going to call this done.  Any further improvements can be split into new bugs.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 508274
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.