Closed Bug 458939 Opened 17 years ago Closed 16 years ago

buildbotcustom.steps.test.GraphServerPost needs to use Twisted's http library instead of urlopen()

Categories

(Release Engineering :: General, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file, 2 obsolete files)

The moz2 Buildbot hung twice today because it couldn't submit results to the graph server. This caused the tree to burn, forced us to restarted release builds, and required lots of manually intervention on windows slaves because of hung processes. Like I mentioned in the original bug, we really need to be using Twisted's http library here. https://bugzilla.mozilla.org/show_bug.cgi?id=433710#c25 https://bugzilla.mozilla.org/show_bug.cgi?id=458937
Component: Release Engineering → Release Engineering: Future
This bug needs to get resolved soon. Again we had something up with graphs.m.o hang the master.
Severity: major → critical
Component: Release Engineering: Future → Release Engineering
A quick fix here would be to add an appropriate socket timeout value and then catch the exception that occurs on timeout. This is how talos deal with graph server issues - it catches the timeout and then reports an appropriate error. In the case of build machines it could report the error and then continue with normal operation.
Unfortunately, urlopen doesn't have a 'timeout' option, and I don't think we want to use socket.setdefaulttimeout() - that would effect the master as a whole (and probably cause an implosion). If no one else takes this by the new year, I guess I will.
Assignee: nobody → bhearsum
Priority: -- → P3
I'm testing a fix for this today.
Status: NEW → ASSIGNED
Priority: P3 → P2
Attached patch GraphServerPost, nonblocking (obsolete) — Splinter Review
Okay, here's a non-blocking version of this BuildStep. Each getPage Deferred has a callback which does the TinderboxPrint, and an errback to detect failure. The DeferredList fires once all of the getPaged Deferreds have finished, and does the final log printing and determines overall success or failure. I had to add flunkOnFailure=True to make sure this step burns the overall build. I couldn't figure out a way to cause the HTTP posts to hang, so I added '127.0.0.1 graphs-stage.mozilla.org' to test the errbacks and failure code. Unfortunately, this means that I couldn't 100% confirm that the hang is gone, but this code is non-blocking, so I'm pretty confident about it.
Attachment #358220 - Flags: review?(l10n)
Attachment #358220 - Flags: review?(anodelman)
Attachment #358220 - Flags: review?(l10n) → review+
Comment on attachment 358220 [details] [diff] [review] GraphServerPost, nonblocking r=me. Two nits, I'd use if "RETURN:" in line: instead of if line.find("RETURN:") > -1: The latter is too perlish ;-) A bit more substance has: I'd move from self.addCompleteLog() in postFinished() to self.addStdout in doTinderboxPrint(). That gives you incremental logs as the posts succeed. And you don't have to carry around .summary. Nice to see us moving over to POST on the way, did you make sure that the graphserver takes those allright? At least it looks like that used to be GET requests.
Attachment #358220 - Flags: review?(anodelman) → review+
Comment on attachment 358220 [details] [diff] [review] GraphServerPost, nonblocking This looks fine to me as well, has it been running on stage at all?
Thanks for the nits Axel, I'll address them and re-post. Alice, I've run this on staging-master in both the leak test and codesighs scenario without issue.
Okay, I've addressed your nits here Axel. I thought about trying to get rid of postFinished, but I don't think I can without bailing after the first error, which doesn't seem ideal. I think it's cleaner to set overall step status here, anyways.
Attachment #358220 - Attachment is obsolete: true
Attachment #358391 - Flags: review?(l10n)
Comment on attachment 358391 [details] [diff] [review] GraphServerPost, nonblocking, nits addressed and yet another nit, in postFailed, I'd self.stdio.addStderr instead of addStdout. Didn't occur to me that we can change that one on the first round of review. r=me.
Attachment #358391 - Flags: review?(l10n) → review+
Attachment #358391 - Attachment is obsolete: true
Comment on attachment 358408 [details] [diff] [review] use addStderr in postFailed changeset: 180:e43363144d85
Attachment #358408 - Flags: checked‑in+
This changed landed without issue.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: