Closed Bug 810393 Opened 12 years ago Closed 11 years ago

deploy release runner

Categories

(Release Engineering :: Release Automation: Other, defect, P2)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: rail)

References

Details

(Whiteboard: [shipit])

Attachments

(1 file)

We need to deploy the release runner portion of the kickoff system. Our current idea is to run it on buildbot-master36.
Whiteboard: [kickoff]
Priority: -- → P2
I deployed the current dev version on bm36.

ssh cltbld@buildbot-master36
# install supervisord
su -
yum install supervisor
chkconfig  supervisord on

# Add the following section to /etc/supervisor.conf:

[program:releaserunner]
command=/home/cltbld/release-runner/build-tools/buildfarm/release/release-runner.sh
exitcodes=0
user=cltbld
log_stderr=true
log_stdout=true
redirect_stderr=true
stdout_logfile=/var/log/supervisor/release-runner.log

# as cltbld
cd ~
mkdir release-runner && release-runner
virtualenv-2.6 --no-site-packages $PWD/venv
sourece venv/bin/activate
pip install simplejson
pip install fabric
pip install buildbot

git clone git://github.com/rail/build-tools.git
cd build-tools
git checkout -b release-runner origin/release-runner-comments

# Set up ~cltbld/.release-runner.ini

service start supervisord
This isn't quite working....here's the problems I've encountered:
* Killing supervisord doesn't terminate release runner (at least, not when it's doing something like cloning a repository). I end up with the .sh process dead and a release-runner.py in sleep state. It eventually dies, presumably after it finishes retrying its current operation
* Not getting all of the output from release runner. We get a lot of output when polling release kickoff, but almost nothing when cloning repositories. For example:
==> /var/log/supervisor/release-runner.log <==
Buildbot version: 0.8.7p1
Twisted version: 12.3.0
2012-12-28 11:42:32,666 - DEBUG - Fetching release requests
2012-12-28 11:42:32,712 - INFO - Got a new release request: {'status': 'Pending', 'product': 'fennec', 'name': 'Fennec-20.0-build1', 'dashboardCheck': False, 'buildNumber': 1, 'ready': True, 'l10nChangesets': '{\r\n  "ca": {\r\n    "revision": "3e911ef81869",\r\n    "platforms": ["android"]\r\n  },\r\n  "cs": {\r\n    "revision": "4d8963178613",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "da": {\r\n    "revision": "65d017f4f5fe",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "de": {\r\n    "revision": "d6ff03c97175",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "es-AR": {\r\n    "revision": "03c68802bdc4",\r\n    "platforms": ["android"]\r\n  },\r\n  "es-ES": {\r\n    "revision": "d2185a7e4f7f",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "fi": {\r\n    "revision": "68cb3de2b609",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "fr": {\r\n    "revision": "6513cd7d17ae",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "fy-NL": {\r\n    "revision": "0bb46089e086",\r\n    "platforms": ["android"]\r\n  },\r\n  "ga-IE": {\r\n    "revision": "b611f1be732c",\r\n    "platforms": ["android"]\r\n  },\r\n  "gl": {\r\n    "revision": "6a40cee822a6",\r\n    "platforms": ["android"]\r\n  },\r\n  "it": {\r\n    "revision": "d986d54e6074",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "ja": {\r\n    "revision": "066c7401bfdc",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "ko": {\r\n    "revision": "de1302d7a7f9",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "lt": {\r\n    "revision": "c2087557d498",\r\n    "platforms": ["android"]\r\n  },\r\n  "nb-NO": {\r\n    "revision": "1a967b024168",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "nl": {\r\n    "revision": "3aec3a20b84b",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "pa-IN": {\r\n    "revision": "4afbb88e0ccf",\r\n    "platforms": ["android"]\r\n  },\r\n  "pl": {\r\n    "revision": "9ba4c01e429f",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "pt-BR": {\r\n    "revision": "f230d9a3c797",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "pt-PT": {\r\n    "revision": "25ed21fde549",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "ru": {\r\n    "revision": "ccfccb0a343d",\r\n    "platforms": ["android", "android-multilocale"]\r\n  },\r\n  "sk": {\r\n    "revision": "e371468ef6c9",\r\n    "platforms": ["android"]\r\n  },\r\n  "sl": {\r\n    "revision": "07442918d3aa",\r\n    "platforms": ["android"]\r\n  },\r\n  "uk": {\r\n    "revision": "e54754531f14",\r\n    "platforms": ["android"]\r\n  },\r\n  "zh-CN": {\r\n    "revision": "65723862ba6d",\r\n    "platforms": ["android"]\r\n  },\r\n  "zh-TW": {\r\n    "revision": "a5dac614368e",\r\n    "platforms": ["android"]\r\n  }\r\n}\r\n', 'version': '20.0', 'branch': 'releases/mozilla-beta', 'submitter': 'rail', 'mozillaRevision': 'default', 'complete': False}
warning: hg.mozilla.org certificate not verified (check web.cacerts config setting)
pulling from https://hg.mozilla.org/users/bhearsum_mozilla.com/buildbot-configs
searching for changes
no changes found
0 files updated, 0 files merged, 0 files removed, 0 files unresolved
abort: no suitable response from remote hg!

The latter makes it very difficult to debug other problems.
Depends on: 831284
Release runner hit an ISE 500 on the 24th and e-mailed us to say that it will sleep for 259200 seconds before retry. A few problems here:
1) ISE 500 from the server shouldn't necessarily cause a sleep like that. I think you fixed this already by retrying requests to ship it?
2) Nothing in the server log indicating that it would sleep - you'd only know that if you got the e-mail.
3) More than 259200 seconds (3 days) have gone by since then, and it's still not polling again.
Attached patch print itSplinter Review
(In reply to Ben Hearsum [:bhearsum] from comment #3)
> Release runner hit an ISE 500 on the 24th and e-mailed us to say that it
> will sleep for 259200 seconds before retry. A few problems here:
> 1) ISE 500 from the server shouldn't necessarily cause a sleep like that. I
> think you fixed this already by retrying requests to ship it?

This should be fixed by bug 833062.

> 2) Nothing in the server log indicating that it would sleep - you'd only
> know that if you got the e-mail.

The attached patch prints a line which goes to logs.

> 3) More than 259200 seconds (3 days) have gone by since then, and it's still
> not polling again.

Hmmm, not sure what happens here... I suspect supervisord not restarting the process when it exits 0. I'll try reproduce this.
Attachment #708142 - Flags: review?(bhearsum)
Attachment #708142 - Flags: review?(bhearsum) → review+
I added autorestart=true to the config, looks like it solves the problem 3), at least it worked fine with a simple test script.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [kickoff] → [shipit]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: