Sometimes our buildbot steps take far longer than they should, due to load on the master. The load criteria can vary wildly, where usually the fix is to just add more masters, or split the slave pool in other ways. We should monitor for this. (e.g. is one step that normally takes <1s on a slave could take 30s+ to complete and start the next step, this extra time adds up fast)
I've been using this code to submit master lag times to graphite. Somebody should be able to use this to generate a coarser metric to use by nagios. e.g. if 50th percentile rises above 10s, we should get an alert. #!/usr/bin/env python import sqlalchemy as sa import time from datetime import timedelta import logging log = logging.getLogger(__name__) def find_lag_since(db, build_id): q = sa.text(""" SELECT builds.id as build_id, masters.name as master, steps.starttime, steps.endtime FROM masters, builds, steps WHERE builds.master_id = masters.id AND steps.build_id = builds.id AND steps.name = 'get_basedir' AND builds.id > :build_id """) return db.execute(q, build_id=build_id) def get_last_build_id(db, d): q = sa.text("SELECT id FROM builds WHERE starttime >= :d ORDER BY starttime asc limit 1") return db.execute(q, d=d).fetchone() def main(): import config from build_times import GraphiteSubmitter, td2s, dt2ts logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.DEBUG) db = sa.create_engine("mysql://foobar") log.debug("getting last_build_id") last_build_id = get_last_build_id(db, "2014-05-01") g = GraphiteSubmitter("graphitehost", 2003, config.graphite_api_key) log.debug("getting lag") for row in find_lag_since(db, last_build_id): d = td2s(row.endtime - row.starttime) t = dt2ts(row.starttime) g.submit("masterlag.%s" % row.master, d, t) if __name__ == '__main__': main()
Component: Tools → General
Product: Release Engineering → Release Engineering
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.