Closed Bug 1049657 Opened 10 years ago Closed 7 years ago

monitoring for buildbot master step delay

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Unassigned)

References

Details

Sometimes our buildbot steps take far longer than they should, due to load on the master.

The load criteria can vary wildly, where usually the fix is to just add more masters, or split the slave pool in other ways.

We should monitor for this.

(e.g. is one step that normally takes <1s on a slave could take 30s+ to complete and start the next step, this extra time adds up fast)
I've been using this code to submit master lag times to graphite. Somebody should be able to use this to generate a coarser metric to use by nagios. e.g. if 50th percentile rises above 10s, we should get an alert.

#!/usr/bin/env python
import sqlalchemy as sa
import time
from datetime import timedelta

import logging
log = logging.getLogger(__name__)

def find_lag_since(db, build_id):
    q = sa.text("""
        SELECT builds.id as build_id, masters.name as master, steps.starttime, steps.endtime FROM masters, builds, steps
        WHERE
            builds.master_id = masters.id AND
            steps.build_id = builds.id AND
            steps.name = 'get_basedir' AND
            builds.id > :build_id
            """)
    return db.execute(q, build_id=build_id)

def get_last_build_id(db, d):
    q = sa.text("SELECT id FROM builds WHERE starttime >= :d ORDER BY starttime asc limit 1")
    return db.execute(q, d=d).fetchone()[0]

def main():
    import config
    from build_times import GraphiteSubmitter, td2s, dt2ts

    logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.DEBUG)
    db = sa.create_engine("mysql://foobar")

    log.debug("getting last_build_id")
    last_build_id = get_last_build_id(db, "2014-05-01")

    g = GraphiteSubmitter("graphitehost", 2003, config.graphite_api_key)
    log.debug("getting lag")
    for row in find_lag_since(db, last_build_id):
        d = td2s(row.endtime - row.starttime)
        t = dt2ts(row.starttime)
        g.submit("masterlag.%s" % row.master, d, t)

if __name__ == '__main__':
    main()
Component: Tools → General
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.