monitoring for buildbot master step delay

RESOLVED FIXED

Status

Release Engineering
General
RESOLVED FIXED
4 years ago
a year ago

People

(Reporter: Callek, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
Sometimes our buildbot steps take far longer than they should, due to load on the master.

The load criteria can vary wildly, where usually the fix is to just add more masters, or split the slave pool in other ways.

We should monitor for this.

(e.g. is one step that normally takes <1s on a slave could take 30s+ to complete and start the next step, this extra time adds up fast)
I've been using this code to submit master lag times to graphite. Somebody should be able to use this to generate a coarser metric to use by nagios. e.g. if 50th percentile rises above 10s, we should get an alert.

#!/usr/bin/env python
import sqlalchemy as sa
import time
from datetime import timedelta

import logging
log = logging.getLogger(__name__)

def find_lag_since(db, build_id):
    q = sa.text("""
        SELECT builds.id as build_id, masters.name as master, steps.starttime, steps.endtime FROM masters, builds, steps
        WHERE
            builds.master_id = masters.id AND
            steps.build_id = builds.id AND
            steps.name = 'get_basedir' AND
            builds.id > :build_id
            """)
    return db.execute(q, build_id=build_id)

def get_last_build_id(db, d):
    q = sa.text("SELECT id FROM builds WHERE starttime >= :d ORDER BY starttime asc limit 1")
    return db.execute(q, d=d).fetchone()[0]

def main():
    import config
    from build_times import GraphiteSubmitter, td2s, dt2ts

    logging.basicConfig(format="%(asctime)s - %(message)s", level=logging.DEBUG)
    db = sa.create_engine("mysql://foobar")

    log.debug("getting last_build_id")
    last_build_id = get_last_build_id(db, "2014-05-01")

    g = GraphiteSubmitter("graphitehost", 2003, config.graphite_api_key)
    log.debug("getting lag")
    for row in find_lag_since(db, last_build_id):
        d = td2s(row.endtime - row.starttime)
        t = dt2ts(row.starttime)
        g.submit("masterlag.%s" % row.master, d, t)

if __name__ == '__main__':
    main()
(Assignee)

Updated

a year ago
Component: Tools → General
Product: Release Engineering → Release Engineering
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.