Closed Bug 1212993 Opened 9 years ago Closed 8 years ago

Reduce hidden Buildbot lag

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: catlee)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 3 obsolete files)

This is discussed in [1]; it is a side effect of overloaded masters not being able to give next step to the machines (which are waiting).

* We can reduce logging for the Buildbot jobs which improves the lag
* We can increase the number of masters
* We can reduce the number of steps for ScriptFactory
* Move jobs to TC to reduce load on masters

The current way to determine the cost this has in our systems is by analyzing every single Buildbot log for a day.

Another option *might* be to query the Buildbot databases 


[1] https://groups.google.com/d/msg/mozilla.release.engineering/wZRfbOwdc54/hqcR5cNuBgAJ
###
There are some interesting things in this pre-mh time:

elapsedTime=0.301000
========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) =========

Notice that elapsed time is 42s while the command itself thinks it only too
0.3s. This is a symptom of master lag, which we track an approximation of
here:
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana2/dashboard/db/buildbot-masters

The "lag" is time between a step completing on the worker and the master
responding to that. If this gets too high, it means we need to add more
masters. 

###
Could you elaborate how each of those data points are calculated?
From looking at that graph I would believe that we're OK, however, in
only [looking at] one job I believe we can see more lag than wanted.

I think a different dashboard would be needed to determine how much lag
is introduced for every pool per day.

What are options to improve lag? I can think of these:
* more masters
* move to taskcluster

Is the lag mainly dependent on number of slaves running a job on a master?
Or is it due to the number of builders?

>From this specific job I can see that this much was wasted per step [1]
(only significant waste):
* rm -> 42s - 0.3s = 41.7s
* rm -> 34s - 3.3s = 30.7s
* bash -> 8s - 2.2s = 5.8s
* mh -> 620s - 601s = 19s

The lag for this job was ~97 secs -> 1.37 mins out of 13 minutes.
Approximately 10% of the wall time. 
###
From the thread:
[1]
========= Started set props: basedir (results: 0, elapsed: 1 secs) (at
2015-09-25 07:30:57.464401) =========
elapsedTime=0.548000
========= Finished set props: basedir (results: 0, elapsed: 1 secs) (at
2015-09-25 07:30:58.857526) =========
========= Started 'rm -rf ...' (results: 0, elapsed: 42 secs) (at
2015-09-25 07:31:01.253640) =========
elapsedTime=0.301000
========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at
2015-09-25 07:31:43.620746) =========
========= Started 'bash -c ...' (results: 0, elapsed: 3 secs) (at
2015-09-25 07:31:43.621910) =========
elapsedTime=0.752000
========= Finished 'bash -c ...' (results: 0, elapsed: 3 secs) (at
2015-09-25 07:31:47.176730) =========
========= Started 'rm -rf ...' (results: 0, elapsed: 34 secs) (at
2015-09-25 07:31:47.177061) =========
elapsedTime=3.303000
========= Finished 'rm -rf ...' (results: 0, elapsed: 34 secs) (at
2015-09-25 07:32:21.989463) =========
========= Started 'bash -c ...' (results: 0, elapsed: 8 secs) (at
2015-09-25 07:32:21.989924) =========
elapsedTime=2.181000
========= Finished 'bash -c ...' (results: 0, elapsed: 8 secs) (at
2015-09-25 07:32:30.760760) =========
========= Started 'c:/mozilla-build/python27/python -u ...' (results: 0,
elapsed: 10 mins, 20 secs) (at 2015-09-25 07:33:05.324682) =========
elapsedTime=601.379000
========= Finished 'c:/mozilla-build/python27/python -u ...' (results:
0, elapsed: 10 mins, 20 secs) (at 2015-09-25 07:43:25.627777) =========
========= Started set props: build_url (results: 0, elapsed: 0 secs) (at
2015-09-25 07:43:25.628616) =========
elapsedTime=0.102000
========= Finished set props: build_url (results: 0, elapsed: 0 secs)
(at 2015-09-25 07:43:26.238286) =========
========= Started 'rm -f ...' (results: 0, elapsed: 0 secs) (at
2015-09-25 07:43:26.238635) =========
elapsedTime=0.101000
========= Finished 'rm -f ...' (results: 0, elapsed: 0 secs) (at
2015-09-25 07:43:26.353214) =========
Blocks: 1213004
We could probably get rid of the 'get basedir' step, and have that hardcoded in buildbot instead. That almost never changes.
Assignee: nobody → catlee
this outputs lines like this:

> program finished with exit code 0
> elapsedTime=0.562000
> basedir: 'C:\\slave\\test'
> ========= master_lag: 8.15 =========
> ========= Finished set props: basedir (results: 0, elapsed: 8 secs) (at 2015-10-27 10:47:49.046646) =========

and at the end of the log:

> ========= Total master_lag: 51.15 =========
Blocks: 1220269
Comment on attachment 8681414 [details] [diff] [review]
put master lag in the log, and report to statsd if available

Review of attachment 8681414 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm

as you mentioned in IRC, need to add statsd to master-pip.txt
Attachment #8681414 - Flags: review?(rail) → review+
Attachment #8681483 - Flags: review?(jwatkins)
Depends on: 1220296
Attachment #8681483 - Flags: review?(jwatkins)
Attachment #8681483 - Flags: review+
Attachment #8681483 - Flags: checked-in+
Something is a little wonky according to graphite. I've seen some spikes approaching "17.5G" seconds, or approximately 554 years. I haven't found any logs containing this yet, so this could be either a log parsing bug, or a statsd/collectd/graphite bug.
Here is a query looking at the past week [1]; the *average* is dubious because it only considers lag over  20sec.  Note that some lag is negative due to rounding and inter-machine clock skew.  Query [2] shows URLs for lag over 10min.  It is best you *Copy* to a spreadsheet: It will be easier to cut-and-paste the URLs.

[1] http://activedata.allizom.org/tools/query.html#query_id=jkzXsTFF
[2] http://activedata.allizom.org/tools/query.html#query_id=+0u3ebNw
Note, I would have put the lag calculation in the `select` clause; but the query translator is broken in this instance.  I have a TODO [1] to make a test and fix it.

[1] https://github.com/klahnakoski/ActiveData/blob/b7467aeea3f2bbe2238b97c2ecca4c935bef166d/tests/test_deep_ops.py#L1079
Depends on: 1220156
Here is a query to show lags over 60sec.

> {
>     "from":"jobs.action.timings",
>     "select":[
>         "builder.step",
>         "builder.duration",
>         "builder.elapsedTime",
>         {
>             "name":"lag",
>             "value":{"sub":["builder.duration","builder.elapsedTime"]}
>         },
>         "run.logurl"
>     ],
>     "where":{"and":[
>         {"exists":"builder.elapsedTime"},
>         {"gt":[
>             {"subtract":["builder.duration","builder.elapsedTime"]},
>             60
>         ]}
>     ]},
>     "limit":100
> }

I am recording it here because this only works on my dev machine; I will be pushing to production on Monday
I pushed to production.  This query [1] reports over lag of over 60sec in past month.

[1] http://activedata.allizom.org/tools/query.html#query_id=+eYDIYkR
(In reply to Kyle Lahnakoski [:ekyle] from comment #13)
> I pushed to production.  This query [1] reports over lag of over 60sec in
> past month.
> 
> [1] http://activedata.allizom.org/tools/query.html#query_id=+eYDIYkR

There's a lot of data here, but I'm not sure what we need in order to make this actionable. Armen, do you have any suggestions?
Flags: needinfo?(armenzg)
catlee, when will the next reboot of the masters be?
Filed bug 1225475 for Windows 8 backlogs.

I can't make sense of the data.
I also failed to be able to paste it into a spreadsheet.

What would be useful, if we could have a total lag per job and add them up grouped by masters.
A master is what creates the lag and the machines/slaves connected to it are the ones showing the effects of it.
If we could track the lag that each master has it would be great.
Flags: needinfo?(armenzg)
We track total lag per master in graphite.
Attached image Buildbot Lag (nov 10)
ActiveData has the same; Here is the 95 percentile lag from the past week, in four hour increments [1].  Of course, a grid of numbers is inhumane, so I charted it.

http://activedata.allizom.org/tools/query.html#query_id=hUrIHBEl
Attachment #8738146 - Flags: review?(rail)
Comment on attachment 8738145 [details]
MozReview Request: Bug 1212993 - remove get basedir step

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44359/diff/1-2/
Attachment #8738145 - Attachment description: MozReview Request: WIP → MozReview Request: Bug 1212993 - remove get basedir step
Attachment #8738146 - Attachment is obsolete: true
Attachment #8738146 - Flags: review?(rail)
Attachment #8738145 - Flags: review?(rail)
Comment on attachment 8738145 [details]
MozReview Request: Bug 1212993 - remove get basedir step

https://reviewboard.mozilla.org/r/44359/#review41313

Removals LGTM. I' not sure if `validateBuilders()` will work as expected though.

::: misc.py:2980
(Diff revision 2)
>      return names
>  
>  
> +def validateBuilders(builders):
> +    for b in builders:
> +        if isinstance(b['factory'], ScriptFactory):

This function sounds like `validateScriptFactoryBuilders` to me. Maybe replace the `if` block with
```if not isinstance(b['factory'], ScriptFactory):
    continue
```
and dedent the block below it?

::: misc.py:3006
(Diff revision 2)
> +            else:
> +                rootdir = '/builds/slave'
> +                basedir = '%s/%s' % (rootdir, slavebuilddir)
> +
> +            b['properties']['basedir'] = basedir
> +            assert 'basedir' in b['properties'], b['name']

This line looks weird to me: `assert` will work in any situatiation, because the line above makes sure `b['properties']` contains `basedir`.

The only places where this function may raise exceptions are where you access `b['builddir']` and `b['properties']['platform']`, so basically it only validates if those are set. Not sure if this is the intent.
Attachment #8738145 - Flags: review?(rail)
Comment on attachment 8738145 [details]
MozReview Request: Bug 1212993 - remove get basedir step

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44359/diff/2-3/
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/1-2/
Attachment #8739524 - Attachment description: MozReview Request: Address review comments → MozReview Request: Bug 1212993 - remove get basedir step
Attachment #8738145 - Attachment is obsolete: true
Attachment #8738145 - Flags: review?(rail)
Attachment #8739524 - Flags: review?(rail)
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail

https://reviewboard.mozilla.org/r/45303/#review41799
Attachment #8739524 - Flags: review?(rail) → review+
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/2-3/
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/3-4/
Attachment #8739524 - Attachment description: MozReview Request: Bug 1212993 - remove get basedir step → MozReview Request: Bug 1212993 - remove get basedir step r=rail
Attachment #8741006 - Attachment description: MozReview Request: Make sure we add properties to the project objects → MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail
Attachment #8741006 - Flags: review?(rail)
Comment on attachment 8741006 [details]
MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/46111/diff/1-2/
Attachment #8741007 - Attachment is obsolete: true
Comment on attachment 8741006 [details]
MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail

https://reviewboard.mozilla.org/r/46111/#review42629
Attachment #8741006 - Flags: review?(rail) → review+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: