Closed
Bug 1212993
Opened 9 years ago
Closed 9 years ago
Reduce hidden Buildbot lag
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: catlee)
References
(Blocks 1 open bug)
Details
Attachments
(5 files, 3 obsolete files)
7.83 KB,
patch
|
rail
:
review+
|
Details | Diff | Splinter Review |
928 bytes,
patch
|
catlee
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
327.94 KB,
image/png
|
Details | |
58 bytes,
text/x-review-board-request
|
rail
:
review+
rail
:
checked-in-
|
Details |
58 bytes,
text/x-review-board-request
|
rail
:
review+
|
Details |
This is discussed in [1]; it is a side effect of overloaded masters not being able to give next step to the machines (which are waiting).
* We can reduce logging for the Buildbot jobs which improves the lag
* We can increase the number of masters
* We can reduce the number of steps for ScriptFactory
* Move jobs to TC to reduce load on masters
The current way to determine the cost this has in our systems is by analyzing every single Buildbot log for a day.
Another option *might* be to query the Buildbot databases
[1] https://groups.google.com/d/msg/mozilla.release.engineering/wZRfbOwdc54/hqcR5cNuBgAJ
###
There are some interesting things in this pre-mh time:
elapsedTime=0.301000
========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) =========
Notice that elapsed time is 42s while the command itself thinks it only too
0.3s. This is a symptom of master lag, which we track an approximation of
here:
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana2/dashboard/db/buildbot-masters
The "lag" is time between a step completing on the worker and the master
responding to that. If this gets too high, it means we need to add more
masters.
###
Could you elaborate how each of those data points are calculated?
From looking at that graph I would believe that we're OK, however, in
only [looking at] one job I believe we can see more lag than wanted.
I think a different dashboard would be needed to determine how much lag
is introduced for every pool per day.
What are options to improve lag? I can think of these:
* more masters
* move to taskcluster
Is the lag mainly dependent on number of slaves running a job on a master?
Or is it due to the number of builders?
>From this specific job I can see that this much was wasted per step [1]
(only significant waste):
* rm -> 42s - 0.3s = 41.7s
* rm -> 34s - 3.3s = 30.7s
* bash -> 8s - 2.2s = 5.8s
* mh -> 620s - 601s = 19s
The lag for this job was ~97 secs -> 1.37 mins out of 13 minutes.
Approximately 10% of the wall time.
###
Reporter | ||
Comment 1•9 years ago
|
||
From the thread:
[1]
========= Started set props: basedir (results: 0, elapsed: 1 secs) (at
2015-09-25 07:30:57.464401) =========
elapsedTime=0.548000
========= Finished set props: basedir (results: 0, elapsed: 1 secs) (at
2015-09-25 07:30:58.857526) =========
========= Started 'rm -rf ...' (results: 0, elapsed: 42 secs) (at
2015-09-25 07:31:01.253640) =========
elapsedTime=0.301000
========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at
2015-09-25 07:31:43.620746) =========
========= Started 'bash -c ...' (results: 0, elapsed: 3 secs) (at
2015-09-25 07:31:43.621910) =========
elapsedTime=0.752000
========= Finished 'bash -c ...' (results: 0, elapsed: 3 secs) (at
2015-09-25 07:31:47.176730) =========
========= Started 'rm -rf ...' (results: 0, elapsed: 34 secs) (at
2015-09-25 07:31:47.177061) =========
elapsedTime=3.303000
========= Finished 'rm -rf ...' (results: 0, elapsed: 34 secs) (at
2015-09-25 07:32:21.989463) =========
========= Started 'bash -c ...' (results: 0, elapsed: 8 secs) (at
2015-09-25 07:32:21.989924) =========
elapsedTime=2.181000
========= Finished 'bash -c ...' (results: 0, elapsed: 8 secs) (at
2015-09-25 07:32:30.760760) =========
========= Started 'c:/mozilla-build/python27/python -u ...' (results: 0,
elapsed: 10 mins, 20 secs) (at 2015-09-25 07:33:05.324682) =========
elapsedTime=601.379000
========= Finished 'c:/mozilla-build/python27/python -u ...' (results:
0, elapsed: 10 mins, 20 secs) (at 2015-09-25 07:43:25.627777) =========
========= Started set props: build_url (results: 0, elapsed: 0 secs) (at
2015-09-25 07:43:25.628616) =========
elapsedTime=0.102000
========= Finished set props: build_url (results: 0, elapsed: 0 secs)
(at 2015-09-25 07:43:26.238286) =========
========= Started 'rm -f ...' (results: 0, elapsed: 0 secs) (at
2015-09-25 07:43:26.238635) =========
elapsedTime=0.101000
========= Finished 'rm -f ...' (results: 0, elapsed: 0 secs) (at
2015-09-25 07:43:26.353214) =========
Assignee | ||
Comment 2•9 years ago
|
||
We could probably get rid of the 'get basedir' step, and have that hardcoded in buildbot instead. That almost never changes.
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → catlee
Assignee | ||
Comment 3•9 years ago
|
||
Attachment #8681414 -
Flags: review?(rail)
Assignee | ||
Comment 4•9 years ago
|
||
this outputs lines like this:
> program finished with exit code 0
> elapsedTime=0.562000
> basedir: 'C:\\slave\\test'
> ========= master_lag: 8.15 =========
> ========= Finished set props: basedir (results: 0, elapsed: 8 secs) (at 2015-10-27 10:47:49.046646) =========
and at the end of the log:
> ========= Total master_lag: 51.15 =========
Comment 5•9 years ago
|
||
Comment on attachment 8681414 [details] [diff] [review]
put master lag in the log, and report to statsd if available
Review of attachment 8681414 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm
as you mentioned in IRC, need to add statsd to master-pip.txt
Attachment #8681414 -
Flags: review?(rail) → review+
Assignee | ||
Comment 6•9 years ago
|
||
Attachment #8681483 -
Flags: review?(jwatkins)
Assignee | ||
Updated•9 years ago
|
Attachment #8681483 -
Flags: review?(jwatkins)
Attachment #8681483 -
Flags: review+
Attachment #8681483 -
Flags: checked-in+
Assignee | ||
Comment 8•9 years ago
|
||
Something is a little wonky according to graphite. I've seen some spikes approaching "17.5G" seconds, or approximately 554 years. I haven't found any logs containing this yet, so this could be either a log parsing bug, or a statsd/collectd/graphite bug.
Comment 9•9 years ago
|
||
Here is a query looking at the past week [1]; the *average* is dubious because it only considers lag over 20sec. Note that some lag is negative due to rounding and inter-machine clock skew. Query [2] shows URLs for lag over 10min. It is best you *Copy* to a spreadsheet: It will be easier to cut-and-paste the URLs.
[1] http://activedata.allizom.org/tools/query.html#query_id=jkzXsTFF
[2] http://activedata.allizom.org/tools/query.html#query_id=+0u3ebNw
Comment 10•9 years ago
|
||
Note, I would have put the lag calculation in the `select` clause; but the query translator is broken in this instance. I have a TODO [1] to make a test and fix it.
[1] https://github.com/klahnakoski/ActiveData/blob/b7467aeea3f2bbe2238b97c2ecca4c935bef166d/tests/test_deep_ops.py#L1079
Comment 11•9 years ago
|
||
Comment 12•9 years ago
|
||
Here is a query to show lags over 60sec.
> {
> "from":"jobs.action.timings",
> "select":[
> "builder.step",
> "builder.duration",
> "builder.elapsedTime",
> {
> "name":"lag",
> "value":{"sub":["builder.duration","builder.elapsedTime"]}
> },
> "run.logurl"
> ],
> "where":{"and":[
> {"exists":"builder.elapsedTime"},
> {"gt":[
> {"subtract":["builder.duration","builder.elapsedTime"]},
> 60
> ]}
> ]},
> "limit":100
> }
I am recording it here because this only works on my dev machine; I will be pushing to production on Monday
Comment 13•9 years ago
|
||
I pushed to production. This query [1] reports over lag of over 60sec in past month.
[1] http://activedata.allizom.org/tools/query.html#query_id=+eYDIYkR
Comment 14•9 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #13)
> I pushed to production. This query [1] reports over lag of over 60sec in
> past month.
>
> [1] http://activedata.allizom.org/tools/query.html#query_id=+eYDIYkR
There's a lot of data here, but I'm not sure what we need in order to make this actionable. Armen, do you have any suggestions?
Flags: needinfo?(armenzg)
Reporter | ||
Comment 15•9 years ago
|
||
catlee, when will the next reboot of the masters be?
Filed bug 1225475 for Windows 8 backlogs.
I can't make sense of the data.
I also failed to be able to paste it into a spreadsheet.
What would be useful, if we could have a total lag per job and add them up grouped by masters.
A master is what creates the lag and the machines/slaves connected to it are the ones showing the effects of it.
If we could track the lag that each master has it would be great.
Flags: needinfo?(armenzg)
Assignee | ||
Comment 16•9 years ago
|
||
We track total lag per master in graphite.
Assignee | ||
Comment 17•9 years ago
|
||
Comment 18•9 years ago
|
||
ActiveData has the same; Here is the 95 percentile lag from the past week, in four hour increments [1]. Of course, a grid of numbers is inhumane, so I charted it.
http://activedata.allizom.org/tools/query.html#query_id=hUrIHBEl
Assignee | ||
Comment 19•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/44359/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/44359/
Assignee | ||
Comment 20•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/44361/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/44361/
Assignee | ||
Updated•9 years ago
|
Attachment #8738146 -
Flags: review?(rail)
Assignee | ||
Comment 21•9 years ago
|
||
Comment on attachment 8738145 [details]
MozReview Request: Bug 1212993 - remove get basedir step
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44359/diff/1-2/
Attachment #8738145 -
Attachment description: MozReview Request: WIP → MozReview Request: Bug 1212993 - remove get basedir step
Assignee | ||
Updated•9 years ago
|
Attachment #8738146 -
Attachment is obsolete: true
Attachment #8738146 -
Flags: review?(rail)
Assignee | ||
Updated•9 years ago
|
Attachment #8738145 -
Flags: review?(rail)
Comment 22•9 years ago
|
||
Comment on attachment 8738145 [details]
MozReview Request: Bug 1212993 - remove get basedir step
https://reviewboard.mozilla.org/r/44359/#review41313
Removals LGTM. I' not sure if `validateBuilders()` will work as expected though.
::: misc.py:2980
(Diff revision 2)
> return names
>
>
> +def validateBuilders(builders):
> + for b in builders:
> + if isinstance(b['factory'], ScriptFactory):
This function sounds like `validateScriptFactoryBuilders` to me. Maybe replace the `if` block with
```if not isinstance(b['factory'], ScriptFactory):
continue
```
and dedent the block below it?
::: misc.py:3006
(Diff revision 2)
> + else:
> + rootdir = '/builds/slave'
> + basedir = '%s/%s' % (rootdir, slavebuilddir)
> +
> + b['properties']['basedir'] = basedir
> + assert 'basedir' in b['properties'], b['name']
This line looks weird to me: `assert` will work in any situatiation, because the line above makes sure `b['properties']` contains `basedir`.
The only places where this function may raise exceptions are where you access `b['builddir']` and `b['properties']['platform']`, so basically it only validates if those are set. Not sure if this is the intent.
Attachment #8738145 -
Flags: review?(rail)
Assignee | ||
Comment 23•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/45303/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/45303/
Attachment #8738145 -
Flags: review?(rail)
Assignee | ||
Comment 24•9 years ago
|
||
Comment on attachment 8738145 [details]
MozReview Request: Bug 1212993 - remove get basedir step
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44359/diff/2-3/
Assignee | ||
Comment 25•9 years ago
|
||
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/1-2/
Attachment #8739524 -
Attachment description: MozReview Request: Address review comments → MozReview Request: Bug 1212993 - remove get basedir step
Assignee | ||
Updated•9 years ago
|
Attachment #8738145 -
Attachment is obsolete: true
Attachment #8738145 -
Flags: review?(rail)
Assignee | ||
Updated•9 years ago
|
Attachment #8739524 -
Flags: review?(rail)
Comment 26•9 years ago
|
||
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail
https://reviewboard.mozilla.org/r/45303/#review41799
Attachment #8739524 -
Flags: review?(rail) → review+
Assignee | ||
Comment 27•9 years ago
|
||
https://hg.mozilla.org/build/buildbotcustom/rev/493648951b51cf9019516d37f2268c77089bb87a
Bug 1212993 - remove get basedir step r=rail
Comment 28•9 years ago
|
||
In production: https://hg.mozilla.org/build/buildbotcustom/rev/493648951b51
Comment 29•9 years ago
|
||
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail
Had to backout: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=5d944b43c2173abb3426503ced62074ba739e112
Attachment #8739524 -
Flags: checked-in-
Assignee | ||
Comment 30•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/46111/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/46111/
Assignee | ||
Comment 31•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/46113/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/46113/
Assignee | ||
Comment 32•9 years ago
|
||
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/2-3/
Assignee | ||
Comment 33•9 years ago
|
||
Comment on attachment 8739524 [details]
MozReview Request: Bug 1212993 - remove get basedir step r=rail
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/3-4/
Attachment #8739524 -
Attachment description: MozReview Request: Bug 1212993 - remove get basedir step → MozReview Request: Bug 1212993 - remove get basedir step r=rail
Attachment #8741006 -
Attachment description: MozReview Request: Make sure we add properties to the project objects → MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail
Attachment #8741006 -
Flags: review?(rail)
Assignee | ||
Comment 34•9 years ago
|
||
Comment on attachment 8741006 [details]
MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/46111/diff/1-2/
Assignee | ||
Updated•9 years ago
|
Attachment #8741007 -
Attachment is obsolete: true
Comment 35•9 years ago
|
||
Comment on attachment 8741006 [details]
MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail
https://reviewboard.mozilla.org/r/46111/#review42629
Attachment #8741006 -
Flags: review?(rail) → review+
Assignee | ||
Comment 36•9 years ago
|
||
https://hg.mozilla.org/build/buildbotcustom/rev/ac0b9905e9d3fd825306d8d5891e95f463b9c301
Bug 1212993 - remove get basedir step r=rail
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•