Closed Bug 1212993 Opened 9 years ago Closed 9 years ago

Reduce hidden Buildbot lag

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: catlee)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 3 obsolete files)

This is discussed in [1]; it is a side effect of overloaded masters not being able to give next step to the machines (which are waiting). * We can reduce logging for the Buildbot jobs which improves the lag * We can increase the number of masters * We can reduce the number of steps for ScriptFactory * Move jobs to TC to reduce load on masters The current way to determine the cost this has in our systems is by analyzing every single Buildbot log for a day. Another option *might* be to query the Buildbot databases [1] https://groups.google.com/d/msg/mozilla.release.engineering/wZRfbOwdc54/hqcR5cNuBgAJ ### There are some interesting things in this pre-mh time: elapsedTime=0.301000 ========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) ========= Notice that elapsed time is 42s while the command itself thinks it only too 0.3s. This is a symptom of master lag, which we track an approximation of here: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana2/dashboard/db/buildbot-masters The "lag" is time between a step completing on the worker and the master responding to that. If this gets too high, it means we need to add more masters. ### Could you elaborate how each of those data points are calculated? From looking at that graph I would believe that we're OK, however, in only [looking at] one job I believe we can see more lag than wanted. I think a different dashboard would be needed to determine how much lag is introduced for every pool per day. What are options to improve lag? I can think of these: * more masters * move to taskcluster Is the lag mainly dependent on number of slaves running a job on a master? Or is it due to the number of builders? >From this specific job I can see that this much was wasted per step [1] (only significant waste): * rm -> 42s - 0.3s = 41.7s * rm -> 34s - 3.3s = 30.7s * bash -> 8s - 2.2s = 5.8s * mh -> 620s - 601s = 19s The lag for this job was ~97 secs -> 1.37 mins out of 13 minutes. Approximately 10% of the wall time. ###
From the thread: [1] ========= Started set props: basedir (results: 0, elapsed: 1 secs) (at 2015-09-25 07:30:57.464401) ========= elapsedTime=0.548000 ========= Finished set props: basedir (results: 0, elapsed: 1 secs) (at 2015-09-25 07:30:58.857526) ========= ========= Started 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:01.253640) ========= elapsedTime=0.301000 ========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) ========= ========= Started 'bash -c ...' (results: 0, elapsed: 3 secs) (at 2015-09-25 07:31:43.621910) ========= elapsedTime=0.752000 ========= Finished 'bash -c ...' (results: 0, elapsed: 3 secs) (at 2015-09-25 07:31:47.176730) ========= ========= Started 'rm -rf ...' (results: 0, elapsed: 34 secs) (at 2015-09-25 07:31:47.177061) ========= elapsedTime=3.303000 ========= Finished 'rm -rf ...' (results: 0, elapsed: 34 secs) (at 2015-09-25 07:32:21.989463) ========= ========= Started 'bash -c ...' (results: 0, elapsed: 8 secs) (at 2015-09-25 07:32:21.989924) ========= elapsedTime=2.181000 ========= Finished 'bash -c ...' (results: 0, elapsed: 8 secs) (at 2015-09-25 07:32:30.760760) ========= ========= Started 'c:/mozilla-build/python27/python -u ...' (results: 0, elapsed: 10 mins, 20 secs) (at 2015-09-25 07:33:05.324682) ========= elapsedTime=601.379000 ========= Finished 'c:/mozilla-build/python27/python -u ...' (results: 0, elapsed: 10 mins, 20 secs) (at 2015-09-25 07:43:25.627777) ========= ========= Started set props: build_url (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:25.628616) ========= elapsedTime=0.102000 ========= Finished set props: build_url (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:26.238286) ========= ========= Started 'rm -f ...' (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:26.238635) ========= elapsedTime=0.101000 ========= Finished 'rm -f ...' (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:26.353214) =========
Blocks: 1213004
We could probably get rid of the 'get basedir' step, and have that hardcoded in buildbot instead. That almost never changes.
Assignee: nobody → catlee
this outputs lines like this: > program finished with exit code 0 > elapsedTime=0.562000 > basedir: 'C:\\slave\\test' > ========= master_lag: 8.15 ========= > ========= Finished set props: basedir (results: 0, elapsed: 8 secs) (at 2015-10-27 10:47:49.046646) ========= and at the end of the log: > ========= Total master_lag: 51.15 =========
Blocks: 1220269
Comment on attachment 8681414 [details] [diff] [review] put master lag in the log, and report to statsd if available Review of attachment 8681414 [details] [diff] [review]: ----------------------------------------------------------------- lgtm as you mentioned in IRC, need to add statsd to master-pip.txt
Attachment #8681414 - Flags: review?(rail) → review+
Attachment #8681483 - Flags: review?(jwatkins)
Depends on: 1220296
Attachment #8681483 - Flags: review?(jwatkins)
Attachment #8681483 - Flags: review+
Attachment #8681483 - Flags: checked-in+
Something is a little wonky according to graphite. I've seen some spikes approaching "17.5G" seconds, or approximately 554 years. I haven't found any logs containing this yet, so this could be either a log parsing bug, or a statsd/collectd/graphite bug.
Here is a query looking at the past week [1]; the *average* is dubious because it only considers lag over 20sec. Note that some lag is negative due to rounding and inter-machine clock skew. Query [2] shows URLs for lag over 10min. It is best you *Copy* to a spreadsheet: It will be easier to cut-and-paste the URLs. [1] http://activedata.allizom.org/tools/query.html#query_id=jkzXsTFF [2] http://activedata.allizom.org/tools/query.html#query_id=+0u3ebNw
Note, I would have put the lag calculation in the `select` clause; but the query translator is broken in this instance. I have a TODO [1] to make a test and fix it. [1] https://github.com/klahnakoski/ActiveData/blob/b7467aeea3f2bbe2238b97c2ecca4c935bef166d/tests/test_deep_ops.py#L1079
Depends on: 1220156
Here is a query to show lags over 60sec. > { > "from":"jobs.action.timings", > "select":[ > "builder.step", > "builder.duration", > "builder.elapsedTime", > { > "name":"lag", > "value":{"sub":["builder.duration","builder.elapsedTime"]} > }, > "run.logurl" > ], > "where":{"and":[ > {"exists":"builder.elapsedTime"}, > {"gt":[ > {"subtract":["builder.duration","builder.elapsedTime"]}, > 60 > ]} > ]}, > "limit":100 > } I am recording it here because this only works on my dev machine; I will be pushing to production on Monday
I pushed to production. This query [1] reports over lag of over 60sec in past month. [1] http://activedata.allizom.org/tools/query.html#query_id=+eYDIYkR
(In reply to Kyle Lahnakoski [:ekyle] from comment #13) > I pushed to production. This query [1] reports over lag of over 60sec in > past month. > > [1] http://activedata.allizom.org/tools/query.html#query_id=+eYDIYkR There's a lot of data here, but I'm not sure what we need in order to make this actionable. Armen, do you have any suggestions?
Flags: needinfo?(armenzg)
catlee, when will the next reboot of the masters be? Filed bug 1225475 for Windows 8 backlogs. I can't make sense of the data. I also failed to be able to paste it into a spreadsheet. What would be useful, if we could have a total lag per job and add them up grouped by masters. A master is what creates the lag and the machines/slaves connected to it are the ones showing the effects of it. If we could track the lag that each master has it would be great.
Flags: needinfo?(armenzg)
We track total lag per master in graphite.
Attached image Buildbot Lag (nov 10)
ActiveData has the same; Here is the 95 percentile lag from the past week, in four hour increments [1]. Of course, a grid of numbers is inhumane, so I charted it. http://activedata.allizom.org/tools/query.html#query_id=hUrIHBEl
Attachment #8738146 - Flags: review?(rail)
Comment on attachment 8738145 [details] MozReview Request: Bug 1212993 - remove get basedir step Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44359/diff/1-2/
Attachment #8738145 - Attachment description: MozReview Request: WIP → MozReview Request: Bug 1212993 - remove get basedir step
Attachment #8738146 - Attachment is obsolete: true
Attachment #8738146 - Flags: review?(rail)
Attachment #8738145 - Flags: review?(rail)
Comment on attachment 8738145 [details] MozReview Request: Bug 1212993 - remove get basedir step https://reviewboard.mozilla.org/r/44359/#review41313 Removals LGTM. I' not sure if `validateBuilders()` will work as expected though. ::: misc.py:2980 (Diff revision 2) > return names > > > +def validateBuilders(builders): > + for b in builders: > + if isinstance(b['factory'], ScriptFactory): This function sounds like `validateScriptFactoryBuilders` to me. Maybe replace the `if` block with ```if not isinstance(b['factory'], ScriptFactory): continue ``` and dedent the block below it? ::: misc.py:3006 (Diff revision 2) > + else: > + rootdir = '/builds/slave' > + basedir = '%s/%s' % (rootdir, slavebuilddir) > + > + b['properties']['basedir'] = basedir > + assert 'basedir' in b['properties'], b['name'] This line looks weird to me: `assert` will work in any situatiation, because the line above makes sure `b['properties']` contains `basedir`. The only places where this function may raise exceptions are where you access `b['builddir']` and `b['properties']['platform']`, so basically it only validates if those are set. Not sure if this is the intent.
Attachment #8738145 - Flags: review?(rail)
Comment on attachment 8738145 [details] MozReview Request: Bug 1212993 - remove get basedir step Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44359/diff/2-3/
Comment on attachment 8739524 [details] MozReview Request: Bug 1212993 - remove get basedir step r=rail Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/1-2/
Attachment #8739524 - Attachment description: MozReview Request: Address review comments → MozReview Request: Bug 1212993 - remove get basedir step
Attachment #8738145 - Attachment is obsolete: true
Attachment #8738145 - Flags: review?(rail)
Attachment #8739524 - Flags: review?(rail)
Comment on attachment 8739524 [details] MozReview Request: Bug 1212993 - remove get basedir step r=rail https://reviewboard.mozilla.org/r/45303/#review41799
Attachment #8739524 - Flags: review?(rail) → review+
Comment on attachment 8739524 [details] MozReview Request: Bug 1212993 - remove get basedir step r=rail Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/2-3/
Comment on attachment 8739524 [details] MozReview Request: Bug 1212993 - remove get basedir step r=rail Review request updated; see interdiff: https://reviewboard.mozilla.org/r/45303/diff/3-4/
Attachment #8739524 - Attachment description: MozReview Request: Bug 1212993 - remove get basedir step → MozReview Request: Bug 1212993 - remove get basedir step r=rail
Attachment #8741006 - Attachment description: MozReview Request: Make sure we add properties to the project objects → MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail
Attachment #8741006 - Flags: review?(rail)
Comment on attachment 8741006 [details] MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail Review request updated; see interdiff: https://reviewboard.mozilla.org/r/46111/diff/1-2/
Attachment #8741007 - Attachment is obsolete: true
Comment on attachment 8741006 [details] MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail https://reviewboard.mozilla.org/r/46111/#review42629
Attachment #8741006 - Flags: review?(rail) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: