Closed Bug 1212993 Opened 9 years ago Closed 9 years ago

Reduce hidden Buildbot lag

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: catlee)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 3 obsolete files)

put master lag in the log, and report to statsd if available 9 years ago Chris AtLee [:catlee] 7.83 KB, patch	rail : review+	Details \| Diff \| Splinter Review
tweak statsd config 9 years ago Chris AtLee [:catlee] 928 bytes, patch	catlee : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
Buildbot Lag (nov 10) 9 years ago Kyle Lahnakoski [:ekyle] 327.94 KB, image/png		Details
MozReview Request: Bug 1212993 - remove get basedir step 9 years ago Chris AtLee [:catlee] 58 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1212993 - remove get basedir step 9 years ago Chris AtLee [:catlee] 58 bytes, text/x-review-board-request		Details
MozReview Request: Bug 1212993 - remove get basedir step r=rail 9 years ago Chris AtLee [:catlee] 58 bytes, text/x-review-board-request	rail : review+ rail : checked-in-	Details
MozReview Request: Make sure we add properties to other ScriptFactory instances r=rail 9 years ago Chris AtLee [:catlee] 58 bytes, text/x-review-board-request	rail : review+	Details
MozReview Request: Add missing properties to other scriptfactory instances 9 years ago Chris AtLee [:catlee] 58 bytes, text/x-review-board-request		Details

Armen [:armenzg]

Reporter

Description

•

9 years ago

This is discussed in [1]; it is a side effect of overloaded masters not being able to give next step to the machines (which are waiting). * We can reduce logging for the Buildbot jobs which improves the lag * We can increase the number of masters * We can reduce the number of steps for ScriptFactory * Move jobs to TC to reduce load on masters The current way to determine the cost this has in our systems is by analyzing every single Buildbot log for a day. Another option *might* be to query the Buildbot databases [1] https://groups.google.com/d/msg/mozilla.release.engineering/wZRfbOwdc54/hqcR5cNuBgAJ ### There are some interesting things in this pre-mh time: elapsedTime=0.301000 ========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) ========= Notice that elapsed time is 42s while the command itself thinks it only too 0.3s. This is a symptom of master lag, which we track an approximation of here: https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana2/dashboard/db/buildbot-masters The "lag" is time between a step completing on the worker and the master responding to that. If this gets too high, it means we need to add more masters. ### Could you elaborate how each of those data points are calculated? From looking at that graph I would believe that we're OK, however, in only [looking at] one job I believe we can see more lag than wanted. I think a different dashboard would be needed to determine how much lag is introduced for every pool per day. What are options to improve lag? I can think of these: * more masters * move to taskcluster Is the lag mainly dependent on number of slaves running a job on a master? Or is it due to the number of builders? >From this specific job I can see that this much was wasted per step [1] (only significant waste): * rm -> 42s - 0.3s = 41.7s * rm -> 34s - 3.3s = 30.7s * bash -> 8s - 2.2s = 5.8s * mh -> 620s - 601s = 19s The lag for this job was ~97 secs -> 1.37 mins out of 13 minutes. Approximately 10% of the wall time. ###

Armen [:armenzg]

Reporter

Comment 1

•

9 years ago

From the thread: [1] ========= Started set props: basedir (results: 0, elapsed: 1 secs) (at 2015-09-25 07:30:57.464401) ========= elapsedTime=0.548000 ========= Finished set props: basedir (results: 0, elapsed: 1 secs) (at 2015-09-25 07:30:58.857526) ========= ========= Started 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:01.253640) ========= elapsedTime=0.301000 ========= Finished 'rm -rf ...' (results: 0, elapsed: 42 secs) (at 2015-09-25 07:31:43.620746) ========= ========= Started 'bash -c ...' (results: 0, elapsed: 3 secs) (at 2015-09-25 07:31:43.621910) ========= elapsedTime=0.752000 ========= Finished 'bash -c ...' (results: 0, elapsed: 3 secs) (at 2015-09-25 07:31:47.176730) ========= ========= Started 'rm -rf ...' (results: 0, elapsed: 34 secs) (at 2015-09-25 07:31:47.177061) ========= elapsedTime=3.303000 ========= Finished 'rm -rf ...' (results: 0, elapsed: 34 secs) (at 2015-09-25 07:32:21.989463) ========= ========= Started 'bash -c ...' (results: 0, elapsed: 8 secs) (at 2015-09-25 07:32:21.989924) ========= elapsedTime=2.181000 ========= Finished 'bash -c ...' (results: 0, elapsed: 8 secs) (at 2015-09-25 07:32:30.760760) ========= ========= Started 'c:/mozilla-build/python27/python -u ...' (results: 0, elapsed: 10 mins, 20 secs) (at 2015-09-25 07:33:05.324682) ========= elapsedTime=601.379000 ========= Finished 'c:/mozilla-build/python27/python -u ...' (results: 0, elapsed: 10 mins, 20 secs) (at 2015-09-25 07:43:25.627777) ========= ========= Started set props: build_url (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:25.628616) ========= elapsedTime=0.102000 ========= Finished set props: build_url (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:26.238286) ========= ========= Started 'rm -f ...' (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:26.238635) ========= elapsedTime=0.101000 ========= Finished 'rm -f ...' (results: 0, elapsed: 0 secs) (at 2015-09-25 07:43:26.353214) =========

Armen [:armenzg]

Reporter

Updated

•

9 years ago

Blocks: 1213004