Closed Bug 1009880 (too-many-builders) Opened 8 years ago Closed 5 years ago

linux64 test master reconfigs are extremely slow and masters stop accepting now jobs mid-reconfig

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jlund, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1948] )

No description provided.
this could be due to two seperate reconfigs that overlapped - masters did not finish reconfigging before being asked to reconfig again.

catlee discovered this to be the case with bm67 and bm51:
[[14:23:11]] <     catlee> | 2014-05-13 12:43:15-0700 [-] configuration update complete
[[14:23:14]] <     catlee> | 2014-05-13 12:44:13-0700 [-] configuration update complete
[[14:23:15]] <     catlee> | 2014-05-13 12:44:15-0700 [-] configuration update complete

I have started a graceful_restart for bm67 and bm51 and looking into other masters that may be in similar situation.

could this be due to anything else?
Trees closed at 2014-05-13 15:06:55 until we can figure this out.
The two reconfigs seem to have been triggered within an hour of themselves today.

Ones that overlap have very very long twistd.log files. As in reconfig activity themselves can span over twistd.log.[2-10+]

I looked at all the linux test masters as they take the longest and thus have the highest probability of overlap.

I found 6 that overlap: bm 51, 52, 53, 54, 67, and 68

I have started a graceful_restart of 51, 52, and 67. I am holding off on doing the other three until these complete unless there are other suggestions.
update ===

completed:
bm 51 and 67 restarted, been re-enabled, and are taking jobs

in progress:
bm 52 is still restarting
bm 53 has just started a graceful_restart

waiting for other masters to complete:
bm 54 and 68
Pending count's back below 2000, so I just reopened the trees.
- all 6 masters have been restarted.
- only 168 builds are currently pending
- trees are open

for above reasons, I'm resolving this bug.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Created bug 1010126 as a prevention from this happening in future.
I don't think this was related to overlapping reconfigs. We had the linux64 masters take a very long time again today with only one reconfig going. I suspect that Rail's adding of a bunch of new test machines names has something to do with this: https://hg.mozilla.org/build/buildbot-configs/rev/84f9cd267546

Until this is fixed, we should probably avoid reconfiging more than one linux64 test master at the same time. Something like this for reconfigs is probably helpful:
python manage_masters.py -f production-masters.json -R build -R try -R scheduler -j32 reconfig
python manage_masters.py -f production-masters.json -M linux32 -M windows -M tegra -M panda -M macosx -j32 reconfig
python manage_masters.py -f production-masters.json -M linux64 -j1 reconfig
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: some buildbot masters are doing too much work due to overlapping reconfigs - pending at 5085 → linux64 test master reconfigs are extremely slow and masters stop accepting now jobs mid-reconfig
Maybe a flag to have parallel reconfigs up to -j N, but limit to X per master type (-l X ?) ?
The problem now seems worse than described in comment #8 -- even after completing their reconfig, new jobs were not being scheduled. We may currently be in a state where we can't reconfig linux64 test masters - I'm not sure. Marking as critical because this needs to be investigated further.
Severity: normal → critical
Ironically, my tmux script made this distinction so it would have been easy to hack there. :)

It sounds like there are two problems here:

1) The ability to limit the parallelism for a particular master type. Both comment #8 and comment #9 suggest solutions here.

2) Prevent overlapping reconfigs on a single master. I think manage_masters.py should use a per-master flag file or at least check to see if a reconfig process is already running. The question remains what we should do when we do find a running reconfig: do we fail out immediately, wait for the in-progress reconfig to finish and then reconfig again immediately, or something else?
(In reply to Ben Hearsum [:bhearsum] from comment #10)
> The problem now seems worse than described in comment #8 -- even after
> completing their reconfig, new jobs were not being scheduled. We may
> currently be in a state where we can't reconfig linux64 test masters - I'm
> not sure. Marking as critical because this needs to be investigated further.

Can I assume that these masters still work if gracefully shutdown and restarted? If that's the case, we should be working on rolling restart code for masters ASAP.
(In reply to Ben Hearsum [:bhearsum] from comment #8)
> I don't think this was related to overlapping reconfigs. We had the linux64
> masters take a very long time again today with only one reconfig going. I
> suspect that Rail's adding of a bunch of new test machines names has
> something to do with this:
> https://hg.mozilla.org/build/buildbot-configs/rev/84f9cd267546
> 
> Until this is fixed, we should probably avoid reconfiging more than one
> linux64 test master at the same time. Something like this for reconfigs is
> probably helpful:
> python manage_masters.py -f production-masters.json -R build -R try -R
> scheduler -j32 reconfig
> python manage_masters.py -f production-masters.json -M linux32 -M windows -M
> tegra -M panda -M macosx -j32 reconfig
> python manage_masters.py -f production-masters.json -M linux64 -j1 reconfig

I tried this.  The linux64 reconfig takes a long time per master; so much so, that it appeared the masters would be reconfiging well past my EOD if I continued reconfiging them serially.

I switchd to graceful restarts.  bm51 stopped but failed to restart automatically [!] (I restarted it manually 10 min after it stopped); bm52 restarted successfully but 3 hours after I started the graceful restart process.  With 7 more linux64 masters to go, and 3 hours per master, I'm hesitant to keep going past my EOD.

Are we using all the new spot slaves?  If not, we could try removing them to see if they speed up reconfigs again.  Or we could try moving the masters to faster instances, or split the linux64 slave_platforms further.
bm52 ended up with 2 buildbot processes started the exact same minute =\
nagios was throwing errors.

I didn't want to leave things broken, so I killed the idle-looking one (less TIME in a ps -ef).  twistd.pid went away, so I recreated it with the running buildbot pid.  twistd.log is still being populated, so we may be good.
Recent changes in buildbot-configs, and their effect on builder count on bm51-tests1-linux64 copy:

count	rev	        comment
5102	68c34c5999d5	back out gaia-ui
5173	b1a2480965e0	add gaia-ui; add web-platform on win cedar
5102	50f3140e4392	fix Android 2.3 emulators
5102	ce932691fbed	Android 4.0 debug on cedar, android 2.3 emulator
4759	2c71a560b78d	build_space for b2g debug, t'bird
4759	f7aaa48929ad	last friday as baseline

So +343 at the start of this week.
I tried out a sampling profiler (https://github.com/bos/statprof.py/blob/master/statprof.py) from bm51's manhole, with the following results:

>>> import statprof
>>> statprof.start()
>>> # wait a bit
>>> statprof.stop()
>>> statprof.display()
  %   cumulative      self          
 time    seconds   seconds  name    
 30.24     60.02     60.02  builder.py:735:attached
  9.12     18.10     18.10  builder.py:806:detached
  7.43     23.79     14.75  buildslave.py:407:canStartBuild
  5.34     10.60     10.60  builder.py:785:detached
  3.57      7.08      7.08  builder.py:71:isBusy
  2.12      4.22      4.22  master.py:266:getBuildersForSlave
  2.05      4.06      4.06  banana.py:175:dataReceived
  1.33     38.17      2.64  master.py:261:slaveLost
  1.14      2.26      2.26  banana.py:289:_encode
  1.06      2.10      2.10  banana.py:30:int2b128
  0.99      1.96      1.96  builder.py:70:isBusy
  0.87     12.04      1.74  banana.py:296:_encode
  0.86      1.70      1.70  builder.py:734:attached
  0.81      1.60      1.60  buildslave.py:406:canStartBuild

We're spending a *TON* of time in attached/detached/canStartBuild. All of these are O(number_of_builders), so as we add more builders, they get worse.

Nick also plotted time spent for the get_basedir step for buildbot-master51: http://people.mozilla.org/~nthomas/builddir%20lag.html. This shows the total amount of time for the master to tell a slave to run `pwd` and report the results. Anything over 0.1s is a bad sign.

A few things we could do:
- Add more masters. We'll have proportionately fewer connects/disconnects as we spread the load around.
- Remove builders by disabling branches or tests.
- Split up the masters (again) so that each one has a subset of the builders...this could be tricky since slaves would get different builders depending on which master they connect to.
- Consolidate builders (e.g. have different chunks distinguished by properties rather than builders). this is definitely not a short term fix.
(correction, the above profiling was done on bm67)
Depends on: 1010674
We could also change the slave definitions so that only use1 slaves are listed on use1 masters. The removes the possibility of cross-region connections, but it turns out we've already got slavealloc set up like that - all the use1 masters were disabled early this week and ../gettac returned 'no allocation available' for a use1 slave.
I had an idea that I shared with coop (I believe I shared with catlee in the past).

This is a long-term solution that might take a bit to get right so it might be appropriate for this problem.
This is to reduce # of builders which IIUC could be the root problem. Correct me if I got it wrong.

We can move to generic builders. Only one per platform. It is to be proven that it would work at scale.

If our builders needed these properties to start:
* branch: m-c
* builder_name: reftest-1
* script_repo: hg.m.o/build/mozharness
* script_name: desktop_tests.py
* script_params: --param1 meh

This would make it easy to not have the need to have that many builders.
This would mean that one builder would spur all the jobs for one build.

Another approach would be to trigger the builder with:
* branch: m-c
* builder_name: reftest-1
* all_the_info_you_need: URL

Such URL would contain all the info needed to run the job.
(In reply to Armen Zambrano (at TRIBE on 14th/15th) [:armenzg] (EDT/UTC-4) from comment #19)
> I had an idea that I shared with coop (I believe I shared with catlee in the
> past).
> 
> This is a long-term solution that might take a bit to get right so it might
> be appropriate for this problem.
> This is to reduce # of builders which IIUC could be the root problem.
> Correct me if I got it wrong.
> 
> We can move to generic builders. Only one per platform. It is to be proven
> that it would work at scale.
> 
> If our builders needed these properties to start:
> * branch: m-c
> * builder_name: reftest-1
> * script_repo: hg.m.o/build/mozharness
> * script_name: desktop_tests.py
> * script_params: --param1 meh
> 
> This would make it easy to not have the need to have that many builders.
> This would mean that one builder would spur all the jobs for one build.
> 
> Another approach would be to trigger the builder with:
> * branch: m-c
> * builder_name: reftest-1
> * all_the_info_you_need: URL
> 
> Such URL would contain all the info needed to run the job.

I guess we'd have to be careful about different builders sharing the same working directory on the slave (and consider impact on clobbers). It might also make it difficult to compare job results against previous runs of the same builder (i.e. typically in the web interface of buildbot I look at the list of historical builds for a given builder, to compare differences between successful/non-successful runs - now all builders would have the same builder name so that could become more difficult to isolate the builders that are for the same job).

I guess also anywhere else where we rely on the builder name, we might have to do some refactoring, since the builder name won't uniquely identify the job anymore (e.g. email notifications, pulse notifications?).
Yeah, that's a good idea. It's what I meant by "consolidate builders" in comment 16.

For tests (where we have far more builders), we already share working directories, so this should be easier. Getting TBPL to handle these could be tricky though.
(In reply to Chris AtLee [:catlee] from comment #21)
> For tests (where we have far more builders), we already share working
> directories, so this should be easier. Getting TBPL to handle these could be
> tricky though.

I don't think it's practical (or worth the time, given TBPL EOL later this year) to make these changes - as it would require substantial changes to the TBPL architecture.

For treeherder, it would mean changing the ETL to pull various properties from new fields in builds-4hr - rather than deriving them from the buildername at point of ingestion. This is desirable anyway (the buildername regexp is a monstrosity that we all want to go away), so seems like a reasonable plan once TBPL is EOL.
s/to make these changes/to make these changes to TBPL/
For TBPL we could have the thing that generates builds-4hr lie about the buildername for TBPL's sake. I don't know what else this would break though :)
Alias: too-many-builders
Depends on: 1011488
Blocks: 1007929
Depends on: 1014318
Component: Buildduty → General Automation
QA Contact: bugspam.Callek → catlee
not critical now, but we're still living on the edge
Severity: critical → major
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1948]
Status: REOPENED → RESOLVED
Closed: 8 years ago5 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.