Closed Bug 563458 Opened 14 years ago Closed 14 years ago

Load on test-master01/test-master02 is too high

Categories

(Release Engineering :: General, defect, P1)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

(Whiteboard: [talos][unittest][buildslaves][buildbot])

Attachments

(8 files)

Since switching on unittests on try, we've started to overload the talos master.  This was fixed on the build masters by deploying a slave-side patch that buffers log output.  Unfortunately the talos slaves are running an older version of buildbot where the patch doesn't apply cleanly.

A few options:

* Port slave buffering to this version of buildbot

* Start investigating buildbot 0.8.0 (which has this patch) now

* Split up the masters so different platforms run on different masters
We've mostly upgraded our test slaves to buildbot 0.8.0, and split out the slaves by platform to three buildbot 0.7.10p1 masters. catlee is working on moving the masters to buildbot 0.8.0 in bug 568568. 

But we are still having big issues with those three masters getting seriously overloaded and a lot of jobs getting backed up. Disabling talos on try will help (bug 579573) but there will still be unit tests to process. I've noticed that once a master starts to get behind it tends to get worse and worse, until we reach a some tipping point and we rapidly process lots of jobs. This suggest a pathological failure mode in buildbot.

For example, test-master01 currently has 600+ Snow Leopard and 500+ Fedora 12x64 jobs pending. It also claims to only have a small proportion of slaves connected, and if you look at a random selection of slave twistd.log you find things like:
 2010-07-26 20:55:28-0700 [Broker,client] message from master: attached
 2010-07-26 20:59:44-0700 [Broker,client] lost remote
 2010-07-26 21:02:37-0700 [Broker,client] message from master: attached
 2010-07-26 21:04:59-0700 [Broker,client] lost remote
 2010-07-26 21:09:04-0700 [Broker,client] message from master: attached
 2010-07-26 21:10:47-0700 [Broker,client] lost remote
 2010-07-26 21:15:53-0700 [Broker,client] message from master: attached
 2010-07-26 21:17:59-0700 [Broker,client] lost remote
 2010-07-26 21:30:13-0700 [Broker,client] message from master: attached
 2010-07-26 21:32:05-0700 [Broker,client] lost remote
I've grepped to get that but confirmed no jobs are happening in between. The conclusion is that slaves are only connected for a couple of minutes before falling off for several minutes.

I would hazard a guess that the master gets busy with genuine jobs, then slaves start dropping the connection (slave initiates this?). The disconnect/reconnect causes more work for the master so more slaves are likely to fall off, and it all snowballs until master that is seriously behind and barely processing anything.
I disconnected most of talos-r3-snow-04 through 28 just before the steep downward drop of the linux64 line. Unknown if that created enough breathing room or disconnected a slave in a bad state, but it certainly perked the master up. Those slaves are now reconnected.

The big drops are merging jobs from non-try branches, there will now be a slower decline as the try jobs are run individually.
Summary: Load on talos-master is too high → Load on talos-master02/test-master01/test-master02 is too high
* bug 579573 has landed (reduces load)
* bug 557918 has landed enabling unit tests for 10.5 & Fedora (test-master01) and Win7 (talos-master02)

Out of all three masters talos-master02 is the only one in good condition.

Very interesting what you did on comment 2.
Depends on: 568568
I hacked this together from a few data sources. The number of running builds comes from the status db, after I got a bit closer to generating plots from there. Pending builds comes from the waterfall scraper I've been running for a while, note that this data uses the right hand y-axis. 

And the new one, slave ping failures, is generated from the master's twistd.log, grepping for lines like this:
2010-07-29 12:51:04-0700 [Broker,696,10.250.51.68] I tried to tell the slave that the build <Build Rev3 Fedora 12x64 tryserver opt test crashtest> started, but remote_startBuild failed: slave ping failed. I've binned these errors into 5 minute periods. I think we hit this when the slave decides the master isn't responding, so disconnects and reconnects. Lots of slaves were doing this, connected for a couple of minutes, disconnected for a 5 or so.

There is also a cpu blog which I attached above.

Interpretation of the graphs:
* the CPU load became high quite rapidly at ~ 12:45, corresponding to an increase in the number of jobs that were running
* a large number of jobs must have been submitted to cause the master to run lots of jobs and rapidly get behind. Should try to trace this
* the master averaged 70 running jobs between 1 and 3pm, with the number of slave ping errors gradually increasing
* after 3pm the ping error rate increases rapidly, and the number of jobs running drops. This may have been triggered by another influx of incoming jobs or a critical number of slaves in disconnect/reconnect cycles
* everything goes to hell in a hand basket
*
* distinct lack of profit
* around 6:15 I started disconnecting about 15 snow leopard slaves, and around 7pm disconnected 15 fedora64 slaves; enough to get the slave ping errors under control
* the pending queue slowly comes under control over the next few hours
* all slaves reconnected

My feeling is that when the master gets busy with handling logs, incoming jobs, etc it starts to become unresponsive to slaves. The additional load from disconnect/reconnecting slaves causes more disconnections, and so on until the master is spiralling to its doom. The current master VMs in Castro can only handle 60 concurrent jobs, so having 100 slaves on them is asking for trouble at busy times. 

--> We need grunty hardware for the masters, or at least running a master per platform. <--   (Yes I know there's no capacity in Castro for this)

catlee looked at the log sizes per job, and found that jsreftest is very verbose. I'll leave him to add more details on that. He's also got bug 583183 to reduce the load from bziping the logs, and hopefully not compressing separately for emailing tinderbox and storing to disk.
One the slaves a slave ping failed looks like:
[attach to master]
2010-08-02 17:53:35-0700 [Broker,client] SlaveBuilder.remote_print(Rev3 Fedora 12 electrolysis debug test mochitests-1/5): message from master: attached
2010-08-02 17:54:22-0700 [Broker,client] SlaveBuilder.remote_print(Rev3 Fedora 12 mozilla-central debug test reftest): message from master: ping
2010-08-02 17:54:22-0700 [Broker,client] SlaveBuilder.remote_ping(<SlaveBuilder 'Rev3 Fedora 12 mozilla-central debug test reftest' at 142290220>)
2010-08-02 17:54:38-0700 [Broker,client] lost remote

2010-08-02 17:54:38-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x879b0cc> will retry in 2 seconds
2010-08-02 17:54:38-0700 [Broker,client] Stopping factory <buildbot.slave.bot.BotFactory instance at 0x87935ec>
2010-08-02 17:54:40-0700 [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x87935ec>
2010-08-02 17:55:38-0700 [Broker,client] message from master: attached
There are a several slaves disconnected from test-master01 and 02 right now to try to prevent those masters getting into broken states when overloaded. We had another episode today due to freeze day + try + branch approvals. Not merging try requests (bug 582734) is making this problem more of an every day occurrence though.

The slaves are:
* tm01:
 * talos-r3-snow-0nn:  6,24,29,30,40,41,44,46,47,50,52
 * talos-r3-fed64-0nn: 33,34,35,40,42,43,46,47,48,52,53
* tm02:
 * talos-r3-leopard-0nn: 3,12,16,17,18,22,26,27,28,29,31,39,41,44,52
 * talos-r3-fed-0nn: 15,17,20,21,29,30,33,38,41,42,46,49

That's 22 and 27 slaves respectively, which leaves a bit more than the ~70 we can cope with when the masters have large pending counts (which seems to be worse than 80 jobs running and a small pending counts).
We've got lots of ix machines in Castro. Can we re-purpose one or two of them as masters? We could get that up in pretty short order if we used one which already has Linux on it.
That's definitely and option, but ... the load is improved since upgrading to buildbot 0.8.0 today, with 60-80% CPU usage for masters that are processing jobs as fast as they can. I've reconnected some slaves, but left 6 or 7 from each platform disconnected:
* tm01:
 * talos-r3-snow-0nn:  24,29,41,44,50,52
 * talos-r3-fed64-0nn: 43,46,47,48,52,53
* tm02:
 * talos-r3-leopard-0nn: 28,29,31,39,41,44,52
 * talos-r3-fed-0nn: 33,38,41,42,46,49
Load was still fine so I reconnected the last of the slaves.
The memory on test-master{01,02} has now passed the 3GB of RAM and it is using the swap.
The wait times today were pretty bad.
* 6712 jobs with 46.03%
while last week we had more jobs but the percentage was better
* 10715 jobs with 52.49%

CPU did not look that bad in either of them.
Summary: Load on talos-master02/test-master01/test-master02 is too high → Load on test-master01/test-master02 is too high
The increase on usage of the cache is associated with the increase of pending jobs.

catlee is debugging the memory issue and we will restart the masters since they won't loose the pending queue.
Assignee: nobody → catlee
Priority: P3 → P1
Had to restart buildbot on test-master02, it had become unresponsive.
(In reply to comment #14)
> Had to restart buildbot on test-master02, it had become unresponsive.
So from last reboot on the 4th to death we have 6 days.
We will probably reach close to this state next week.
Blocks: 580410
and restarted test-master01 as well.

Both test-master01 and test-master02 have this change, which should help: http://hg.mozilla.org/build/buildbot/rev/7348713b55c5
Attached image io_wait of Aug. 10th
catlee your fix seems to have made the memory usage growth closer to flat (at least on test-master01) unlike the steeped growth seen after the reboot of the 4th (I will need one more day to have a screenshot to show this clearly).

This attached screenshot shows something that I had noticed previously and that it seems the first time we have been hitting.
The CPU usage shows "io_wait" which is now almost unperceived.
The state of pending jobs is now back to normal. If you look 24 hours ago we had more than 250 pending jobs for each of the 4 platforms on tm{01-02}.

We are now down at 0 pending jobs. The queue went from hundreds of jobs (3 platforms were competing to see if they could pass the 800 pending jobs) down to 0 in 10 hours with a decrease of 200-300 jobs per hour.
To note that the tree was apparently closed due to bug 586179 according to https://wiki.mozilla.org/Tree_Closures
This is the last one of my screen shots.
We can see in it that since the reboot of the 4th the memory usage kept on growing until we hit the ceiling.

Since catlee's fix the memory usage has stayed almost flat.
CPU usage on all 3 masters is good and I believe the only master without the patch is talos-master02 but the growth is not as steeped (we have quite some time to deploy the fix) as it was on the other two masters.

Do you guys think we could close this bug after few more days without issues?
Looks good enough to me.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: