Closed Bug 563458 Opened 15 years ago Closed 15 years ago

Load on test-master01/test-master02 is too high

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

(Whiteboard: [talos][unittest][buildslaves][buildbot])

Attachments

(8 files)

Pending graph for talos-r3 15 years ago Nick Thomas [:nthomas] (UTC+12) 7.42 KB, image/png		Details
Munin monitoring of CPU load on test-master01 15 years ago Nick Thomas [:nthomas] (UTC+12) 31.56 KB, image/png		Details
Spiral of doom: test-master01 on Thu Jul 29 15 years ago Nick Thomas [:nthomas] (UTC+12) 148.72 KB, image/png		Details
cache being used: test-master01::memory on Tue Aug 10 15 years ago Armen [:armenzg] 41.85 KB, image/png		Details
pending graph for talos/unit test on Aug 10th 15 years ago Armen [:armenzg] 25.64 KB, image/png		Details
io_wait of Aug. 10th 15 years ago Armen [:armenzg] 39.43 KB, image/png		Details
pending graph for talos/unit test on Aug 10th-Aug 11th 15 years ago Armen [:armenzg] 27.75 KB, image/png		Details
memory usage from Aug. 4th to 12th 15 years ago Armen [:armenzg] 52.51 KB, image/png		Details

Chris AtLee [:catlee]

Assignee

Description

•

15 years ago

Since switching on unittests on try, we've started to overload the talos master. This was fixed on the build masters by deploying a slave-side patch that buffers log output. Unfortunately the talos slaves are running an older version of buildbot where the patch doesn't apply cleanly. A few options: * Port slave buffering to this version of buildbot * Start investigating buildbot 0.8.0 (which has this patch) now * Split up the masters so different platforms run on different masters

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

15 years ago

We've mostly upgraded our test slaves to buildbot 0.8.0, and split out the slaves by platform to three buildbot 0.7.10p1 masters. catlee is working on moving the masters to buildbot 0.8.0 in bug 568568. But we are still having big issues with those three masters getting seriously overloaded and a lot of jobs getting backed up. Disabling talos on try will help (bug 579573) but there will still be unit tests to process. I've noticed that once a master starts to get behind it tends to get worse and worse, until we reach a some tipping point and we rapidly process lots of jobs. This suggest a pathological failure mode in buildbot. For example, test-master01 currently has 600+ Snow Leopard and 500+ Fedora 12x64 jobs pending. It also claims to only have a small proportion of slaves connected, and if you look at a random selection of slave twistd.log you find things like: 2010-07-26 20:55:28-0700 [Broker,client] message from master: attached 2010-07-26 20:59:44-0700 [Broker,client] lost remote 2010-07-26 21:02:37-0700 [Broker,client] message from master: attached 2010-07-26 21:04:59-0700 [Broker,client] lost remote 2010-07-26 21:09:04-0700 [Broker,client] message from master: attached 2010-07-26 21:10:47-0700 [Broker,client] lost remote 2010-07-26 21:15:53-0700 [Broker,client] message from master: attached 2010-07-26 21:17:59-0700 [Broker,client] lost remote 2010-07-26 21:30:13-0700 [Broker,client] message from master: attached 2010-07-26 21:32:05-0700 [Broker,client] lost remote I've grepped to get that but confirmed no jobs are happening in between. The conclusion is that slaves are only connected for a couple of minutes before falling off for several minutes. I would hazard a guess that the master gets busy with genuine jobs, then slaves start dropping the connection (slave initiates this?). The disconnect/reconnect causes more work for the master so more slaves are likely to fall off, and it all snowballs until master that is seriously behind and barely processing anything.

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

15 years ago

Attached image Pending graph for talos-r3 — Details

I disconnected most of talos-r3-snow-04 through 28 just before the steep downward drop of the linux64 line. Unknown if that created enough breathing room or disconnected a slave in a bad state, but it certainly perked the master up. Those slaves are now reconnected. The big drops are merging jobs from non-try branches, there will now be a slower decline as the try jobs are run individually.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

15 years ago

Summary: Load on talos-master is too high → Load on talos-master02/test-master01/test-master02 is too high

Armen [:armenzg]

Comment 3

•

15 years ago

* bug 579573 has landed (reduces load) * bug 557918 has landed enabling unit tests for 10.5 & Fedora (test-master01) and Win7 (talos-master02) Out of all three masters talos-master02 is the only one in good condition. Very interesting what you did on comment 2.

Armen [:armenzg]

Updated

•

15 years ago

Depends on: 568568

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

15 years ago

Attached image Munin monitoring of CPU load on test-master01 — Details

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

15 years ago

Attached image Spiral of doom: test-master01 on Thu Jul 29 — Details

I hacked this together from a few data sources. The number of running builds comes from the status db, after I got a bit closer to generating plots from there. Pending builds comes from the waterfall scraper I've been running for a while, note that this data uses the right hand y-axis. And the new one, slave ping failures, is generated from the master's twistd.log, grepping for lines like this: 2010-07-29 12:51:04-0700 [Broker,696,10.250.51.68] I tried to tell the slave that the build <Build Rev3 Fedora 12x64 tryserver opt test crashtest> started, but remote_startBuild failed: slave ping failed. I've binned these errors into 5 minute periods. I think we hit this when the slave decides the master isn't responding, so disconnects and reconnects. Lots of slaves were doing this, connected for a couple of minutes, disconnected for a 5 or so. There is also a cpu blog which I attached above. Interpretation of the graphs: * the CPU load became high quite rapidly at ~ 12:45, corresponding to an increase in the number of jobs that were running * a large number of jobs must have been submitted to cause the master to run lots of jobs and rapidly get behind. Should try to trace this * the master averaged 70 running jobs between 1 and 3pm, with the number of slave ping errors gradually increasing * after 3pm the ping error rate increases rapidly, and the number of jobs running drops. This may have been triggered by another influx of incoming jobs or a critical number of slaves in disconnect/reconnect cycles * everything goes to hell in a hand basket * * distinct lack of profit * around 6:15 I started disconnecting about 15 snow leopard slaves, and around 7pm disconnected 15 fedora64 slaves; enough to get the slave ping errors under control * the pending queue slowly comes under control over the next few hours * all slaves reconnected My feeling is that when the master gets busy with handling logs, incoming jobs, etc it starts to become unresponsive to slaves. The additional load from disconnect/reconnecting slaves causes more disconnections, and so on until the master is spiralling to its doom. The current master VMs in Castro can only handle 60 concurrent jobs, so having 100 slaves on them is asking for trouble at busy times. --> We need grunty hardware for the masters, or at least running a master per platform. <-- (Yes I know there's no capacity in Castro for this) catlee looked at the log sizes per job, and found that jsreftest is very verbose. I'll leave him to add more details on that. He's also got bug 583183 to reduce the load from bziping the logs, and hopefully not compressing separately for emailing tinderbox and storing to disk.

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

15 years ago

One the slaves a slave ping failed looks like: [attach to master] 2010-08-02 17:53:35-0700 [Broker,client] SlaveBuilder.remote_print(Rev3 Fedora 12 electrolysis debug test mochitests-1/5): message from master: attached 2010-08-02 17:54:22-0700 [Broker,client] SlaveBuilder.remote_print(Rev3 Fedora 12 mozilla-central debug test reftest): message from master: ping 2010-08-02 17:54:22-0700 [Broker,client] SlaveBuilder.remote_ping(<SlaveBuilder 'Rev3 Fedora 12 mozilla-central debug test reftest' at 142290220>) 2010-08-02 17:54:38-0700 [Broker,client] lost remote 2010-08-02 17:54:38-0700 [Broker,client] <twisted.internet.tcp.Connector instance at 0x879b0cc> will retry in 2 seconds 2010-08-02 17:54:38-0700 [Broker,client] Stopping factory <buildbot.slave.bot.BotFactory instance at 0x87935ec> 2010-08-02 17:54:40-0700 [-] Starting factory <buildbot.slave.bot.BotFactory instance at 0x87935ec> 2010-08-02 17:55:38-0700 [Broker,client] message from master: attached

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

15 years ago

There are a several slaves disconnected from test-master01 and 02 right now to try to prevent those masters getting into broken states when overloaded. We had another episode today due to freeze day + try + branch approvals. Not merging try requests (bug 582734) is making this problem more of an every day occurrence though. The slaves are: * tm01: * talos-r3-snow-0nn: 6,24,29,30,40,41,44,46,47,50,52 * talos-r3-fed64-0nn: 33,34,35,40,42,43,46,47,48,52,53 * tm02: * talos-r3-leopard-0nn: 3,12,16,17,18,22,26,27,28,29,31,39,41,44,52 * talos-r3-fed-0nn: 15,17,20,21,29,30,33,38,41,42,46,49 That's 22 and 27 slaves respectively, which leaves a bit more than the ~70 we can cope with when the masters have large pending counts (which seems to be worse than 80 jobs running and a small pending counts).

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

15 years ago

We've got lots of ix machines in Castro. Can we re-purpose one or two of them as masters? We could get that up in pretty short order if we used one which already has Linux on it.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

15 years ago

That's definitely and option, but ... the load is improved since upgrading to buildbot 0.8.0 today, with 60-80% CPU usage for masters that are processing jobs as fast as they can. I've reconnected some slaves, but left 6 or 7 from each platform disconnected: * tm01: * talos-r3-snow-0nn: 24,29,41,44,50,52 * talos-r3-fed64-0nn: 43,46,47,48,52,53 * tm02: * talos-r3-leopard-0nn: 28,29,31,39,41,44,52 * talos-r3-fed-0nn: 33,38,41,42,46,49

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

15 years ago

Load was still fine so I reconnected the last of the slaves.

Armen [:armenzg]

Comment 11

•

15 years ago

The memory on test-master{01,02} has now passed the 3GB of RAM and it is using the swap. The wait times today were pretty bad. * 6712 jobs with 46.03% while last week we had more jobs but the percentage was better * 10715 jobs with 52.49% CPU did not look that bad in either of them.

Summary: Load on talos-master02/test-master01/test-master02 is too high → Load on test-master01/test-master02 is too high

Armen [:armenzg]

Comment 12

•

15 years ago

Attached image cache being used: test-master01::memory on Tue Aug 10 — Details

Armen [:armenzg]

Comment 13

•

15 years ago

Attached image pending graph for talos/unit test on Aug 10th — Details

The increase on usage of the cache is associated with the increase of pending jobs. catlee is debugging the memory issue and we will restart the masters since they won't loose the pending queue.

Chris AtLee [:catlee]

Assignee

Updated

•

15 years ago

Assignee: nobody → catlee

Priority: P3 → P1

Chris AtLee [:catlee]

Assignee

Comment 14

•

15 years ago

Had to restart buildbot on test-master02, it had become unresponsive.

Armen [:armenzg]

Comment 15

•

15 years ago

(In reply to comment #14) > Had to restart buildbot on test-master02, it had become unresponsive. So from last reboot on the 4th to death we have 6 days. We will probably reach close to this state next week.

Armen [:armenzg]

Updated

•

15 years ago

Blocks: 580410

Chris AtLee [:catlee]

Assignee

Comment 16

•

15 years ago

and restarted test-master01 as well. Both test-master01 and test-master02 have this change, which should help: http://hg.mozilla.org/build/buildbot/rev/7348713b55c5

Armen [:armenzg]

Comment 17

•

15 years ago

Attached image io_wait of Aug. 10th — Details

catlee your fix seems to have made the memory usage growth closer to flat (at least on test-master01) unlike the steeped growth seen after the reboot of the 4th (I will need one more day to have a screenshot to show this clearly). This attached screenshot shows something that I had noticed previously and that it seems the first time we have been hitting. The CPU usage shows "io_wait" which is now almost unperceived.

Armen [:armenzg]

Comment 18

•

15 years ago

Attached image pending graph for talos/unit test on Aug 10th-Aug 11th — Details

The state of pending jobs is now back to normal. If you look 24 hours ago we had more than 250 pending jobs for each of the 4 platforms on tm{01-02}. We are now down at 0 pending jobs. The queue went from hundreds of jobs (3 platforms were competing to see if they could pass the 800 pending jobs) down to 0 in 10 hours with a decrease of 200-300 jobs per hour. To note that the tree was apparently closed due to bug 586179 according to https://wiki.mozilla.org/Tree_Closures

Armen [:armenzg]

Comment 19

•

15 years ago

Attached image memory usage from Aug. 4th to 12th — Details

This is the last one of my screen shots. We can see in it that since the reboot of the 4th the memory usage kept on growing until we hit the ceiling. Since catlee's fix the memory usage has stayed almost flat. CPU usage on all 3 masters is good and I believe the only master without the patch is talos-master02 but the growth is not as steeped (we have quite some time to deploy the fix) as it was on the other two masters. Do you guys think we could close this bug after few more days without issues?

Chris AtLee [:catlee]

Assignee

Comment 20

•

15 years ago

Looks good enough to me.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.