Closed Bug 649734 Opened 9 years ago Closed 9 years ago

Determine the cause of & fix the increased testpool wait times

Categories

(Release Engineering :: General, defect, P2)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

Attachments

(3 files)

Something has changed after the 1st of April since it was the last time we had a normal wait time:
> Wait: 8242/83.96% (testpool)
April 3rd was the first day that things were wrong:
> Wait: 3079/74.31% (testpool) 

catlee says that it could be the increase of # of builders:
> One possibility is that we've added a lot more branches recently, meaning we
> have a lot more schedulers and builders.
>
> We currently have 4364 builders and 1646 schedulers. On April 1st we had 3006
> builders and 1121 schedulers, so we've grown by about 50%.

catlee has initially patched buildbot to get some logging.
Comment on attachment 525749 [details] [diff] [review]
[buildbot] patch to get some logging

Let's pick a test master to land this on. How about buildbot-master2:tests_master5? I'm avoiding buildbot-master1 because we know it has a slow drive.
Attachment #525749 - Flags: review?(dustin)
Not sure if it has value but I have gathered this information at 13:50 PDT
pm01 top - 13:52:03 up 88 days, 14:39,  2 users,  load average: 4.61, 3.44, 3.10
pm02 top - 13:49:04 up 464 days,  4:01,  1 user,  load average: 0.02, 0.08, 0.12
pm03 top - 13:49:13 up 88 days, 14:32,  2 users,  load average: 0.81, 0.77, 0.73
bm1  top - 13:49:22 up 19 days,  2:55,  1 user,   load average: 0.91, 1.39, 1.65
bm2  top - 13:49:31 up 116 days, 23:45,  1 user,  load average: 1.68, 1.04, 0.95
bm3  top - 13:49:41 up 82 days,  1:25,  1 user,   load average: 0.05, 0.02, 0.00
bm5  top - 13:49:53 up 27 days,  5:35,  1 user,   load average: 0.03, 0.37, 0.48

The only masters with high load are pm01 & bm2.
Attachment #525749 - Flags: review?(dustin) → review+
Landed on buildbot-master2:tests_master5, doing a graceful restart right now.
Summary: figure out the bad wait times we are seeing since the beginning of April → Determine the cause of & fix the increased testpool wait times
bug 648665 is certainly responsible for some slowness in the mac testing silos.
See Also: → 648665
This works on my laptop and on staging master. 

Takes about 0.10s to sort 147 builders on my staging master with this method vs. 14s to sort 741 builders with the existing method.
Attachment #525917 - Flags: review?(dustin)
Comment on attachment 525917 [details] [diff] [review]
Faster builder sorting

I like it, replete with a logged metric!
Attachment #525917 - Flags: review?(dustin) → review+
catlee is up to something.
The time a step takes on the slave (e.g. 1.3 secs) VS how long it takes the master to consider the step finished (e.g. 7 secs) and move on is different.
The problem seems to be that master is busy doing "stuff".

It seems that on buildbot-master1 happens the most.
We are looking into finishing buildbot-master4 and buildbot-master6 and move away from buildbot-master1.

mysql> select masters.url, count(*) from masters, steps, builds, builders where builds.master_id = masters.id and builds.builder_id = builders.id and steps.build_id = builds.id and steps.name = 'get_buildername' and (unix_timestamp(steps.endtime) - unix_timestamp(steps.starttime)) > 5  and builds.starttime >= '2011-04-01' group by masters.url;
+-----------------------------------------------------+----------+
| url                                                 | count(*) |
+-----------------------------------------------------+----------+
| http://buildbot-master1.build.scl1.mozilla.com:8010 |       26 | 
| http://buildbot-master1.build.scl1.mozilla.com:8012 |      278 | 
| http://buildbot-master2.build.scl1.mozilla.com:8010 |        1 | 
| http://buildbot-master2.build.scl1.mozilla.com:8011 |       14 | 
| http://buildbot-master2.build.scl1.mozilla.com:8012 |       11 | 
+-----------------------------------------------------+----------+
On January the only misbehaving was buildbot-master1

mysql> select masters.url, count(*) from masters, steps, builds, builders where builds.master_id = masters.id and builds.builder_id = builders.id and steps.build_id = builds.id and steps.name = 'get_buildername' and (unix_timestamp(steps.endtime) - unix_timestamp(steps.starttime)) > 5  and builds.starttime >= '2011-01-01' and builds.starttime <= '2011-02-01' group by masters.url;
+-----------------------------------------------------+----------+
| url                                                 | count(*) |
+-----------------------------------------------------+----------+
| http://buildbot-master1.build.scl1.mozilla.com:8010 |       10 | 
| http://buildbot-master1.build.scl1.mozilla.com:8011 |      179 | 
| http://buildbot-master1.build.scl1.mozilla.com:8012 |      183 | 
+-----------------------------------------------------+----------+
I was trying to figure out the averages for March VS April grouped by master url for a given test suite.
I don't see anything significantly wrong.
I might be choosing the wrong test suite to look at or I should exclude weekends to not soften the averages.
Maybe I should group by # of jobs finished less than 1500, between 1500 and 1650 and over 1650.
I should go back to my statistic notes.

mysql> select masters.url, AVG(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime)) AS average from builders, builds, masters where builds.master_id = masters.id and builders.id = builds.builder_id and builders.name LIKE "mozilla-central_xp_test-xpcshell%" and (endtime-starttime)>0 AND builds.starttime >= '2011-03-01' AND builds.starttime <= '2011-03-31'  GROUP BY builds.master_id ORDER BY masters.url;
+-----------------------------------------------------+-----------+
| url                                                 | average   |
+-----------------------------------------------------+-----------+
| http://buildbot-master1.build.scl1.mozilla.com:8011 | 1662.7857 | 
| http://buildbot-master1.build.scl1.mozilla.com:8012 | 1567.6200 | 
| http://buildbot-master2.build.scl1.mozilla.com:8011 | 1551.3571 | 
| http://buildbot-master2.build.scl1.mozilla.com:8012 | 1563.6267 | 
| http://buildbot-master5.build.mozilla.org:8011      | 1541.6304 | 
| http://buildbot-master5.build.scl1.mozilla.com:8011 | 1561.3333 | 
+-----------------------------------------------------+-----------+
6 rows in set (2.33 sec)

mysql> select masters.url, AVG(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime)) AS average from builders, builds, masters where builds.master_id = masters.id and builders.id = builds.builder_id and builders.name LIKE "mozilla-central_xp_test-xpcshell%" and (endtime-starttime)>0 AND builds.starttime >= '2011-04-01'  GROUP BY builds.master_id ORDER BY masters.url;
+-----------------------------------------------------+-----------+
| url                                                 | average   |
+-----------------------------------------------------+-----------+
| http://buildbot-master1.build.scl1.mozilla.com:8012 | 1585.4082 | 
| http://buildbot-master2.build.scl1.mozilla.com:8011 | 1640.0000 | 
| http://buildbot-master2.build.scl1.mozilla.com:8012 | 1609.7576 | 
| http://buildbot-master5.build.mozilla.org:8011      | 1528.8571 | 
+-----------------------------------------------------+-----------+
4 rows in set (1.27 sec)
(In reply to comment #9)
> I was trying to figure out the averages for March VS April grouped by master
> url for a given test suite.
> I don't see anything significantly wrong.
> I might be choosing the wrong test suite to look at or I should exclude
> weekends to not soften the averages.
> Maybe I should group by # of jobs finished less than 1500, between 1500 and
> 1650 and over 1650.
> I should go back to my statistic notes.
> 
> mysql> select masters.url,
> AVG(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime)) AS average
> from builders, builds, masters where builds.master_id = masters.id and
> builders.id = builds.builder_id and builders.name LIKE
> "mozilla-central_xp_test-xpcshell%" and (endtime-starttime)>0 AND
> builds.starttime >= '2011-03-01' AND builds.starttime <= '2011-03-31'  GROUP BY
> builds.master_id ORDER BY masters.url;
> +-----------------------------------------------------+-----------+
> | url                                                 | average   |
> +-----------------------------------------------------+-----------+
> | http://buildbot-master1.build.scl1.mozilla.com:8011 | 1662.7857 | 
> | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1567.6200 | 
> | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1551.3571 | 
> | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1563.6267 | 
> | http://buildbot-master5.build.mozilla.org:8011      | 1541.6304 | 
> | http://buildbot-master5.build.scl1.mozilla.com:8011 | 1561.3333 | 
> +-----------------------------------------------------+-----------+
> 6 rows in set (2.33 sec)
> 
> mysql> select masters.url,
> AVG(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime)) AS average
> from builders, builds, masters where builds.master_id = masters.id and
> builders.id = builds.builder_id and builders.name LIKE
> "mozilla-central_xp_test-xpcshell%" and (endtime-starttime)>0 AND
> builds.starttime >= '2011-04-01'  GROUP BY builds.master_id ORDER BY
> masters.url;
> +-----------------------------------------------------+-----------+
> | url                                                 | average   |
> +-----------------------------------------------------+-----------+
> | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1585.4082 | 
> | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1640.0000 | 
> | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1609.7576 | 
> | http://buildbot-master5.build.mozilla.org:8011      | 1528.8571 | 
> +-----------------------------------------------------+-----------+
> 4 rows in set (1.27 sec)

What does January look like?
Assignee: nobody → catlee
Priority: -- → P2
dustin marked buildbot-master1 not to take any slaves since the slave allocator is now fully functional.

Once buildbot-master4 and buildbot-master6 are up and running the situation should get better as we would off load to them.

I will look into the averages since January and others on Monday.
Priority: P2 → --
Duplicate test jobs caused by bug 592244 may be a factor here.
Priority: -- → P2
Depends on: 592244
The attachment can show slightly that some test jobs on some days can run slower on some masters compared to others. This is not easy to notice.

The rest of the comment is the same previous query but for January for XP xpshell and Fedora xpcshell without revealing anything significant.
 
mysql> select MONTH(builds.starttime), masters.url, AVG(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime)) AS average from builders, builds, masters where builds.master_id = masters.id and builders.id = builds.builder_id and builders.name LIKE "mozilla-central_xp_test-xpcshell%" and (endtime-starttime)>0 AND builds.starttime >= '2011-01-01' AND builds.endtime <= '2011-04-19'  GROUP BY builds.master_id, MONTH(builds.starttime) ORDER BY masters.url, MONTH(builds.starttime);
+-------------------------+-----------------------------------------------------+-----------+
| MONTH(builds.starttime) | url                                                 | average   |
+-------------------------+-----------------------------------------------------+-----------+
|                       1 | http://buildbot-master1.build.scl1.mozilla.com:8011 | 1554.2632 | 
|                       2 | http://buildbot-master1.build.scl1.mozilla.com:8011 | 1534.1368 | 
|                       3 | http://buildbot-master1.build.scl1.mozilla.com:8011 | 1662.7857 | 
|                       1 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1586.9744 | 
|                       2 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1556.8767 | 
|                       3 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1572.5238 | 
|                       4 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1597.6364 | 
|                       1 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1619.2349 | 
|                       2 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1585.0909 | 
|                       3 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1558.1290 | 
|                       4 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1630.6957 | 
|                       1 | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1573.2182 | 
|                       2 | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1551.8579 | 
|                       3 | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1563.6154 | 
|                       4 | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1607.8140 | 
|                       3 | http://buildbot-master5.build.mozilla.org:8011      | 1538.1429 | 
|                       4 | http://buildbot-master5.build.mozilla.org:8011      | 1540.7170 | 
|                       3 | http://buildbot-master5.build.scl1.mozilla.com:8011 | 1561.3333 | 
+-------------------------+-----------------------------------------------------+-----------+

I also don't see anything significant for Fedora's xpcshell:

mysql> select MONTH(builds.starttime), masters.url, AVG(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime)) AS average from builders, builds, masters where builds.master_id = masters.id and builders.id = builds.builder_id and builders.name LIKE "mozilla-central_fedora_test-xpcshell%" and (endtime-starttime)>0 AND builds.starttime >= '2011-01-01' AND builds.endtime <= '2011-04-19'  GROUP BY builds.master_id, MONTH(builds.starttime) ORDER BY masters.url, MONTH(builds.starttime);
+-------------------------+-----------------------------------------------------+-----------+
| MONTH(builds.starttime) | url                                                 | average   |
+-------------------------+-----------------------------------------------------+-----------+
|                       1 | http://buildbot-master1.build.scl1.mozilla.com:8011 | 1013.1086 | 
|                       2 | http://buildbot-master1.build.scl1.mozilla.com:8011 | 1015.7333 | 
|                       3 | http://buildbot-master1.build.scl1.mozilla.com:8011 | 1080.7917 | 
|                       1 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1035.4889 | 
|                       2 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1019.5299 | 
|                       3 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1011.7000 | 
|                       4 | http://buildbot-master1.build.scl1.mozilla.com:8012 | 1017.2593 | 
|                       1 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1028.2653 | 
|                       2 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1015.7714 | 
|                       3 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1015.6203 | 
|                       4 | http://buildbot-master2.build.scl1.mozilla.com:8011 | 1166.8824 | 
|                       1 | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1001.2192 | 
|                       2 | http://buildbot-master2.build.scl1.mozilla.com:8012 |  998.2609 | 
|                       3 | http://buildbot-master2.build.scl1.mozilla.com:8012 |  982.4091 | 
|                       4 | http://buildbot-master2.build.scl1.mozilla.com:8012 | 1038.7273 | 
|                       3 | http://buildbot-master5.build.mozilla.org:8011      | 1003.0175 | 
|                       4 | http://buildbot-master5.build.mozilla.org:8011      | 1111.4242 | 
|                       3 | http://buildbot-master5.build.scl1.mozilla.com:8011 | 1007.1613 | 
+-------------------------+-----------------------------------------------------+-----------+
Depends on: 652962
After a lot of asking around I have determined that:
* We have move most slaves (>95% of non-w7 & non-w7x64 slaves) connected to buildbot-master5 because buildbot-master{1,2} have disk issues
** This means that we have one master talking with 5/7 of all of our test pool capacity (plus slaves for build and try slaves)
** This also means that the steps run by slaves are super super overheaded (a 'pwd' command can take 40 secs)

Here is the list of things that can be done to improve the situation:
* move non-test slaves away from buildbot-master5
* move a small portion of slaves back to buildbot-master{1,2}
* once buildbot-master{04,06} are set up move slaves back to it
* check to which master are the Win7 slaves talking to

I am going to tackle #1 for now.

Things started going worst for this master around midday on Tuesday:
http://ganglia1.build.scl1.mozilla.com/ganglia/graph.php?g=mem_report&z=large&c=RelEngSCL1&h=buildbot-master5.build.scl1.mozilla.com&m=load_one&r=week&s=descending&hc=4&mc=2&st=1304016526

This is the current situation:
* bm1  -- slow drive (bug 644991)
* bm2  -- slow drive (bug 652962)
* bm04 (being set up)
* bm5  -- very bad load
* bm06 (being set up)

There are currently this many slaves (not counting win7 & win7x64) connected to each test master (according to slavealloc):
* bm1 - 0 test slaves
* bm2:8011 (tm5) - 3 test slaves (probably they are sick slaves as they should have moved)
* bm2:8012 (tm6) - 1 test slaves (probably they are sick slaves as they should have moved)
* bm5:8012 (tm3) - 246 test slavealloc managed slaves

[1] http://ganglia1.build.scl1.mozilla.com/ganglia/?r=day&c=RelEngSCL1&h=buildbot-master1.build.scl1.mozilla.com&mc=2
[2] http://ganglia1.build.scl1.mozilla.com/ganglia/?c=RelEngSCL1&h=buildbot-master2.build.scl1.mozilla.com&m=load_one&r=hour&s=descending&hc=4&mc=2
[3] http://ganglia3.build.mtv1.mozilla.com/ganglia/?m=load_one&r=hour&s=descending&c=RelEngMTV1&h=buildbot-master3.build.mozilla.org&sh=1&hc=4&z=small
[4] http://ganglia1.build.scl1.mozilla.com/ganglia/?c=RelEngSCL1&h=buildbot-master04.build.scl1.mozilla.com&m=load_one&r=hour&s=descending&hc=4&mc=2
[5] http://ganglia1.build.scl1.mozilla.com/ganglia/?r=week&c=RelEngSCL1&h=buildbot-master5.build.scl1.mozilla.com&mc=2
[6] http://ganglia1.build.scl1.mozilla.com/ganglia/?c=RelEngSCL1&h=buildbot-master06.build.scl1.mozilla.com&m=load_one&r=hour&s=descending&hc=4&mc=2
Depends on: 653539
Depends on: 653555
Masters bm04-tests1 and bm06-tests1 have been added.

Since the beginning of this bug slaves have been moved away from masters buildbot-master0{1,2} (I think it has been during this time range) since they have bad hard disks.

We now have 3 tests masters up (tm3, bm04-tests1 & bm06-tests1).

We have also been reporting wait times incorrectly for every masters that has been added since Sept. 28th [1].

[1] http://hg.mozilla.org/build/buildapi/rev/3d844b4e65c7

########

I have also run these two queries [1] & [2].

The 2nd one does not reveal anything relevant (I probably wrote it incorrectly) but the first one reveals that for the builder "mozilla-central_xp_test-xpcshell" on the 27th we had longer tests running on buildbot-master5.
Instead of around ~1600 seconds it took more than 2500 seconds.
Adding the two new masters have brought this back down.

** [1] **
mysql> select day(builds.starttime) as day , masters.url, AVG(unix_timestamp(builds.endtime)-unix_timestamp(builds.starttime)) AS average from builders, builds, masters where builds.master_id = masters.id and builders.id = builds.builder_id and builders.name LIKE "mozilla-central_xp_test-xpcshell%" and (endtime-starttime)>0 AND builds.starttime >= '2011-04-25' AND builds.starttime <= '2011-04-30'  GROUP BY builds.master_id, day(builds.starttime)  ORDER BY day(builds.starttime);
+------+------------------------------------------------------+-----------+
| day  | url                                                  | average   |
+------+------------------------------------------------------+-----------+
|   25 | http://buildbot-master2.build.scl1.mozilla.com:8012  | 1595.0000 | 
|   25 | http://buildbot-master5.build.mozilla.org:8011       | 1692.7500 | 
|   25 | http://buildbot-master1.build.scl1.mozilla.com:8012  | 1534.3333 | 
|   25 | http://buildbot-master2.build.scl1.mozilla.com:8011  | 1659.5000 | 
|   26 | http://buildbot-master2.build.scl1.mozilla.com:8012  | 1546.5000 | 
|   26 | http://buildbot-master5.build.mozilla.org:8011       | 1726.5000 | 
|   26 | http://buildbot-master2.build.scl1.mozilla.com:8011  | 1611.6000 | 
|   26 | http://buildbot-master1.build.scl1.mozilla.com:8012  | 1614.6667 | 
|   27 | http://buildbot-master1.build.scl1.mozilla.com:8012  | 2669.6000 | 
|   27 | http://buildbot-master5.build.mozilla.org:8011       | 2582.7500 | 
|   28 | http://buildbot-master04.build.scl1.mozilla.com:8201 | 1541.6667 | 
|   28 | http://buildbot-master5.build.mozilla.org:8011       | 1975.1429 | 
|   28 | http://buildbot-master06.build.scl1.mozilla.com:8201 | 1600.5000 | 
|   29 | http://buildbot-master04.build.scl1.mozilla.com:8201 | 1618.5000 | 
|   29 | http://buildbot-master06.build.scl1.mozilla.com:8201 | 1625.0000 | 
|   29 | http://buildbot-master5.build.mozilla.org:8011       | 1659.5714 | 
+------+------------------------------------------------------+-----------+
16 rows in set (0.54 sec)

** [2] **
mysql> select masters.url, count(*) from masters, steps, builds, builders where builds.master_id = masters.id and builds.builder_id = builders.id and steps.build_id = builds.id and steps.name = 'get_buildername' and (unix_timestamp(steps.endtime) - unix_timestamp(steps.starttime)) > 5  and builds.starttime >= '2011-04-25' and builds.starttime <= '2011-04-29' group by masters.url;
+-----------------------------------------------------+----------+
| url                                                 | count(*) |
+-----------------------------------------------------+----------+
| http://buildbot-master2.build.scl1.mozilla.com:8010 |        1 | 
| http://buildbot-master2.build.scl1.mozilla.com:8011 |        9 | 
| http://buildbot-master2.build.scl1.mozilla.com:8012 |        4 | 
+-----------------------------------------------------+----------+
3 rows in set (15.13 sec)
Depends on: 653729
Depends on: 648665, 650801
Status update:
* Bug 592244 - race condition in schedulerdb
** this will help with the jobs that get run on more than one master (duplicate efforts)
* Bug 648665 - Some slaves do not reboot - twistd.pid still exists
** this fixes the slaves that would go MIA after failing to start buildbot (hence reducing pool capacity)
* Bug 650801 - Cannot allocate memory errors on buildbot masters
** this fixes the excessive memory usage.
** I am not sure how to draw the line on how it affects the wait times (besides avoiding our masters to reach OOM) but worth mention it as it bring stability to our masters
* bug 644991 and bug 652962 are the two masters that we moved away from for having disk issues.
** having all slaves on buildbot-master5 was slowing down test runs. We now run tests on 3 test masters.
* Bug 653729 - Fix wait times report
** This bug has been preempting us from getting real wait times stats

I believe there is no specific action left for this bug besides dealing with the dependent bugs.
I will keep on watching on the wait times reports and see if we really are back to normal sane days.
Anything left to do here?
Assignee: catlee → nobody
(In reply to comment #18)
> Anything left to do here?

It was quite a ride but we are now in better shape.

Thanks catlee! Your forensic touch on memory leaks saved us once more :)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 668461
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.