764713 - The Windows test pending count is too damn high!

Reporter

Description

•

13 years ago

It's been bad for several days, but when someone just mentioned 2000 pending at 11pm and I checked, the backlog for Window tests on try is now over 24 hours ("Waiting for... 1 day, 3:31:02"). That's at least critical, because nobody is going to wait more than 24 hours for try results on something where they've already got review, they're just going to push it and let inbound or central be their tryserver. This morning's wait time mail said 35636 jobs/58.97%, 4019 Win7 jobs of which 26% waited 90+ minutes, 3752 WinXP jobs of which 27% waited 90+ minutes. A recent brutal day, May 31st, was 40972 jobs/56.75%, but 3963 Win7 jobs had 20% 90+ minutes and 3780 WinXP jobs had 9% 90+ minutes. Part of the Win7 problem is probably because a bunch of slaves are down for dongling, but I wouldn't have guessed that was enough to make us so much worse; no idea what's wrong with WinXP, though it does have a history of having large numbers of slaves just go AWOL.

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

13 years ago

A bakers dozen of talos-r3-xp have been rebooted back into service. Except for talos-r3-xp068 they had failed to reboot after a job sometime over the last few days. There are 6 less talos-r4-w7 slaves at the moment due to bug 710233. None of the three masters (bm15/16/23) have obvious issues. Their CPU load and memory are within historical norms from the last few weeks (from ganglia). The buildbot master process is monopolizing almost a whole CPU but they're dual CPU VMs and I don't see any lag in getting a new step as one finishes. Talos was turned on for IonMonkey about a week ago (bug 762081) and they've been pretty active today. Overall the number of pushes in the last 24 hours (237) is high but not a record (6 other days in the last month we've gotten at least that high).

Armen [:armenzg]

Comment 2

•

13 years ago

We have cpu wio which has always been problematic and perhaps tcp_timewait is another indicator. Perhaps jobs are taking long to be scheduled. NOTE: Increasing the time ranges from one hour smooths the peaks. I also see a high ratio of slaves per Windows master (50.7 slaves/master). I wonder how the IO of the kvms is performing. http://ganglia1.build.scl1.mozilla.com/ganglia/?r=hour&cs=&ce=&c=RelEngSCL1&h=buildbot-master15.build.scl1.mozilla.com&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS_|_disk_|_load_|_memory_|_process http://ganglia1.build.scl1.mozilla.com/ganglia/?r=hour&cs=&ce=&c=RelEngSCL1&h=buildbot-master16.build.scl1.mozilla.com&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS_|_disk_|_load_|_memory_|_process http://ganglia1.build.scl1.mozilla.com/ganglia/?r=hour&cs=&ce=&c=RelEngSCL1&h=buildbot-master23.build.scl1.mozilla.com&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS_|_disk_|_load_|_memory_|_process

Armen [:armenzg]

Comment 3

•

13 years ago

Adding buildduty and relops point of contact.

Chris AtLee [:catlee]

Comment 4

•

13 years ago

I'm not sure there's actually anything wrong with the masters themselves, I don't see evidence that high master load is causing steps to take longer. I suspect we're simply hitting the limits of what our pool of machines can handle. We've been turning on more tests and branches and doing more and more try pushes without increasing the capacity of the test pool.

Armen [:armenzg]

Comment 5

•

13 years ago

Was there also once an issue that jobs were taking long to be scheduled? On another note, I have noticed that the masters take long to load up jobs and builders. You are likely right nevertheless. Here is some data I have gathered that might or might not be useful: If we look at yearly view of bm15/bm16 we can see there is a continues growth of cpu wio since November (remember that yearly view smooths the graph a lot): http://cl.ly/HNNu http://cl.ly/HOad FTR the cpu wio for this host was very very good back in August. I wonder how we could see how our kvm setup is performing wrt to IO. Back in November we had around 20K test jobs and less than 2K build/try jobs. These days we have around 40K test jobs and around 3K build/try jobs and. [1] Wait time report for testpool for jobs submitted between Tue, 29 Nov 2011 00:00:00 -0800 (PST) and Wed, 30 Nov 2011 00:00:00 -0800 (PST) Total Jobs: 21996 Wait time report for trybuildpool for jobs submitted between Tue, 29 Nov 2011 00:00:00 -0800 (PST) and Wed, 30 Nov 2011 00:00:00 -0800 (PST) Total Jobs: 849 Wait time report for buildpool for jobs submitted between Tue, 29 Nov 2011 00:00:00 -0800 (PST) and Wed, 30 Nov 2011 00:00:00 -0800 (PST) Total Jobs: 1101 [2] Wait time report for testpool for jobs submitted between Wed, 13 Jun 2012 00:00:00 -0700 (PDT) and Thu, 14 Jun 2012 00:00:00 -0700 (PDT) Total Jobs: 38614 Wait time report for trybuildpool for jobs submitted between Wed, 13 Jun 2012 00:00:00 -0700 (PDT) and Thu, 14 Jun 2012 00:00:00 -0700 (PDT) Total Jobs: 1057 Wait time report for buildpool for jobs submitted between Wed, 13 Jun 2012 00:00:00 -0700 (PDT) and Thu, 14 Jun 2012 00:00:00 -0700 (PDT) Total Jobs: 1943

Aki Sasaki (not active)

Updated

•

12 years ago

Whiteboard: [buildduty]

Ed Morley [:emorley]

Comment 7

•

12 years ago

There are pending Windows XP tests on inbound as far back as: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&onlyunstarred=1&rev=878c00396d62 ...which finished the windows build at 05:05 PDT. This makes trying to sheriff really difficult & results in people getting irritated about how long we have to wait for pgo green on all platforms, before I can do an inbound -> mozilla-central merge. Is there something specific causing this of late? Or is it a more general case of "we need more windows machines asap"? If the situation keeps on getting this bad, we're going to have to have daily tree closures to let things catch up...

Armen [:armenzg]

Comment 8

•

12 years ago

I'm trying to get more Windows 7 and XP machines by re-purposing 80% of the Leopard pool since we're dropping the support. On the other side I believe there might be an issue from looking at http://build.mozilla.org/builds/last-job-per-slave.html It seems that there are a lot of xp slaves not taking jobs for 9-11 hours. I have poked kmoir_buildduty to help me look into this. I think today it might be related to the OPSI master (bug 774602). Still to be proven.

Ed Morley [:emorley]

Comment 9

•

12 years ago

Thank you for looking into this :-)

Armen [:armenzg]

Comment 10

•

12 years ago

I have rebooted most of the 75 WinXP slaves. I would say more than 60% were not taking jobs since 1-2AM PDT. We should be back to 90% assuming that I did not miss too many. I have told kmoir to look into remaining slaves when http://build.mozilla.org/builds/last-job-per-slave.html has more updated information. Let's see in 30minutes how http://build.mozilla.org/builds/pending/pending.html starts trending.

Ed Morley [:emorley]

Comment 11

•

12 years ago

The situation is looking a bit better now, though still quite high levels of winxp/win7 try test jobs, hopefully just residual. Isn't this something that nagios should be set up to warn on? Looking at http://build.mozilla.org/builds/last-job-per-slave.html there seem to be many more machines that don't have notes, that have not performed a job recently. Is this expected? It's a shame there is no filtering mechanism on that page to exclude machines that are expected to be inactive. The linux compile graphs on http://build.mozilla.org/builds/pending/pending.html didn't look very happy earlier either, are any of the yellow/red machines on last-job-per-slave expected to be running but not?

Chris Cooper [:coop] (he/him)

Comment 12

•

12 years ago

(In reply to Ed Morley [:edmorley] from comment #11) > Isn't this something that nagios should be set up to warn on? Looking at > http://build.mozilla.org/builds/last-job-per-slave.html there seem to be > many more machines that don't have notes, that have not performed a job > recently. Is this expected? It's a shame there is no filtering mechanism on > that page to exclude machines that are expected to be inactive. I'll try to find some time to break out known staging/dev slaves into their own section in last-job-per-slave.html. You're right, we shouldn't be confusing the results with those at all, and it's particularly confusing (for me at least) with the tegras. I've filed bug 775073 for this.

Ed Morley [:emorley]

Comment 13

•

12 years ago

That's great - thank you :-)

Ed Morley [:emorley]

Updated

•

12 years ago

Blocks: 772458

Ed Morley [:emorley]

Updated

•

12 years ago

Whiteboard: [buildduty] → [buildduty][sheriff-want]

Chris AtLee [:catlee]

Comment 14

•

12 years ago

Anything left to do here?

Ed Morley [:emorley]

Comment 15

•

12 years ago

(In reply to Chris AtLee [:catlee] from comment #14) > Anything left to do here? The windows pending test count is still regularly high enough to cause Try backlog, so I don't think we can call this fixed yet.

Ed Morley [:emorley]

Comment 16

•

12 years ago

(In reply to Ed Morley [:edmorley] from comment #15) > (In reply to Chris AtLee [:catlee] from comment #14) > > Anything left to do here? > > The windows pending test count is still regularly high enough to cause Try > backlog, so I don't think we can call this fixed yet. For example: Pending test(s) @ Aug 15 02:15:03 win7 (57) 39 release-mozilla-beta 9 mozilla-inbound 9 fx-team winxp (35) 31 release-mozilla-beta 4 mozilla-inbound Pending test(s) @ Aug 15 02:15:03 win7 (406) 406 try winxp (453) 453 try

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

12 years ago

AFAIK there aren't any acute issues left here, so this isn't a buildduty concern at this point. It's also not a Machine Management issue, because it's not a specific-slave issue. Moving back to the main RelEng component for re-triage.

Component: Release Engineering: Machine Management → Release Engineering

QA Contact: armenzg

Whiteboard: [buildduty][sheriff-want] → [sheriff-want]

Ed Morley [:emorley]

Updated

•

12 years ago

Depends on: 794987

Ed Morley [:emorley]

Updated

•

12 years ago

Depends on: 794420

Armen [:armenzg]

Comment 18

•

12 years ago

I don't know what correct resolution to give to this but for now the only things we can do are: * get into action slaves that are hung * find tests to disable * keep on working to move away from the minis (working on it) * re-purpose rev3 machines IMHO there's no point of keeping the bug open.

Phil Ringnalda (:philor)

Reporter

Comment 19

•

12 years ago

Fixed by adding what slaves were available and getting serious about rebooting the constantly hanging WinXP slaves. I will, however, be filing a clone of this bug soon, for Linux32 - it's perpetually around 10-12 hours, with no real reason to believe we won't shove it up to a 24 hour backlog.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering