Closed Bug 764713 Opened 13 years ago Closed 12 years ago

The Windows test pending count is too damn high!

Categories

(Release Engineering :: General, defect)

x86
Windows 7
defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [sheriff-want])

It's been bad for several days, but when someone just mentioned 2000 pending at 11pm and I checked, the backlog for Window tests on try is now over 24 hours ("Waiting for... 1 day, 3:31:02"). That's at least critical, because nobody is going to wait more than 24 hours for try results on something where they've already got review, they're just going to push it and let inbound or central be their tryserver. This morning's wait time mail said 35636 jobs/58.97%, 4019 Win7 jobs of which 26% waited 90+ minutes, 3752 WinXP jobs of which 27% waited 90+ minutes. A recent brutal day, May 31st, was 40972 jobs/56.75%, but 3963 Win7 jobs had 20% 90+ minutes and 3780 WinXP jobs had 9% 90+ minutes. Part of the Win7 problem is probably because a bunch of slaves are down for dongling, but I wouldn't have guessed that was enough to make us so much worse; no idea what's wrong with WinXP, though it does have a history of having large numbers of slaves just go AWOL.
A bakers dozen of talos-r3-xp have been rebooted back into service. Except for talos-r3-xp068 they had failed to reboot after a job sometime over the last few days. There are 6 less talos-r4-w7 slaves at the moment due to bug 710233. None of the three masters (bm15/16/23) have obvious issues. Their CPU load and memory are within historical norms from the last few weeks (from ganglia). The buildbot master process is monopolizing almost a whole CPU but they're dual CPU VMs and I don't see any lag in getting a new step as one finishes. Talos was turned on for IonMonkey about a week ago (bug 762081) and they've been pretty active today. Overall the number of pushes in the last 24 hours (237) is high but not a record (6 other days in the last month we've gotten at least that high).
Adding buildduty and relops point of contact.
I'm not sure there's actually anything wrong with the masters themselves, I don't see evidence that high master load is causing steps to take longer. I suspect we're simply hitting the limits of what our pool of machines can handle. We've been turning on more tests and branches and doing more and more try pushes without increasing the capacity of the test pool.
Was there also once an issue that jobs were taking long to be scheduled? On another note, I have noticed that the masters take long to load up jobs and builders. You are likely right nevertheless. Here is some data I have gathered that might or might not be useful: If we look at yearly view of bm15/bm16 we can see there is a continues growth of cpu wio since November (remember that yearly view smooths the graph a lot): http://cl.ly/HNNu http://cl.ly/HOad FTR the cpu wio for this host was very very good back in August. I wonder how we could see how our kvm setup is performing wrt to IO. Back in November we had around 20K test jobs and less than 2K build/try jobs. These days we have around 40K test jobs and around 3K build/try jobs and. [1] Wait time report for testpool for jobs submitted between Tue, 29 Nov 2011 00:00:00 -0800 (PST) and Wed, 30 Nov 2011 00:00:00 -0800 (PST) Total Jobs: 21996 Wait time report for trybuildpool for jobs submitted between Tue, 29 Nov 2011 00:00:00 -0800 (PST) and Wed, 30 Nov 2011 00:00:00 -0800 (PST) Total Jobs: 849 Wait time report for buildpool for jobs submitted between Tue, 29 Nov 2011 00:00:00 -0800 (PST) and Wed, 30 Nov 2011 00:00:00 -0800 (PST) Total Jobs: 1101 [2] Wait time report for testpool for jobs submitted between Wed, 13 Jun 2012 00:00:00 -0700 (PDT) and Thu, 14 Jun 2012 00:00:00 -0700 (PDT) Total Jobs: 38614 Wait time report for trybuildpool for jobs submitted between Wed, 13 Jun 2012 00:00:00 -0700 (PDT) and Thu, 14 Jun 2012 00:00:00 -0700 (PDT) Total Jobs: 1057 Wait time report for buildpool for jobs submitted between Wed, 13 Jun 2012 00:00:00 -0700 (PDT) and Thu, 14 Jun 2012 00:00:00 -0700 (PDT) Total Jobs: 1943
Whiteboard: [buildduty]
There are pending Windows XP tests on inbound as far back as: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&onlyunstarred=1&rev=878c00396d62 ...which finished the windows build at 05:05 PDT. This makes trying to sheriff really difficult & results in people getting irritated about how long we have to wait for pgo green on all platforms, before I can do an inbound -> mozilla-central merge. Is there something specific causing this of late? Or is it a more general case of "we need more windows machines asap"? If the situation keeps on getting this bad, we're going to have to have daily tree closures to let things catch up...
I'm trying to get more Windows 7 and XP machines by re-purposing 80% of the Leopard pool since we're dropping the support. On the other side I believe there might be an issue from looking at http://build.mozilla.org/builds/last-job-per-slave.html It seems that there are a lot of xp slaves not taking jobs for 9-11 hours. I have poked kmoir_buildduty to help me look into this. I think today it might be related to the OPSI master (bug 774602). Still to be proven.
Thank you for looking into this :-)
I have rebooted most of the 75 WinXP slaves. I would say more than 60% were not taking jobs since 1-2AM PDT. We should be back to 90% assuming that I did not miss too many. I have told kmoir to look into remaining slaves when http://build.mozilla.org/builds/last-job-per-slave.html has more updated information. Let's see in 30minutes how http://build.mozilla.org/builds/pending/pending.html starts trending.
The situation is looking a bit better now, though still quite high levels of winxp/win7 try test jobs, hopefully just residual. Isn't this something that nagios should be set up to warn on? Looking at http://build.mozilla.org/builds/last-job-per-slave.html there seem to be many more machines that don't have notes, that have not performed a job recently. Is this expected? It's a shame there is no filtering mechanism on that page to exclude machines that are expected to be inactive. The linux compile graphs on http://build.mozilla.org/builds/pending/pending.html didn't look very happy earlier either, are any of the yellow/red machines on last-job-per-slave expected to be running but not?
(In reply to Ed Morley [:edmorley] from comment #11) > Isn't this something that nagios should be set up to warn on? Looking at > http://build.mozilla.org/builds/last-job-per-slave.html there seem to be > many more machines that don't have notes, that have not performed a job > recently. Is this expected? It's a shame there is no filtering mechanism on > that page to exclude machines that are expected to be inactive. I'll try to find some time to break out known staging/dev slaves into their own section in last-job-per-slave.html. You're right, we shouldn't be confusing the results with those at all, and it's particularly confusing (for me at least) with the tegras. I've filed bug 775073 for this.
That's great - thank you :-)
Blocks: 772458
Whiteboard: [buildduty] → [buildduty][sheriff-want]
Anything left to do here?
(In reply to Chris AtLee [:catlee] from comment #14) > Anything left to do here? The windows pending test count is still regularly high enough to cause Try backlog, so I don't think we can call this fixed yet.
(In reply to Ed Morley [:edmorley] from comment #15) > (In reply to Chris AtLee [:catlee] from comment #14) > > Anything left to do here? > > The windows pending test count is still regularly high enough to cause Try > backlog, so I don't think we can call this fixed yet. For example: Pending test(s) @ Aug 15 02:15:03 win7 (57) 39 release-mozilla-beta 9 mozilla-inbound 9 fx-team winxp (35) 31 release-mozilla-beta 4 mozilla-inbound Pending test(s) @ Aug 15 02:15:03 win7 (406) 406 try winxp (453) 453 try
AFAIK there aren't any acute issues left here, so this isn't a buildduty concern at this point. It's also not a Machine Management issue, because it's not a specific-slave issue. Moving back to the main RelEng component for re-triage.
Component: Release Engineering: Machine Management → Release Engineering
QA Contact: armenzg
Whiteboard: [buildduty][sheriff-want] → [sheriff-want]
Depends on: 794987
Depends on: 794420
I don't know what correct resolution to give to this but for now the only things we can do are: * get into action slaves that are hung * find tests to disable * keep on working to move away from the minis (working on it) * re-purpose rev3 machines IMHO there's no point of keeping the bug open.
Fixed by adding what slaves were available and getting serious about rebooting the constantly hanging WinXP slaves. I will, however, be filing a clone of this bug soon, for Linux32 - it's perpetually around 10-12 hours, with no real reason to believe we won't shove it up to a 24 hour backlog.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.