Closed Bug 999558 Opened 11 years ago Closed 11 years ago

high pending for ubuntu64-vm try test jobs on Apr 22 morning PT

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: Callek)

Details

a higher spike than normal was seen on try test pending jobs normally we would see 10.8 numbers higher than the rest but ubuntu64-vm took the lead with a peak of ~700 pending at around 1030 PT. nagios was poking from cruncher as well: Tue 09:24:22 PDT [4615] cruncher.srv.releng.scl3.mozilla.com:Pending builds is WARNING: WARNING Pending Builds: 2828
I'd guess (from observation, not actual knowledge) that we have a limit of ~400 linux32 test instances and ~600 linux64 test instances running, but that's no longer the correct split for our thousand since we moved b2g emulator tests off the fedora slaves and onto AWS. Right now, there's 1 pending linux32 job, and 1737 pending linux64 jobs.
Or perhaps not: linux32 is staying at 1 pending while having gone up to 488 running; linux64 is at 1891 pending with the same 601 running it had before, so maybe it's less "we have the wrong split between them" and more "only running 601 of the 984 linux64 test slaves is too few."
so we limit ourselves at 600: http://hg.mozilla.org/build/cloud-tools/file/default//configs/watch_pending.cfg#l70 I wonder if we will have to re-look at that number sooner with the recent trend of pending. Currently we have: 1315 ubuntu64-vm try test jobs pending. I will continue to track numbers here.
doing a diff of our try builders from april 8th till now yields: https://pastebin.mozilla.org/4938674 looks like we added a bunch new tests by splitting up: Bug 819963 - Split up mochitest-bc on desktop into ~30minute chunks not sure if this is what is hurting us. also from irc: jlund|build> | catlee-away: rail - do you think we should bump the 600 limit or do you have any suggestions for optimizing? something among the lines of philor's gestions < rail> | jlund|buildduty: I'm not sure if we have all those in DNS/slavealloc/buildbot <jlund|build> | well, actually, I should probably figure out why the sudden bump all week compared to last week. < rail> | moar tests? < philor> | is it really compared to last week, or compared to week-before-last? <jlund|build> | philor: I'm probably off. Ya, I actually looked at first week of the April. not last week < philor> | oh, the b2g reftest move actually was last week, though they ran side-by-side for a while < philor> | but we went from 10 chunks on Fedora to 15 (!) chunks on AWS, because it was such a slowdown that even 15 still made the overall time worse < philor> | so that's a chunk of load, and a make-the-count-worse, for anybody who builds b2g <jlund|build> | ok. so moar tests. so we either 1) remove tests 2) add more slaves 3) take the hit in waitimes < philor> | much as I'd like it to be the case that "pending numbers don't necessarily mean waittimes", the back edge of 10.8 is at 10.3 hours, and the back e of linux64 is at 8.75, so even though 2700 is still better than 900, it's not better by much < philor> | and there's no way out of this, like there is for 10.8 where you can pick another OS, or go with the 10.6 results and kill the 10.8 tests
Assignee: nobody → bugspam.Callek
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.