Closed Bug 586418 Opened 14 years ago Closed 10 years ago

merge test suites for unit-test runs to balance load

Categories

(Release Engineering :: General, enhancement, P4)

enhancement

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: catlee, Unassigned)

References

Details

(Whiteboard: [reporting][optimization][buildfaster:p2])

Attachments

(6 files, 2 obsolete files)

e.g. win32 try opt mochitests spend only about 50% of their time doing tests.  The rest is downloading, unpacking, rebooting, etc.
OS: Linux → All
Hardware: x86_64 → All
Yeah, could be some big wins here for sure.

(In some cases, it's more like 30% of their time; we saw 4:28 cleaning up, 1:30 unpacking, 4:24 running the test and probably 4 mins rebooting.)
Severity: normal → enhancement
Priority: -- → P4
Whiteboard: [reporting][optimization]
could part of this be reduced by creating smaller test packages:
firefox-version-test-xpcshell.zip
firefox-version-test-jsreftest.zip
firefox-version-test-reftest.zip
firefox-version-test-crashtest.zip
firefox-version-test-mochitest.zip
firefox-version-test-mochitest-other.zip

I know each of these would need to have the binaries and certificates in the .zip file, but it would reduce the total download time and unpacking time.  

In Shaver's comment above, '4 mins rebooting' seems like a problem.  Maybe the 4 minutes includes full reboot, and connecting to buildbot. 

Also 4:28 cleaning up seems like a very long time.  That is longer than the test run!  What happens in a cleanup step?  I imagine 'uninstall Firefox, remove test files, and some buildbot stuff'.
Attached file crashtest
In my opinion we should try to optimize win32 first since I have seen overall worst tear down/tear up times for this platform. IMHO I believe the optimizations for win32 will benefit the other platforms and focusing on win32 will help us not having to look too many different places.

Said that, I have opened each of the test suites for win7 and I am going to analyze the one that took the shortest and then the one that took the longest.

To note that downloading the builds, tests and symbols can vary depending on who hits the cache first.

This is the breakdown of a crashtest run which was the shortest one:
* total - 10 mins, 34 secs
* actual test - 1 mins, 55 secs
* clobber of previous run - 4 mins, 53 secs (*ouch^n!*)
* unpack tests - 1 mins, 53 secs (*ouch!* - jmaher's proposal would help here)
* clobber tools - 31 secs (this can be easily optimized)
* download builds, tests & symbols - 4 secs (probably another job has hit the cache first)
* unpack build - 4 secs
Attached file mochitest 2/5
This log took the longest.

The breakdown for mochitest 2/5 is:
* total - 27 mins, 49 secs
* actual test - 13 mins, 9 secs
* clobber of previous run - 6 mins, 38 secs (*ouch^n!*)
* unpack of tests - 2 mins, 29 secs (*oooff*)
* download steps - (builds, tests, symbols - 36 secs/ 2 mins, 31 secs / 54 secs)
* clobber of tools - 42 secs
This shows clearly that we waste a lot of time in the following (sorted by awfulness):
* clobber of previous run (related -> bug 583129)
* unpack tests (jmaher's proposal makes a lot of sense; the unpacking is more concerning than the download)
* clobber of tools - (related -> bug 590969)
Depends on: 583129, 590969
(In reply to comment #5)
> * unpack tests (jmaher's proposal makes a lot of sense; the unpacking is more
> concerning than the download)

if that is the case, could we use knowledge of the harness and the files it requires to selectively extract files instead of creating extra test archives?  Adding a bunch of test archives will complicate mobile tests quite a bit using the current master config + factories.
to just unzip the xpcshell tests, I did this on my linux box:
unzip firefox-4.0b5pre.en-US.linux-i686.tests -x jsreftest/* mochitest/* mozmill/* reftest/* 

this could be done in python:
tests = ['jsreftest', 'reftest', 'xpcshell', 'mozmill', 'mochitest']
tests.remove('xpcshell')

unzipcmd = 'unzip firefox-4.0b5pre.en-US.linux-i686.tests -x'
for test in tests:
  unzipcmd += ' ' + test + '/*'


I know list comprehension could make this more polished and we would need an array for subprocess.popen, but this should give a general idea of how this could be solved.

Thanks :jhford, that is a much better idea than mine for achieving the same result.
btw, is there a link to the clobber script?
ok, so in that file, there is a urlopen(url).read():
http://mxr.mozilla.org/build/source/tools/clobberer/clobberer.py#102

can this information be derived and cached somewhere locally?  It might not take much time, but that seems like a potential perf hit.

Really I think the problem would be in the python code for iterating through the directory and deleting files.  We might be able to avoid this whole issue if we only unpack the files we care about (comment #7), otherwise, maybe a shellcommand or a custom native code app to clean the file system?  I would need to play with this.  Also is this clobberer.py script called >1 times for this whole step?
we could have the ShellCommand have a really long command that has the entire shell command, but I don't know if that is the right thing to do.

There is another script that is needed i think for unpacking the dmgs for mac.  Hitting the hgweb interface for those files, which is what talos does, is probably best.

It would be great if talos and unittest could share the same setup and teardown so that a fix for one is also a fix for the other :)
it would be nice to do an experiment with the regular unpack and selective unpack (comment #7) and see if the clobber time is reduced as a result.  

another crazy option for the tests and related files is the set them up with rsync and we could rsync to just update the changed binaries.  That would be fast and we wouldn't have to delete them all :)
Are we talking about clobbering the specific branch/specific test directory, or are we talking about clobbering to clean up enough disk space (other branches, other test dirs)?

While I love rsync, keeping a centrally unpacked set of all tests from all builds from all platforms that might be running tests seems unmaintainable.
I would bet on the selective unpacking resulting in a quicker clobber.  

Rsyncing the tests is an interesting option, but something that might not be as scalable.  We currently have an HTTP proxy for the test archives for devices that are located in MV.  Because rsync works over a terminal connection, it requires an rsync machine to be set up to act as a proxy and that adds extra complexity.  Because rsync would have to touch every file, it might end up taking a while.
I am not 100% clear on the clobber step.  I assume the clobber step is
performed on the test machine, not the build machine to free up disk space and
to avoid stale files from affecting a subsequent run on the same machine.

Yeah, the rsync idea is half baked...just thought it might spur other ideas to
make it fully baked or some other out of the box idea.  One way this could work
is the objdir (or $(objdir)/dist/test-package-stage) could be moved to a
buildid folder on the build machine and the 10 different test clients could
sync to that.  I think that what :jhford brings up with the rsync touching every file might make the whole unpack vs rsync a wash.
(In reply to comment #15)
> I am not 100% clear on the clobber step.  I assume the clobber step is
> performed on the test machine, not the build machine to free up disk space and
> to avoid stale files from affecting a subsequent run on the same machine.

Would there be any advantage into running this step asynchronously?
unittests don't use the clobberer.  When we say 'clobber' it could either be rm -rf <important file/folder> or it could be running the clobberer on a build machine.

On unittest machines at least, we do a straight rm -rf <tools> ; rm -rf <build>
wow, so if we are doing an actual shell command to remove directories on a test machine for the 'clobber' step, I think the selective unzip will solve a lot of the problems.

We could get more advanced on Mochitest-other since we won't need the mochitest/browser/* mochitest/a11y/* and mochitest/chrome/* folders for M1-5.
we also have to clobber the downloaded build, I believe; the stuff proposed in bug 525037 would improve the teardown time, as well as reducing I/O variability in our talos results, and letting us put the working area on the fastest part of the platters.
Assignee: nobody → salbiz
I've put together some queries to try to identify the worst offenders for this bug. The attached file lists the 100 builds submitted since July with lowest ratio between time spent on actual test runs versus the total time spent on the build. A couple of observations from this file:

- win7-opt-u-crashtest looks like a consistent worst offender
- w764-opt-u-xpcshell, although the entire run takes usually less than 5 minutes, the time spent running the test suite is <5s.
(In reply to comment #20)
> - w764-opt-u-xpcshell, although the entire run takes usually less than 5
> minutes, the time spent running the test suite is <5s.

That doesn't sound right.
Attachment #470586 - Attachment mime type: application/octet-stream → text/plain
Syed, does your query exclude jobs which failed with result = 2 ? Might even make sense to only look at runs where green (result = 0).
One more result-set from my investigation, this time grouped by builder aggregated average total time and average time for each test-run-step. Again, win7-crashtest seems to be the worst offender in this area.
(In reply to comment #22)
> Syed, does your query exclude jobs which failed with result = 2 ? Might even
> make sense to only look at runs where green (result = 0).

Yes, only builds passing with either result (0, 1) are screened. The xpcshell cases seemed to be isolated in mid-july and can probably be safely ignored.
There's a lot of good data here, but I'm not able to really glean a direction to go try and do something.  From most of the comments, it sounds like having a separate zip file for each test type should help quite a bit.  I'd believe that as the test package is rather huge. Should we put together a patch to try this?

I'll volunteer to help out with that, but who's going to own the buildbot changes on the other side of the coin for this experiment? (and is that a set of changes you're comfortable making on the buildbot side?)
(In reply to comment #4)

> * download steps - (builds, tests, symbols - 36 secs/ 2 mins, 31 secs / 54

Ugh, seems like our network is a huge bottleneck here. That's like 300K/s (tests are ~40MB). Seems like it should be possible to make this absurdly fast.

Splitting the test file might be possible, but we'd still be boned on downloading the build and symbols. +1 on selective extraction, that sounds like the better fix (and could be done in parallel with improving download speeds).
I think we need to be careful with download times. The machines running these tests have a squid proxy between them and the files. The first download will be slower, but hopefully still several Mbps. Other test runs using the same files should be insanely fast because they hit the cache; I've seen 1 or 2 seconds in the past.
(In reply to comment #26)
> (In reply to comment #4)
> 
> > * download steps - (builds, tests, symbols - 36 secs/ 2 mins, 31 secs / 54
> 
> Ugh, seems like our network is a huge bottleneck here. That's like 300K/s
> (tests are ~40MB). Seems like it should be possible to make this absurdly fast.
> 
> Splitting the test file might be possible, but we'd still be boned on
> downloading the build and symbols. +1 on selective extraction, that sounds like
> the better fix (and could be done in parallel with improving download speeds).

Is it possible to run the tests directly out of the .zip file?  e.g. extract just the scripts required to start running the tests, and then the tests themselves know how to read the .zip file.
In comment 7, I have some sample code to extract just a subset of tests from the .zip file.  I am not sure if a method for running tests from the zip file specifically.  Most of the tests require a webserver to be setup and that webserver needs to access the files.  I know in some of the work I am doing on bug 543800 (run chrome style tests from a .jar package) we can run tests from inside a .jar.

We could modify the script in comment 7 to extract all the test harness related code as well as the tests we care about.  We could also experiment with just extracting the core harness bits (about 30 files) and running the rest from a .jar (really a .zip) on the webserver end.
Hey folks,  so I read this today, and I did some experiments on my boxes/vms. On mac, linux, and windows just doing unpack and delete of the tests, I found that unpacking the entire thing averaged about 25s and deleting the entire thing averaged about 7 seconds.  Breaking it down into subgroups significantly reduced the unpack and deletion time.  The only large unpack time was jsreftest at 4s.  All the deletion times averaged below 1s, except jsreftest which was between 1 and 2s.

So, I thought I'd throw together a patch as it takes some knowledge of what the tests need in order to properly unpack them.  All the tests need the bin/ and certs/ directories.  jsreftest needs both jsreftest/ and reftest/.  The others are pretty straight forward.

I'm not sure if I've done the integration stuff in packaged UnittestPackagedBuildFactory correctly. But if someone can correct this and run it on staging, we could see what impact selective unpacking would have.
Attachment #471771 - Flags: feedback?(catlee)
(In reply to comment #30)
> Created attachment 471771 [details] [diff] [review]
> Selectively unpack the tests
> 
> Hey folks,  so I read this today, and I did some experiments on my boxes/vms.
> On mac, linux, and windows just doing unpack and delete of the tests, I found
> that unpacking the entire thing averaged about 25s and deleting the entire
> thing averaged about 7 seconds.  Breaking it down into subgroups significantly
> reduced the unpack and deletion time.  The only large unpack time was jsreftest
> at 4s.  All the deletion times averaged below 1s, except jsreftest which was
> between 1 and 2s.
> 
> So, I thought I'd throw together a patch as it takes some knowledge of what the
> tests need in order to properly unpack them.  All the tests need the bin/ and
> certs/ directories.  jsreftest needs both jsreftest/ and reftest/.  The others
> are pretty straight forward.
> 
> I'm not sure if I've done the integration stuff in packaged
> UnittestPackagedBuildFactory correctly. But if someone can correct this and run
> it on staging, we could see what impact selective unpacking would have.

This is awesome, thanks a lot!

Is there a reason you're excluding particular patterns instead of including what you know you want?  i.e. 'unzip -o filename bin certs mochitest' for mochitest suites?
catlee, I assume this is a side effect of my comment 7 where I indicated the unzip -x feature.  That was my 10 minute looking into unzip and using it.  I suspect there is no reason we need the -x.

One thing that sort of concerns me is the numbers that ctalbert mentioned in his comment.  His initial cleanup time was 7 seconds and we identified this clobber step as 5+ minutes.  Is it possible that we are overlooking something else?  

I know 1 second here and there adds up, so no complaints from me.
(In reply to comment #32)
> catlee, I assume this is a side effect of my comment 7 where I indicated the
> unzip -x feature.  That was my 10 minute looking into unzip and using it.  I
> suspect there is no reason we need the -x.
> 
> One thing that sort of concerns me is the numbers that ctalbert mentioned in
> his comment.  His initial cleanup time was 7 seconds and we identified this
> clobber step as 5+ minutes.  Is it possible that we are overlooking something
> else?  
> 
> I know 1 second here and there adds up, so no complaints from me.

Maybe differences in hardware?  On a randomly selected Win7 slave, I get these times for various steps:

cleanup: 5 minutes
unpack tests: 2 minutes
jsreftests: 7 minutes
Based on some investigation on the average times taken to run these jobs/time spent actually doing tests I've proposed a reshuffling of how some of the suites are split up. Since the actual test components of a11y and scroll seem to take very little time (a11y <1min on average, scroll ~3 min), we can shave a few minutes of setup/teardown by rolling them in with the chrome suite. Similarly, crashtest usually only runs tests for ~4 min, so I propose rolling it in with jsreftest just in opt (total test suite takes ~12 min on average for optimized builds)
based on :catlee 's feedback, I've made a couple of changed to the proposed patch, main one being the naming of the merged suites, since a11y tests don't run on mac, I've named the new merged suites chrome, (consists of old chrome, scroll and a11y suites) and chrome_mac(consists of old chrome and scroll suites).
Attachment #471834 - Attachment is obsolete: true
Attachment #471846 - Flags: review?
Attachment #471846 - Flags: review? → review?(catlee)
I don't think we should change the names of the suites -- it'll just be confusing to everyone, make it hard to consistently track intermittent orange, and we'll have to confuse things again every time we rebalance.  In addition, as you've seen, we may want a different balance on opt and debug, or on different platforms.  Instead, I think that these combined jobs should just be a scheduling artifact, and that they should report each of the results independently.

(In reply to comment #32)
> One thing that sort of concerns me is the numbers that ctalbert mentioned in
> his comment.  His initial cleanup time was 7 seconds and we identified this
> clobber step as 5+ minutes.  Is it possible that we are overlooking something
> else?  

Cleanup is filesystem grinding, so very sensitive to what's in the kernel's cache.  If you aren't rebooting before the cleanup, you probably have those directories still in cache from unpacking them.  The cleanup we do on the slaves is the worst possible case: totally cold cache from a reboot.

This points to two things that can probably win huge:

- make it not be filesystem operations, by putting the working area on a separate drive or partition, and then mke2fs/format/etc.-ing our way to victory.  A ramdisk might also win large here.  There's a separate bug on this, which outlines other great advantages to doing so, but I've advocated for this enough for people to understand why I think it's so valuable.

- doing cleanup after the test run!  This means that we're not using time in which someone is waiting for the test run (since we've already reported), but also gives the advantage of the hot cache.  You can continue to do a cleanup before the run to "make sure" (probably you want to log a warning somewhere if there's anything found!), but in most cases this will be a no-op.  This is probably the biggest payoff for smallest effort available in terms of reducing cleanup overhead.
(In reply to comment #36)
> I don't think we should change the names of the suites -- it'll just be
> confusing to everyone, make it hard to consistently track intermittent orange,
> and we'll have to confuse things again every time we rebalance.  In addition,
> as you've seen, we may want a different balance on opt and debug, or on
> different platforms.  Instead, I think that these combined jobs should just be
> a scheduling artifact, and that they should report each of the results
> independently.

This doesn't change the name of the suites, just which column they show up on Tinderbox.  They still get reported separately.

> (In reply to comment #32)
> > One thing that sort of concerns me is the numbers that ctalbert mentioned in
> > his comment.  His initial cleanup time was 7 seconds and we identified this
> > clobber step as 5+ minutes.  Is it possible that we are overlooking something
> > else?  
> 
> Cleanup is filesystem grinding, so very sensitive to what's in the kernel's
> cache.  If you aren't rebooting before the cleanup, you probably have those
> directories still in cache from unpacking them.  The cleanup we do on the
> slaves is the worst possible case: totally cold cache from a reboot.
> 
> This points to two things that can probably win huge:
> 
> - make it not be filesystem operations, by putting the working area on a
> separate drive or partition, and then mke2fs/format/etc.-ing our way to
> victory.  A ramdisk might also win large here.  There's a separate bug on this,
> which outlines other great advantages to doing so, but I've advocated for this
> enough for people to understand why I think it's so valuable.

Yeah, I'm sure using a ramdisk would be a huge win.  I'm marginally worried that the extra memory usage could bump performance results.

> - doing cleanup after the test run!  This means that we're not using time in
> which someone is waiting for the test run (since we've already reported), but
> also gives the advantage of the hot cache.  You can continue to do a cleanup
> before the run to "make sure" (probably you want to log a warning somewhere if
> there's anything found!), but in most cases this will be a no-op.  This is
> probably the biggest payoff for smallest effort available in terms of reducing
> cleanup overhead.

Great idea, filed as bug 593384.  For talos runs, yes, we have already reported results.  For unittest runs though, the results are contained only in the log, which isn't sent until the job is finished.
(In reply to comment #37)
> > 
> > Cleanup is filesystem grinding, so very sensitive to what's in the kernel's
> > cache.  If you aren't rebooting before the cleanup, you probably have those
> > directories still in cache from unpacking them.  The cleanup we do on the
> > slaves is the worst possible case: totally cold cache from a reboot.
> > 
> > This points to two things that can probably win huge:
> > 
> > - make it not be filesystem operations, by putting the working area on a
> > separate drive or partition, and then mke2fs/format/etc.-ing our way to
> > victory.  A ramdisk might also win large here.  There's a separate bug on this,
> > which outlines other great advantages to doing so, but I've advocated for this
> > enough for people to understand why I think it's so valuable.
> 
> Yeah, I'm sure using a ramdisk would be a huge win.  I'm marginally worried
> that the extra memory usage could bump performance results.
> 

+1 on ramdisk usage.

> > - doing cleanup after the test run!  This means that we're not using time in
> > which someone is waiting for the test run (since we've already reported), but
> > also gives the advantage of the hot cache.  You can continue to do a cleanup
> > before the run to "make sure" (probably you want to log a warning somewhere if
> > there's anything found!), but in most cases this will be a no-op.  This is
> > probably the biggest payoff for smallest effort available in terms of reducing
> > cleanup overhead.
> 
> Great idea, filed as bug 593384.  For talos runs, yes, we have already reported
> results.  For unittest runs though, the results are contained only in the log,
> which isn't sent until the job is finished.

I'm sure someone has thought about this, however, is there any reason why we would want to keep all of this around after the run (say when there is a failure so that we can try to diagnose?) Otherwise I'm a big fan of automated tests;
1) set up what you need (don't assume its already done)
2) execute test
3) report results
4) cleanup after yourself (this might mean moving the stuff to someplace safe for later investigation)
(In reply to comment #37)
> Yeah, I'm sure using a ramdisk would be a huge win.  I'm marginally worried
> that the extra memory usage could bump performance results.

TBH, I wouldn't bother with the ramdisk until we saw what we gained from putting it on a fresh filesystem.
Comment on attachment 471771 [details] [diff] [review]
Selectively unpack the tests

I like it!  I think using file inclusion instead of exclusion is clearer.  We'll test to make sure.
Attachment #471771 - Flags: feedback?(catlee) → feedback+
(In reply to comment #40)
> Comment on attachment 471771 [details] [diff] [review]
> Selectively unpack the tests
> 
> I like it!  I think using file inclusion instead of exclusion is clearer. 
> We'll test to make sure.

I'll take a closer look at testing the proposed patch. Filed under bug 594018
patch was tested in staging, combined jobs are still not too long (way <20 min), so this should shave off several minutes of extraneous setup/teardown
Attachment #471846 - Attachment is obsolete: true
Attachment #473040 - Flags: review?(catlee)
Attachment #471846 - Flags: review?(catlee)
(In reply to comment #37)
> This doesn't change the name of the suites, just which column they show up on
> Tinderbox.  They still get reported separately.

Until we take tinderbox out of the equation, this will still cause developer confusion, because it will change the number of builds shown on TBPL. Currently, if you get orange on crashtest, you'll have an orange Co or Cd result on TBPL. If you combine crashtest+jsreftest, you'll now have one single result containing both. We should figure out how to not change that display to make this work.
I was under the impression that combining crashtest and jsreftest would be just the machine running it, not the reporting.  Although mochitest-other reports as a single value to TBPL so we might be in the same boat.
Depends on: 594415
Attachment #473040 - Flags: review?(catlee) → review+
Attachment #471771 - Flags: feedback+
Since the selective unpacking has landed, the only outstanding thing to be done for this bug is to merge the smaller test suites. Which will be landed once 594415 gets fixed so that the reporting on TBPL does not get affected. Changing the bug summary to reflect this.
Summary: Identify builders that are spending most of their time doing setup/teardown → merge test suites for unit-test runs to balance load
Assignee: s.s.albiz → nobody
Blocks: 661585
No longer depends on: 583129
No longer depends on: 590969
Whiteboard: [reporting][optimization] → [reporting][optimization][buildfaster:p2]
Product: mozilla.org → Release Engineering
We've done tons of re-arranging of test suites since this was last touched. Is this bug still useful?
Flags: needinfo?(catlee)
There's certainly still demand - look at the way androidx86 tests were implemented, as a glob of unrelated tests crammed together in a single run reporting in a lump because nobody ever figured out the reporting part of this. We also have things like the talos "other" suite, where the job takes 6 minutes and the actual runtime is less than 3 minutes, which would benefit from this existing.
Agreed this is something we need to keep an eye on. I'm not sure this bug is the best way, but we don't have anything else right now.
Flags: needinfo?(catlee)
Let's file specific bugs as we recognize a need.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: