mgerva, do we need a longer timeout on the mozharness step in the buildbot factory ? Windows isn't very speedy at purging.
(In reply to Nick Thomas [:nthomas] from comment #2) > mgerva, do we need a longer timeout on the mozharness step in the buildbot > factory ? Windows isn't very speedy at purging. This will help for sure. Another option is to set purge_maxage to a larger value (30 days?) and maybe we could find what operation takes 1200 seconds and force it to write some log messages too.
Created attachment 8592202 [details] [diff] [review] [mozharness] bug 1151218 - rewrite the purge_builds.py script.patch I've tried to update the first line of the external_tools/purge_builds.py script with #!/usr/bin/env python -u but I am still getting the timeout error. I don't see any advantage about using a sub process call to a python script from python itself (at least for this particular bug) I think cleanest solution would be to to rewrite the purge() function provided by purge_builds.py and call it directly without passing through the subprocess (command_run) call. Here's a first patch (30kB!!) that allows us to have the full control of the purge output but it also requires some non trivial changes: * some methods (is_*, retry, rmtree, _rmtree_windows), living inside the ScriptMixin, have been moved to the script.py module level, so they can be used without need to instantiate a ScriptMixin object. Moved methods are still available to the ScriptMixin via wrappers. There are also few other methods that can be moved moved aschmod(), mkdir_p(), ... but it's outside the scope of this patch * logging outside mixins is achieved via log = logging.getLogger(__name__); + The only issue here is that log.fatal raises SystemExit in mozharness so there are some unaesthetic "if log_level == FATAL; raise SystemExit" * except for logging, the only method using 'self' is _rmtree_windows(), it calls run_command(...) twice. These calls can be replaced by simple subprocess.Popen() (command_run is doing nothing special here) * There are new methods in diskutils to find out .hg directories and directories older than "n days". * added tests: diskutils.py has now a coverage of 91% and all tox tests are fine. This patch includes a first draft of the purge/purge_hg_share methods they have the same interface as the methods from external_tools/purge_builds.py I am going to do some builds on ash to see if there are some unexpected edge cases our tests do not cover. Working on this patch, I have discovered that in our repositories (mozharness and tools) we have created at least 3 functions/methods to delete directories: rmdir_recursive, rmtree and rmdirRecursive (each one has also a windows specific version). I have also noticed that purge_builds.py script provided by mozharenss and the one provided by tools are slightly different (tools one is more recent)
catlee and I were discussing this bug today, and found that en-US builds are launched like this: ========= Started 'c:/mozilla-build/python.... 'c:/mozilla-build/python27/python' '-u' 'scripts/scripts/fx_desktop_build.py' '--config' 'builds/releng_base_windows_64_builds.py' '--config' 'balrog/production.py' '--branch' 'mozilla-central' '--build-pool' 'production' '--enable-nightly' in dir c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\. (timeout 10800 secs) (maxTime 19800 secs) while l10n is doing: ========= Started 'c:/mozilla-build/python.... 'c:/mozilla-build/python27/python' '-u' 'scripts/scripts/desktop_l10n.py' '--branch-config' 'single_locale/mozilla-central.py' '--platform-config' 'single_locale/win64.py' '--environment-config' 'single_locale/production.py' '--balrog-config' 'balrog/production.py' '--total-chunks' '3' '--this-chunk' '3' in dir c:\builds\moz2_slave\m-cen-w64-l10n-ntly-3-00000000\. (timeout 1200 secs) Note the difference in buildbot timeouts. We don't see problems purging in en-US builds because of the much larger values (which are actually for linking xul.dll and PGO respectively), and we could set values in l10n to get through the purging. re actually fixing the purge_builds.py output to not buffer There are two code paths in run_command(), depending on if output_timeout is specified - see http://hg.mozilla.org/build/mozharness/file/2b5ee9d274e1/mozharness/base/script.py#l721. It's *possible* that ProcessHandler may disable buffering where a simple Popen doesn't, and would be worth trying (ie by hack PurgeMixin).
We have bumped the timeout in bug 740142, see comment 220 and comment 221, we are not getting this class of errors anymore.