Closed Bug 1151218 Opened 9 years ago Closed 9 years ago

l10n repacks are intermittently broken on m-c by "command timed out: 1200 seconds without output, attempting to kill" during clobber step

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#?job_id=1284322&repo=mozilla-central

07:12:09 INFO - Running command: ['python', 'c:\\builds\\moz2_slave\\m-cen-w64-ntly-l10n-2_3-000000\\scripts\\external_tools\\purge_builds.py', '-s', '12', '--max-age', '14', '--not', 'info', '--not', 'rel-*', '--not', 'tb-rel-*', 'c:\\builds\\moz2_slave']
07:12:09 INFO - Copy/paste: python c:\builds\moz2_slave\m-cen-w64-ntly-l10n-2_3-000000\scripts\external_tools\purge_builds.py -s 12 --max-age 14 --not info --not rel-* --not tb-rel-* c:\builds\moz2_slave
07:12:09 INFO - Using env: {'HG_SHARE_BASE_DIR': 'c:/builds/hg-shared',
07:12:09 INFO - 'PATH': 'C:\\mozilla-build\\python27;C:\\mozilla-build\\buildbotve\\scripts;C:\\mozilla-build\\msys\\local\\bin;c:\\mozilla-build\\wget;c:\\mozilla-build\\7zip;c:\\mozilla-build\\blat261\\full;c:\\mozilla-build\\python;c:\\mozilla-build\\svn-win32-1.6.3\\bin;c:\\mozilla-build\\upx203w;c:\\mozilla-build\\emacs-22.3\\bin;c:\\mozilla-build\\info-zip;c:\\mozilla-build\\nsis-2.22;c:\\mozilla-build\\nsis-2.33u;c:\\mozilla-build\\nsis-2.46u;c:\\mozilla-build\\wix-351728;c:\\mozilla-build\\hg;c:\\mozilla-build\\python\\Scripts;c:\\mozilla-build\\kdiff3;c:\\mozilla-build\\yasm;.;C:\\mozilla-build\\msys\\local\\bin;C:\\mozilla-build\\msys\\mingw\\bin;C:\\mozilla-build\\msys\\bin;c:\\windows\\system32;c:\\windows;c:\\windows\\System32\\Wbem;c:\\windows\\System32\\WindowsPowerShell\\v1.0\\;c:\\mozilla-build;c:\\mozilla-build\\python27;c:\\mozilla-build\\python27\\Scripts;C:\\mozilla-build\\msys\\bin;c:\\mozilla-build\\vim\\vim72;c:\\mozilla-build\\wget;c:\\mozilla-build\\info-zip;c:\\CoreUtils\\bin;c:\\mozilla-build\\buildbotve\\scripts;c:\\Program Files (x86)\\Microsoft SQL Server\\100\\Tools\\Binn\\;c:\\Program Files\\Microsoft SQL Server\\100\\Tools\\Binn\\;c:\\Program Files\\Microsoft SQL Server\\100\\DTS\\Binn\\;c:\\Program Files (x86)\\Windows Kits\\8.0\\Windows Performance Toolkit\\;c:\\Program Files (x86)\\Windows Kits\\8.1\\Windows Performance Toolkit\\;c:\\Program Files\\Microsoft SQL Server\\110\\Tools\\Binn\\;c:\\Program Files (x86)\\Microsoft SDKs\\TypeScript\\1.0\\;c:\\mozilla-build\\hg;c:\\mozilla-build\\moztools-x64\\bin;c:\\mozilla-build\\vim\\vim72'}
command timed out: 1200 seconds without output, attempting to kill
07:32:12 ERROR - Return code: 1
07:32:12 FATAL - failed to purge builds
07:32:12 FATAL - Running post_fatal callback...
07:32:12 FATAL - Exiting 2
mgerva, do we need a longer timeout on the mozharness step in the buildbot factory ? Windows isn't very speedy at purging.
(In reply to Nick Thomas [:nthomas] from comment #2)
> mgerva, do we need a longer timeout on the mozharness step in the buildbot
> factory ? Windows isn't very speedy at purging.

This will help for sure. Another option is to set purge_maxage to a larger value (30 days?) and maybe we could find what operation takes 1200 seconds and force it to write some log messages too.
I've tried to update the first line of the external_tools/purge_builds.py script with

#!/usr/bin/env python -u

but I am still getting the timeout error.
I don't see any advantage about using a sub process call to a python script from python itself (at least for this particular bug)

I think cleanest solution would be to to rewrite the purge() function provided by purge_builds.py and call it directly without passing through the subprocess (command_run) call. 

Here's a first patch (30kB!!) that allows us to have the full control of the purge output but it also requires some non trivial changes:

* some methods (is_*, retry, rmtree, _rmtree_windows),  living inside the ScriptMixin, have been moved to the script.py module level, so they can be used without need to instantiate a ScriptMixin object.  Moved methods are still available to the ScriptMixin via wrappers.
There are also few other methods that can be moved moved aschmod(), mkdir_p(), ... but it's outside the scope of this patch

* logging outside mixins is achieved via log = logging.getLogger(__name__);
  + The only issue here is that log.fatal raises SystemExit in mozharness so there are some unaesthetic "if log_level == FATAL; raise SystemExit"

* except for logging, the only method using 'self' is _rmtree_windows(), it calls run_command(...) twice. These calls can be replaced by simple subprocess.Popen() (command_run is doing nothing special here)

* There are new methods in diskutils to find out .hg directories and directories older than "n days".

* added tests: diskutils.py has now a coverage of 91% and all tox tests are fine.

This patch includes a first draft of the purge/purge_hg_share methods they have the same interface as the  methods from external_tools/purge_builds.py 

I am going to do some builds on ash to see if there are some unexpected edge cases our tests do not cover.

Working on this patch, I have discovered that in our repositories (mozharness and tools) we have created at least 3 functions/methods to delete directories: rmdir_recursive, rmtree  and rmdirRecursive (each one has also a windows specific version). I have also noticed that purge_builds.py script provided by mozharenss and the one provided by tools are slightly different (tools one is more recent)
Attachment #8592202 - Flags: feedback?(catlee)
catlee and I were discussing this bug today, and found that en-US builds are launched like this:
========= Started 'c:/mozilla-build/python....
'c:/mozilla-build/python27/python' '-u' 'scripts/scripts/fx_desktop_build.py' '--config' 'builds/releng_base_windows_64_builds.py' '--config' 'balrog/production.py' '--branch' 'mozilla-central' '--build-pool' 'production' '--enable-nightly'
 in dir c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\. (timeout 10800 secs) (maxTime 19800 secs)

while l10n is doing:
========= Started 'c:/mozilla-build/python....
'c:/mozilla-build/python27/python' '-u' 'scripts/scripts/desktop_l10n.py' '--branch-config' 'single_locale/mozilla-central.py' '--platform-config' 'single_locale/win64.py' '--environment-config' 'single_locale/production.py' '--balrog-config' 'balrog/production.py' '--total-chunks' '3' '--this-chunk' '3'
 in dir c:\builds\moz2_slave\m-cen-w64-l10n-ntly-3-00000000\. (timeout 1200 secs)

Note the difference in buildbot timeouts. We don't see problems purging in en-US builds because of the much larger values (which are actually for linking xul.dll and PGO respectively), and we could set values in l10n to get through the purging.

re actually fixing the purge_builds.py output to not buffer
There are two code paths in run_command(), depending on if output_timeout is specified - see http://hg.mozilla.org/build/mozharness/file/2b5ee9d274e1/mozharness/base/script.py#l721. It's *possible* that ProcessHandler may disable buffering where a simple Popen doesn't, and would be worth trying (ie by hack PurgeMixin).
We have bumped the timeout in bug 740142, see comment 220 and comment 221, we are not getting this class of errors anymore.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Attachment #8592202 - Flags: feedback?(catlee)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: