Closed
Bug 1305485
Opened 8 years ago
Closed 8 years ago
Windows TaskCluster workers should use `hg robustcheckout`, not put data on c:\
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(firefox52 fixed)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox52 | --- | fixed |
People
(Reporter: gps, Assigned: grenade)
References
Details
Attachments
(1 file)
There are a few major issues with hg performance on TaskCluster Windows workers currently: 1) Data on C:\ makes I/O slow per bug 1305174 2) Not using a shared / pooled storage repo 3) May not be purging/updating working copy robustly #1 can be fixed by moving storage to z:\ and performing a clone at task time, not at AMI bake time. #2 and #3 can be fixed by using `hg robustcheckout`. See https://hg.mozilla.org/mozilla-central/file/default/testing/docker/recipes/run-task for how `hg robustcheckout` should be used. Essentially, automation should invoke the following for ideal Mercurial interaction: hg robustcheckout \ --sharebase z:\hg-shared \ # Tells Mercurial to pool storage --purge \ # To delete untracked files from the source checkout (optional) --upstream https://hg.mozilla.org/mozilla-unified \ # Likely value --revision <sha1> \ https://hg.mozilla.org/path/to/repo \ z:\path\to\checkout The fingerprint for hg.mozilla.org should also be obtained from a TC secret and fed into the commandline (for now - I may move fingerprint lookup into `hg robustcheckout`). The following should also exist in the global mercurial.ini: # By default the progress bar starts after 3s and updates every 0.1s. We # change this so it shows and updates every 1.0s. # We also tell progress to assume a TTY is present so updates are printed # even if there is no known TTY. [progress] delay = 1.0 refresh = 1.0 assume-tty = true [web] cacerts = /path/to/cacerts-bundle-file [extensions] robustcheckout = /path/to/robustcheckout.py You can copy robustcheckout from https://hg.mozilla.org/hgcustom/version-control-tools/file/tip/hgext/robustcheckout/__init__.py into the base AMI or Puppet so it is available to the base image.
Assignee | ||
Comment 1•8 years ago
|
||
robust checkout added to tc win builders: https://github.com/mozilla-releng/OpenCloudConfig/commit/8366c5700159ae8072f298f724b4d7f8002887d4
Comment hidden (mozreview-request) |
Reporter | ||
Comment 3•8 years ago
|
||
mozreview-review |
Comment on attachment 8794966 [details] Bug 1305485 - use robustcheckout with ephemeral storage (tc windows); https://reviewboard.mozilla.org/r/81162/#review79772 I abbreviated the logic a bit in my Bugzilla comment. Please look at the logic inside testing/docker/recipes/run-task:vcs_checkout() and duplicate that. I'm fine deferring the hg.mo fingerprint setting to a follow-up.
Attachment #8794966 -
Flags: review?(gps) → review-
Assignee | ||
Comment 4•8 years ago
|
||
robust checkout shared folder and permissions on y: https://github.com/mozilla-releng/OpenCloudConfig/commit/a41b2c9df1641b2776a283324acc78675a0ed251
Reporter | ||
Comment 5•8 years ago
|
||
(In reply to Rob Thijssen (:grenade - GMT) from comment #4) > robust checkout shared folder and permissions on y: > https://github.com/mozilla-releng/OpenCloudConfig/commit/ > a41b2c9df1641b2776a283324acc78675a0ed251 You should probably vendor robustcheckout.py into that repo. Or, you should pin the revision used. e.g. https://hg.mozilla.org/hgcustom/version-control-tools/raw-file/af7287700528/hgext/robustcheckout/__init__.py. The benefit to vendoring is if hg.mozilla.org goes down, you'll still be able to generate AMIs. The benefit to pinning the revision is if we screw up the file in version-control-tools, that mistake won't get propagated to Windows workers. Also, if hg.mozilla.org is compromised or you aren't doing certificate checking properly in that AMI setup script, a malicious file executed on every build could get executed. Moral of the story: vendoring reduces your exposure to down services and mitigates attack surface area.
Comment hidden (mozreview-request) |
Reporter | ||
Comment 7•8 years ago
|
||
mozreview-review |
Comment on attachment 8794966 [details] Bug 1305485 - use robustcheckout with ephemeral storage (tc windows); https://reviewboard.mozilla.org/r/81162/#review79778 I /think/ this will work. I'm anxious to see a Try result with this. I'm not sure we've even verified `hg robustcheckout` works on Windows. Please ping me when you have a Try push so I can verify this is doing the right thing. I'm also curious about the performance!
Attachment #8794966 -
Flags: review?(gps) → review+
Comment 8•8 years ago
|
||
Hidden on all trees, so you'll need to click the "Excluded Jobs" in the grey bar at the top of treeherder to see them.
Assignee | ||
Comment 9•8 years ago
|
||
gps: it almost works. here's the output from a try push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff69a411ba3e09723fcfea64041709fa13c148c8&exclusion_profile=false&selectedJob=28093375): [taskcluster 2016-09-27T09:37:59.319Z] Worker Type settings: [taskcluster 2016-09-27T09:38:04.828Z] { [taskcluster 2016-09-27T09:38:04.828Z] "aws": { [taskcluster 2016-09-27T09:38:04.828Z] "ami-id": "ami-48d7935f", [taskcluster 2016-09-27T09:38:04.828Z] "availability-zone": "us-east-1d", [taskcluster 2016-09-27T09:38:04.828Z] "instance-id": "i-0f42b93583ff186d5", [taskcluster 2016-09-27T09:38:04.828Z] "instance-type": "c3.2xlarge", [taskcluster 2016-09-27T09:38:04.828Z] "public-ipv4": "54.227.52.26" [taskcluster 2016-09-27T09:38:04.828Z] }, [taskcluster 2016-09-27T09:38:04.828Z] "generic-worker": { [taskcluster 2016-09-27T09:38:04.828Z] "go-arch": "amd64", [taskcluster 2016-09-27T09:38:04.828Z] "go-os": "windows", [taskcluster 2016-09-27T09:38:04.828Z] "go-version": "go1.7", [taskcluster 2016-09-27T09:38:04.828Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v5.2.0", [taskcluster 2016-09-27T09:38:04.828Z] "version": "5.2.0" [taskcluster 2016-09-27T09:38:04.828Z] }, [taskcluster 2016-09-27T09:38:04.828Z] "machine-setup": { [taskcluster 2016-09-27T09:38:04.828Z] "maintainer": "rthijssen@mozilla.com", [taskcluster 2016-09-27T09:38:04.828Z] "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/76edb6b28de33f1569425ae51ce912e41487b74d/userdata/Manifest/win2012.json" [taskcluster 2016-09-27T09:38:04.828Z] } [taskcluster 2016-09-27T09:38:04.828Z] } [taskcluster 2016-09-27T09:38:04.828Z] === Task Starting === [taskcluster 2016-09-27T09:38:04.829Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision ff69a411ba3e09723fcfea64041709fa13c148c8 https://hg.mozilla.org/try/ .\build\src Z:\task_1474968877>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision ff69a411ba3e09723fcfea64041709fa13c148c8 https://hg.mozilla.org/try/ .\build\src ensuring https://hg.mozilla.org/try/@ff69a411ba3e09723fcfea64041709fa13c148c8 is available at .\build\src (cloning from upstream repo https://hg.mozilla.org/mozilla-unified) Traceback (most recent call last): File "mercurial\dispatch.pyc", line 204, in _runcatch File "mercurial\dispatch.pyc", line 880, in _dispatch File "mercurial\dispatch.pyc", line 637, in runcommand File "mercurial\dispatch.pyc", line 1010, in _runcommand File "mercurial\dispatch.pyc", line 971, in checkargs File "mercurial\dispatch.pyc", line 877, in <lambda> File "mercurial\util.pyc", line 1036, in check File "C:/mozilla-build/robustcheckout.py", line 149, in robustcheckout File "C:/mozilla-build/robustcheckout.py", line 229, in _docheckout File "mercurial\hg.pyc", line 439, in clone File "mercurial\hg.pyc", line 175, in peer File "mercurial\hg.pyc", line 152, in _peerorrepo File "mercurial\httppeer.pyc", line 297, in instance File "mercurial\httppeer.pyc", line 77, in _fetchcaps File "mercurial\httppeer.pyc", line 206, in _call File "mercurial\httppeer.pyc", line 153, in _callstream File "urllib2.pyc", line 429, in open File "urllib2.pyc", line 447, in _open File "urllib2.pyc", line 407, in _call_chain File "mercurial\url.pyc", line 375, in https_open File "mercurial\keepalive.pyc", line 264, in do_open URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)> abort: error: EOF occurred in violation of protocol (_ssl.c:590) [taskcluster 2016-09-27T09:41:39.107Z] Exit Code: 0 [taskcluster 2016-09-27T09:41:39.107Z] === Task Finished === [taskcluster 2016-09-27T09:41:39.107Z] Task Duration: 3m34.279s
Assignee | ||
Comment 10•8 years ago
|
||
actually, that may have just been spectacularly bad luck. a subsequent run (no change to worker/ami, but different instance) performed the robustcheckout without issue... https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff69a411ba3e09723fcfea64041709fa13c148c8&exclusion_profile=false&selectedJob=28093782
Comment hidden (mozreview-request) |
Reporter | ||
Comment 12•8 years ago
|
||
Looks like the clone is 2-3x faster. It could be faster if there were more IOPS. But this is ephemeral storage and I don't think you have any control over IOPS there. Still promising.
Comment 13•8 years ago
|
||
Note, there is support for task-defined caches as of generic worker 5.4.0. It isn't heavily tested yet, and the docs in the review phase of a PR at the moment. What this gets you is some more control over who can use your cache, rather than any task that runs on the worker...
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 16•8 years ago
|
||
patch content unchanged, just updated for latest m-c merge compatibility and retested on tc win builders with hg 3.9.1 updates applied.
Comment 17•8 years ago
|
||
Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/68be44803595 use robustcheckout with ephemeral storage (tc windows); r=gps
Keywords: checkin-needed
Assignee | ||
Comment 18•8 years ago
|
||
seems to be some file lock issues we need to add wait/retry to robustcheckout for. [taskcluster 2016-10-13T07:49:03.459Z] Worker Type settings: [taskcluster 2016-10-13T07:49:03.471Z] { [taskcluster 2016-10-13T07:49:03.471Z] "aws": { [taskcluster 2016-10-13T07:49:03.471Z] "ami-id": "ami-431f5723", [taskcluster 2016-10-13T07:49:03.471Z] "availability-zone": "us-west-1b", [taskcluster 2016-10-13T07:49:03.471Z] "instance-id": "i-01c9e735154d130a7", [taskcluster 2016-10-13T07:49:03.471Z] "instance-type": "c3.2xlarge", [taskcluster 2016-10-13T07:49:03.471Z] "public-ipv4": "54.153.58.136" [taskcluster 2016-10-13T07:49:03.471Z] }, [taskcluster 2016-10-13T07:49:03.471Z] "generic-worker": { [taskcluster 2016-10-13T07:49:03.472Z] "go-arch": "amd64", [taskcluster 2016-10-13T07:49:03.472Z] "go-os": "windows", [taskcluster 2016-10-13T07:49:03.472Z] "go-version": "go1.7", [taskcluster 2016-10-13T07:49:03.472Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v5.2.0", [taskcluster 2016-10-13T07:49:03.472Z] "version": "5.2.0" [taskcluster 2016-10-13T07:49:03.472Z] }, [taskcluster 2016-10-13T07:49:03.472Z] "machine-setup": { [taskcluster 2016-10-13T07:49:03.472Z] "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/593a69f40311e4b9fc3b3e9213a9126fa307c11f/userdata/Manifest/win2012.json" [taskcluster 2016-10-13T07:49:03.472Z] } [taskcluster 2016-10-13T07:49:03.472Z] } [taskcluster 2016-10-13T07:49:03.472Z] === Task Starting === [taskcluster 2016-10-13T07:49:03.473Z] Executing command 0: "c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7480f3d08b18ee0d347a944020509c2d57bb5cf0 https://hg.mozilla.org/integration/autoland/ .\build\src Z:\task_1476344616>"c:\Program Files\Mercurial\hg.exe" robustcheckout --sharebase y:\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 7480f3d08b18ee0d347a944020509c2d57bb5cf0 https://hg.mozilla.org/integration/autoland/ .\build\src ensuring https://hg.mozilla.org/integration/autoland/@7480f3d08b18ee0d347a944020509c2d57bb5cf0 is available at .\build\src (cloning from upstream repo https://hg.mozilla.org/mozilla-unified) (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29) searching for changes adding changesets Traceback (most recent call last): File "mercurial\dispatch.pyc", line 204, in _runcatch File "mercurial\dispatch.pyc", line 880, in _dispatch File "mercurial\dispatch.pyc", line 637, in runcommand File "mercurial\dispatch.pyc", line 1010, in _runcommand File "mercurial\dispatch.pyc", line 971, in checkargs File "mercurial\dispatch.pyc", line 877, in <lambda> File "mercurial\util.pyc", line 1036, in check File "C:/mozilla-build/robustcheckout.py", line 149, in robustcheckout File "C:/mozilla-build/robustcheckout.py", line 229, in _docheckout File "mercurial\hg.pyc", line 496, in clone File "mercurial\hg.pyc", line 385, in clonewithshare File "mercurial\exchange.pyc", line 1186, in pull File "mercurial\exchange.pyc", line 1327, in _pullbundle2 File "mercurial\bundle2.pyc", line 355, in processbundle File "mercurial\bundle2.pyc", line 432, in _processpart File "mercurial\bundle2.pyc", line 1361, in handlechangegroup File "mercurial\changegroup.pyc", line 316, in apply File "mercurial\revlog.pyc", line 1572, in addgroup File "mercurial\changelog.pyc", line 138, in _delay File "mercurial\changelog.pyc", line 78, in __init__ File "mercurial\store.pyc", line 480, in __call__ File "mercurial\scmutil.pyc", line 588, in __call__ File "mercurial\windows.pyc", line 126, in posixfile IOError: [Errno 13] y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\00changelog.i: Access is denied abort: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg/store\00changelog.i: Access is denied [taskcluster 2016-10-13T07:49:05.712Z] Exit Code: 0 [taskcluster 2016-10-13T07:49:05.712Z] === Task Finished === [taskcluster 2016-10-13T07:49:05.712Z] Task Duration: 2.2399978s
Comment 19•8 years ago
|
||
had to back this out in coordination with Rob after we had some failures like https://treeherder.mozilla.org/logviewer.html#?job_id=4994445&repo=autoland in the following push after this change landed
Flags: needinfo?(rthijssen)
Comment 20•8 years ago
|
||
Backout by cbook@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6cb3fe45e731 Backed out changeset 68be44803595 for further investigations after failures when this landed
Assignee | ||
Comment 21•8 years ago
|
||
gps, i think the actual filesystem access is good. it's set here: https://github.com/mozilla-releng/OpenCloudConfig/blob/593a69f40311e4b9fc3b3e9213a9126fa307c11f/userdata/Manifest/gecko-1-b-win2012.json#L1041-L1057 several try pushes succeeded without issue as well as the autoland merge (https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=68be448035953cf34c99cc642374ea54754c86e2). it was a subsequent autoland commit that barfed when an instance (i-01c9e735154d130a7) which had already performed a successful robustcheckout on the preceding task attempted the next robustcheckout. so i guess at that point y:\hg-shared had been created already. i suspect the error above is just a race condition or file lock issue that could be fixed with a wait/retry loop in robustcheckout. do you have any suggestions?
Flags: needinfo?(rthijssen) → needinfo?(gps)
Reporter | ||
Comment 22•8 years ago
|
||
I don't think it is a race condition or file lock issue, as the stack indicates the permissions failure occurs when opening a file in append mode. This is after a write lock has been acquired. I think this is your run of the mill file permissions issue. The file in question (.hg/store/00changelog.i) isn't writable by the 2nd task. IIRC TaskCluster on Windows creates a new user for each task. So my guess is this new user doesn't have write permissions on hg files created by a previous user. Someone will have to poke at the filesystem of a worker to see what permissions in y:\hg-shared look like after a task performs an initial clone.
Flags: needinfo?(gps)
Reporter | ||
Comment 23•8 years ago
|
||
needinfo grenade to grab more filesystem details from Windows worker. At this point, I just want details. We can discuss options once we see what's happening.
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 24•8 years ago
|
||
to check permissions after initial clone, i modified a task (https://tools.taskcluster.net/task-inspector/#O306m3mcQlKQTIzLS-Ft3Q/) from one of the earlier try commits, using tc task inspector, to perform the clone then run: `icacls y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store` the output was: [taskcluster 2016-10-17T12:20:36.972Z] Executing command 1: icacls y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store Z:\task_1476704133>icacls y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store BUILTIN\Administrators:(I)(OI)(CI)(F) NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F) I-0606177769CDC\task_1476704133:(I)(F) CREATOR OWNER:(I)(OI)(CI)(IO)(F) BUILTIN\Users:(I)(OI)(CI)(RX) BUILTIN\Users:(I)(CI)(AD) BUILTIN\Users:(I)(CI)(WD) Successfully processed 1 files; Failed processing 0 files [taskcluster 2016-10-17T12:20:37.193Z] Exit Code: 0
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 25•8 years ago
|
||
problem was indeed a run of the mill file permissions issue. the access rights for y:\hg-shared were not correctly applied during the dsc run on windows worker types due to a syntax error in the config. i reran the test today on the corrected environment and got acls that include the "everyone (f)" acl: y:\hg-shared\8ba995b74e18334ab3707f27e9eb8f4e37ba3d29\.hg\store Everyone:(I)(OI)(CI)(F) BUILTIN\Administrators:(I)(OI)(CI)(F) NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F) I-0EA2E9221E801\task_1476860601:(I)(F) CREATOR OWNER:(I)(OI)(CI)(IO)(F) BUILTIN\Users:(I)(OI)(CI)(RX) BUILTIN\Users:(I)(CI)(AD) BUILTIN\Users:(I)(CI)(WD) https://tools.taskcluster.net/push-inspector/#/O306m3mcQlKQTIzLS-Ft3Q/GnvrY6q7Sw-Rcodw2Thb-w?_k=e4siyj please can we re-land the patch (environment is now corrected)?
Keywords: checkin-needed
Comment 27•8 years ago
|
||
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d99d0f49ac4a use robustcheckout with ephemeral storage (tc windows); r=gps
Keywords: checkin-needed
Reporter | ||
Comment 28•8 years ago
|
||
The logs don't seem to be updating when running `hg robustcheckout`. This is likely a combination of: a) progress extension not properly configured b) need to use PYTHONUNBUFFERED=1 to force Python/Mercurial to not buffer stdout
Reporter | ||
Comment 29•8 years ago
|
||
progress is working. It looks like the live.log streaming isn't working. The output shows up once the task has completed. There is a problem with unordered stdout/stderr. For that, we'll need PYTHONUNBUFFERED=1. I may just fix that in robustcheckout.py since multiple consumers have this problem.
Comment 30•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/d99d0f49ac4a
Assignee | ||
Comment 31•8 years ago
|
||
PYTHONUNBUFFERED set on all gecko windows worker types https://github.com/mozilla-releng/OpenCloudConfig/commit/333c9a4058bee679c7131cfc587e2cfa3603970f
Assignee | ||
Comment 32•8 years ago
|
||
note that there was an issue on try workers (gecko-1-b-win2012) not getting hg-shared permissions set due to dsc being switched off on worker spot instances. level 2 and 3 repos (like m-c, autoland, etc) are not affected because they use different workers (gecko-2-b-win2012 and gecko-3-b-win2012). i've corrected the problem here: https://github.com/mozilla-releng/OpenCloudConfig/commit/33f01c828ef6ab25bcf5258eba09e8ef806768e1 and am waiting for the ami rebuild in order to fix tc win builds on try in an hour or so.
Reporter | ||
Comment 33•8 years ago
|
||
(In reply to Rob Thijssen (:grenade - GMT) from comment #31) > PYTHONUNBUFFERED set on all gecko windows worker types > https://github.com/mozilla-releng/OpenCloudConfig/commit/ > 333c9a4058bee679c7131cfc587e2cfa3603970f Can you please back this out? Setting PYTHONUNBUFFERED globally will affect all Python processes, including thousands of Python processes invoked by the build system. While we do want many processes to be unbuffered, making processes fully unbuffered will result in the process handling stdout/stderr to process a lot more write() calls. This can result in significant overhead. We're currently hacking PYTHONUNBUFFERED into `hg robustcheckout` calls in a number of places. I'll just modify the robustcheckout extension to unbuffer.
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 34•8 years ago
|
||
done: https://github.com/mozilla-releng/OpenCloudConfig/commit/e5382d08d66e1c455da4f7a0d74af01575bbb5c9
Flags: needinfo?(rthijssen)
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•